Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Bookmark this page. Review it the morning of your interview. These tables are the cheat codes for sounding prepared.

Interview Day Quick Review

The essentials worth re-reading in the elevator on the way up.
#Remember This
1RAM is ~1,000x faster than SSD, SSD is ~100x faster than HDD. Latency intuition wins system design rounds.
2QPS = (DAU x actions/user) / 86,400. Always estimate peak at 3x average.
399.99% uptime = 52 minutes downtime/year. Each dependency in the critical path multiplies downtime.
4Cache-Aside is the default caching pattern. Know when Write-Behind or Read-Through is better.
5Default to REST for public APIs, gRPC for internal service-to-service. Know why, not just what.
6401 = “who are you?” (authn), 403 = “you can’t do this” (authz). Getting this wrong is a red flag.
7O(1) = hash map, O(log n) = binary search, O(n log n) = sort. Pattern-match complexity to data structure.
8Canary + feature flags is the gold standard deployment strategy at scale.
9Never roll your own auth. Say this in the interview. Mean it.
10SOLID is about managing change, not about following rules. Explain the why, not just the acronym.
11Raft = strong leader + majority quorum. Most modern consensus systems (etcd, CockroachDB, Consul) use Raft, not Paxos.
12CRDTs merge without coordination. G-Counter, PN-Counter, OR-Set — know which to reach for.
13epoll is O(1) per event; select is O(n). This is why Nginx and Node.js handle 100K+ connections.
14DynamoDB: design for access patterns, not entities. Single-table design with PK/SK is the canonical pattern.
15WebSocket = bidirectional, SSE = server-push only (simpler), WebRTC = peer-to-peer (lowest latency). Pick the simplest protocol that fits.
16GraphQL solves over-fetching but creates N+1 problems. Always mention DataLoader.
17Service mesh = sidecar proxy on every pod. It handles mTLS, retries, and circuit breaking so your app code doesn’t.
18Lambda cold starts: Go/Rust ~50-100ms, Python/Node ~100-300ms, Java ~3-10s. Choose runtime by latency budget.

System Design Interview Framework

Use this five-step framework in every system design round. It keeps you structured and prevents the most common mistake: jumping into low-level details before establishing the big picture.

1. Latency Numbers Every Engineer Should Know

Memorize the order of magnitude — interviewers care that you know SSD is ~1000x slower than RAM, not the exact nanoseconds.
OperationLatencyNotes
L1 cache reference0.5 nsFastest memory access available
Branch mispredict5 nsCPU pipeline flush penalty
L2 cache reference7 ns~14x slower than L1
Mutex lock/unlock25 nsContention makes this much worse
Main memory (RAM) reference100 ns~200x slower than L1
Compress 1 KB with Snappy3 μsFast compression for real-time use
Read 1 MB sequentially from RAM3 μsRAM is fast for sequential access
SSD random read150 μs~1,500x slower than RAM
Read 1 MB sequentially from SSD1 msSSDs excel at sequential reads
Network round trip (same datacenter)500 μsAssumes modern datacenter networking
HDD disk seek10 msMechanical latency — avoid random reads
Read 1 MB sequentially from HDD20 msHDDs still viable for bulk sequential I/O
Network round trip (cross-continent)150 msSpeed of light is the bottleneck
TLS handshake250 ms1–2 round trips depending on version
DNS lookup (uncached)~50 msVaries widely; caching helps enormously
TCP connection setup (3-way handshake)~1.5x RTTOne and a half round trips
Key ratios to remember: RAM is ~1,000x faster than SSD. SSD is ~100x faster than HDD. Network within a datacenter is ~300x faster than cross-continent.
Deep dive: Performance & Scalability | Capacity Planning
Senior: Knows the order-of-magnitude ratios (RAM vs SSD vs HDD vs network) and can apply them to justify architecture decisions. Can reason about latency budgets for a request path.Staff: Connects latency numbers to cost and organizational strategy. Asks “where in the stack is the user’s time actually being spent?” before optimizing. Understands that tail latency (P99/P999) matters more than averages for user-facing systems and can explain Little’s Law to reason about concurrency from latency. Drives adoption of latency SLOs at the team or org level.
AI changes how you interact with latency in three ways:
  • LLM inference latency is a new tier. A GPT-4 class API call is 500ms-5s — slower than a cross-continent round trip. If you are adding AI features to a hot path, you need async patterns, streaming responses, or pre-computation. This single number reshapes architecture for AI-augmented products.
  • AI-powered profiling tools (Datadog’s Watchdog, Dynatrace Davis AI) can automatically detect latency anomalies and correlate them with deployments, infrastructure changes, or traffic patterns — replacing hours of manual investigation.
  • Copilot-assisted estimation. When doing back-of-envelope calculations in interviews, AI tools can sanity-check your math in real-time during design docs (not during the interview itself, obviously). Build the intuition so you do not need the crutch.
Q: Why is sequential disk I/O sometimes faster than random RAM access? A: Sequential disk reads exploit prefetching, OS page cache, and DMA. At large enough block sizes, a modern NVMe SSD doing sequential reads (~3 GB/s) can outpace random RAM access patterns that thrash CPU cache lines and TLB. Kafka’s entire design exploits this.Q: A cross-continent round trip is 150ms. Your SLA is 200ms. What do you do? A: Move compute to the edge (CDN workers, regional deployments), pre-compute and cache at edge locations, or accept eventual consistency for reads and serve from a local replica. You cannot beat the speed of light — you work around it.Q: Why does adding TLS add 1-2 round trips, and how do you reduce it? A: TLS 1.2 requires two round trips for key exchange before data flows. TLS 1.3 reduces this to one (or zero with 0-RTT resumption). In practice, use TLS 1.3 everywhere, enable session tickets for resumption, and terminate TLS at the edge to keep internal traffic fast.

2. Database Selection Matrix

Use CaseRecommended DBReasoning
Transactions, complex joinsPostgreSQL / MySQLACID guarantees, mature tooling, SQL standard
Flexible schema, rapid devMongoDB / DynamoDBDocument model maps to application objects, schema-on-read
Session store, caching, leaderboardsRedis / MemcachedSub-ms latency, in-memory, simple key-value operations
Social networks, recommendationsNeo4j / Amazon NeptuneNative graph traversal, relationship-first data model
Metrics, IoT, monitoringTimescaleDB / InfluxDBOptimized for time-ordered writes and range queries
Full-text search, log analyticsElasticsearch / OpenSearchInverted index, fuzzy matching, aggregation pipelines
Wide-column, massive scaleCassandra / ScyllaDBLinear horizontal scaling, tunable consistency
Embedded / edge devicesSQLiteZero-config, single-file, surprisingly powerful
Multi-model (graph + doc + KV)ArangoDB / SurrealDBOne engine for multiple access patterns
No database is “best.” The right choice depends on your access patterns, consistency requirements, team expertise, and operational budget. Picking a DB because it is trendy is a career-limiting move.
Deep dive: APIs & Databases
Senior: Evaluates databases along access pattern, consistency, and scale dimensions. Can articulate trade-offs between 2-3 options for a given use case and justify a recommendation.Staff: Thinks about database selection as an organizational decision, not just a technical one. Factors in team operational expertise, on-call burden, migration cost from the current stack, vendor lock-in risk, and the total cost of ownership over 3-5 years. Advocates for polyglot persistence (using different databases for different data paths) while managing the complexity tax. Pushes back on “one database to rule them all” and on unnecessary proliferation equally.
  • Vector databases are a new category. If your system needs semantic search or AI-powered recommendations, you need a vector store (Pinecone, pgvector, Weaviate, Qdrant). This did not exist as a mainstream choice two years ago. Know when to bolt vector search onto your existing DB (pgvector) vs when to use a dedicated vector store (high-volume similarity search at <10ms).
  • AI-generated query optimization. Tools like Amazon Q for databases and EverSQL use AI to analyze slow queries and suggest indexes, rewrites, or schema changes. A staff engineer evaluates these suggestions critically rather than blindly applying them.
  • Schema design with AI. LLMs are surprisingly good at generating initial schema designs from natural-language requirements. The staff-level skill is reviewing the AI-generated schema for normalization issues, missing indexes for write-heavy patterns, and access-pattern mismatches that the AI cannot anticipate without production traffic data.
Q: When would you pick SQLite over Postgres for a production service? A: Embedded or edge use cases where zero-ops is the priority: mobile apps, IoT devices, single-tenant desktop applications, or read-heavy services where the dataset fits on one machine. Litestream adds streaming replication to S3. SQLite is also excellent for test databases — identical SQL semantics, zero setup.Q: Your team is debating MongoDB vs DynamoDB for a new microservice. What is the first question you ask? A: “Do you need ad-hoc query flexibility during development, or are all access patterns known upfront?” If known, DynamoDB’s single-table design with zero ops is hard to beat. If the schema is evolving and developers need to explore data with complex queries, MongoDB’s flexible query language wins. Secondary consideration: are you all-in on AWS? DynamoDB’s vendor lock-in is absolute.Q: What is the most common mistake teams make when choosing a database? A: Choosing based on the write path and ignoring the read path. Your database serves reads 10-100x more than writes in most applications. Teams pick a write-optimized store like Cassandra then suffer when they need to do anything beyond primary-key lookups for reads.

3. Caching Strategy Decision Tree

PatternHow It WorksWhen to UseTrade-off
Cache-AsideApp checks cache; on miss, reads DB, fills cacheRead-heavy, general purposePossible stale data; app manages logic
Read-ThroughCache fetches from DB on miss automaticallyTransparent cachingCache library must support DB integration
Write-ThroughWrite to cache and DB synchronouslyCannot tolerate stale readsHigher write latency (two writes)
Write-BehindWrite to cache; async flush to DBWrite-heavy, low-latency needsData loss risk if cache crashes pre-flush
Refresh-AheadProactively refresh before TTL expiresPredictable access, low latencyWasted resources if prediction is wrong
As Phil Karlton said: “There are only two hard things in Computer Science: cache invalidation and naming things.”Strategies for invalidation:
  • TTL (Time-To-Live): Simple, but stale data during the window.
  • Event-driven invalidation: Publish a cache-bust event on write. Accurate but adds coupling.
  • Version keys: Append a version number to cache keys; bump version on write.
  • Lease-based: Cache entry holds a lease; writer must acquire lease before updating.
Rule of thumb: If your data changes less than once per minute, TTL is usually fine. If it changes per-second, use event-driven invalidation.
Deep dive: Caching & Observability
Caching is the single highest-leverage performance optimization in most systems. A cache hit ratio improvement from 90% to 99% means your database sees 10x fewer queries — that is the difference between needing one database instance and ten. But caching done wrong creates some of the hardest bugs to reproduce: stale data served to the wrong user, cache stampedes during traffic spikes, and “ghost data” that persists after deletion. Every production caching incident I have seen stems from one of three root causes: no invalidation strategy (relying on TTL alone), no stampede protection, or caching user-specific data with a key that does not include the user ID.
Senior: Knows the five caching patterns, can design a cache-aside strategy with TTL and event-driven invalidation, and understands cache stampede prevention.Staff: Thinks about caching as a system-wide concern, not a per-service optimization. Designs cache warming strategies for cold-start scenarios (new region deployment, cache cluster replacement). Understands the economic model — “this Redis cluster costs X/monthandsavesusX/month and saves us Y/month in database instances.” Knows when NOT to cache: write-heavy paths where invalidation cost exceeds read savings, data with high cardinality and low repeat access, or security-sensitive data where staleness creates compliance risk.
  • AI-powered cache warming. ML models can predict which keys will be requested next based on access patterns, pre-warming the cache proactively. Netflix uses this approach for content metadata caching.
  • Intelligent TTL tuning. Instead of static TTLs, AI can dynamically adjust TTL per key based on access frequency and change frequency. A product page viewed 10,000 times/hour gets a longer TTL than one viewed 3 times/day.
  • LLM response caching is a new challenge. Semantic caching (caching LLM responses by meaning, not exact input) using embedding similarity is an emerging pattern. If two prompts are semantically identical, serve the cached response instead of making a $0.03 API call.
Q: Your cache hit ratio is 99.5% and your P99 latency is still high. Why? A: The 0.5% of cache misses are the long-tail requests that hit the database. If those misses correlate with complex queries (aggregation, full-text search, unindexed lookups), the P99 reflects the database’s worst-case performance, not the cache’s. Fix the slow queries or add a second-level cache (local in-process cache for the hottest keys, Redis for the rest).Q: Name a situation where adding a cache makes your system slower. A: When the cache miss penalty is higher than not having a cache at all. Example: cache-aside with a cold cache under load — every request misses, queries the DB, then writes to the cache. The cache write adds latency to an already-slow DB path. Solution: warm the cache before cutting traffic over, or use a read-through cache that coalesces concurrent misses.Q: Write-behind cache — when is the data loss risk acceptable? A: When the data can be reconstructed or is not the source of truth. Examples: analytics counters, user activity logs, search index updates. Never acceptable for financial transactions, inventory state, or anything where losing a write means money or compliance.

4. API Style Comparison

DimensionRESTgRPCGraphQLWebSocket
ProtocolHTTP/1.1 or HTTP/2HTTP/2 (always)HTTP/1.1 or HTTP/2TCP (upgraded from HTTP)
Payload formatJSON (typically)Protocol Buffers (binary)JSONAny (text or binary frames)
Best forPublic APIs, CRUDInternal microservices, low-latencyMobile/frontend with varied data needsReal-time bidirectional communication
StreamingNot native (SSE possible)Bidirectional streaming built-inSubscriptions via WebSocketFull-duplex by design
ToolingExcellent (Postman, curl)Growing (grpcurl, BloomRPC)Good (GraphiQL, Apollo)Moderate (wscat)
Schema/ContractOpenAPI / Swagger.proto files (strict)SDL (strongly typed)No built-in contract
OverheadModerate (text-based)Low (binary, multiplexed)Moderate (single endpoint)Low after handshake
CacheabilityExcellent (HTTP caching)Hard (binary, no native HTTP cache)Hard (POST requests)Not applicable
Browser supportNativeRequires grpc-web proxyNativeNative
Default to REST for public APIs. Use gRPC for internal service-to-service communication where latency matters. Use GraphQL when clients have highly variable data needs. Use WebSockets only when you truly need server-push or bidirectional streaming.
Deep dive: APIs & Databases
Senior: Can compare REST, gRPC, GraphQL, and WebSocket across dimensions like caching, tooling, and browser support. Picks the right protocol for a given use case.Staff: Designs API strategy at the organization level. Establishes conventions (REST for public, gRPC for internal, GraphQL for BFF) and builds tooling to enforce them. Understands API versioning as a product decision — breaking changes affect external developers, partner SLAs, and revenue. Thinks about API governance: schema registries for protobuf, OpenAPI linting in CI, deprecation policies.
  • AI-powered API generation. Tools like GitHub Copilot and Cursor can generate OpenAPI specs, protobuf definitions, and GraphQL schemas from natural-language descriptions. The staff-level skill is reviewing generated contracts for backward compatibility, proper error modeling, and missing edge cases (pagination, rate limiting, partial failures).
  • LLM-as-a-service APIs are creating a new API style: streaming JSON over SSE for token-by-token responses. If you are building AI features, you need to understand SSE streaming, chunked transfer encoding, and how to display partial results in UIs. This is a hybrid of REST and real-time protocols.
  • Automatic API documentation. AI tools can generate human-readable documentation from OpenAPI specs, and vice versa. But the staff engineer knows that the best API docs include example workflows, error handling guides, and rate limit explanations — context that AI cannot infer from the schema alone.
Q: When is gRPC a bad choice even for internal services? A: When your team needs to debug requests easily (gRPC is binary, not curl-friendly), when you need HTTP caching (gRPC responses are not natively cacheable), or when services are polyglot with some languages having poor gRPC support. Also: if your internal services communicate through an API gateway that does not support HTTP/2 end-to-end, gRPC loses its performance advantage.Q: GraphQL subscriptions vs WebSocket — what is the relationship? A: GraphQL subscriptions typically USE WebSocket as the transport layer (via the graphql-ws protocol). The subscription is the GraphQL abstraction (declarative, typed, schema-driven), WebSocket is the underlying transport. You can also implement subscriptions over SSE for simpler use cases.Q: You are building a public API. REST or GraphQL? A: REST, almost always. Public APIs need aggressive HTTP caching (CDN-friendly), simple rate limiting (requests/second, not query-cost analysis), stable versioning (URL-based), and universal client compatibility (curl, any HTTP library). GraphQL’s advantages (flexible querying, no over-fetching) primarily benefit first-party clients where you control both sides.

5. Deployment Strategy Matrix

StrategyRisk LevelDowntimeInfra CostComplexityRollback SpeedBest For
RollingMediumZeroLowLowSlowStateless services, general use
Blue-GreenLowZeroHigh (2x)MediumInstantCritical services needing instant rollback
CanaryLowZeroMediumHighFastHigh-traffic services, gradual validation
ShadowVery LowZeroHighVery HighN/A (no live traffic affected)Testing new versions with real traffic patterns
RecreateHighYesLowLowSlowDev/staging, or when in-place upgrade is required
A/B TestingLowZeroMediumHighFastFeature experiments, UX testing
Canary + feature flags is the gold standard for production deployments at scale. Roll out to 1% of traffic, monitor error rates and latency, then gradually increase.
Deep dive: Networking & Deployment
Your deployment strategy is your most frequent risk decision. A team deploying 10 times per day makes 3,650 risk decisions per year. The difference between a rolling deploy and a canary deploy is the difference between “we discovered the bug affected 100% of users for 3 minutes” and “we discovered the bug affected 1% of users for 30 seconds before auto-rollback.” At scale, this is measured in dollars: a 3-minute full outage at 10M requests/minute on an e-commerce platform can cost $500K+ in lost revenue.
Senior: Knows the strategies, can pick canary for a critical service and rolling for a low-risk one. Sets up basic automated rollback on error rate thresholds.Staff: Designs the deployment platform for the organization. Builds progressive delivery pipelines with automated canary analysis (Netflix Kayenta, Flagger), integrates feature flags with deployment (decouple deploy from release), and establishes deployment SLOs (“99% of deploys complete in <15 minutes, 0% cause customer-visible incidents”). Thinks about deployment as a sociotechnical system — how fast can you ship without breaking trust?
  • AI-powered canary analysis. Tools like Harness and Dynatrace use ML to automatically compare canary metrics against baseline, detecting anomalies humans would miss (subtle latency distribution shifts, slow memory leaks, gradual error rate increase).
  • Predictive rollback. AI models trained on historical deployment data can predict whether a deploy will need rollback before the canary period completes, based on early signals (first 60 seconds of metrics).
  • AI-assisted rollback root cause analysis. When a deploy is rolled back, AI tools can correlate the diff, error logs, and metric changes to pinpoint the exact code change that caused the regression — reducing mean time to fix from hours to minutes.
Q: Canary shows 0.5% error rate increase but the baseline also has 0.5% variance. Do you proceed? A: You need statistical significance, not just a delta. Check the sample size — if the canary has served <1000 requests, 0.5% is noise. Wait for a statistically significant sample. If you have 100K+ requests and the confidence interval does not overlap with the baseline, it is a real signal — investigate before proceeding.Q: Blue-green deployment costs 2x infrastructure. When is it worth it? A: When instant rollback is non-negotiable: payment processing, auth services, or any service where a bad deploy costs more per minute than the infrastructure savings per month. If a 30-second outage in your payment service costs 50K,the50K, the 2K/month for 2x infra is obviously worth it.Q: Why do feature flags matter for deployment strategy? A: Feature flags decouple deployment (code goes to production) from release (users see the feature). You can deploy risky code behind a flag, validate in production with 0% exposure, then gradually roll out to 1%, 10%, 100%. Rollback is flipping a flag (milliseconds), not redeploying (minutes). This is strictly better than canary for feature-level risk.

6. Authentication Method Decision Matrix

MethodUse CaseStateful?RevocationComplexityScalability
SessionTraditional web appsYesEasy (delete from store)LowRequires shared store (Redis)
JWTStateless APIs, microservicesNoHard (must wait for expiry or use blocklist)MediumExcellent (no central store)
OAuth 2.0Third-party access, SSODependsModerate (token revocation endpoint)HighGood
API KeyServer-to-server, developer APIsYesEasy (delete key)LowGood
mTLSZero-trust service mesh, internalNoHard (CRL/OCSP)Very HighExcellent
SAMLEnterprise SSOYesModerateHighGood
Passkeys/WebAuthnPasswordless consumer authNoEasy (remove credential)MediumExcellent
Never roll your own auth for production systems. Use battle-tested libraries and standards. The most common security breaches come from custom authentication implementations.
Deep dive: Auth & Security
Auth is the one domain where “it works in testing” means absolutely nothing. The most expensive production incidents in software history are auth failures: Equifax (unpatched Apache Struts — $700M settlement), SolarWinds (compromised build pipeline tokens), and countless JWT-related breaches where tokens were never revoked. The interview one-liner: “The most dangerous line of code is the one that checks whether a user is allowed to do something. Get it wrong and nothing else matters.”
Senior: Understands JWT vs sessions trade-offs, can implement OAuth 2.0 flows, knows to never roll your own auth. Can design a token refresh strategy.Staff: Designs the auth architecture across the organization. Chooses between centralized auth service vs decentralized JWT validation based on latency and revocation requirements. Thinks about token lifecycle management: rotation policies, revocation infrastructure, key management (HSM vs software keys). Evaluates auth decisions through a compliance lens — SOC2, PCI-DSS, HIPAA all have specific requirements for session management, token storage, and credential rotation. Drives adoption of zero-trust networking where auth is verified at every hop, not just at the edge.
  • AI-powered anomaly detection for auth. ML models detect suspicious login patterns (impossible travel, credential stuffing, token abuse) in real-time. AWS GuardDuty and Datadog Security Monitoring use AI to flag compromised credentials before manual review could catch them.
  • AI code review for auth vulnerabilities. Tools like Snyk Code and Semgrep use AI to detect auth anti-patterns in code: hardcoded secrets, missing CSRF tokens, JWT validation bypass paths, insecure token storage. These should be in your CI pipeline.
  • Passkeys and the passwordless future. AI-powered biometric authentication (face, fingerprint) is replacing passwords. As an engineer, understand that WebAuthn/FIDO2 eliminates the entire class of password-related attacks (phishing, credential stuffing, brute force). This is not a trend — it is the end state.
Q: Your JWT has a 15-minute expiry. A user is mid-checkout when it expires. What happens? A: The API returns 401. The client’s HTTP interceptor catches it, silently sends the refresh token to get a new JWT, retries the original request. The user never notices. If the refresh token is also expired, redirect to login. This “silent refresh” flow must be implemented correctly or users get logged out randomly — one of the most common JWT UX bugs.Q: mTLS for internal services — overkill or essential? A: Depends on your threat model. If you trust your network boundary (single VPC, no third-party access), mTLS between internal services may be overhead. If you are in a zero-trust environment (multi-tenant cloud, compliance mandates, or past breach), mTLS is essential. The real cost is operational: certificate rotation, debugging TLS handshake failures, and sidecar overhead if using a service mesh for it.Q: Why is 403 Forbidden not always the right response for authorization failures? A: Sometimes you should return 404 Not Found instead. If a user should not even know a resource exists (e.g., another tenant’s data in a multi-tenant system), returning 403 leaks information — the attacker now knows the resource exists. Return 404 to hide the resource’s existence entirely. This is called “authorization by obscurity” layered on top of real authorization.

7. Message Queue Comparison

DimensionKafkaRabbitMQSQSRedis Streams
ThroughputMillions/secTens of thousands/secNearly unlimited (managed)Hundreds of thousands/sec
OrderingPer-partitionPer-queue (with caveats)Best-effort (FIFO available)Per-stream
PersistenceDisk (configurable retention)Optional (disk or memory)Managed (AWS handles it)AOF / RDB snapshots
DeliveryAt-least-once / exactly-onceAt-least-once / at-most-onceAt-least-once / exactly-once (FIFO)At-least-once
Consumer modelPull-based consumer groupsPush-based (with prefetch)Pull-based pollingConsumer groups (pull)
Best forEvent streaming, log pipelinesTask queues, RPC, routingServerless, AWS-native appsLightweight streaming with existing Redis
Operational costHigh (ZooKeeper/KRaft)Medium (Erlang runtime)Zero (fully managed)Low (Redis add-on)
Use a message queue when:
  • The downstream service can be temporarily unavailable
  • You need to decouple producers from consumers
  • Work can be processed asynchronously
  • You need to buffer traffic spikes
  • Multiple consumers need the same event
Use a direct API call when:
  • You need a synchronous response
  • The operation must complete before proceeding
  • Latency is critical (queues add latency)
  • The system is simple enough that a queue adds unjustified complexity
Deep dive: Messaging, Concurrency & State
Senior: Knows Kafka vs RabbitMQ trade-offs, understands consumer groups and offsets, can design a producer-consumer pipeline with retry and dead-letter handling.Staff: Designs the messaging platform for the organization. Makes decisions like “Kafka is our event backbone, SQS for simple task queues, no RabbitMQ” and enforces consistency. Thinks about schema evolution for events (Avro + Schema Registry), cross-team event contracts, and event discoverability (event catalog). Understands that a message queue is an organizational boundary — the contract between teams — and treats event schema changes like public API changes.
  • AI-powered consumer lag prediction. ML models can predict consumer lag hours before it becomes critical, based on producer throughput trends and consumer processing patterns. This enables preemptive scaling of consumers.
  • Intelligent dead-letter queue triage. Instead of manually inspecting DLQ messages, AI can classify failure reasons, suggest fixes, and even auto-retry messages that failed due to transient issues vs routing truly poisonous messages for human review.
  • Event-driven AI pipelines. Modern AI systems use Kafka as the backbone for real-time feature engineering, model inference pipelines, and feedback loops. Understanding Kafka is now table stakes for ML engineers, not just backend engineers.
Q: Kafka guarantees ordering per partition. How do you handle a use case that needs global ordering? A: Use a single partition. Yes, this limits throughput to one consumer, but if you truly need global ordering (e.g., financial ledger), that is the price. If you need high throughput AND ordering, partition by entity (user ID, account ID) to get per-entity ordering with parallelism across entities.Q: SQS FIFO vs Kafka — when do you pick SQS FIFO? A: When you are on AWS, need exactly-once delivery, have <3000 messages/second per group, and do not want any operational overhead. SQS FIFO is zero-ops, has built-in deduplication, and native Lambda integration. If you need replay, multi-consumer, or >3000 msg/s, Kafka wins.Q: Your Kafka cluster has 100 partitions but only 10 consumers. What happens? A: Each consumer handles 10 partitions. This is fine if consumers can keep up. If they cannot, you can add up to 90 more consumers (max = partition count). Beyond that, extra consumers sit idle. If 10 consumers still cannot keep up, increase parallelism within each consumer (thread pools) or increase partition count.

8. Container Orchestration Quick Reference

Core Kubernetes Objects

ObjectWhat It Does
PodSmallest deployable unit; one or more containers sharing network/storage
DeploymentManages ReplicaSets; handles rolling updates and rollbacks
ReplicaSetEnsures a specified number of pod replicas are running at all times
ServiceStable network endpoint that routes traffic to a set of pods
IngressHTTP/HTTPS routing rules from external traffic to internal services
ConfigMapInjects non-sensitive configuration data into pods as env vars or files
SecretStores sensitive data (tokens, passwords) with base64 encoding
StatefulSetLike Deployment but with stable pod identity and persistent storage
DaemonSetRuns exactly one pod per node (logging agents, monitoring)
Job / CronJobRuns a task to completion once (Job) or on a schedule (CronJob)
NamespaceVirtual cluster for isolating resources within the same physical cluster
PersistentVolume (PV)A piece of storage provisioned in the cluster
PersistentVolumeClaim (PVC)A request for storage by a pod
HorizontalPodAutoscalerScales pod count based on CPU, memory, or custom metrics
NetworkPolicyFirewall rules controlling pod-to-pod and external traffic
Mental model: Deployments manage ReplicaSets, which manage Pods. Services give Pods a stable DNS name. Ingress gives Services an external URL. Everything else is configuration, storage, or scheduling.
Deep dive: Cloud & Infrastructure | Leadership & Execution
Senior: Knows the core objects, can write Deployment/Service/Ingress YAML, understands pod lifecycle, can debug CrashLoopBackOff. Sets up HPA based on CPU/memory.Staff: Designs the Kubernetes platform for the organization. Makes decisions about cluster topology (single large cluster vs multi-cluster), namespace strategy (per-team vs per-environment), RBAC policies, network policies, and resource quota governance. Understands the cost model — knows that a c5.2xlarge node running 30% utilized pods is burning money, and drives bin-packing optimization. Evaluates when Kubernetes is overkill (team of 3, two services) and when it is essential (50+ services, multi-region). Champions GitOps (ArgoCD/Flux) for deployment consistency.
  • AI-powered resource right-sizing. Tools like Kubecost and StormForge use ML to analyze historical usage and recommend optimal resource requests/limits — eliminating the manual guesswork that wastes 30-60% of cluster spend at most organizations.
  • AI-assisted YAML generation and debugging. LLMs can generate K8s manifests from natural language, but more importantly, they can explain why a pod is not scheduling (kubectl describe output is notoriously verbose — AI can extract the one relevant line from 200 lines of output).
  • Predictive autoscaling. Instead of reactive HPA (scale after CPU spikes), AI models predict load 10-30 minutes ahead based on historical traffic patterns and pre-scale. This eliminates cold-start latency during traffic ramps.
Q: Deployment vs StatefulSet — when does the choice matter? A: Deployment for stateless services (web servers, API gateways) — pods are interchangeable. StatefulSet for stateful workloads (databases, Kafka brokers, ZooKeeper) — pods get stable network identities (pod-0, pod-1) and persistent volumes that survive rescheduling. If you deploy a database on a Deployment, you lose your data when the pod restarts.Q: A pod is stuck in Pending state. What do you check? A: Three things in order: (1) kubectl describe pod Events section — usually tells you directly. (2) Insufficient resources — the scheduler cannot find a node with enough CPU/memory. Check kubectl describe node for allocatable vs allocated. (3) Affinity/anti-affinity rules or taints preventing scheduling. The most common cause in production: someone set resource requests too high and no node can satisfy them.Q: Why do Kubernetes Secrets use base64 encoding instead of encryption? A: Base64 is encoding, not encryption — anyone with kubectl get secret access can read them. Secrets are stored encrypted at rest in etcd (if you enable encryption at rest), but the base64 encoding is just for safe YAML transport of binary data. For actual secret management, use an external provider: AWS Secrets Manager, HashiCorp Vault, or the External Secrets Operator.

9. Common HTTP Status Codes for Engineers

Success (2xx)

CodeNameWhen to Use
200OKStandard success for GET, PUT, PATCH
201CreatedResource successfully created (POST)
202AcceptedRequest accepted for async processing (not yet completed)
204No ContentSuccess with no response body (DELETE, PUT with no return)

Redirection (3xx)

CodeNameWhen to Use
301Moved PermanentlyResource URL has permanently changed (SEO-safe redirect)
302FoundTemporary redirect (use 307 for strict method preservation)
304Not ModifiedClient cache is still valid (conditional GET)

Client Error (4xx)

CodeNameWhen to Use
400Bad RequestMalformed syntax, invalid parameters, validation failure
401UnauthorizedMissing or invalid authentication credentials
403ForbiddenAuthenticated but not authorized for this resource
404Not FoundResource does not exist at this URI
405Method Not AllowedHTTP method not supported on this endpoint
409ConflictState conflict (duplicate resource, concurrent edit)
422Unprocessable EntitySyntactically valid but semantically incorrect
429Too Many RequestsRate limit exceeded — include Retry-After header

Server Error (5xx)

CodeNameWhen to Use
500Internal Server ErrorUnhandled exception — generic server failure
502Bad GatewayUpstream service returned an invalid response
503Service UnavailableServer is overloaded or in maintenance — temporary
504Gateway TimeoutUpstream service did not respond in time
401 vs 403: 401 means “I don’t know who you are” (authentication). 403 means “I know who you are, but you can’t do this” (authorization). Getting this wrong confuses every frontend developer on the team.
Deep dive: APIs & Databases | Networking & Deployment
Q: Your API returns 200 with {"success": false, "error": "payment failed"}. What is wrong with this? A: You are lying to the HTTP layer. Monitoring tools, CDNs, load balancers, and retry logic all use status codes to determine success. A 200 with an error body means your dashboards show green while customers are failing. Use proper status codes: 402 Payment Required, 422 Unprocessable Entity, or 400 Bad Request depending on the failure reason.Q: 502 vs 503 vs 504 — how do you differentiate them operationally? A: 502 (Bad Gateway): the load balancer reached your service but got garbage back — your app crashed mid-response or returned invalid HTTP. 503 (Service Unavailable): your service is explicitly saying “I am overloaded or in maintenance, try later.” 504 (Gateway Timeout): the load balancer waited for your service and gave up — your app is alive but too slow. Debugging priority: 502 = check app crash logs, 503 = check scaling/circuit breakers, 504 = check downstream dependencies and timeouts.Q: When should you use 202 Accepted? A: When the request was valid and accepted for processing, but the work is not done yet. Classic use cases: email sending, report generation, bulk imports, or any async workflow. Return 202 with a Location header pointing to a status endpoint where the client can poll for completion. This is how you design non-blocking APIs.

10. The “Nines” Table — Availability Reference

AvailabilityCommon NameDowntime / YearDowntime / MonthDowntime / Week
99%Two nines3.65 days7.31 hours1.68 hours
99.9%Three nines8.77 hours43.83 minutes10.08 minutes
99.95%Three and a half4.38 hours21.92 minutes5.04 minutes
99.99%Four nines52.60 minutes4.38 minutes1.01 minutes
99.999%Five nines5.26 minutes26.30 seconds6.05 seconds
99.9999%Six nines31.56 seconds2.63 seconds0.60 seconds
Combining availability: If Service A (99.9%) depends on Service B (99.9%), the combined availability is at best 99.9% x 99.9% = 99.8%. Each dependency in the critical path multiplies downtime.Improving availability:
  • Redundancy: Run multiple replicas across availability zones.
  • Eliminate single points of failure: Every component in the critical path needs failover.
  • Graceful degradation: Serve cached/stale data instead of failing entirely.
  • Health checks + auto-restart: Detect and recover from failures automatically.
Rule of thumb: Most production web apps target three nines (99.9%). Banks and telecom target four to five nines. Achieving five nines requires automated everything — humans are too slow.
Deep dive: Reliability Principles
Availability is a business decision, not an engineering one. Each additional nine costs roughly 10x more in engineering effort, redundancy, and operational discipline. Moving from 99.9% to 99.99% means going from “we can have 8 hours of downtime per year” to “we can have 52 minutes.” That delta requires automated failover, multi-AZ redundancy, zero-downtime deployments, and an on-call team that responds in minutes, not hours. The interview one-liner: “Every dependency in the critical path multiplies your downtime. Three services at 99.9% each give you 99.7% combined — not 99.9%.”
Senior: Knows the nines table, can calculate combined availability for a dependency chain, designs for redundancy and failover.Staff: Sets SLOs (Service Level Objectives) for the organization and builds the culture around them. Understands that SLOs are error budgets, not targets — “we have 52 minutes of downtime budget this year, and we have spent 30 minutes. Do we slow down deployments?” Differentiates between SLI (the metric), SLO (the target), and SLA (the contract with consequences). Knows that chasing five nines for a service that does not need it is a waste of engineering time that could be spent on features.
  • AIOps for incident detection. AI-powered tools (PagerDuty AIOps, BigPanda, Moogsoft) correlate alerts across services to reduce noise and identify the root cause faster. Instead of 50 alerts firing during an incident, AI groups them into one actionable incident with a probable root cause.
  • Predictive failure detection. ML models trained on historical metrics can predict failures 15-60 minutes before they happen (disk filling up, memory leak trajectory, connection pool approaching exhaustion), enabling preemptive action.
  • SLO tracking automation. Tools like Nobl9 and Datadog SLOs automate error budget tracking, alerting when you are burning budget faster than expected, and recommending whether to freeze deployments or proceed.
Q: Your service is 99.99% available but depends on a database at 99.9%. What is your actual availability? A: At best, 99.9% — your availability is capped by your least-available critical dependency. To get back to 99.99%, you need either a cache in front of the database that can serve reads during database downtime (graceful degradation) or a multi-region database with automatic failover.Q: Is “five nines” achievable without automation? A: No. Five nines means <5.26 minutes of downtime per year. A human cannot even be paged, wake up, VPN in, and start diagnosing in 5 minutes. Five nines requires fully automated detection, failover, and recovery. If a human is in the critical path, you are at four nines at best.Q: Your SLO is 99.9% and you have used 90% of your error budget in Q1. What do you do? A: Freeze non-critical deployments, prioritize reliability work (fix the top 3 sources of errors), and require canary + automated rollback for any critical deploys that must ship. This is the error budget policy in action — it converts reliability from a vague goal into a concrete operational constraint.

11. Back-of-Envelope Estimation Cheat Sheet

Powers of 2 — Capacity Reference

PowerExact ValueApproximate Size
2^101,024~1 Thousand (1 KB)
2^201,048,576~1 Million (1 MB)
2^301,073,741,824~1 Billion (1 GB)
2^401,099,511,627,776~1 Trillion (1 TB)
2^50~1 Petabyte (1 PB)

Common Estimation Building Blocks

MetricValue
Seconds in a day~86,400 (~10^5)
Seconds in a month~2.6 million (~2.5 x 10^6)
Seconds in a year~31.5 million (~3 x 10^7)
Average size of a tweet / text post~0.5 KB
Average size of a photo (compressed)~200 KB – 2 MB
Average size of a short video (1 min)~10 MB
Average HTTP request/response~1–10 KB
Characters in a URL~100 bytes

QPS Quick Math

Daily Active UsersActions/User/DayQPS (avg)QPS (peak, ~3x avg)
1 million10~115~350
10 million10~1,150~3,500
100 million10~11,500~35,000
1 billion10~115,000~350,000
The formula: QPS = (DAU x actions per user) / 86,400. Peak QPS is typically 2x–5x the average. Always calculate peak, not just average — systems must handle bursts.

Storage Estimation Formula

Daily storage = DAU x actions/user x size per action
Monthly storage = Daily x 30
Yearly storage = Daily x 365
Plan for 3–5 years of growth + replication factor (usually 3x)
Deep dive: Capacity Planning, Git & Pipelines | System Design Practice
Senior: Can produce reasonable QPS and storage estimates with explicit assumptions. Knows powers of 2, seconds-in-a-day, and the peak-to-average ratio.Staff: Uses estimation as a strategic tool, not just an interview skill. Before greenlighting a project, estimates total cost of ownership: compute, storage, data transfer, and engineering time. Can do “Fermi estimation” for business metrics too — “if we capture 5% of this market at 10/user/month,thatis10/user/month, that is X ARR, which justifies Y engineers.” Connects technical capacity planning to business planning.
  • AI-assisted capacity planning. Cloud providers (AWS Compute Optimizer, Google Recommender) use ML to analyze your actual usage patterns and recommend right-sized instances, reserved instance purchases, and storage tier transitions. This replaces manual estimation with data-driven forecasting.
  • LLM cost estimation is a new skill. If you are building AI features, you must estimate: tokens per request x requests per day x cost per token = daily LLM API cost. A GPT-4 call at ~0.03/1Kinputtokensserving1Mrequests/day= 0.03/1K input tokens serving 1M requests/day = ~30K/day. This estimation skill did not exist two years ago and is now essential for AI product budgeting.
  • Sanity-checking estimates with AI. LLMs are excellent at catching order-of-magnitude errors in back-of-envelope calculations. “Does it make sense that a URL shortener needs 110 TB over 5 years?” Use AI as a reviewer for your math, not a replacement for your intuition.
Q: You need to estimate storage for 1 billion chat messages per day. Walk through it in 30 seconds. A: Average message ~200 bytes (text + metadata). 1B x 200B = 200 GB/day. With 3x replication = 600 GB/day. Per year = ~219 TB. With compression (2-3x for text) = ~73-110 TB/year. This fits in a single Kafka cluster or a DynamoDB table comfortably.Q: Your boss asks “can our system handle Black Friday traffic?” What do you need to know? A: Three numbers: (1) Current peak QPS (from metrics, not estimates). (2) Expected Black Friday multiplier (typically 3-10x normal peak for e-commerce). (3) Current headroom — what percentage of capacity are you using at current peak? If you are at 60% of capacity at current peak and expect 5x, you need to roughly 8x your capacity. Then identify the bottleneck: is it compute, database, cache, or a third-party dependency?Q: Why do we estimate peak at 3x average, not 2x or 5x? A: 3x is a commonly observed ratio for most web applications (traffic concentrates during business hours, with a peak window of 2-4 hours). Social media and gaming can see 5-10x spikes (viral events, launches). E-commerce during Black Friday can see 10-20x. The right multiplier depends on your domain — 3x is a safe default when you do not have data.

12. Design Pattern Quick Reference

PatternProblem It SolvesWhen NOT to Use
SingletonEnsures one instance globally (config, connection pool)When it hides dependencies or makes testing difficult
Factory MethodDecouples object creation from usageWhen there is only one concrete type and it will not change
ObserverOne-to-many notifications on state changeWhen the order of notification matters or chains get deep
StrategySwap algorithms at runtime without changing client codeWhen there is only one algorithm and no foreseeable variation
DecoratorAdds behavior to objects dynamically without subclassingWhen the combination explosion of wrappers becomes unreadable
AdapterMakes incompatible interfaces work togetherWhen you can modify the original interface instead
BuilderConstructs complex objects step-by-stepFor simple objects where a constructor with parameters suffices
ProxyControls access to an object (lazy load, access control, caching)When the indirection adds latency with no real benefit
Circuit BreakerPrevents cascading failures by stopping calls to failing servicesWhen failures are transient and retries are cheap
CQRSSeparates read and write models for scalabilityFor simple CRUD apps where read/write patterns are identical
Beyond OOP design patterns, these distributed system patterns come up frequently:
PatternPurpose
SagaManage distributed transactions across microservices
Event SourcingStore state changes as an immutable sequence of events
SidecarAttach utility processes alongside your main container
BulkheadIsolate failures to prevent one component from sinking all
Strangler FigIncrementally migrate from legacy to new system
Leader ElectionCoordinate a single active node among replicas
Consistent HashingDistribute load evenly with minimal remapping on scaling
Outbox PatternReliably publish events alongside database transactions
Deep dive: Design Patterns
Senior: Knows the common patterns, can identify when to apply them, and — critically — knows when NOT to use them. Can explain a pattern with a real codebase example, not just the textbook UML diagram.Staff: Views patterns as communication tools, not implementation prescriptions. “We are using the Outbox pattern here” conveys intent to the entire team in three words. Recognizes that distributed system patterns (Saga, CQRS, Event Sourcing, Circuit Breaker) are far more interview-relevant and production-impactful than GoF patterns. Evaluates pattern adoption at the organizational level: “Should we standardize on Circuit Breaker in every service, or only at network boundaries?”
  • AI code review for pattern misuse. LLMs can detect anti-patterns in code reviews: Singleton hiding dependencies, God objects violating SRP, deep inheritance hierarchies that should be composition. Use AI as a pattern smell detector in CI.
  • Pattern selection assistance. Describe your problem to an LLM and it can suggest applicable patterns with trade-offs. The staff-level skill is evaluating whether the suggested pattern is the simplest solution or over-engineered for the context.
  • Refactoring with AI. LLMs excel at mechanical refactoring: extracting a strategy pattern from a switch statement, converting inheritance to composition, or introducing a builder for a complex constructor. The human judgment is deciding WHEN to refactor, not how.
Q: What is the difference between the Outbox pattern and Event Sourcing? A: The Outbox pattern reliably publishes events alongside database writes — the events are a side channel for integration. Event Sourcing makes events the source of truth — there is no separate database state, only the event log. Outbox adds a table; Event Sourcing replaces the table. Outbox is an integration pattern; Event Sourcing is a data modeling pattern.Q: Circuit Breaker vs Retry — when do they conflict? A: Retries without a circuit breaker hammer a failing service (3 retries x 1000 requests = 3000 requests to a dying service). The circuit breaker must wrap the retry logic: if the circuit is open, skip retries entirely and fail fast. The correct order is: Bulkhead then Circuit Breaker then Retry then Timeout.Q: Name a pattern that most teams adopt too early. A: CQRS. Most teams do not have genuinely different read and write models. If your read and write paths both hit the same Postgres table and the same schema, CQRS adds a projection layer, an event bus, and eventual consistency for zero benefit. Adopt it when reads and writes have fundamentally different scaling needs or data shapes — not because a blog post said microservices need it.

13. SOLID Principles — One-Liner

PrincipleOne-LinerCode Smell It Prevents
S — Single ResponsibilityA class should have only one reason to change.God classes that touch everything
O — Open/ClosedOpen for extension, closed for modification.Modifying existing code every time a new type appears
L — Liskov SubstitutionSubtypes must be usable wherever their parent type is expected.Subclasses that break parent behavior or throw unexpected errors
I — Interface SegregationNo client should be forced to depend on methods it does not use.Fat interfaces where implementors stub out half the methods
D — Dependency InversionDepend on abstractions, not concretions.Tightly coupled modules that cannot be tested or swapped
S — Single Responsibility: Bad: A User class that handles authentication, database access, and email sending. Good: Separate UserAuth, UserRepository, and EmailService classes.O — Open/Closed: Bad: A giant if/else chain that grows every time you add a payment method. Good: A PaymentProcessor interface with StripeProcessor, PayPalProcessor implementations.L — Liskov Substitution: Bad: A Square that extends Rectangle but breaks when setWidth is called independently. Good: Use a common Shape interface instead of inheritance.I — Interface Segregation: Bad: A Worker interface with work(), eat(), sleep() — robots do not eat. Good: Split into Workable, Eatable, Sleepable interfaces.D — Dependency Inversion: Bad: OrderService creates new MySQLDatabase() directly. Good: OrderService accepts a Database interface via constructor injection.
Deep dive: Design Patterns | Engineering Mindset
Senior: Can explain each principle with a code example and identify violations in a code review. Understands that SOLID is about managing change, not following rules dogmatically.Staff: Knows when SOLID principles conflict with each other and makes judgment calls. Understands that the “I” in ISP can be taken too far (20 single-method interfaces are worse than one 5-method interface). Recognizes that DIP at every layer creates indirection that hurts readability. Uses SOLID as a diagnostic tool (“this class is hard to change because it violates SRP”) rather than a prescriptive rule (“all classes must have exactly one reason to change”). Teaches the team to think in terms of coupling and cohesion rather than memorized acronyms.
Q: Give a real-world Liskov Substitution violation you have actually seen. A: A ReadOnlyRepository that extends Repository but throws UnsupportedOperationException on save() and delete(). Any code that accepts a Repository and calls save() will break. The fix: ReadOnlyRepository should be a separate interface, not a subclass. The LSP violation is that the subtype cannot be safely substituted where the parent is expected.Q: When is the Open/Closed Principle actually harmful? A: When you add extension points for variation that never comes. Building a PaymentProcessor interface with a plugin system when you only ever use Stripe adds indirection, makes debugging harder, and slows down onboarding — all for a hypothetical second payment provider. Add the abstraction when you need it, not before. YAGNI trumps OCP when there is no evidence of variation.Q: Which SOLID principle does DI (dependency injection) implement? A: Dependency Inversion (D) — you depend on abstractions (interfaces), not concretions (implementations). But it also enables testability (substituting mocks), which is a practical benefit beyond the principle itself. The principle is the “why,” DI is the “how.”

14. Git Commands Engineers Actually Use

Beyond the Basics

CommandWhat It Does
git log --oneline --graph --allVisualize the entire branch topology in your terminal
git diff --stagedSee exactly what will be committed (staged changes only)
git stash -uStash all changes including untracked files
git stash popRe-apply the most recent stash and remove it from the stash list
git cherry-pick <commit>Apply a single commit from another branch onto current branch
git rebase -i HEAD~NInteractively squash, reorder, or edit the last N commits
git bisect start / good / badBinary search through commits to find the one that introduced a bug
git reflogView the full history of HEAD — your safety net for “I lost my work”
git reset --soft HEAD~1Undo last commit but keep changes staged
git blame -L 10,20 file.pySee who last modified lines 10–20 (great for understanding context)
git log -S "functionName"Search commit history for when a string was added or removed
git shortlog -sn --no-mergesLeaderboard of contributors by commit count
git clean -fdRemove all untracked files and directories (destructive)
git worktree add ../feature-branch featureCheck out a branch in a separate directory without switching
git commit --fixup <commit>Mark a commit as a fixup for a previous commit (use with autosquash)

Aliases Worth Setting Up

git config --global alias.co checkout
git config --global alias.br branch
git config --global alias.st status
git config --global alias.lg "log --oneline --graph --all --decorate"
git config --global alias.unstage "reset HEAD --"
git config --global alias.last "log -1 HEAD --stat"
git config --global alias.amend "commit --amend --no-edit"
Dangerous commands to use with caution: git reset --hard, git push --force, and git clean -fd are destructive and cannot be undone easily. Always prefer --force-with-lease over --force when pushing, as it prevents overwriting teammates’ work.
Deep dive: Capacity Planning, Git & Pipelines
  • AI-powered commit message generation. Tools like GitHub Copilot and conventional-commit plugins generate commit messages from diffs. Useful for routine changes but staff engineers write messages that explain “why,” not “what” — AI struggles with intent.
  • AI code review. GitHub Copilot code review, CodeRabbit, and similar tools can catch bugs, style issues, and security problems in PRs. The staff-level skill is configuring these tools to catch what matters (security, performance, breaking changes) and ignore what does not (style nitpicks the team has not agreed on).
  • AI-assisted git bisect. When git bisect finds the offending commit, AI can analyze the diff and explain why that change caused the regression — reducing debugging time from hours to minutes.
Q: git rebase vs git merge — when do you use each? A: Rebase for a clean linear history on feature branches before merging to main. Merge for preserving the branch topology when multiple people worked on a branch or when you want an explicit merge commit for traceability. Golden rule: never rebase commits that have been pushed to a shared branch — it rewrites history that others depend on.Q: You accidentally committed a secret to git. It is already pushed. What do you do? A: (1) Immediately rotate the secret — assume it is compromised the moment it hits the remote. (2) Use git filter-branch or BFG Repo-Cleaner to remove it from history. (3) Force-push the cleaned history. (4) Notify the team. Rotating the secret is step 1, not step 4 — do not waste time cleaning history while the secret is live.Q: What is git reflog and when is it your lifesaver? A: Reflog records every change to HEAD, including operations that are not in git log (rebases, resets, amends). If you accidentally git reset --hard and lose commits, git reflog shows the commit hashes before the reset, and git checkout or git reset to that hash recovers your work. It is the safety net for destructive git operations.

15. Common Complexity Patterns

Know which data structure or algorithm gives you each time complexity. Pattern-matching this in interviews is a superpower.
ComplexityPatternTypical Data Structure / Algorithm
O(1)Direct lookupHash map, array index access
O(log n)Halve the search space each stepBinary search, balanced BST, skip list
O(n)Touch every element onceLinear scan, single-pass counting
O(n log n)Sort then processMerge sort, heap sort, sort-based problems
O(n^2)Nested comparison of all pairsBrute-force pair matching, bubble sort
O(2^n)All subsets / combinationsBacktracking, power set generation
O(n!)All permutationsPermutation-based brute force (TSP)
Interview shortcut: When you see “find a pair that sums to X” the brute force is O(n^2), but a hash map gives O(n). Recognizing that pattern is half the battle.
Deep dive: DSA Answer Framework
Senior: Can analyze the complexity of common algorithms and data structures. Recognizes patterns (two pointers, sliding window, BFS/DFS) and matches them to problems. Knows that O(n log n) is the lower bound for comparison-based sorting.Staff: Thinks about complexity at the system level, not just the algorithm level. “This endpoint scans all N items in the database per request. At 1000 QPS, that is N x 1000 scans per second. We need an index or a pre-computed view.” Connects algorithmic complexity to operational cost: an O(n^2) batch job on 1M records runs in minutes; on 100M records it runs in days. Identifies the point where the current approach stops scaling and designs for it proactively.
Q: An interviewer says “optimize this O(n^2) solution.” What is your instinct? A: Ask if sorting helps (O(n log n) + O(n) pass is common). Then consider: can a hash map give O(n) with O(n) space? Can two pointers or a sliding window reduce the nested loop? The pattern: trade space for time (hash map), or exploit ordering (sort first), or maintain a running state (sliding window).Q: Why is amortized O(1) not the same as O(1)? A: Amortized O(1) means “O(1) on average across many operations, but individual operations can be O(n).” Example: ArrayList.add() is O(1) amortized because most adds are O(1), but occasionally the array doubles in size (O(n) copy). In latency-sensitive systems, that occasional O(n) spike matters — it shows up as a P99 latency outlier.Q: When does O(n) beat O(log n) in practice? A: When n is small (linear scan of 100 elements beats binary search due to cache locality and branch prediction), or when the constant factor of O(log n) is large (a B-tree traversal with disk seeks per level vs a sequential scan in memory). Complexity notation hides constants — always benchmark with real data before optimizing.

16. System Design Components — When to Reach for What

When the interviewer draws a blank architecture box, these are the building blocks you fill it with.
ComponentWhat It DoesReach for It When…
Load BalancerDistributes traffic across serversYou have more than one instance of a service
CDNCaches static assets at the edgeUsers are geographically distributed; you serve images, JS, CSS, or video
Cache (Redis/Memcached)Stores hot data in memoryRead-heavy workloads with tolerable staleness
Message QueueDecouples producers from consumersAsync processing, traffic spikes, or unreliable downstream services
Database (SQL)Structured, relational storage with ACIDTransactions, joins, or strong consistency requirements
Database (NoSQL)Flexible schema, horizontal scaleHigh write throughput, variable data shapes, or key-value access patterns
Blob Storage (S3)Stores large unstructured filesImages, videos, backups, logs — anything over ~1 MB
Search Index (ES)Full-text search and aggregationUsers need fuzzy search, autocomplete, or faceted filtering
API GatewaySingle entry point for external trafficMicroservices needing auth, rate limiting, and routing in one place
Service MeshHandles inter-service networking (mTLS, retries)Microservices at scale needing observability and zero-trust networking
Interview move: When sketching architecture, name the component and justify it in one sentence. “I’ll add a CDN here because our users span three continents and static assets don’t change often.” That’s the level of reasoning interviewers look for.

Quick Reference Architecture

This is the “default” web application architecture you can draw in the first 2 minutes of any system design interview, then customize based on requirements.
Deep dive: Cloud & Problem Framing | System Design Practice
Knowing when to reach for a component is the core skill of system design interviews and real architecture work. The mistake is not choosing the wrong component — it is adding components you do not need. Every box on the whiteboard is operational cost: monitoring, alerting, scaling, debugging, on-call rotation. A load balancer you do not need is a failure point you did not need. A message queue for 10 requests/second is complexity that a synchronous call handles better. The interview one-liner: “Justify every box on the diagram. If you cannot explain why it is there in one sentence, remove it.”
  • AI-assisted architecture review. Describe your system design to an LLM and ask it to identify missing components, single points of failure, and over-engineering. LLMs are surprisingly good at catching “you have a cache but no invalidation strategy” or “you have three sequential service calls that could be parallelized.”
  • AI as an architecture search engine. “How does Uber handle ride matching at scale?” LLMs can synthesize architecture patterns from publicly available engineering blogs and conference talks, giving you starting points for specific system design problems.
  • New AI-specific components. Modern architectures increasingly include: vector databases for semantic search, embedding services for content understanding, inference endpoints for real-time ML, and feature stores for ML pipelines. These are becoming standard building blocks alongside caches and queues.
Q: When do you NOT need a message queue? A: When the downstream service is highly available, the operation must complete before responding to the user (synchronous by nature), throughput is low enough that direct calls handle it, and you do not need to decouple teams or replay events. Adding a queue to a 10 QPS synchronous workflow adds latency, complexity, and a new failure mode for zero benefit.Q: CDN vs edge computing — what is the difference? A: A CDN caches static content at edge locations. Edge computing runs your code at edge locations. CDN serves a pre-built JavaScript bundle. Edge computing personalizes the HTML response for each user at the edge. Use CDN for static assets (table stakes), edge computing for dynamic personalization, A/B testing, or auth at the edge.Q: API Gateway vs Load Balancer — why do you often need both? A: Load Balancer distributes traffic across instances of the same service (L4/L7 routing). API Gateway handles cross-cutting concerns for external traffic: authentication, rate limiting, request transformation, API versioning. The gateway sits in front of the load balancer (or multiple load balancers for different services). Putting auth logic in a load balancer is an anti-pattern; putting request distribution logic in an API gateway is wasteful.

17. REST API Naming Conventions

A quick reference for designing clean, idiomatic REST endpoints. Get these right and reviewers stop nitpicking your API.
RuleGoodBadWhy
Use nouns, not verbs/users/getUsersHTTP methods already express the verb
Use plural nouns/orders/42/order/42Consistent collection semantics
Use kebab-case/user-profiles/userProfilesURLs are case-insensitive by convention
Nest for relationships/users/5/orders/getUserOrders?id=5Expresses hierarchy in the URI
Use query params for filtering/orders?status=shipped/shipped-ordersKeeps the resource URI clean
Version in the URL or header/v1/users/users?version=1Clear, cacheable, hard to miss
Return 201 on POST create201 Created + Location header200 OK with body onlySignals resource creation; Location enables follow-up
Use PATCH for partial updatesPATCH /users/5PUT /users/5 with partial bodyPUT implies full replacement
Deep dive: APIs & Databases
Q: Should you use PUT or PATCH for an update? When does it matter? A: PUT replaces the entire resource — you must send all fields, even unchanged ones. PATCH applies a partial update — send only the changed fields. Use PATCH for most updates (less bandwidth, no risk of accidentally nulling out omitted fields). Use PUT when the client truly intends to replace the entire resource and you want idempotent semantics.Q: Nested resources: /users/5/orders vs /orders?user_id=5. When do you choose which? A: Nested when the child makes no sense without the parent (orders belong to a user, comments belong to a post). Flat with query params when the child can exist independently or when you need to filter across parents (all orders with status=shipped regardless of user). Deep nesting beyond two levels (/users/5/orders/42/items/3) is a red flag — flatten it.Q: URL versioning (/v1/users) vs header versioning. Which do you pick? A: URL versioning for public APIs — it is explicit, cacheable, and impossible to miss. Header versioning for internal APIs where URL aesthetics matter and you control all clients. The industry has largely settled on URL versioning for public APIs. GitHub, Stripe, and Twilio all use it.

18. Consensus Algorithm Comparison

Consensus is how distributed systems get unreliable machines to agree on a value. The three algorithms you need to know — and the systems that use them.
AspectRaftPaxosZAB (ZooKeeper Atomic Broadcast)
Published2014 (Ongaro & Ousterhout)1989/1998 (Lamport)2011 (Junqueira, Reed, Serafini)
Design goalUnderstandabilityCorrectness proofZooKeeper-specific total order broadcast
LeaderStrong leader required; all writes go through leaderNo strict leader (leaderless or weak leader)Designated leader; followers forward writes to leader
Log orderingFirst-class log replicationMust be added on top (Multi-Paxos)Total order broadcast of state changes
Election mechanismRandomized timeout (150-300ms), majority voteProposer with highest ballot numberLongest zxid (transaction ID) wins; TCP-based ordering
Membership changesJoint consensus protocol defined in paperComplex and underspecifiedDynamic reconfiguration (since ZK 3.5)
Correctness proofFull TLA+ specification availableSafety proven; liveness depends on implementationFormal proof in the ZAB paper
Implementation difficultyHigh, but significantly more tractable than PaxosVery high; many subtle edge casesModerate (purpose-built for ZooKeeper)
Used byetcd, CockroachDB, TiKV, Consul, RethinkDBGoogle Chubby, some older systemsApache ZooKeeper, Kafka (pre-KRaft)
Read performanceLeader reads by default; follower reads possible with leaseFlexible (leaderless reads possible)Reads from any follower (sequential consistency)
Typical cluster size3, 5, or 7 nodes3, 5, or 7 acceptors3, 5, or 7 nodes
When do you choose which? You almost never implement consensus from scratch. You choose a system that uses consensus: etcd or Consul (Raft) for service discovery and config, ZooKeeper (ZAB) for coordination in legacy Kafka or Hadoop, CockroachDB (Raft) for distributed SQL. The algorithm matters for understanding failure modes and debugging, not for selection.
Deep dive: Distributed Systems Theory
Senior: Understands Raft at a conceptual level — leader election, log replication, majority quorum. Knows that etcd and CockroachDB use Raft. Can explain why you need an odd number of nodes.Staff: Understands the operational implications of consensus. Knows that adding a fifth Raft node does not improve throughput (writes still go through one leader) but does improve fault tolerance (survive 2 failures instead of 1). Can reason about linearizable reads vs stale reads and the latency trade-off. Understands that consensus algorithms are for metadata and coordination, not bulk data — etcd should hold megabytes, not gigabytes. Has an opinion on the Raft vs Paxos debate and can explain why Raft “won” practically (understandability, complete specification) even though Paxos is more academically pure.
  • AI for distributed systems debugging. When a consensus cluster misbehaves (split-brain, leader flapping, stuck elections), AI tools can analyze logs from all nodes simultaneously and identify the timeline of events. This is a task that humans struggle with because it requires correlating logs across nodes with clock skew.
  • Formal verification with AI. TLA+ and similar formal methods are used to verify consensus implementations. AI is getting better at writing and checking TLA+ specs, which could make formal verification accessible to more teams — not just researchers at AWS and Google.
  • Managed consensus as a service. The trend is away from self-managed consensus toward managed services (AWS MemoryDB, DynamoDB, Aurora) that hide the consensus layer entirely. The staff engineer understands what is happening underneath but appreciates not having to operate it.
Q: Why does a 3-node Raft cluster tolerate only 1 failure, not 2? A: Raft requires a majority (quorum) to commit writes. In a 3-node cluster, the majority is 2. If 2 nodes fail, the remaining 1 cannot form a majority. With 5 nodes, majority is 3, so you tolerate 2 failures. The formula: tolerate F failures requires 2F+1 nodes.Q: Raft leader handles all writes. Is this a bottleneck? A: Yes, Raft’s write throughput is bounded by the leader’s capacity and the replication round-trip time. For metadata stores (etcd, Consul) this is fine — you are writing kilobytes, not gigabytes. For databases (CockroachDB), the solution is range-based sharding with a separate Raft group per range, so different ranges have different leaders, distributing write load across nodes.Q: Why did Kafka switch from ZooKeeper (ZAB) to KRaft (Raft)? A: ZooKeeper was an external dependency that added operational complexity, limited Kafka’s scalability (metadata updates bottlenecked on ZK), and required a separate operational skillset. KRaft moves metadata management into Kafka itself using Raft, eliminating the external dependency, simplifying operations, and enabling faster controller failover. It also allows more partitions per cluster (millions vs hundreds of thousands).

19. CRDT Types Quick Reference

CRDTs (Conflict-free Replicated Data Types) allow replicas to be modified independently and merged automatically — no coordination, no conflicts.
CRDT TypeWhat It DoesMerge StrategyLimitationsReal-World Use
G-CounterGrow-only counter (increment only)Element-wise max of per-node counters; value = sumCannot decrementDistributed page view counts
PN-CounterCounter with increment and decrementTwo G-Counters: P (positive) and N (negative); value = sum(P) - sum(N)State grows with node countLike/dislike counters, inventory adjustments
G-SetGrow-only set (add only)Union of setsCannot remove elementsTag collections, participant lists
OR-SetObserved-Remove Set (add and remove)Each add tagged with unique ID; remove removes known tags; concurrent add winsMetadata overhead from tombstones/tagsShopping carts, collaborative editing
LWW-RegisterLast-Writer-Wins single valueHigher timestamp winsClock skew can silently discard concurrent writesUser profile fields, configuration values
MV-RegisterMulti-Value RegisterKeeps all concurrent values; app resolvesRequires application-level conflict resolutionSystems needing explicit merge UI
RGAReplicated Growable Array (ordered list)Unique element IDs with causal orderingComplex implementation; metadata overheadCollaborative text editing
State-based (CvRDTs): Nodes send their entire state to other nodes. The merge function (commutative, associative, idempotent) handles any message loss, duplication, or reordering. Simple but expensive for large data structures.Operation-based (CmRDTs): Nodes send operations (like “add X” or “increment by 1”). Smaller messages, but the transport must guarantee at-least-once delivery. If an operation is lost, replicas diverge permanently.Rule of thumb: State-based for small data structures or unreliable networks. Operation-based for large data structures over reliable transports.
Deep dive: Distributed Systems Theory — CRDTs
Senior: Knows what CRDTs are and can name a few types (G-Counter, OR-Set). Understands that they solve the “merge without coordination” problem.Staff: Understands where CRDTs are used in production systems they actually interact with: DynamoDB Global Tables use CRDTs internally for cross-region conflict resolution. Redis CRDTs (Redis Enterprise) enable active-active geo-replication. Riak used CRDTs as a core feature. Can reason about CRDT limitations: metadata overhead grows with the number of replicas, tombstones in OR-Sets need garbage collection, and LWW-Register silently drops concurrent writes (which is a data loss pattern, not a conflict resolution pattern). Knows that most engineers should use systems that implement CRDTs rather than implementing CRDTs themselves.
  • Collaborative AI editing. AI-powered collaborative tools (Cursor, Replit) use CRDT-like data structures for real-time code collaboration. Understanding CRDTs helps you understand how your IDE handles concurrent edits from multiple developers and AI assistants simultaneously.
  • Offline-first AI applications. AI features in mobile apps (smart compose, local inference) generate state changes offline that must merge when reconnecting. CRDTs provide a principled merge strategy for these scenarios without the complexity of conflict resolution UIs.
Q: G-Counter vs PN-Counter — why not just use PN-Counter everywhere? A: PN-Counter uses twice the space (two G-Counters). If you know the counter only grows (page views, event counts), G-Counter is simpler and more space-efficient. Use PN-Counter only when you genuinely need decrements (like/unlike, inventory adjustments).Q: LWW-Register says “last writer wins.” Why is that dangerous? A: “Last” is determined by timestamp. If two replicas have clock skew, the “last” writer might not be the one the user intended. A write at 12:00:01 on replica A (clock running 2 seconds fast) beats a write at 12:00:02 on replica B (accurate clock), even though B’s write was genuinely later. Clock skew silently discards the correct write. This is why CRDTs that do not depend on physical timestamps (OR-Set, G-Counter) are more robust.Q: When would you reach for a CRDT instead of consensus? A: When availability matters more than strong consistency and the data structure supports a natural merge. Counters, sets, and registers all have CRDT variants. Complex data with business invariants (account balance must not go negative) cannot be safely modeled as a CRDT because the merge function cannot enforce cross-replica invariants. For those, use consensus.

20. OS I/O Models Comparison

Understanding I/O models explains why different server architectures exist and what makes Nginx, Node.js, and Redis fast.
I/O ModelHow It WorksScalabilityProgramming ComplexityUsed By
Blocking I/OThread calls read() and blocks until data arrivesPoor (~thousands of connections)Lowest (sequential code)Traditional Apache httpd, PHP-FPM
Non-blocking I/Oread() returns EAGAIN if no data; process must pollPoor alone (busy-waiting)MediumRarely used alone; combined with multiplexing
I/O Multiplexing (select/poll)Monitor multiple fds; kernel scans all fds on each callModerate (O(n) per call, ~10K connections)MediumOlder servers, legacy codebases
I/O Multiplexing (epoll/kqueue)Kernel maintains interest set; returns only ready fdsExcellent (O(1) per event, ~100K+ connections)Medium-HighNginx, Node.js, Redis, HAProxy
Async I/O (io_uring)Kernel performs I/O and notifies via shared ring buffers; zero syscalls in fast pathHighest (millions of ops/sec)HighNext-gen storage engines, modern databases
The key ratios: select() has a hard limit of ~1024 fds and is O(n). epoll is O(1) for events with no fd limit. This single difference is why Node.js can handle 100K connections on one thread. io_uring (Linux 5.1+) is the next leap — true async with no syscall overhead in the fast path.
Deep dive: Operating System Fundamentals
Senior: Understands why epoll is better than select for high-concurrency servers. Knows that Node.js and Nginx use event-driven I/O to handle many connections on few threads.Staff: Can reason about I/O model choices at the architecture level. Understands that io_uring is the future of Linux I/O and can explain why it matters for database engines (reduces syscall overhead by 10-100x for storage-bound workloads). Knows that the thread-per-connection model (Apache httpd, traditional Java) is not “wrong” — it is simpler to program and debug, and for <10K connections, the performance difference is negligible. Makes I/O model choices based on actual connection counts and throughput requirements, not blog-post hype.
  • AI inference serving and I/O. LLM inference servers (vLLM, TensorRT-LLM) are fundamentally I/O-bound — they batch requests, manage GPU memory, and stream tokens. Understanding async I/O models explains why these servers use event-driven architectures and why batching requests improves GPU utilization.
  • AI-powered performance profiling. Tools like Intel VTune and Perf combined with AI analysis can identify I/O bottlenecks (excessive syscalls, page faults, context switches) in production workloads and suggest the right I/O model for your access pattern.
Q: Node.js is single-threaded but handles 100K connections. How? A: Node.js uses epoll/kqueue (via libuv) to multiplex I/O events on a single thread. The event loop processes callbacks when I/O completes, never blocking. CPU-bound work blocks the event loop — that is why you offload it to worker threads. The key insight: I/O multiplexing works because most connections are idle most of the time (waiting for client input, waiting for DB response).Q: io_uring vs epoll — when does the difference matter? A: For network I/O (web servers, API servers), epoll is still excellent — the bottleneck is usually application logic, not syscall overhead. io_uring shines for storage I/O — database engines, file-processing pipelines — where the syscall overhead of read()/write() becomes the bottleneck at millions of ops/sec. RocksDB and some modern databases are adopting io_uring for this reason.Q: Why does Redis use single-threaded I/O and still achieve millions of ops/sec? A: Redis operations are in-memory and complete in microseconds. The bottleneck is network I/O, not CPU. A single thread with epoll can handle 100K+ connections because each operation is so fast that the thread is almost never blocked. Redis 6+ added I/O threading for the network layer (read/write to sockets) while keeping the command execution single-threaded to avoid lock contention.

21. Database Deep Dive Comparison

When interviewers ask “which database and why?” — this is the table you want in your head.
DimensionPostgreSQLMongoDBDynamoDBRedis
ModelRelational (tables, rows, SQL)Document (JSON/BSON)Wide-column / key-value (PK + SK)In-memory data structures
SchemaStrict schema (schema-on-write)Flexible (schema-on-read)Schemaless (attribute-level)Schemaless
Query languageSQL (full standard)MQL + Aggregation PipelinePartiQL or API (GetItem, Query, Scan)Commands (GET, SET, ZADD, etc.)
ConsistencyStrong (ACID, serializable isolation)Tunable (w:1 to w:majority, read concern local to linearizable)Tunable (eventually consistent by default, strongly consistent reads optional)Strong on single node; eventual across replicas
Scaling modelVertical (read replicas for reads; app-level sharding for writes)Horizontal (built-in sharding by shard key)Horizontal (automatic partitioning by partition key)Horizontal (Redis Cluster, 16384 hash slots)
Max practical size~10 TB single instance; larger with partitioning + CitusPetabytes (with sharding)Virtually unlimited (fully managed)Limited by RAM (100s of GB per node)
Latency1-10 ms typical1-10 ms typical1-5 ms (single-digit guaranteed)<1 ms (sub-millisecond)
JOINsNative, optimized$lookup (slow, not recommended at scale)None (design around access patterns)None (application-level)
ReplicationStreaming replication (WAL-based)Replica sets (oplog-based)Global Tables (multi-region, CRDT-based)Sentinel (HA) or Cluster (sharding)
Best forTransactions, complex queries, strong consistency, JOINsFlexible schemas, hierarchical data, rapid iterationPredictable latency at any scale, serverless, key-based accessCaching, sessions, leaderboards, rate limiting, pub/sub
Worst forHorizontal write scaling without extensionsComplex joins, many-to-many relationshipsAd-hoc queries, analytics, complex relationshipsDatasets larger than RAM, complex queries
Managed optionsRDS, Aurora, Supabase, NeonAtlasDynamoDB (native)ElastiCache, MemoryDB
No database is best — only best for your access pattern. A senior engineer does not say “I like Postgres.” They say “Given our read-heavy workload with complex joins and ACID requirements, Postgres is the right fit — but we’ll use Redis in front of it for hot-path reads and DynamoDB for the event audit log where we need unlimited write throughput.”
Deep dive: Database Deep Dives
Q: PostgreSQL vs MySQL — when does the choice actually matter? A: For most CRUD applications, either works fine. PostgreSQL wins when you need: JSON queries (jsonb), advanced indexing (GIN, GiST, BRIN), CTEs and window functions for analytics, or extensibility (PostGIS for geo, pgvector for embeddings). MySQL wins when you need: simpler replication setup, broader hosting compatibility, or your team already knows it deeply. The operational expertise factor outweighs the feature comparison 9 times out of 10.Q: DynamoDB charges per read/write. When does this become painfully expensive? A: When you do full-table scans (Scan operations read every item and charge for every 4KB read), when your access patterns do not match your key design (leading to hot partitions and throttling), or when you use on-demand pricing for steady-state workloads (on-demand is ~6.5x costlier per request than provisioned). The worst case: a team running analytics queries via DynamoDB Scan operations instead of exporting to S3 and querying with Athena.Q: Redis persistence — AOF vs RDB. When do you choose each? A: RDB (point-in-time snapshots): lower disk I/O, faster restarts, but you lose data since last snapshot. AOF (append-only file): logs every write, minimal data loss, but larger files and slower restarts. For caching (data is reconstructible), RDB or no persistence. For primary data store (sessions, rate limits), AOF with appendfsync everysec — at most 1 second of data loss. For zero data loss, use Redis with AOF appendfsync always (slower) or switch to a real database.

22. Real-Time Protocol Comparison

AspectWebSocketSSE (Server-Sent Events)WebRTCLong Polling
DirectionBidirectional (full-duplex)Server to client onlyBidirectional (peer-to-peer)Server to client (with request per update)
ProtocolTCP (upgraded from HTTP)HTTP (text/event-stream)UDP (SRTP/SCTP)HTTP
LatencyVery low (~1-5ms LAN)Low (~5-50ms)Lowest (~10-50ms P2P)Medium (~50-500ms per cycle)
Auto-reconnectManual (you implement it)Built-in (EventSource API)Manual (ICE restart)Manual
Binary dataYes (binary frames)No (UTF-8 text only)Yes (data channels)Yes (HTTP body)
Frame overhead2-14 bytes after handshake~50 bytes per eventVaries (RTP headers)Full HTTP headers per response (~200-800 bytes)
Scaling difficultyMedium-Hard (stateful connections, pub/sub backbone needed)Easy (stateless HTTP, works with CDNs)Hard (STUN/TURN servers, SFUs at scale)Easy (stateless HTTP)
Proxy/firewallSometimes blockedFully compatible (plain HTTP)Often blocked (UDP)Fully compatible
Server connections~50K-100K per server6 per domain (HTTP/1.1), unlimited (HTTP/2)Limited by CPU/bandwidth6 per domain (HTTP/1.1)
Best forChat, collaboration, gaming, live dashboardsNotifications, feeds, build logs, stock tickersVoice, video, screen sharing, P2P file transferLegacy fallback, serverless, low-frequency updates
Decision shortcut: Need server-push only? SSE (simplest). Need bidirectional? WebSocket. Need audio/video? WebRTC. Serverless or behind strict firewalls? Long Polling. Do not reach for WebSocket when SSE solves your problem.
Deep dive: Real-Time Systems
Senior: Knows WebSocket vs SSE vs WebRTC and can pick the right one. Can implement a basic WebSocket chat or SSE notification feed.Staff: Designs real-time infrastructure at scale. Understands the stateful connection problem (WebSocket connections are stateful, which breaks stateless horizontal scaling — you need a pub/sub backbone like Redis Pub/Sub or Kafka to fan out messages to the right connection server). Evaluates the operational cost of maintaining 100K+ persistent connections per server. Knows when to push back on real-time requirements: “Does this dashboard really need sub-second updates, or would polling every 30 seconds deliver the same user experience at 1/100th the infrastructure cost?”
  • Streaming LLM responses. Every AI chatbot uses SSE to stream token-by-token responses. Understanding SSE is now a must-have skill for any engineer building AI-powered UIs. The pattern: HTTP POST to /chat/completions, response is text/event-stream with data: prefixed JSON chunks.
  • Real-time AI collaboration. Tools like Cursor and Copilot Chat use WebSocket for bidirectional communication — the user types, the AI responds, the user edits, the AI adapts. This is pushing WebSocket adoption beyond traditional chat apps into developer tooling.
  • WebRTC for AI voice. AI voice assistants (OpenAI Realtime API, Hume) use WebRTC for low-latency audio streaming. Understanding WebRTC’s STUN/TURN/ICE negotiation is becoming relevant for AI product engineers, not just video conferencing teams.
Q: SSE has a 6-connection-per-domain limit in HTTP/1.1. How do you work around it? A: Use HTTP/2 (multiplexes streams over one connection, no per-domain limit) or use a different subdomain for the SSE connection. In practice, most modern deployments use HTTP/2, making this a non-issue. If you are still on HTTP/1.1, a single SSE connection per tab is usually sufficient — multiplex events for different features over one connection.Q: WebSocket connections survive a load balancer restart. True or false? A: False. WebSocket connections are stateful TCP connections. If the load balancer restarts, drops, or rebalances, all WebSocket connections are severed. Your client must implement reconnection logic with exponential backoff. This is also why blue-green deployments with WebSocket services are tricky — you need to gracefully drain connections from the old version.Q: When is long polling still the right choice in 2025? A: Behind restrictive corporate firewalls that block WebSocket upgrades, in serverless environments where persistent connections are expensive (Lambda has a 15-minute max), or as a fallback transport when WebSocket/SSE fails. Socket.IO famously falls back to long polling automatically.

23. GraphQL vs REST Decision Matrix

DimensionRESTGraphQL
Data fetchingServer decides response shape; often over-fetches or under-fetchesClient specifies exact fields; no over-fetching
EndpointsOne per resource (/users, /orders/42)Single endpoint (/graphql)
VersioningURL versioning (/v1/users) or headerSchema evolution with @deprecated directives; no versions needed
CachingExcellent (HTTP caching, CDN-friendly, ETag, Cache-Control)Hard (all requests are POST to one endpoint; need persisted queries or APQ)
Error handlingHTTP status codes (400, 404, 500)Always returns 200; errors in response body { errors: [...] }
ToolingExcellent (Postman, curl, any HTTP client)Good (GraphiQL, Apollo Studio, Insomnia)
N+1 problemServer controls query; can optimize with JOINsField-level resolvers cause N+1 by default; requires DataLoader
Type safetyOptional (OpenAPI/Swagger)Built-in (SDL schema is the contract)
Real-timePolling or SSE (bolted on)Subscriptions (via WebSocket, built into spec)
File uploadsNative (multipart/form-data)Not in spec (requires multipart extension or pre-signed URL workaround)
Rate limitingSimple (requests per second)Complex (each query has different cost; need calculated query cost)
Best forPublic APIs, simple CRUD, server-to-server, teams wanting HTTP cachingMobile/frontend with varied data needs, multiple client types, graph-shaped data
The honest answer: If you have a simple CRUD API with one client type and want aggressive caching, REST wins. If you have mobile + web + watch clients all needing different subsets of the same data, GraphQL wins. If you are building server-to-server internal APIs, neither — use gRPC.
Deep dive: GraphQL at Scale | APIs & Databases

24. Service Mesh Comparison

AspectIstioLinkerdConsul Connect
Data plane proxyEnvoy (C++)linkerd2-proxy (Rust, purpose-built)Envoy (C++)
Memory per sidecar~40 MB~10-20 MB~40 MB (Envoy)
Latency per hop~1-3 ms~1-2 ms~1-3 ms
Configuration complexityHigh (many CRDs: VirtualService, DestinationRule, Gateway, etc.)Low (opinionated defaults, fewer knobs)Medium (integrates with Consul service catalog)
mTLSAutomatic with cert rotation (default 24h)Automatic with cert rotationAutomatic via Consul CA or Vault
Traffic splittingYes (weight-based, header-based, fault injection)Yes (basic traffic splitting)Yes (via service-resolver and service-splitter)
Circuit breakingYes (outlier detection in Envoy)Yes (built into proxy)Yes (via Envoy)
Multi-clusterSupported (complex setup)Supported (simpler model)Native (designed for multi-DC/multi-cloud)
Non-Kubernetes supportLimited (K8s-first)K8s onlyYes (VMs, bare metal, Nomad, K8s)
ExtensibilityVery high (Wasm/Lua filters in Envoy)Limited (intentionally)Moderate (Envoy filters + Consul intentions)
Learning curveWeeks to monthsDays to weeksDays to weeks
Best forLarge orgs with platform teams needing advanced traffic managementSmall-medium teams wanting mTLS + observability with minimal opsHybrid environments spanning K8s, VMs, and multi-cloud
Decision shortcut: Small team, just want mTLS and observability? Linkerd. Large platform team needing fine-grained traffic control and extensibility? Istio. Mixed infrastructure (K8s + VMs + multi-cloud)? Consul Connect.
Deep dive: API Gateways & Service Mesh

25. Distributed Systems Numbers

Numbers every engineer should have at their fingertips for distributed systems design and interviews.

Consensus & Coordination

MetricTypical ValueContext
Raft election timeout150-300 ms (randomized)Prevents split votes; must be >> heartbeat interval
Raft heartbeat interval50-150 msLeader sends to suppress elections
Raft leader election time~200-500 msFrom leader failure to new leader
ZooKeeper session timeout6-30 s (default varies)Client must heartbeat within this window; too low = false expiry
ZooKeeper write latency2-10 msWrites go through leader; commit requires majority
etcd write latency2-10 msSimilar to ZooKeeper (Raft-based)
etcd recommended max DB size8 GB (default)Can be raised, but etcd is for metadata, not bulk data

Replication & Messaging

MetricTypical ValueContext
Kafka replication lag (in-rack)<10 msBetween leader and ISR followers
Kafka end-to-end latency5-50 msProducer to consumer, same datacenter
Kafka single-broker throughput800 MB/s+With zero-copy (sendfile()), sequential I/O, batching
PostgreSQL streaming replication lag<1 ms (same AZ) to seconds (cross-region)Depends on network and write volume
MongoDB replica set replication lag0-100 ms typicalOplog-based; can spike under heavy write load
DynamoDB Global Tables replicationTypically <1 second cross-regionCRDT-based conflict resolution
Redis replication lagSub-millisecond (same DC)Async replication; can lose data on failover

Timeouts & Failure Detection

MetricTypical ValueContext
TCP connect timeout1-5 sTime to establish connection; 75s kernel default on Linux is too high for production
HTTP request timeout (internal)1-10 sService-to-service calls; longer = more resource holding
Health check interval5-30 sKubernetes liveness/readiness probes
Circuit breaker trip threshold5-10 consecutive failuresOpen circuit, stop calling failing service
NTP clock skew (same DC)1-10 msNTP precision; not good enough for causal ordering
NTP clock skew (cross-DC)10-200 msWhy logical clocks exist
Google TrueTime uncertainty1-7 msWith atomic clocks + GPS; Spanner’s commit-wait
Deep dive: Distributed Systems Theory | Messaging, Concurrency & State

26. Cloud Service Limits Cheat Sheet (AWS)

The numbers that matter when you’re sketching architecture on a whiteboard.

Lambda

LimitValueNotes
Max execution time15 minutesUse Step Functions for longer workflows
Memory128 MB - 10,240 MBCPU scales proportionally with memory
Deployment package (zip)50 MB zipped / 250 MB unzippedUse container images for larger deps
Container image size10 GBFor ML models, large native binaries
Concurrent executions (default)1,000 per regionRaise via support ticket (commonly 10K-100K)
Burst concurrency500-3,000 (region-dependent)Cannot be increased
Ephemeral storage (/tmp)512 MB - 10,240 MBConfigurable since 2022
Environment variables4 KB totalUse SSM Parameter Store for larger configs
Payload (sync invocation)6 MB request / 6 MB responseUse S3 for larger payloads
Payload (async invocation)256 KBEvents larger than this must reference S3
Cold start (Go/Rust)50-100 msCompiled, minimal runtime
Cold start (Python/Node)100-300 msInterpreter startup + imports
Cold start (Java)3-10 sJVM + class loading; use SnapStart

S3

LimitValueNotes
Object size (single PUT)5 GB maxUse multipart upload for files >100 MB
Object size (multipart)5 TB maxUp to 10,000 parts
Bucket count per account100 (soft limit)Raise via support ticket
Request rate per prefix5,500 GET/s + 3,500 PUT/sDistribute across prefixes to scale beyond this
Consistency modelStrong read-after-write (since Dec 2020)No more eventual consistency surprises
Storage classes6 tiers (Standard to Deep Archive)Use lifecycle policies to auto-transition
Minimum object charge128 KB for IA/Glacier classesSmall objects in IA cost more than Standard

DynamoDB

LimitValueNotes
Item size400 KB maxCompress or offload large attributes to S3
Partition key value throughput3,000 RCU + 1,000 WCU per partitionHot partitions get throttled even if table capacity is not exhausted
GSIs per table20 (hard limit)Plan access patterns carefully
LSIs per table5 (must be defined at table creation)Cannot be added later
Partition + all LSIs per PK value10 GB maxReason most practitioners avoid LSIs
Query/Scan result set1 MB per call (paginate for more)Use LastEvaluatedKey for pagination
Batch operations25 items per BatchWriteItem / BatchGetItemItems can be up to 400 KB each
Transaction limit100 items per TransactWriteItems4 MB total request size
On-demand pricing~6.5x costlier per request than provisioned at steady stateUse for spiky/unknown traffic
Global Tables replicationTypically <1 secondCRDT-based; last-writer-wins per attribute

RDS / Aurora

LimitValueNotes
RDS max storage64 TB (PostgreSQL, MySQL)Aurora auto-grows to 128 TB
RDS max connections (default)~5000 (depends on instance RAM)Use PgBouncer or RDS Proxy
Aurora read replicasUp to 15Replication lag typically <20 ms
RDS read replicasUp to 5Lag can be seconds under heavy write load
RDS Multi-AZ failover time60-120 secondsAurora: <30 seconds
Aurora Serverless v2 min ACU0.5 ACUScales to zero only on v1 (with pause/resume delay)
RDS Proxy connectionsUp to 1,000 per proxy endpointMultiplexes many app connections to fewer DB connections
Automated backup retentionUp to 35 daysPoint-in-time recovery within retention window
Max IOPS (gp3)16,000Provisioned IOPS (io2) goes up to 256,000
Limits change. AWS updates limits regularly. These numbers are accurate as of early 2025 but always verify against the AWS Service Quotas console for your specific region and account before making architecture decisions.
Deep dive: Cloud Service Patterns | Cloud & Problem Framing

27. API Gateway Comparison

GatewayDeploymentStrengthsWeaknessesBest For
KongSelf-hosted or Kong CloudRich plugin ecosystem, Lua extensibility, PostgreSQL/Cassandra backingComplex to operate at scale, Lua is nicheTeams wanting extensibility with a large plugin marketplace
AWS API GatewayFully managedZero ops, native AWS integration (Lambda, IAM), WebSocket support29-second timeout, cold starts, limited customization, vendor lock-inAWS-native serverless architectures
EnvoySelf-hosted (often K8s)Extreme performance, L7 protocol awareness, xDS dynamic configSteep learning curve, YAML-heavy, not turnkey API managementK8s teams, performance-critical paths, service mesh data plane
TraefikSelf-hostedAuto-discovery (Docker/K8s labels), Let’s Encrypt integrationSmaller plugin ecosystem, less enterprise toolingDocker/K8s environments wanting automatic service discovery
APISIXSelf-hostedHigh performance (Nginx/OpenResty), etcd-backed, CNCF ecosystemSmaller community, less enterprise toolingHigh-performance needs, CNCF-aligned teams
NginxSelf-hostedBattle-tested, extremely stable, massive community, low resourcesManual configuration, limited API management out of the boxSimple routing, teams already running Nginx
The golden rule: A gateway handles infrastructure concerns — routing, auth, rate limiting, TLS termination. The moment you put business logic in the gateway (pricing rules, inventory checks), you are building a distributed monolith.
Deep dive: API Gateways & Service Mesh

Quick-Find Index

TopicSection
API Gateway comparison27
API styles (REST, gRPC, etc.)4
Authentication methods6
Availability (“nines” table)10
Back-of-envelope estimation11
Caching strategies3
Cloud service limits (Lambda, S3, DynamoDB, RDS)26
Complexity patterns (Big O)15
Consensus algorithms (Raft, Paxos, ZAB)18
Container orchestration (K8s)8
CRDT types19
Database deep dive (Postgres, Mongo, Dynamo, Redis)21
Database selection2
Deployment strategies5
Design patterns12
Distributed systems numbers25
Git commands14
GraphQL vs REST23
HTTP status codes9
Interview Day Quick ReviewTop
Latency numbers1
Message queues7
OS I/O models20
Real-time protocols (WebSocket, SSE, WebRTC)22
REST naming conventions17
Service mesh comparison (Istio, Linkerd, Consul)24
SOLID principles13
System design components16

Interview Deep-Dive Questions

These questions are what separates a “read the cheatsheet” candidate from someone who has actually built and operated systems. Every question below has been asked in real senior/staff-level interviews. Practice answering them out loud before you walk into the room.

Q1. You are designing a system where latency matters. Walk me through how you would reason about where time is spent in a typical request from a user’s browser to a database and back.

The way I think about this is as a latency budget — you have a total time the user will tolerate (say 200ms for a web request), and every hop in the path eats into that budget. Let me walk through the chain.DNS resolution is the first cost. If uncached, that is roughly 50ms. In practice, most browsers and OSes cache DNS aggressively, so this is usually zero for repeat requests, but it absolutely bites you on the first request or after a TTL expires.TCP handshake is 1.5 round trips. Within the same datacenter that is under 1ms, but cross-continent that is roughly 225ms just for the handshake alone (150ms RTT times 1.5). If TLS is involved, add another 1-2 round trips — that is another 150-300ms cross-continent. This is exactly why we use connection pooling, keep-alive connections, and TLS session resumption.Network traversal through the load balancer adds 0.1-1ms typically. The load balancer itself is rarely the bottleneck, but if it is doing TLS termination, that adds CPU time.Application processing is where your code runs. If you are hitting a cache (Redis), that is sub-millisecond. A simple database query to PostgreSQL is 1-10ms. But if your code makes three sequential service calls at 5ms each, you have already spent 15ms just on internal networking.The database is where most time hides. An indexed query on Postgres is 1-5ms. A missing index on a table with millions of rows can blow that to 500ms or more. A sequential scan that should be a cache hit is the most common performance regression I have seen in production.The return path adds symmetric network cost. So the total for a cache-hit path within a datacenter might be: 0ms DNS + 0.5ms TCP (reused) + 0.1ms LB + 2ms app + 0.5ms Redis = roughly 3ms. For a cache-miss path with a cross-continent user: 50ms DNS + 225ms TCP+TLS + 0.1ms LB + 5ms app + 10ms DB + 150ms return = roughly 440ms.The key insight is that network latency dominates everything when users are far from your servers. That is why CDNs and edge computing exist. Inside the datacenter, the database is almost always the bottleneck. The ratio that matters: RAM access is 100ns, SSD random read is 150 microseconds (1,500x slower), cross-continent network is 150ms (1 million times slower than RAM). You architect differently depending on which part of the budget is your constraint.
  • Candidate only mentions “network” and “database” without quantifying anything
  • No awareness that TLS handshake can dominate the budget for cross-region requests
  • Talks about optimizing application code but ignores that network latency is usually the real bottleneck for geo-distributed users
  • Cannot explain why connection pooling matters (hint: it amortizes handshake cost)

Follow-up: Your P50 latency is 20ms but P99 is 800ms. What do you investigate first?

A 40x gap between P50 and P99 tells me something is bimodal — most requests take a fast path, but some requests hit a qualitatively different slow path. My investigation order:1. Is it garbage collection? JVM-based services or languages with stop-the-world GC can cause periodic latency spikes. I would check GC logs and correlate pauses with the P99 spikes. If GC pauses match the spike pattern, you are looking at tuning heap size, switching GC algorithms (G1 to ZGC), or reducing allocation rate.2. Is it a cache miss pattern? If 95% of requests hit Redis (sub-ms) but 5% miss and go to Postgres (10-50ms) or even worse trigger a cold computation, that perfectly explains a bimodal distribution. I would check cache hit ratios and see if the P99 correlates with cache misses.3. Is it a specific query or endpoint? Not all endpoints are equal. One slow endpoint at 5% of traffic can dominate P99. I would segment latency by endpoint and by query. Often it is a single missing index or a query that does a sequential scan on a specific access pattern.4. Is it connection pool exhaustion? If the pool is too small, requests queue for a connection. Under load, the queue time dominates. The telltale sign: latency is fine until QPS crosses a threshold, then P99 spikes exponentially.5. Is it contention (locks, mutex)? Database row-level locks, application-level mutexes, or even filesystem locks can create tail latency. One request holding a lock blocks N others.6. Is it infrastructure jitter? Noisy neighbors on shared hardware, thermal throttling, or cross-AZ calls that sometimes route through a congested path. This is harder to diagnose and often requires correlating with infrastructure metrics.The tooling I would reach for: distributed tracing (Jaeger/Zipkin) to see where the 800ms is actually spent, and I would look at the latency histogram, not just P50/P99. If the histogram is bimodal with two distinct peaks, that confirms two distinct code paths. If it is a long tail, it is more likely contention or resource exhaustion.

Follow-up: How do CDNs and edge computing change this latency picture?

CDNs fundamentally change the game by moving the data closer to the user, which eliminates the cross-continent round trip that dominates latency for geo-distributed users.Static content (images, JS, CSS) is the easy win. A CDN like CloudFront caches these at edge locations. Instead of 150ms RTT to your origin, the user gets a 5-20ms RTT to the nearest edge node. This is table stakes — every production system should do this.Dynamic content at the edge is where it gets interesting. Services like Cloudflare Workers or AWS Lambda@Edge let you run logic at the edge, which means you can handle personalization, A/B testing, or auth token validation without going back to origin. The latency win is massive — you go from 300-500ms (cross-continent round trip plus processing) to 20-50ms.The trade-off is cache invalidation complexity and data consistency. Static assets are easy because they rarely change. But if you are caching API responses at the edge, you need to think about TTLs carefully. Serving stale data from the edge is faster but might be unacceptable for some use cases (financial data, inventory counts). The pattern I have seen work well is: cache aggressively at the edge with short TTLs (5-30 seconds) for data that can tolerate eventual consistency, and go to origin for anything requiring strong consistency.The gotcha that most people miss: CDNs do not help with authenticated or user-specific requests unless you set up cache keys that include auth tokens or user segments. If every request is unique (personalized feed, user dashboard), the CDN becomes a pass-through proxy and you only save the TLS termination time.

Q2. You need to choose a database for a new service. Walk me through your decision framework — not which database to pick, but how you decide.

My framework has five dimensions, and I evaluate them roughly in this order because each one eliminates options before you get to the next.1. What are the access patterns? This is the single most important question. Am I doing key-value lookups by a known ID? Complex joins across multiple entities? Full-text search? Time-series aggregation? Graph traversals? The access pattern narrows the field immediately. If I need joins and transactions, I am in relational territory. If I need key-value at scale with predictable latency, I am looking at DynamoDB. If I need search, I need an inverted index (Elasticsearch). Trying to force the wrong access pattern onto a database is the number one architectural mistake I have seen.2. What are the consistency requirements? Do I need ACID transactions? Can I tolerate eventual consistency? Is “read your own writes” sufficient? Financial systems and inventory management demand strong consistency — that points to PostgreSQL or CockroachDB. An activity feed or analytics pipeline can tolerate eventual consistency — that opens up Cassandra, DynamoDB, or MongoDB with weaker read concerns. The critical insight is that consistency is not binary. Most systems have some data that needs strong consistency (user balances, order state) and other data that does not (view counts, recommendations). You often need more than one database.3. What is the scale trajectory? If I am building for a startup that will have 10K users for the next year, PostgreSQL handles almost everything. If I am building for a system that needs to handle 100K writes per second and will grow to petabytes, I need to think about horizontal write scaling from day one, which means DynamoDB, Cassandra, or CockroachDB. The mistake is over-engineering for scale you will never reach. Postgres with read replicas handles far more than most people think — many companies run their entire business on a single Postgres instance into the tens of millions of users.4. What does the team know? Operational expertise matters enormously. A team that knows PostgreSQL inside and out will run a Postgres cluster more reliably than a team learning Cassandra from scratch, even if Cassandra is “theoretically better” for the use case. Operational maturity — knowing how to back up, restore, monitor, tune, and upgrade — is an underrated factor. I have seen teams pick the “right” database and then suffer months of operational incidents because nobody understood its failure modes.5. What is the operational cost? This includes managed vs self-hosted, licensing, and the cost of the engineering time to operate it. DynamoDB is expensive per-request but zero ops. Self-hosted Cassandra is cheap per-request but requires dedicated engineers to keep it healthy. For most teams, I would default to a managed service unless there is a compelling reason not to.In practice, most production systems end up with two or three databases: a relational database for transactional core data, Redis for caching and hot-path reads, and optionally a specialized store (Elasticsearch for search, a time-series DB for metrics, DynamoDB for a specific high-throughput use case).
  • Jumps straight to a specific database without asking about access patterns
  • Says “MongoDB because it is flexible” or “Postgres because it is good” without deeper reasoning
  • Does not mention operational expertise as a factor
  • Treats the choice as permanent — a senior engineer knows you can (and often should) use different databases for different parts of the system
  • No mention of consistency requirements as a decision axis

Follow-up: When would you use both PostgreSQL and DynamoDB in the same system?

This is actually a pattern I have seen work well in several production systems. The idea is to use each database for what it does best.PostgreSQL for the transactional core. User accounts, order processing, payment state, anything that requires ACID transactions, complex queries during development, and ad-hoc reporting. Postgres gives you joins, constraints, and the ability to ask questions about your data that you did not anticipate at design time.DynamoDB for high-throughput, access-pattern-driven data. Event logs, session state, API rate limiting counters, user activity feeds, feature flags — anything where you know the access pattern upfront, need single-digit millisecond latency at any scale, and do not need joins.A concrete example: an e-commerce platform. Orders, inventory, and user accounts live in Postgres because you need transactions (decrement inventory AND create order atomically) and ad-hoc queries for business reporting. But the product catalog view counts, user session data, and the order event audit trail go to DynamoDB because those are pure key-value lookups at high volume where you need predictable latency regardless of scale.The synchronization pattern between them is important. Typically, the Postgres write path publishes events (using the Outbox pattern to avoid dual-write problems), and a consumer writes the denormalized view to DynamoDB. This gives you the transactional source of truth in Postgres and the read-optimized view in DynamoDB.The trade-off is operational complexity — you now have two databases to monitor, back up, and understand. But at sufficient scale, the alternative (trying to make one database do everything) is worse.

Going Deeper: How do you avoid the dual-write problem when keeping two databases in sync?

The dual-write problem is one of the most common mistakes in distributed systems. The naive approach is: write to Postgres, then write to DynamoDB. If the second write fails, your databases are out of sync. If you wrap them in a distributed transaction, you have coupled two systems and introduced a performance and availability bottleneck.The Outbox pattern is the standard solution. Instead of writing to both databases, you write to Postgres only — but you write both the business data AND an event record into an “outbox” table in the same local transaction. A separate process (a CDC consumer or a poller) reads the outbox table and publishes those events to a message broker (Kafka, SNS), which then triggers the write to DynamoDB. Because the business data and the outbox event are in the same Postgres transaction, they are guaranteed to be consistent.Change Data Capture (CDC) is the more modern approach. Tools like Debezium read the Postgres WAL (write-ahead log) and publish every change as an event. This avoids the need for the outbox table entirely — the WAL is your outbox. The consumer reads those events and writes to DynamoDB.The trade-off with both approaches is that DynamoDB is now eventually consistent with Postgres. There is a propagation delay — typically milliseconds to low seconds. If you need strong consistency across both stores, you have a fundamentally harder problem that usually means you should not be using two databases for that specific data path.The gotcha I have seen in production: idempotency. Your consumer must handle duplicate events (network retries, consumer restarts). Writing to DynamoDB must be idempotent — use conditional writes or version checks so that processing the same event twice does not corrupt data.

Q3. Explain the trade-offs between JWT and session-based authentication. When would you choose one over the other?

At the core, this is a trade-off between stateless scalability and immediate revocability.Session-based auth stores the session state server-side (typically Redis or a database). The client holds an opaque session ID cookie. On every request, the server looks up the session. The advantage is full control: you can revoke a session instantly by deleting it from the store. The disadvantage is that every single request requires a round trip to the session store, and you need that store to be highly available — if Redis goes down, nobody can authenticate.JWT-based auth puts the session state in the token itself — the server signs a token containing user claims, and the client sends it with every request. The server verifies the signature without any database lookup. The advantage is beautiful horizontal scalability — any server can validate the token independently, no shared state needed. This is why JWTs are popular in microservices architectures where you do not want every service calling a central auth service. The disadvantage is revocation: once you issue a JWT, it is valid until it expires. If a user’s account is compromised, you cannot invalidate their existing token without extra infrastructure.When I choose sessions: Traditional web applications with a single backend, systems where instant revocation is a security requirement (banking, healthcare), and when you already have Redis in your infrastructure for caching.When I choose JWTs: Microservices architectures where you want to avoid a centralized auth bottleneck, mobile APIs where stateless auth simplifies the architecture, and short-lived tokens (5-15 minutes) paired with refresh tokens stored server-side.The hybrid approach most production systems actually use: short-lived JWTs (5-15 minute expiry) for API authentication plus server-side refresh tokens for issuing new JWTs. This gives you the scalability of JWTs (most requests do not hit the auth server) with reasonable revocation (revoke the refresh token, and the JWT expires naturally within minutes). The security window is the JWT lifetime — that is your maximum exposure time.What most people miss: JWT size. A JWT with a few claims is easily 800 bytes to 1 KB. That is sent with every single request. Compare that to a 32-byte session ID. In high-throughput systems, that bandwidth adds up. Also, JWTs are not encrypted by default — they are base64-encoded. Anyone can read the claims. If you put sensitive data in a JWT, you need JWE (encrypted JWT), which adds complexity and size.
  • Says “JWT is more secure because it is signed” — signing proves integrity, not confidentiality
  • Does not mention the revocation problem with JWTs
  • Claims sessions “do not scale” without mentioning that Redis-backed sessions scale to millions of concurrent users
  • No awareness of the hybrid approach (short-lived JWT + server-side refresh token)
  • Puts sensitive data (roles, permissions, PII) in JWT claims without mentioning encryption

Follow-up: A security team tells you a user’s JWT has been stolen. What do you do?

This is exactly why the revocation problem matters. My response depends on what infrastructure we have in place.Immediate action: revoke the refresh token. If we are using the short-lived JWT + refresh token pattern, the stolen JWT will expire within minutes. The attacker cannot get a new one because the refresh token is revoked server-side.If we need to invalidate the JWT before expiry, we have a few options, none of them free:Token blocklist. Add the JWT’s jti (unique token ID) to a blocklist in Redis. Every service that validates JWTs must check this blocklist. This works but defeats the main advantage of JWTs — you are back to hitting a centralized store on every request. The blocklist only needs to persist until the token’s natural expiry, so it is a bounded problem.Rotate the signing key. If you rotate the key, all existing JWTs become invalid. This is a nuclear option — it logs out every user, not just the compromised one. Appropriate only in catastrophic breach scenarios.User-level version claim. Include a token_version claim in JWTs. Store the current version per user in a fast store. On revocation, increment the user’s version. Services compare the JWT’s version against the current version. This is a targeted blocklist — you only invalidate tokens for the compromised user, and the lookup is a simple key-value read.The lesson: the right architecture depends on your threat model. If “a stolen token is valid for 15 minutes” is acceptable, short-lived JWTs with no blocklist is the simplest approach. If that window is unacceptable (banking, healthcare), you need the blocklist or you should reconsider whether JWTs are the right choice for your system.

Q4. You are designing a notification system that needs to deliver messages in real-time to millions of connected users. Walk me through the architecture and the protocol choices you would make.

This is a problem of scale, statefulness, and fan-out. Let me work through it layer by layer.Protocol choice: SSE over WebSocket for this use case. Notifications are server-to-client only — the user is not sending messages back. SSE (Server-Sent Events) is simpler than WebSocket for unidirectional push: it works over plain HTTP, auto-reconnects natively, works through proxies and CDNs without issues, and is simpler to load-balance because it is just an HTTP connection. WebSocket would be the right choice if the user needed to send data back (like a chat system), but for notifications, SSE is the simpler tool that fits.Connection management layer. Each server can hold maybe 50K-100K concurrent SSE connections. For millions of users, that is 10-20+ servers just for connections. I would put these behind a load balancer with sticky sessions (or IP hashing) so that reconnects go back to the same server when possible. Each connection server maintains an in-memory map of userId -> connection.The fan-out problem. When a notification is generated, how does it reach the right connection server? The notification producer does not know which server holds the user’s connection. The pattern is: notification producer publishes to a message broker (Kafka or Redis Pub/Sub), and every connection server subscribes. Each server checks if the target user is connected to it, and if so, pushes the notification down the SSE stream. With Redis Pub/Sub, you can use channel-per-user or a fan-out pattern. With Kafka, you partition by user ID so that each connection server only processes notifications for users it is likely to hold.Offline users. If the user is not connected, the notification goes to a persistent store (DynamoDB or Postgres) and is delivered on next connection. The client, on reconnect, sends a Last-Event-ID header (SSE supports this natively), and the server replays missed events.Scaling to millions. At 1 million concurrent connections, I am looking at roughly 20 connection servers. The bottleneck shifts to the fan-out layer. Redis Pub/Sub works well up to maybe 100K messages per second, but beyond that you need Kafka with partitioning or a dedicated pub/sub system. The connection servers themselves are I/O-bound (holding connections), not CPU-bound, so they can be relatively small instances with high network capacity.The trade-off I would discuss with the team: do we need exactly-once delivery or is at-least-once acceptable? For notifications, at-least-once with client-side deduplication (using a notification ID) is usually the pragmatic choice. Exactly-once is significantly harder and rarely worth the complexity for notifications.
  • Jumps to WebSocket without considering SSE for a server-push-only use case
  • Does not address the fan-out problem — how does a notification reach the right server
  • No mention of offline handling or message persistence
  • Assumes a single server can hold all connections
  • Does not discuss delivery guarantees (at-least-once vs exactly-once)
A 10-million-user fan-out is a classic thundering herd. If you try to deliver all 10 million notifications at once, you will either overwhelm your message broker, saturate your connection servers, or both.Staggered delivery. Not all 10 million users need the notification in the same millisecond. Introduce a jitter — spread the delivery over 5-30 seconds. For each batch of users, add a random delay. The user perceives “instant” (a few seconds is fine for most notifications), but your system sees a smooth ramp instead of a spike.Pre-computed fan-out vs on-demand fan-out. For a “celebrity posts” scenario (one user has 10M followers), you have two architectures. Fan-out-on-write: when the event happens, immediately write a notification record for each of the 10M users. This is fast for reads but expensive and slow for writes. Fan-out-on-read: do not pre-compute — when a user opens their notifications tab, query “what events happened from accounts I follow since my last check.” This is slow for reads but avoids the write amplification. The hybrid approach (what Twitter famously does) is: fan-out-on-write for normal users, fan-out-on-read for celebrities with millions of followers.Rate limiting the producer. The notification-generation pipeline should have a rate limiter. Instead of emitting 10M messages at once, it emits them at a controlled rate — say 100K per second. A queue (Kafka) buffers the rest. Connection servers consume at their own pace.Backpressure. If connection servers cannot keep up, they need to signal backpressure to the message broker. With Kafka, this happens naturally because consumers pull at their own rate. With Redis Pub/Sub, messages are fire-and-forget and can be dropped if the subscriber is slow — another reason to prefer Kafka for this scale.

Q5. Explain the CAP theorem, then tell me why it is frequently misunderstood and what you actually think about when designing distributed systems.

The CAP theorem states that in a distributed system experiencing a network partition, you must choose between consistency and availability. You cannot have both simultaneously during the partition. Let me break down each term as CAP defines them, because this is where most misunderstandings start.Consistency in CAP means linearizability — every read returns the most recent write. This is a very strong form of consistency, much stronger than what most people mean when they say “consistent.”Availability in CAP means every request to a non-failed node receives a response. Not eventually — immediately.Partition tolerance means the system continues to operate despite network partitions between nodes.Why it is misunderstood: People treat CAP as a ternary choice — “pick 2 of 3.” But in any distributed system, partitions will happen (networks are unreliable). So the real choice is always between C and A during a partition. The question is not “pick 2 of 3” but “when a partition occurs, do you sacrifice consistency or availability?”What I actually think about: PACELC. CAP only describes behavior during partitions. But most of the time, your system is not partitioned. PACELC extends CAP: during a Partition, choose Availability or Consistency. Else (normal operation), choose Latency or Consistency. This is much more practical because it captures the daily trade-off: even without partitions, replicating data synchronously to ensure consistency adds latency. DynamoDB is PA/EL — during partitions it chooses availability, and during normal operation it chooses low latency (eventually consistent reads are the default). Spanner is PC/EC — it always chooses consistency, even at the cost of higher latency (commit-wait based on TrueTime).In practice, I think about consistency on a spectrum, not a binary: strong consistency (linearizable), sequential consistency, causal consistency, read-your-writes, eventual consistency. Most systems do not need linearizability everywhere. The art is identifying which data needs which level. User account balances? Strong consistency. View counts on a post? Eventual consistency is more than fine.The question I ask in design reviews is not “are we CP or AP?” but “what is the blast radius if this read returns stale data? And is that acceptable for this specific use case?”
  • Says “pick any 2 of 3” as if you choose to not have partition tolerance
  • Cannot define what consistency means in CAP (linearizability) versus general “data is correct”
  • Treats it as a system-wide choice rather than per-operation or per-data-path
  • No mention of PACELC or the latency vs consistency trade-off during normal operation
  • Cannot give a concrete example of when eventual consistency is acceptable

Follow-up: Give me a concrete example of a system that appears to violate CAP and explain why it does not.

Google Spanner is the classic example. Spanner claims to be a globally consistent, highly available, partition-tolerant database. That sounds like it violates CAP.It does not, and here is why. Spanner trades latency for consistency. It uses TrueTime (atomic clocks + GPS receivers in every datacenter) to keep clock uncertainty to 1-7 milliseconds. When a write commits, Spanner performs a “commit-wait” — it waits for the TrueTime uncertainty window to pass before confirming the write. This ensures that any subsequent read, anywhere in the world, will see the write.During a network partition, Spanner chooses consistency over availability. If a Paxos group loses its majority, it stops serving writes. Reads from a lagging replica will wait until it catches up or time out. So Spanner is a CP system that provides very high availability in practice because Google’s network is so reliable that partitions are exceedingly rare.The real insight is that CAP is about what happens during partitions. If you can make partitions rare enough (by investing billions in network infrastructure, as Google does), a CP system can appear to also be highly available. Spanner reportedly achieves five nines of availability. But if a partition did occur, it would sacrifice availability to maintain consistency. CAP is not violated — partitions are just extremely rare on Google’s network.Another example is CockroachDB, which takes a similar approach (Raft-based, strong consistency) and accepts that during a partition affecting a quorum, affected ranges become unavailable. In practice, with three or five replicas across availability zones, the probability of losing a majority is extremely low.

Going Deeper: How does CockroachDB handle a situation where a Raft leader becomes partitioned from its followers?

When a Raft leader is partitioned from the majority, the followers will stop receiving heartbeats. After the election timeout (randomized, 150-300ms equivalent in CockroachDB’s implementation), the followers that can still communicate with each other will trigger a leader election. Because they form a majority, they will successfully elect a new leader and resume serving reads and writes for that range.The old partitioned leader will continue to think it is the leader for a brief period, but any writes it attempts will fail because it cannot get acknowledgment from a majority of replicas. It will eventually step down after not receiving responses to its heartbeats. This is the key safety property of Raft: the old leader cannot commit anything because commits require majority acknowledgment.The leaseholder complication. CockroachDB has a concept beyond Raft leader: the leaseholder, which is the node that serves reads. If the partitioned node holds the lease, reads to that range will also be affected until the lease expires and a new leaseholder is elected. The lease duration is a tunable parameter — shorter leases mean faster recovery from this scenario but more overhead during normal operation (more frequent lease renewals).Client impact. Clients connected to the partitioned node will see errors for the affected ranges. If the client has retries with a different node in the cluster, the request will succeed against the new leader. This is why client-side retry logic with awareness of the cluster topology matters. CockroachDB’s client drivers have built-in retry and redirect logic.The whole recovery typically happens within a few hundred milliseconds to a few seconds, which is why CockroachDB can claim high availability despite being a CP system — partitions are resolved quickly enough that SLAs are maintained.

Q6. Walk me through how you would design a caching strategy for a read-heavy API endpoint that serves user profile data. The data changes infrequently but must not be stale for more than 30 seconds.

The constraints give me clear guardrails: read-heavy workload, infrequent writes, 30-second staleness budget. Let me design this layer by layer.Cache-aside with event-driven invalidation is my starting pattern. Cache-aside means the application checks Redis first, and on a miss, reads from the database and populates Redis. This is the simplest and most well-understood pattern.TTL as a safety net, not the primary invalidation mechanism. I would set a TTL of 30 seconds on the cache entries. This ensures that even if everything else fails, the data is never staler than 30 seconds. But relying on TTL alone means that on average, data is 15 seconds stale. We can do better.Event-driven invalidation for freshness. When a user updates their profile, the write path publishes an invalidation event (either directly to Redis via DEL or through a lightweight message like SNS/SQS). The cache entry is deleted immediately, and the next read triggers a cache miss and fetches fresh data. This gives us near-zero staleness for the common case, with the 30-second TTL as the backstop for edge cases (event lost, consumer lag).Cache stampede protection. When a popular user’s cache expires, hundreds of concurrent requests might all miss the cache simultaneously and hit the database. Three mitigations: (1) Lock/lease: the first request that gets a cache miss acquires a short-lived lock (Redis SET NX with a TTL). Other requests either wait for the lock to release and then read the now-populated cache, or serve a slightly stale value. (2) Refresh-ahead: proactively refresh the cache before the TTL expires. If the TTL is 30 seconds, refresh at 25 seconds. The cache never actually expires for hot keys. (3) Request coalescing: at the application level, deduplicate identical in-flight requests so that only one database query executes.Cache key design. user:profile:{userId} is the obvious key. I would also consider including a version if the profile schema evolves. For multi-region setups, consider whether caches should be regional (lower latency, possible cross-region staleness) or global (consistent but higher latency for cache operations).What I would monitor: cache hit ratio (target above 95% for this use case), P99 latency for cache misses, invalidation event lag (time between write and cache delete), and the rate of cache stampedes (concurrent misses for the same key).
  • Uses only TTL-based expiration without event-driven invalidation
  • No awareness of cache stampede (thundering herd on cache miss)
  • Does not mention the consistency model — what happens during the window between write and invalidation
  • Proposes write-through without acknowledging the added write latency
  • No monitoring strategy

Follow-up: The cache hit ratio is 60% and you expected it to be above 95%. How do you diagnose this?

A 60% hit ratio when expecting 95%+ tells me a lot of data is being evicted or never cached in the first place. My investigation:1. Is the working set larger than the cache? Check Redis INFO memory — if maxmemory is hit and evicted_keys is high, the cache is too small. The solution is either to increase memory or to be more selective about what gets cached. Not all user profiles need to be in cache — only hot ones.2. Is the TTL too short? If the TTL is 30 seconds and the average time between requests for the same user profile is 45 seconds, most entries will expire before being read again. Calculate the ratio: if the inter-request gap is longer than the TTL for most keys, you will have mostly misses. Consider increasing the TTL for this data (if the staleness budget allows) or using refresh-ahead for frequently accessed keys.3. Is the invalidation too aggressive? If the event-driven invalidation is firing too frequently (maybe a background job is “updating” profiles even when nothing changed), you are evicting entries unnecessarily. Check whether the invalidation events correspond to actual data changes.4. Is the cache key cardinality too high? If you have 10 million users but only 1 million are active daily, 90% of cache slots are wasted on profiles that will never be requested before TTL expires. Consider only caching on read (cache-aside does this naturally), not pre-warming the entire dataset.5. Are clients bypassing the cache? Check if some code paths hit the database directly without checking the cache. This is surprisingly common in large codebases — a new developer writes a query that does not go through the caching layer.The diagnostic tool I would build: a breakdown of misses by reason — “key did not exist” (never cached), “key expired” (TTL), “key evicted” (memory pressure). Redis does not give this out of the box, so I would instrument the application to log miss reasons. That single metric immediately tells you which of the above is the root cause.

Q7. Compare Kafka with a traditional message queue like RabbitMQ. When would you reach for each, and what are the architectural implications of your choice?

The fundamental difference is the mental model. RabbitMQ is a message broker — it routes messages from producers to consumers and removes them after acknowledgment. Kafka is a distributed commit log — it appends messages to an immutable, ordered, persistent log that multiple consumers can read independently.RabbitMQ: smart broker, dumb consumers. The broker manages message routing (direct, topic, fanout exchanges), tracks which messages have been delivered, and handles acknowledgment and redelivery. Consumers are simple — they connect, receive messages, and ack. This model is excellent for task queues where each message should be processed exactly once by one worker (e.g., “send this email,” “process this payment”). RabbitMQ excels when you need complex routing rules, priority queues, dead-letter queues, and message-level TTLs.Kafka: dumb broker, smart consumers. The broker just appends to the log and serves reads. Consumers track their own position (offset) in the log. This means multiple consumer groups can independently read the same data at different speeds without affecting each other. The log is retained for a configurable period (hours, days, or forever). This is transformative for event streaming — you can replay events, add new consumers retroactively, and build event-sourcing architectures.When I reach for RabbitMQ: task queues where order does not matter across the full queue, RPC-style request-reply patterns, systems where individual message routing logic is complex (route based on message content, headers, routing keys), and teams that want simpler operational overhead.When I reach for Kafka: event streaming where multiple consumers need the same data, log-based architectures (CDC, event sourcing, CQRS), systems that need replay capability (a new analytics service that needs to process the last 30 days of events), high-throughput ingestion (Kafka handles millions of messages per second), and when you want a durable source of truth for what happened.Architectural implications: choosing Kafka means you are committing to an event-driven architecture. Your consumers must handle out-of-order processing (across partitions), must be idempotent (at-least-once delivery), and must manage their own offsets. Choosing RabbitMQ means simpler consumer logic but no replay capability and lower throughput ceiling. The operational cost difference is also significant — Kafka requires managing brokers, ZooKeeper/KRaft, and topic partitioning, while RabbitMQ is simpler to run but has clustering gotchas (network partition handling in Erlang is notoriously tricky).The mistake I see most often: using Kafka as a simple task queue. If you do not need replay, do not need multiple consumer groups reading the same data, and your throughput is under 50K messages per second, Kafka is overengineered for that problem. RabbitMQ or even SQS will be simpler and cheaper.
  • Says “Kafka is better because it is faster” without nuance
  • Does not understand that Kafka retains messages after consumption while RabbitMQ deletes them
  • Cannot explain consumer groups or offsets in Kafka
  • No mention of replay capability as a key differentiator
  • Does not acknowledge operational complexity differences

Follow-up: A consumer is falling behind and lag is growing in Kafka. How do you diagnose and fix this?

Consumer lag in Kafka means the consumer’s committed offset is falling further behind the log’s latest offset. This is one of the most common Kafka operational issues.Step 1: Quantify the lag. Use kafka-consumer-groups --describe or a monitoring tool (Burrow, Kafka Exporter for Prometheus) to see lag per partition. If lag is uniform across partitions, it is a throughput problem. If lag is concentrated on specific partitions, it is a hot partition or a stuck consumer problem.Step 2: Is the consumer processing fast enough? Profile the consumer. Is it CPU-bound (complex processing per message)? Is it I/O-bound (writing to a slow database)? Is it blocking on external calls? The most common cause I have seen is that the consumer makes a synchronous database write per message when it should be batching. Switching from single-row inserts to batch inserts of 100-1000 rows can improve throughput by 10-50x.Step 3: Is there enough parallelism? Kafka parallelism is bounded by the number of partitions. If you have 10 partitions and 10 consumers, you are maxed out — adding more consumers does nothing. You would need to increase partitions (a disruptive operation, especially if you depend on key-based ordering). Alternatively, each consumer can process messages using internal thread pools, but you lose per-partition ordering guarantees.Step 4: Is it a rebalance storm? If consumers are joining and leaving the group frequently (due to long processing time exceeding max.poll.interval.ms, or health check failures), the group constantly rebalances, and during rebalancing, no consumption happens. The fix is to increase max.poll.interval.ms, decrease max.poll.records, or use cooperative sticky rebalancing to minimize disruption.Step 5: Is the consumer committing offsets efficiently? If using synchronous commits after every message, that adds latency. Switch to async commits or commit every N messages.If lag is unrecoverable (days behind and growing), you may need to make a business decision: skip to the latest offset and accept data loss, or provision significantly more consumer capacity to catch up gradually.

Going Deeper: How does Kafka achieve its high throughput despite writing to disk?

This is one of the most elegant parts of Kafka’s design. The conventional wisdom is that disk is slow, but Kafka exploits the fact that sequential disk I/O is fast — often faster than random RAM access at sufficient scale.Sequential writes only. Kafka appends messages to the end of a log file. There are no random writes, no seeks. A modern SSD can handle 1+ GB/s of sequential writes. Even HDDs can do 200+ MB/s sequentially. Kafka never modifies data in place.Page cache exploitation. Kafka relies on the OS page cache rather than managing its own cache in JVM heap. Writes go to the page cache and are flushed to disk asynchronously by the OS. Reads of recent data are served directly from the page cache without hitting disk at all. This avoids GC pressure and double-buffering (which happens when Java apps cache data in heap that the OS also caches).Zero-copy with sendfile(). When a consumer reads data, Kafka uses the sendfile() system call to transfer data directly from the page cache to the network socket, bypassing user space entirely. Normal I/O requires: disk to page cache to application buffer to socket buffer to NIC. Zero-copy skips the two middle copies. This is a huge win for throughput.Batching everywhere. Producers batch multiple messages into a single network request. Brokers write batches as a single sequential append. Consumers fetch batches. This amortizes the per-message overhead (network round trips, syscalls, headers) across many messages.Compression at the batch level. Producers can compress batches (Snappy, LZ4, zstd). Because similar messages in a batch compress well together, you get better compression ratios than compressing individual messages. The broker stores and replicates compressed batches without decompressing.Put it all together: batched, compressed, sequential writes to the page cache with zero-copy reads. That is why a single Kafka broker can sustain 800+ MB/s of throughput. The disk is not the bottleneck — the network usually is.

Q8. A canary deployment is showing a 2% error rate increase compared to the stable version. Walk me through your decision framework for whether to proceed, roll back, or investigate.

This is a judgment call, not a formula, and the right answer depends on context. Let me walk through how I think about it.First: is the 2% increase statistically significant? If the canary is only serving 1% of traffic and has only been running for 5 minutes, 2% might be noise. I need to check the sample size and confidence interval. A 2% increase on 100 requests is meaningless. A 2% increase on 100,000 requests is a real signal. I would wait until the canary has processed enough requests that the error rate delta is outside the normal variance of the baseline.Second: what kind of errors? A 2% increase in 500 errors is very different from a 2% increase in 400 errors. 500s mean something is broken in my code or infrastructure. 400s might mean a client contract changed. I would look at the specific error codes and the error messages. If I see a new exception type that does not exist in the stable version, that is a strong signal of a real bug.Third: which endpoints are affected? If the errors are concentrated on one endpoint that the new version changed, that is a clear regression. If they are spread across all endpoints, it might be an infrastructure issue (the canary pod got placed on a degraded node) rather than a code issue.Fourth: what are the user-facing symptoms? Are users seeing error pages, or are these internal retries that succeed on the second attempt? If there is no user-visible impact, the urgency is lower. Check downstream metrics: are conversion rates, latency, or business KPIs affected?My decision framework:
  • Roll back immediately if: error rate is clearly significant AND affects user-facing functionality AND you do not immediately understand the cause. In production, you protect users first and debug later.
  • Investigate (pause rollout, do not roll back) if: errors are significant but isolated to a non-critical path, or you have a strong hypothesis about the cause that you can verify quickly (under 15 minutes).
  • Proceed with caution if: errors are within normal variance, you have high confidence they are unrelated to the change, and expanding to 5% traffic does not increase the rate.
The principle I follow: when in doubt, roll back. A delayed deployment costs hours. A broken production costs trust, revenue, and on-call engineer sleep. The asymmetry strongly favors caution.
  • Immediately says “roll back” without asking about statistical significance
  • Immediately says “proceed” without investigation — disregards error signals
  • No awareness that error rate on a small sample can be noise
  • Does not differentiate between types of errors (5xx vs 4xx, user-facing vs internal)
  • No mention of business impact as part of the decision

Follow-up: How would you build an automated canary analysis system that makes this decision without human intervention?

Automated canary analysis is what companies like Netflix (Kayenta) and Google (their internal system) have built to remove human judgment from the deployment hot path. The key components:Metric selection. Define a set of canary metrics: error rate, latency (P50, P95, P99), saturation (CPU, memory), and business metrics (conversion rate, revenue per request). Not all metrics are equally important — you assign weights or criticality levels.Statistical comparison. For each metric, compare the canary population against the control (baseline). Use a statistical test appropriate for the data distribution. Mann-Whitney U test is common for latency (non-normal distribution). For error rates, a chi-squared test or Fisher’s exact test works. The key is setting a significance threshold (p-value < 0.05 typically) and a practical threshold (is the difference large enough to matter, not just statistically significant).Warm-up period. Ignore the first N minutes after deployment. JVM warm-up, cache cold starts, and connection pool initialization cause temporary anomalies that do not represent steady-state behavior.Phased rollout with automatic gates. The system rolls out in phases: 1% for 10 minutes, 5% for 10 minutes, 25% for 15 minutes, 100%. At each gate, the statistical analysis runs. If any critical metric fails, automatic rollback. If all pass, proceed to the next phase.The Netflix Kayenta model specifically compares canary against a separate baseline deployed at the same time on similar infrastructure. This eliminates time-based confounders (traffic patterns change over the day) — both canary and baseline experience the same traffic pattern, the only difference is the code version.Edge cases to handle: metric emission lag (some metrics take minutes to propagate), alerting fatigue (too-sensitive thresholds cause constant rollbacks), and non-comparable traffic (if canary gets disproportionately more bot traffic due to load balancer quirks).The biggest risk of full automation is false rollbacks — rolling back a perfectly fine deployment because of a transient metric blip. You need tuning and iteration. Most teams start with automated analysis providing recommendations to a human, and only move to fully automated rollbacks after building confidence.

Q9. Explain the Circuit Breaker pattern. When is it essential, and when is it actually harmful?

The circuit breaker is a resilience pattern that prevents a service from repeatedly calling a downstream dependency that is failing. Think of it like an electrical circuit breaker — when current is too high (too many failures), it trips and stops the flow to prevent further damage.Three states: Closed (normal operation, requests flow through), Open (failures exceeded threshold, requests are immediately rejected without calling the downstream), and Half-Open (after a timeout, a limited number of test requests are allowed through to see if the downstream has recovered).How it works in practice: you configure a failure threshold (e.g., 5 consecutive failures or 50% error rate in a 10-second window). When the threshold is breached, the circuit opens. All subsequent calls immediately return an error (or a fallback response) without making the network call. After a configurable timeout (e.g., 30 seconds), the circuit moves to half-open. If the test request succeeds, the circuit closes. If it fails, the circuit opens again.When it is essential:
  • Preventing cascading failures. If Service A calls Service B, and B is down, A’s threads/connections pile up waiting for B’s timeouts. Soon A is exhausted and fails too, which cascades to everything that depends on A. The circuit breaker fails fast, freeing A’s resources.
  • Protecting a recovering service. When a downstream service crashes and restarts, it is vulnerable. If all callers immediately slam it with backed-up requests, it crashes again. The circuit breaker gradually allows traffic back (half-open state), giving the recovering service time to warm up.
  • Reducing latency during outages. Instead of waiting for a 10-second timeout on every request, the circuit breaker returns instantly. This improves user experience — a fast error is better than a slow error.
When it is harmful:
  • Idempotent retry scenarios. If the downstream failure is transient (network blip, brief GC pause) and your retry strategy would handle it in 100ms, a circuit breaker that opens and blocks all requests for 30 seconds is a massive overreaction. You lose 30 seconds of availability to protect against a 100ms blip.
  • Systems with natural variance. If the downstream has a legitimately high error rate for some requests (e.g., 5% of requests fail because of bad user input, not downstream issues), the circuit breaker may trip on expected errors. You need to differentiate between errors that indicate downstream health problems (5xx, timeouts) and errors that are expected (4xx).
  • Single-dependency systems. If your service has exactly one downstream and it is the only thing you do, opening the circuit breaker means you are 100% unavailable. There is no graceful degradation possible. The circuit breaker only helps when you can serve partial functionality or a fallback.
The gotcha in production is tuning. Too sensitive and you get false trips that cause unnecessary outages. Too insensitive and the circuit never trips when it should. The right thresholds depend on your traffic patterns and can only be determined through load testing and production observation.
  • Can only describe the pattern at a textbook level without discussing when it hurts
  • Does not mention the half-open state or how recovery works
  • Cannot explain cascading failures as the primary motivation
  • Thinks circuit breaker and retry are the same thing
  • No awareness of the tuning challenge

Follow-up: How does a circuit breaker interact with retries, timeouts, and bulkheads in a resilience stack?

These four patterns form a layered resilience stack, and the order and interaction matters. Done wrong, they fight each other. Done right, they complement each other.Timeouts are the foundation. Every external call must have a timeout. Without one, a hanging downstream can hold your threads indefinitely. The timeout should be set based on the downstream’s P99 latency plus a small margin. If the downstream normally responds in 50ms at P99, a timeout of 200-500ms is reasonable. Setting it to 30 seconds (a common default) is almost always wrong.Retries sit on top of timeouts. When a request times out or fails, you retry. But retries must be bounded (max 2-3 attempts) and use exponential backoff with jitter. Without jitter, all clients retry at the same time after a failure, creating a synchronized thundering herd. Retries multiply your load on the downstream — 3 retries means the failing service sees 3x traffic, which can make the problem worse. This is called retry amplification.Circuit breakers sit on top of retries. The circuit breaker monitors the aggregate failure rate across all requests (including retried ones). When the rate exceeds the threshold, it stops all requests, including retries. This is the key interaction: without a circuit breaker, retries keep hammering a failing service. The circuit breaker says “this service is clearly down, stop trying.”Bulkheads are orthogonal — they isolate resources. Instead of sharing one thread pool or connection pool across all downstream services, you give each downstream its own pool. If Downstream A is slow and exhausts its pool, Downstream B’s pool is unaffected. The Titanic analogy: compartmentalized hulls prevent one breach from sinking the ship.The correct layering: Request goes through Bulkhead (ensuring resource isolation) then Circuit Breaker (if open, fail immediately) then Retry (with backoff and jitter) then Timeout (per individual attempt). The order matters: the circuit breaker must wrap the retry logic so it counts final failures, not individual attempt failures.The anti-pattern I see most often: retries without circuit breakers. The service retries every failed request 3 times, which means a downstream outage causes 3x load on the failing service, accelerating its failure and preventing recovery. Adding a circuit breaker that opens after detecting the downstream is unhealthy breaks this feedback loop.

Q10. You need to do a back-of-the-envelope estimation for the storage requirements of a URL shortener serving 100 million new URLs per day. Walk me through your calculation.

Let me break this down systematically. The key to estimation in interviews is showing structured thinking and stating assumptions explicitly.What we need to store per URL:
  • Short code: ~7 characters = 7 bytes (base62 encoding: 62^7 = ~3.5 trillion possible codes, more than enough)
  • Original URL: average ~100 bytes (URLs vary, but this is a reasonable average)
  • Creation timestamp: 8 bytes
  • Expiration timestamp: 8 bytes
  • User ID (optional): 8 bytes
  • Click count (optional): 8 bytes
Total per URL entry: roughly 150 bytes. Let me round up to 200 bytes to account for indexing overhead, metadata, and padding.Daily storage:
  • 100 million URLs/day times 200 bytes = 20 GB per day
Monthly storage:
  • 20 GB/day times 30 = 600 GB per month
Yearly storage:
  • 20 GB/day times 365 = ~7.3 TB per year
With replication (3x for durability):
  • 7.3 TB times 3 = ~22 TB per year of raw storage
5-year projection:
  • 22 TB times 5 = ~110 TB over five years
QPS calculation for reads (assuming 100:1 read-to-write ratio):
  • Write QPS: 100M/86400 = ~1,150 QPS average, ~3,500 peak
  • Read QPS: ~115,000 average, ~350,000 peak
What this tells me architecturally:
  • The dataset is not enormous — 7.3 TB per year fits comfortably in a single sharded database cluster
  • The read QPS (350K peak) strongly suggests we need caching. Redis can easily handle this. If the hot set (recently created URLs) fits in a few hundred GB of RAM, we get sub-millisecond reads for most requests
  • The write QPS (3.5K peak) is very manageable for most databases — even a single Postgres instance can handle this
  • I would use DynamoDB or Cassandra for this because the access pattern is pure key-value (short code to URL), no joins, no complex queries. DynamoDB gives me single-digit millisecond latency at any scale with zero operational overhead
The estimation I would sanity-check: 100 bytes per URL. Short URLs (https://example.com) are 20 bytes. Long URLs (with query strings) can be 2000 bytes. 100 bytes is conservative for a median. If the use case skews toward long URLs (analytics tracking URLs), I might revise to 500 bytes, which would 5x the storage estimate.
  • Cannot produce reasonable per-record size estimates
  • Forgets replication factor
  • Does not calculate QPS alongside storage — storage alone does not drive architecture
  • Gives exact numbers instead of orders of magnitude (the point is approximate reasoning, not precision)
  • Does not state assumptions explicitly

Follow-up: How would you generate the short codes to ensure uniqueness at 100 million URLs per day?

Several approaches, each with different trade-offs.Approach 1: Pre-generated key service. Generate a large batch of unique keys offline and store them in a database table. When a URL needs to be shortened, pop a key from the pool. This is simple and guarantees uniqueness (no collision checking needed), but you need to manage key exhaustion and the key generation service becomes a dependency.Approach 2: Base62 encoding of an auto-incrementing ID. Use a centralized ID generator (or a distributed one like Twitter’s Snowflake) to produce a unique 64-bit integer, then base62-encode it to get a short string. A 7-character base62 string gives 62^7 = ~3.5 trillion IDs, which at 100M/day lasts ~96 years. The challenge is the centralized ID generator can become a bottleneck. Snowflake-style IDs solve this — each node independently generates unique IDs using a combination of timestamp + node ID + sequence number.Approach 3: Hash the URL. MD5 or SHA-256 of the URL, then take the first 7 characters (base62-encoded). This is simple and stateless, but has a collision risk. With 7 base62 characters (3.5 trillion space) and 100M URLs/day, by the birthday paradox, you would expect a collision after roughly sqrt(3.5T) = ~1.87 million URLs. That is way too soon. You would need to handle collisions: check if the code exists, and if so, append a counter or re-hash. This adds latency and complexity.Approach 4: Snowflake-style distributed IDs. Each application server is assigned a range or a node ID. It generates IDs locally using timestamp + node ID + local counter. No coordination needed, no central bottleneck, guaranteed unique. This is what I would recommend for this scale.My recommendation for this system: Snowflake IDs base62-encoded. It scales horizontally, has no central coordination, and the IDs are naturally time-sortable (useful for analytics). I would combine it with a Bloom filter as a lightweight collision check — false positives are fine (just generate another ID), and a Bloom filter for 36.5 billion URLs per year (100M/day * 365) is only a few GB in memory.

Q11. Explain the difference between horizontal and vertical scaling. Then tell me when vertical scaling is actually the right choice.

Vertical scaling means adding more resources (CPU, RAM, faster disks) to a single machine. Horizontal scaling means adding more machines. This is the textbook answer, but the real insight is about when each is appropriate.Vertical scaling is the right choice more often than most people admit. The industry has a bias toward horizontal scaling because “that is what Google does,” but most companies are not Google. A modern single-server instance on AWS (e.g., r6g.16xlarge) gives you 64 vCPUs and 512 GB RAM. That is an enormous amount of compute. A single PostgreSQL instance on that hardware can handle tens of thousands of transactions per second, serve millions of users, and store terabytes of data.When vertical scaling wins:
  • Operational simplicity. One database to back up, monitor, tune, and upgrade is massively simpler than a 10-node cluster. Distributed systems introduce failure modes (network partitions, split brains, rebalancing storms) that a single node simply does not have.
  • Strong consistency is trivial. A single node is always consistent with itself. No consensus protocols, no replication lag, no conflicting writes.
  • Cost efficiency at moderate scale. Running one large instance is often cheaper than running and coordinating many small ones, especially when you factor in the engineering time to operate a distributed system.
  • Stateful workloads that resist partitioning. Some workloads — like a game server managing shared state for 1000 players in a single match — do not naturally partition. Horizontal scaling requires rethinking the data model.
When vertical scaling fails:
  • You hit the ceiling. There is a maximum machine size. When you need more than the biggest available instance, you have no choice but to go horizontal.
  • Availability requirements. A single machine is a single point of failure. If you need four or five nines, you need redundancy, which is horizontal by definition.
  • Geographic distribution. You cannot put one machine in three continents simultaneously.
My rule of thumb: start vertical, go horizontal when you must. Premature horizontal scaling is a form of premature optimization. It adds complexity that you might never need. Companies like Stack Overflow famously ran their entire platform on a few beefy servers for years. Basecamp (Hey.com) runs on a handful of large instances. The “scale horizontally from day one” advice from distributed systems papers does not apply to most startups or mid-size companies.The hybrid pattern most real systems use: vertically scale the database (one beefy primary + read replicas) and horizontally scale the stateless application servers (easy because they share no state). This gives you the simplicity of a single database with the throughput of many application instances.
  • Immediately dismisses vertical scaling as “not scalable” without nuance
  • Cannot name a single scenario where vertical scaling is preferable
  • Thinks horizontal scaling has no downsides
  • Does not mention operational complexity as a real cost of horizontal scaling
  • No mention of the hybrid approach (vertical DB, horizontal app)

Follow-up: Your PostgreSQL primary is at 80% CPU utilization. Walk me through your options before adding read replicas.

Before scaling horizontally with read replicas (which introduces replication lag and read-after-write consistency challenges), I would exhaust the vertical and optimization options first.1. Query optimization. Run pg_stat_statements to find the top queries by total time and by frequency. In my experience, 80% of database load comes from 5% of queries. Look for sequential scans on large tables (add missing indexes), N+1 queries from the ORM (batch them), and queries that return more data than needed (add appropriate column selection and LIMIT clauses). This alone has taken me from 80% to 30% CPU before.2. Connection management. Check pg_stat_activity for the number of connections. If you have hundreds of application connections directly to Postgres, the per-connection overhead (memory, context switching) is significant. Add PgBouncer or RDS Proxy in front of Postgres to multiplex connections. This can reduce effective connection count by 10-100x.3. Caching hot queries. If the same query is executed thousands of times per second with the same parameters, cache the result in Redis. Profile-by-ID lookups, configuration values, and materialized aggregations are prime candidates. A 95% cache hit rate means Postgres sees 20x fewer queries.4. Upgrade the instance. If you are on a db.r6g.xlarge (4 vCPU), move to r6g.4xlarge (16 vCPU). This is a restart but no application changes. Buying yourself 4x headroom with a 15-minute maintenance window is a great trade.5. VACUUM and bloat. Check for table bloat with pg_stat_user_tables. If autovacuum is not keeping up, dead tuples accumulate, tables and indexes bloat, and queries scan more data than necessary. Tuning autovacuum parameters or running a manual VACUUM FULL on bloated tables can dramatically reduce I/O.6. Partitioning. If one table dominates the load and is very large, table partitioning (by date, by customer) can reduce query scope. Instead of scanning a 500M-row table, you scan a 10M-row partition.Only after exhausting all of these would I add read replicas, and even then, I would be careful to route only truly read-only queries to them and ensure the application handles replication lag gracefully.

Q12. What is the Outbox pattern and why is it important in microservice architectures?

The Outbox pattern solves one of the most insidious problems in microservices: how do you reliably update a database and publish an event without the two operations getting out of sync?The problem: your service needs to save an order to the database AND publish an “OrderCreated” event to Kafka. If you write to the database first and then publish to Kafka, the Kafka publish might fail — now the order exists but nobody knows about it. If you publish to Kafka first and then write to the database, the database write might fail — now downstream services are processing an order that does not exist. There is no atomic operation that spans a database and a message broker.The Outbox solution: instead of publishing directly to Kafka, you write both the business data AND the event message into the same database, in the same local transaction. The event goes into an “outbox” table. A separate process reads the outbox table and publishes events to Kafka. Because the business data and the outbox entry are in the same transaction, they are guaranteed to be atomically consistent — either both exist or neither does.Two implementation approaches:Polling publisher. A background process polls the outbox table periodically (e.g., every 100ms), reads unpublished events, publishes them to Kafka, and marks them as published. Simple but has latency (up to the polling interval) and can struggle with throughput at scale.Change Data Capture (CDC). Use Debezium or a similar tool to tail the database’s transaction log (WAL in Postgres, binlog in MySQL). Every insert into the outbox table is captured as a change event and published to Kafka in near-real-time. This is the preferred approach for production systems because it has lower latency, does not add query load to the database, and captures events in the same order they were committed.Why it matters so much in microservices: without the Outbox pattern (or something equivalent), microservices that need to coordinate state across boundaries inevitably drift out of sync. I have seen systems where 0.1% of orders were “lost” — they existed in the order database but the notification and fulfillment services never received the event. At 10,000 orders per day, that is 10 lost orders daily. The Outbox pattern makes that number zero (assuming the message broker is reliable).The trade-off: eventual consistency. The downstream services will see the event after a delay (milliseconds to seconds). For most use cases, this is perfectly acceptable. For cases where it is not, you need synchronous communication (which couples the services) or a distributed transaction (which has availability and performance implications).
  • Does not understand the dual-write problem that motivates the pattern
  • Suggests using a distributed transaction instead (2PC across a database and Kafka is extremely fragile)
  • Cannot explain the difference between polling and CDC approaches
  • Does not mention idempotency requirements for the consumer
  • Thinks “just write to Kafka first” is fine

Follow-up: How do you ensure exactly-once processing when the consumer reads from the Outbox via Kafka?

True exactly-once processing in a distributed system is extremely hard. In practice, we achieve effective exactly-once by combining at-least-once delivery with idempotent processing.At-least-once delivery means Kafka will deliver every message at least once, but may deliver duplicates (due to producer retries, consumer rebalances, or network issues). This is the default and most reliable delivery guarantee.Idempotent consumers ensure that processing the same message twice produces the same result as processing it once. There are several strategies:1. Deduplication table. Store the event ID (from the outbox) in a “processed events” table in the consumer’s database. Before processing an event, check if the ID already exists. If it does, skip it. Use a unique constraint on the event ID so that even concurrent processing of the same event results in one success and one constraint violation.2. Idempotent operations. Design the write operations to be naturally idempotent. For example, “set balance to 100"isidempotent(doingittwicegivesthesameresult),while"add100" is idempotent (doing it twice gives the same result), while "add 50 to balance” is not (doing it twice adds $100). Where possible, use absolute state updates rather than relative deltas.3. Transactional outbox on the consumer side. The consumer reads the Kafka message, processes it, writes the result to its database, AND updates its Kafka offset, all in the same database transaction. If the transaction fails, the offset is not committed, and Kafka will redeliver the message. If it succeeds, the offset is committed, and Kafka will not redeliver. This gives you effective exactly-once between Kafka and the consumer’s database.The Kafka-specific approach: Kafka supports idempotent producers (eliminates duplicate sends) and transactional producers (atomic writes across multiple partitions). Combined with read_committed consumer isolation, you get exactly-once semantics within Kafka itself. But the consumer’s side-effects (database writes, API calls) still need application-level idempotency.The practical answer: use event IDs for deduplication, make operations idempotent where possible, and accept that in rare edge cases (consumer crash between processing and committing offset), you might process an event twice. If your operations are idempotent, this is harmless.

Advanced Interview Scenarios

These questions are designed to be uncomfortable. They target areas where the obvious answer is wrong, where textbook knowledge collapses under production reality, and where only engineers who have been paged at 3 AM can give a convincing answer. If you can handle these, you can handle any senior/staff-level interview.

Q13. Your Kubernetes pod is in CrashLoopBackOff and the logs show nothing obvious. Walk me through your debugging process, step by step.

“I would check the logs with kubectl logs and maybe restart the pod.” They treat Kubernetes as a black box and have no mental model of the pod lifecycle. They stop investigating after one tool returns nothing useful.
The first thing I do is resist the urge to immediately look at application logs, because CrashLoopBackOff with “nothing obvious” in the logs usually means the problem is infrastructure, not application code.Step 1: Describe the pod. kubectl describe pod <name> is the single most informative command. I am looking at three things: the Events section at the bottom (which tells me why the container is being restarted), the Last State section (which shows the exit code — exit code 137 is OOMKilled, 1 is application error, 139 is segfault), and whether the Reason field says “OOMKilled,” “Error,” or “ContainerCannotRun.”Step 2: Check previous container logs. kubectl logs <pod> --previous shows the logs from the last crashed container. Most people forget this flag. The current container might have zero logs because it just started and immediately crashed, but the previous one might have logged the actual error before dying.Step 3: Exit code analysis. Exit code 137 is the most common surprise. It means the kernel sent SIGKILL, almost always because the container exceeded its memory limit. The application might not log anything because OOM kills are instant — the process does not get a chance to catch the signal. I have seen this catch entire teams for days. The fix is to either increase the memory limit in the pod spec or fix the memory leak. I check kubectl top pod and compare against the resource limits in the deployment YAML.Step 4: Check resource quotas and limits. kubectl describe node <node> shows if the node itself is under memory or CPU pressure. If the node is in MemoryPressure, pods get evicted regardless of their own limits. Also check if there is a LimitRange or ResourceQuota in the namespace that is silently constraining the pod below what it needs.Step 5: Init container failures. If the pod has init containers (common for migration scripts, sidecar injection, secret fetching), the init container might be the one failing. kubectl describe pod will show init container status separately. I have seen production outages where the Vault sidecar init container could not reach Vault, and the application container never started.Step 6: Image pull issues. If the image tag is wrong or the registry requires authentication, the container cannot even start. This shows as “ImagePullBackOff” initially but can transition to CrashLoopBackOff in some edge cases with init containers.Step 7: If I am truly stuck, I kubectl exec into a debug container (using ephemeral containers in K8s 1.23+) sharing the pod’s namespace, or I temporarily override the container command with sleep infinity in the deployment spec so the container stays running and I can exec in and debug interactively.War Story: At a previous company, we had a microservice that went into CrashLoopBackOff every Monday morning at 6 AM. Logs showed nothing. Exit code was 137. Turned out the service loaded a config file from a ConfigMap that was updated by a CronJob every Sunday night, and the new config included a larger in-memory lookup table that pushed the container over its 512MB limit. The fix was 30 seconds — bump the limit to 1GB — but the diagnosis took two weeks because nobody checked the exit code and everyone was looking at application logic.

Follow-up: The exit code is 137 but your container’s memory limit is 4GB and your app only uses 800MB according to metrics. What is going on?

This is one of my favorite gotchas because it reveals whether someone understands how container memory accounting actually works in Linux.Container memory is not just heap. The memory metric you see in your APM (800MB) is typically the application’s heap usage. But the kernel’s cgroup memory counter — which is what enforces the 4GB limit — includes everything: heap, stack, memory-mapped files, page cache used by the process, native allocations from C libraries (like OpenSSL buffers or glibc malloc arenas), and thread stacks.The usual suspects:
  • Native memory in JVM. If this is a Java service, the JVM heap might be 800MB, but Metaspace, thread stacks (1MB per thread by default, times 200 threads = 200MB), JIT compiled code cache, direct ByteBuffers (Netty loves these), and native memory from JNI calls can easily add another 1-2GB. Use -XX:NativeMemoryTracking=summary and jcmd <pid> VM.native_memory to see the full picture.
  • Page cache thrashing. If the application reads large files or does memory-mapped I/O, the kernel counts those pages against the cgroup. A service that sequentially reads a 3GB log file for processing can trigger OOM even though its heap is tiny.
  • Sidecar containers. If Istio’s Envoy proxy or a logging sidecar shares the pod, their memory counts toward the pod limit. Envoy alone uses 40-100MB. A Fluentd sidecar can use 200-500MB depending on buffer configuration. So your “4GB pod” might only have 3GB for the application.
I would run cat /sys/fs/cgroup/memory/memory.usage_in_bytes inside the container right before it crashes (or look at cAdvisor/Prometheus container_memory_usage_bytes metric) to see the real cgroup usage, not just the application-reported heap.

Follow-up: How do you set resource requests and limits properly? Most teams get this wrong.

Most teams do get this wrong, and the failure modes are opposite depending on the direction.Requests too low: the scheduler packs too many pods onto a node (it thinks there is room), the node runs out of actual memory, and the kernel starts OOM-killing pods. You get cascading failures because one greedy pod starves its neighbors.Limits too low: your pod gets OOMKilled during traffic spikes even though the node has plenty of free memory. You are artificially constraining the application.Limits too high (or no limits): one misbehaving pod consumes all node memory, killing unrelated pods. This is the “noisy neighbor” problem.My approach: set requests to the P95 steady-state usage (what the app uses under normal load), and set limits to 1.5-2x the request (headroom for spikes). For CPU, I often set requests but no limits, because CPU is compressible — if the pod exceeds its CPU request, it gets throttled but not killed. Memory is not compressible — exceed the limit and you die. So memory limits are critical, CPU limits are debatable.The tool I recommend: run the application under realistic load for a few days, then use the Vertical Pod Autoscaler (VPA) in recommendation mode. It analyzes actual usage and suggests request/limit values. Do not use it in auto-update mode for production initially — let it recommend, then you decide.

Q14. Your team adopted GraphQL six months ago. The frontend team loves it. But your database is falling over. What is happening and how do you fix it?

“GraphQL is slow, we should switch back to REST.” They blame the tool without understanding the actual mechanism. They have never profiled a GraphQL server under real load.
This is the classic GraphQL N+1 problem, and it catches almost every team that adopts GraphQL without understanding how resolvers interact with data fetching.What is happening: GraphQL resolves fields individually. If a client queries { users { posts { comments } } } and there are 50 users, each with 20 posts, here is what happens. The users resolver fires one query: SELECT * FROM users (1 query). Then the posts resolver fires for each user: SELECT * FROM posts WHERE user_id = ? (50 queries). Then the comments resolver fires for each post: SELECT * FROM comments WHERE post_id = ? (50 * 20 = 1,000 queries). Total: 1,051 queries for one GraphQL request. Scale that to 100 requests per second and you are hitting the database with 105,000 queries per second. No database survives that.Fix 1: DataLoader (essential, non-negotiable). DataLoader batches and deduplicates data-fetching calls within a single request. Instead of 50 individual SELECT ... WHERE user_id = ? calls, DataLoader collects all 50 user IDs and issues one SELECT ... WHERE user_id IN (?, ?, ..., ?) call. Those 1,051 queries collapse to 3 queries. At Facebook, where GraphQL was born, DataLoader is not optional — it is part of the architecture. Every team adopting GraphQL should implement DataLoader from day one.Fix 2: Query depth and complexity limiting. Without limits, a client can craft a deeply nested query that causes exponential resolver calls. Implement query complexity analysis that assigns a cost to each field, and reject queries that exceed a budget. Libraries like graphql-query-complexity or graphql-cost-analysis handle this. At Shopify, they assign point costs to every field in their storefront API and reject queries above 1,000 points.Fix 3: Persisted queries. Instead of letting clients send arbitrary query strings, pre-register allowed queries at build time and have clients send a query hash. This prevents malicious or accidental query abuse, enables server-side query optimization, and makes caching feasible (you can cache by query hash + variables).Fix 4: Look-ahead optimization. Use GraphQL’s info parameter in resolvers to inspect what fields the client actually requested. If the client only asked for users { name, email } and did not request posts, do not eagerly load posts. In SQL terms, this means your root resolver can dynamically build a JOIN or a SELECT with only the necessary columns based on the requested fields.War Story: I worked with a team that migrated from REST to GraphQL and their average database query count per API request went from 3 to 47. Their p99 latency went from 80ms to 2.3 seconds. They thought GraphQL was inherently slow. It took us one afternoon to add DataLoader and query complexity limits. The query count dropped to 4 per request, and p99 went back to 90ms. The lesson: GraphQL is not slow — naive resolver implementations are slow.

Follow-up: How do you cache GraphQL responses? You cannot use standard HTTP caching because everything is a POST to a single endpoint.

This is GraphQL’s biggest operational weakness compared to REST, and there are several layers to address it.Automatic Persisted Queries (APQ). The client sends a SHA-256 hash of the query instead of the full query string. On first request, the server does not recognize the hash and asks the client to resend with the full query (which gets cached). Subsequent requests use the hash. This does not solve caching directly, but it makes GET-based requests possible — the query hash plus variables can be encoded as a URL, enabling CDN caching for public data.Response caching at the resolver level. Cache individual resolver results in Redis, not entire GraphQL responses. Because different queries may request user(id: 5) as part of completely different query shapes, caching at the field level gives you better hit rates. Apollo Server’s @cacheControl directive lets you set per-field TTLs, and a cache plugin stores/retrieves resolved values.Normalized caching on the client. Apollo Client and urql maintain a normalized cache keyed by __typename:id. When a mutation updates a user, the client automatically updates that user everywhere it appears in any cached query result. This means the client often does not need to re-fetch after mutations, reducing server load.CDN-level with edge caching. For public, non-authenticated queries (product catalogs, content), use persisted queries over GET and cache at the CDN layer. Fastly and Cloudflare both support this pattern. For authenticated queries, you are out of luck with CDN caching unless you segment by role or user tier.The hard truth: if your GraphQL API serves highly personalized, authenticated data with complex nesting, caching is significantly harder than REST. That is a real trade-off you accept when choosing GraphQL.

Follow-up: When would you argue against adopting GraphQL, even when the frontend team is pushing for it?

I would push back when the costs outweigh the benefits for the specific context.When you have one client type with stable data needs. If you have a single web frontend and the API shapes rarely change, REST with well-designed endpoints gives you HTTP caching, simpler monitoring, and less operational complexity. GraphQL’s value is letting multiple clients (mobile, web, watch, third-party) each fetch exactly the data they need. One client means one “shape” — just design the REST endpoint to return that shape.When your team has no GraphQL operational experience. The learning curve is real: DataLoader, query complexity analysis, persisted queries, N+1 prevention, schema design, resolver testing. A team that ships a naive GraphQL server without these will have worse performance than REST, not better. I have seen this firsthand.When you need aggressive HTTP caching. REST’s biggest superpower is native HTTP caching. ETags, Cache-Control headers, CDN caching — all of this works out of the box with GET endpoints. GraphQL requires significant additional work to achieve equivalent caching.When the data model is not graph-shaped. GraphQL shines with interconnected, relational data. If your API is mostly flat CRUD operations (create a form submission, fetch a list of items), the overhead of a GraphQL schema, resolvers, and tooling is not justified.

Q15. You are asked to implement a distributed lock. Your first thought is to use Redis with SET NX EX. Why is this harder than it looks, and what goes wrong in production?

“Just use SETNX with a TTL and you’re done.” They describe the happy path and have never thought about what happens when a lock holder crashes, clocks drift, or the Redis node fails. They may not even know that Redlock exists or why it was invented.
The naive Redis lock (SET lock_key unique_value NX EX 30) works beautifully on a whiteboard and fails in production in at least three ways.Problem 1: Lock expiry while the holder is still working. You acquire the lock with a 30-second TTL. Your processing takes 35 seconds (GC pause, slow downstream call, network partition to the DB). The lock expires after 30 seconds, another process acquires it, and now two processes are in the critical section simultaneously. The “fencing token” pattern mitigates this: the lock returns a monotonically increasing token, and the resource being protected (e.g., the database) rejects operations with a token lower than the highest it has seen. But this requires the downstream resource to cooperate, which is not always possible.Problem 2: Redis failover. You write the lock to a Redis primary. Before the lock replicates to the replica, the primary crashes. The replica is promoted to primary — and the lock does not exist. Another process acquires the same lock. You now have two lock holders. This is not theoretical — Redis replication is asynchronous by default. This exact failure mode is why Martin Kleppmann wrote his famous critique of Redlock.Problem 3: Clock drift and Redlock. The Redlock algorithm (acquire locks on N/2+1 independent Redis nodes) was designed to address Problem 2. But Kleppmann showed that it relies on the assumption that process pauses and clock drift are bounded. If a process pauses for longer than the lock TTL (GC, swap, CPU scheduling), Redlock’s safety guarantee breaks. Salvatore Sanfilippo (Redis creator) and Kleppmann had a public debate about this, and the conclusion is: Redlock provides better safety than a single Redis node, but it is not a substitute for a consensus-based system if you need absolute correctness.What I actually recommend:
  • For efficiency locks (preventing duplicate work, rate limiting) where occasional double-execution is annoying but not catastrophic: single-node Redis SET NX EX is fine. Just accept the rare failure and make your operations idempotent.
  • For correctness locks (financial transactions, inventory decrements) where double-execution causes real damage: use a consensus-based system like etcd or ZooKeeper. etcd’s lease-based locks use Raft consensus and provide linearizable guarantees. The performance is lower (~2-10ms per operation vs sub-ms for Redis), but the correctness guarantee is real.
  • For database-specific locking: PostgreSQL’s SELECT ... FOR UPDATE or advisory locks are underrated. If your critical section is a database transaction anyway, use the database’s own locking. No external coordination needed.
War Story: A payments team I consulted for used Redis SETNX to prevent duplicate charge processing. It worked perfectly for 11 months. Then a Redis failover during a peak traffic period caused 23 duplicate charges in 4 minutes. Total customer impact: $47,000 in overcharges. They switched to PostgreSQL advisory locks the next week. The lock acquisition was 3ms slower, but correctness was absolute. The CTO’s quote: “I will trade 3ms for not calling 23 customers to apologize.”

Follow-up: Walk me through the Redlock algorithm and why Martin Kleppmann says it is fundamentally flawed.

Redlock algorithm: to acquire a lock, the client sends SET NX EX to N independent Redis nodes (typically 5). It records the time before starting. If it successfully sets the lock on N/2+1 (majority) nodes, AND the total time to acquire is less than the lock TTL, the lock is acquired. The effective lock validity is TTL minus the time spent acquiring.Kleppmann’s core argument is about time. Redlock assumes that processes do not pause for longer than the lock validity period. But in a garbage-collected language, a GC pause can last hundreds of milliseconds or even seconds. Here is the attack scenario: Client A acquires the Redlock. Client A enters a long GC pause. The lock TTL expires while A is paused. Client B acquires the Redlock. Client A’s GC pause ends. Client A still believes it holds the lock (it never received the expiry notification). Both A and B now execute the critical section.Why fencing tokens fix this (but Redlock does not provide them natively): a fencing token is a monotonically increasing number returned with each lock acquisition. When A acquired the lock it got token 34, and when B later acquired it, B got token 35. The storage system rejects A’s write because 34 < 35. But if you have a storage system that supports fencing tokens, Kleppmann argues, you can use that same system for locking — you do not need Redis at all.Sanfilippo’s counterargument was that clock drift and process pauses are bounded in practice, and Redlock is designed for “practical distributed systems” not “theoretical adversarial environments.” Both are right — it depends on your failure tolerance.

Follow-up: If you need a truly correct distributed lock today, what do you use and what is the latency cost?

etcd with its built-in lock API. etcd uses Raft consensus, so the lock is linearizable. Acquiring a lock takes 2-10ms (one Raft round trip for the write). The lock is tied to an etcd lease — if the holder crashes, the lease expires and the lock is released automatically. The etcd client library handles lease keep-alives transparently.ZooKeeper ephemeral nodes. Create an ephemeral sequential node under a lock path. The client with the lowest-numbered node holds the lock. ZooKeeper’s session mechanism handles failure detection — if the holder crashes, the ephemeral node is deleted after the session timeout (configurable, typically 6-30 seconds). Write latency is similar to etcd: 2-10ms.The latency cost compared to Redis: Redis lock acquisition is sub-millisecond. etcd/ZooKeeper is 2-10ms. That is a 10-20x difference. For a lock you acquire once per user request, that 5ms is negligible in a 200ms total request budget. For a lock you acquire thousands of times per second in a hot loop, it matters, and you should reconsider whether a distributed lock is the right abstraction (maybe partition the workload so each partition has a local leader that does not need a distributed lock).The practical middle ground I have used: Redis for the hot-path distributed rate limiting (speed matters, occasional double-count is tolerable) and etcd for the cold-path leader election / distributed cron (correctness matters, 5ms is nothing). Different tools for different correctness requirements within the same system.

Q16. You have a microservice that works perfectly in integration tests but fails intermittently in production with 502 errors. Tests pass, staging passes, production fails. What is your mental model for this class of bug?

“There must be a configuration difference between staging and production.” They throw out generic troubleshooting steps without a structured model for why environments diverge. They have not internalized the categories of production-only bugs.
Production-only bugs that pass all tests are my favorite category to debug because they reveal the gap between your test environment’s model of reality and actual reality. I think about them in five categories.Category 1: Scale effects. The most common cause. Your integration tests run at 10 requests per second. Production runs at 10,000. Connection pool exhaustion, thread pool saturation, database lock contention, and memory pressure only manifest at scale. A 502 specifically means the upstream service (your microservice) returned something the load balancer or API gateway considered invalid — which often means your service crashed, hung, or closed the connection unexpectedly. I check: is the pod being OOMKilled? Is the connection pool maxed out? Are threads blocked on a slow downstream call?Category 2: Timing and concurrency. Race conditions do not reproduce in single-threaded tests. Two requests hitting the same row with a read-modify-write pattern will occasionally collide in production but never in integration tests that run sequentially. I look for non-atomic operations that assume sequential execution: “check if exists, then insert” without a unique constraint, balance updates without optimistic locking, cache invalidation races.Category 3: Data shape differences. Test data is clean. Production data is a horror show of edge cases accumulated over years. A field that is “always” a string is sometimes null. A user ID that is “always” a UUID is sometimes a legacy integer from the 2018 migration. A JSON payload that “always” has a nested object sometimes has that field as an empty string because a client from three API versions ago sends it that way. The 502 might be an unhandled null pointer exception on a data shape your tests never exercise.Category 4: Infrastructure differences that staging does not replicate. Staging often has one replica. Production has twelve across three availability zones. Network calls that are same-node in staging are cross-AZ in production, adding 1-2ms of latency per hop. Timeouts that are generous enough for same-node latency start failing for cross-AZ calls. Also: staging often does not have the same TLS configuration, load balancer rules, network policies, or sidecar proxies (Istio/Envoy) that production has.Category 5: Dependency behavior under load. Your service’s tests mock external dependencies. But the real dependency’s behavior changes under load. A third-party API that responds in 50ms during staging hours responds in 5 seconds during peak production hours. Your 10-second timeout means you wait 5 seconds, get a response, all is well. But the 5-second delay means your connection pool is holding 50x more connections than usual, and that causes pool exhaustion, which causes 502s on unrelated requests.My debugging protocol for a production-only 502:
  1. kubectl describe pod — is the pod healthy? Restart count?
  2. Load balancer access logs — which upstream returned the 502? Was it a timeout or a connection reset?
  3. Distributed tracing (Jaeger/Datadog APM) — where did the failing requests spend their time?
  4. Thread dump / goroutine dump — are threads blocked? On what?
  5. Compare QPS at time of failures vs baseline — is this a load-triggered issue?
War Story: We had a Go microservice that passed every test, worked in staging for months, and threw 502s in production every day between 2-4 PM. Turned out a downstream Elasticsearch cluster had a daily merge cycle at 2 PM that increased its p99 latency from 20ms to 800ms. Our service’s HTTP client had a 1-second timeout — fine. But our connection pool had a max of 50 connections, and at 800ms per request instead of 20ms, we needed 40x more connections to handle the same QPS. The pool exhausted at exactly the time Elasticsearch slowed down. Fix: increase pool size to 200 and add circuit breaker on the Elasticsearch calls. Time to diagnose: 3 days. Time to fix: 20 minutes.

Follow-up: How do you build test environments that actually catch these production-only bugs?

You cannot make a test environment identical to production — that is economically and technically impossible. Instead, I focus on narrowing the gap in the areas that matter most.Load testing against staging with production traffic shape. Not synthetic load with uniform distribution — replay actual production traffic patterns (anonymized) against staging. Tools like GoReplay or Speedscale capture and replay real traffic. This catches scale effects and data shape issues.Chaos engineering. Use Chaos Mesh or Litmus in staging to inject failures: kill pods randomly, add network latency between services, corrupt DNS, slow down the database. If your service handles these gracefully in chaos tests, it will handle the milder versions that happen in production.Shadow traffic / dark launch. Fork production traffic to the new version alongside the stable version. Compare responses without serving the new version’s responses to users. This gives you production-scale, production-data testing without risk. The cost is double compute for the shadowed service.Contract testing with production data samples. Periodically export anonymized production data samples and run them through your test suite. If a test that passes with synthetic data fails with production data, you have found a data shape gap.The honest answer: you will never catch everything. That is why canary deployments, circuit breakers, and fast rollback exist. The goal is not zero production bugs — it is fast detection and minimal blast radius.

Q17. Your team wants to add a service mesh (Istio). You think it is a mistake. Make the argument against it.

“Service meshes add latency.” They give a one-liner without quantifying the cost, understanding the operational burden, or proposing alternatives. Or worse, they cannot argue against a popular technology at all because they have never been in a position to push back on architectural decisions.
I want to be clear: I am not anti-service-mesh. I am anti-premature-service-mesh. The argument depends entirely on where your team is on the maturity curve.Argument 1: Operational complexity is extreme for small teams. Istio adds 30+ CRDs to your cluster (VirtualService, DestinationRule, Gateway, PeerAuthentication, AuthorizationPolicy, and more). Each has its own configuration surface. A misconfigured DestinationRule can silently route all traffic to a single pod. A PeerAuthentication set to STRICT before all services have sidecars will cause immediate outages. I have watched a team of 6 engineers spend 40% of their operational capacity managing Istio instead of building product features. If you have fewer than 20 microservices and fewer than 10 backend engineers, the cost-benefit is almost certainly negative.Argument 2: Sidecar resource overhead is real. Each Envoy sidecar consumes 40-100MB of memory and adds 1-3ms of latency per hop. In a request that traverses 5 services, that is 5-15ms added to every request, and 200-500MB of cluster memory consumed per 5 pods. At 200 pods, the sidecars alone consume 8-20GB of cluster memory. For a team running on a constrained Kubernetes cluster, this is not trivial.Argument 3: You can get 80% of the value without a mesh.
  • mTLS: Use cert-manager with SPIFFE identities, or just use network policies in Kubernetes to restrict pod-to-pod communication. If you are in a single VPC and trust your network boundary, mTLS between internal services may be security theater for your threat model.
  • Observability: Instrument with OpenTelemetry directly in your application code. You get more meaningful traces (business-level spans, not just HTTP hops) with less infrastructure overhead.
  • Retries and circuit breaking: Libraries like resilience4j (Java), Polly (.NET), or go-retryablehttp handle this at the application layer with more nuanced control than mesh-level policies.
  • Traffic splitting for canaries: Use Argo Rollouts or Flagger, which integrate with your existing ingress controller without requiring a full mesh.
Argument 4: Debugging gets harder, not easier. When something goes wrong with Istio, you are debugging at the Envoy configuration level — reading Envoy’s xDS config dumps, understanding how Pilot translates CRDs to Envoy config, and tracing through sidecar proxy logs. This requires a specialized skill set. I have seen P1 incidents where the root cause was a stale Envoy config pushed by Istio’s control plane, and nobody on the team understood Envoy’s internals well enough to diagnose it within the SLA.When I WOULD support Istio: 50+ microservices, dedicated platform engineering team, zero-trust security mandate from compliance, multi-cluster or multi-cloud networking requirements, and the team has already outgrown library-based resilience patterns.War Story: A startup I advised (15 engineers, 12 microservices) adopted Istio because “Netflix uses it.” Within three months, they had two P1 incidents caused by Istio misconfigurations, their deploy pipeline time doubled because of sidecar injection and readiness checks, and three engineers had spent a combined 6 weeks learning Istio instead of shipping features. They ripped it out, replaced mTLS with network policies, added OpenTelemetry instrumentation, and used Argo Rollouts for canary deploys. Reliability improved. Shipping velocity doubled.

Follow-up: At what team size and service count does a service mesh start making sense?

There is no universal threshold, but the signals I look for:50+ microservices with heterogeneous languages. When you have services in Go, Java, Python, and Node.js, implementing consistent retry policies, circuit breaking, and mTLS in four different library ecosystems becomes a maintenance nightmare. The mesh standardizes this at the infrastructure layer.Dedicated platform team (3+ engineers). If nobody’s job is to operate the mesh, it will rot. The mesh needs care: upgrades, CRD migrations, Envoy version bumps, debugging config drift. Without a platform team, the mesh becomes tech debt.Compliance mandates for mTLS everywhere. If your SOC2 or PCI audit requires encrypted service-to-service communication with certificate rotation, a mesh handles this automatically. Doing it application-by-application is error-prone and audit-unfriendly.Multi-cluster or multi-cloud networking. When you need services in Cluster A to transparently call services in Cluster B across cloud providers, a mesh’s cross-cluster service discovery and traffic management becomes genuinely valuable.As a rough heuristic: under 20 services, almost never. 20-50 services, evaluate carefully. 50+ services with a platform team, likely yes.

Q18. You inherit a system using the Saga pattern for a distributed transaction that spans three microservices. Orders are occasionally being left in an inconsistent state. What is going wrong?

“The saga is not implemented correctly.” They cannot name the specific failure modes of sagas or distinguish between choreography and orchestration approaches. They may not even know what compensating transactions are.
Sagas are the standard replacement for distributed transactions in microservices, but they have subtle failure modes that most teams discover in production, not in design reviews.First, I need to know which saga pattern is in use. Choreography (each service publishes events and the next service reacts) or orchestration (a central orchestrator tells each service what to do). This matters because the failure modes are different.Failure mode 1: Compensating transactions that do not fully compensate. The saga goes: Create Order -> Reserve Inventory -> Charge Payment. If payment fails, the compensating transaction must release the inventory reservation. But what if the “release inventory” compensating call also fails? Now you have a deducted inventory that nobody is going to buy, and an order that was never completed. The fix: compensating transactions must be retried with exponential backoff until they succeed. They should be idempotent so retries are safe. And there must be a dead-letter mechanism for compensations that fail after N retries, alerting a human.Failure mode 2: Choreography without a global view. In choreography-based sagas, no single component knows the overall state of the saga. If the “Inventory Reserved” event is published but the Payment Service is down, the saga stalls. Nobody notices because nobody is tracking it. The event sits in the queue. The inventory is reserved indefinitely. The fix: implement saga timeouts. If a saga has not completed within a defined SLA (say 30 seconds), a sweeper process detects the stuck saga and triggers compensations. This is effectively an orchestrator for failure cases, even in a choreography-based saga.Failure mode 3: Non-idempotent saga steps. A network retry causes the “Charge Payment” step to execute twice. The customer is double-charged. This is the most common production bug in sagas. The fix: every saga step must be idempotent. Use idempotency keys. The Payment Service should accept the same saga_id + step_id combination and return the previous result instead of processing again.Failure mode 4: Ordering violations in choreography. Event A and Event B are published to different partitions or topics. The consumer processes B before A. Now the saga is in an unexpected state. For example, “PaymentCharged” arrives before “InventoryReserved” at a service that expects them in order. The fix: design each saga step to handle events arriving in any order. Use a state machine per saga instance that transitions correctly regardless of event arrival order.Failure mode 5: The “semantic rollback” problem. You cannot un-send an email. You cannot un-call an external API that triggers a real-world action. If the saga needs to compensate after a step that had irreversible side effects, the best you can do is a corrective action (send a “sorry, order cancelled” email), not a true rollback. This must be designed explicitly for each step.What I would do for the inherited system:
  1. Add a saga_state table that tracks each saga instance: saga_id, current step, status (in_progress, completed, compensating, failed), last_updated timestamp.
  2. Add a sweeper that detects sagas stuck in any state for longer than the SLA and triggers compensations or alerts.
  3. Add idempotency keys to every saga step.
  4. Add compensating transaction retry logic with dead-letter alerting.
War Story: An e-commerce team I worked with had a choreography-based saga for order processing. Every few days, 5-10 orders would show as “confirmed” to the customer but the warehouse never received the fulfillment event. Root cause: the Kafka consumer for the fulfillment service occasionally lagged, and the saga had no timeout mechanism. Orders would sit in “Inventory Reserved, awaiting fulfillment” state forever. They added a 60-second saga timeout with automatic compensation and a dead-letter queue for human review. Stuck orders went from 5-10 per day to zero, with the occasional alert for genuinely weird cases that a human needed to triage.

Follow-up: When would you use a saga versus just using a distributed transaction (2PC)?

This is one of those questions where the “correct” answer — “never use 2PC” — is actually wrong in some contexts.Use a saga when: your services are independently deployed and scaled, you can tolerate eventual consistency (the final state is correct but there is a window where it is not), the transaction spans services owned by different teams, or the individual operations are long-running (minutes, hours). Sagas are the microservices-native approach.Use 2PC when: all participants are databases or resources you control, the transaction window is very short (milliseconds), and you need atomicity, not eventual consistency. 2PC works well within a single datacenter with controlled latency. PostgreSQL’s PREPARE TRANSACTION supports 2PC natively across multiple Postgres instances. XA transactions in Java work across a database and a JMS broker. The problem is not 2PC itself — it is 2PC across unreliable network boundaries with independent failure modes.The real answer: most systems should use neither. Design your service boundaries so that transactions do not span services. If “Create Order” and “Reserve Inventory” always happen together, maybe they should be the same service. The need for distributed transactions often reveals a service boundary that was drawn in the wrong place.

Q19. Someone on your team says “we should use event sourcing for our new service.” When is this a great idea, and when is it the worst architectural decision you could make?

“Event sourcing is good for audit logs.” They know the concept at the surface level but cannot articulate the actual operational cost or identify use cases where it is genuinely transformative versus cases where it becomes an albatross around the team’s neck.
Event sourcing means storing every state change as an immutable event rather than overwriting the current state. To get the current state, you replay all events. It is one of those patterns that is incredibly powerful in the right context and catastrophically overengineered in the wrong one.When it is a great idea:Financial systems and ledgers. A bank account is naturally a sequence of events: deposit 100,withdraw100, withdraw 30, transfer $50. The balance is derived, not stored. If there is ever a dispute, you have the complete, immutable history. You can answer “what was the balance at 3:47 PM on March 15th?” by replaying events up to that timestamp. Traditional state-based systems cannot answer this without separate audit tables that may drift from the actual state.Collaborative editing and CRDT-based systems. Google Docs, Figma, and multiplayer game state are naturally event-sourced. Every keystroke, every move is an event. The current state is the result of applying all events. This enables undo/redo, conflict resolution, and offline-first architectures.Regulatory compliance where auditability is mandatory. Healthcare (HIPAA), finance (SOX), and government systems where you must prove the history of every state change. Event sourcing gives you this for free — the event log IS the audit trail.Systems that need to “time travel.” If you frequently need to answer “what would the state have been if X had not happened?” or “reconstruct the system state at any arbitrary past timestamp,” event sourcing gives you that capability naturally.When it is the worst decision you could make:Simple CRUD applications. If your service manages user profiles with name, email, and avatar, and updates are infrequent, event sourcing adds enormous complexity for zero benefit. You now need an event store, a projection layer, a snapshot mechanism, and event schema versioning. For what? To answer “what was John’s email address three weeks ago?” which nobody will ever ask.High-frequency updates with large state. If an entity is updated 1,000 times per second (real-time metrics, IoT sensor readings), replaying 1,000 events to compute the current state is absurdly expensive. You need snapshots (store the current state every N events and replay from the snapshot). But now you have the complexity of both event sourcing AND state management. At that point, just store the state.Teams without event-sourcing experience. The operational complexity is genuinely high. Event schema evolution (what happens when the structure of your events changes?), eventual consistency between the event store and read projections, rebuilding projections from scratch (which can take hours for large event stores), and debugging issues where the projection diverges from the event stream. I have seen teams spend 3-4x longer building an event-sourced system than a state-based one, and the only benefit was a theoretical audit capability that the product never needed.When the domain does not have natural events. If you find yourself creating events like UserUpdated with a before/after diff as the payload, you are not doing event sourcing — you are doing change data capture with extra steps. Real event sourcing has domain-meaningful events: OrderPlaced, PaymentReceived, ItemShipped. If your events are just EntityChanged, reconsider.War Story: A team I consulted for event-sourced their user management service. They had events like UserCreated, UserEmailChanged, UserAvatarUpdated. After 18 months, they had 2 billion events for 500,000 users. Rebuilding the user projection (needed after a schema change) took 4 hours. Loading a single user required replaying an average of 4,000 events or maintaining snapshots. They eventually migrated to a standard PostgreSQL table with a separate CDC-based audit log. Total migration cost: 6 engineer-weeks. The lesson: event sourcing is not an audit log mechanism. If you only need an audit trail, use CDC or database triggers.

Follow-up: How do you handle event schema evolution when events are immutable?

This is the hardest operational problem in event sourcing and the one that bites teams hardest after they have been in production for a year.Events are immutable, but their schema needs to evolve. You cannot go back and change historical events, but new code needs to read old events and old code might need to read new events.Upcasting: maintain a chain of transformers that convert old event formats to the current format on read. Event stored as v1 is read through an upcaster that converts v1 to v2, then another that converts v2 to v3. This keeps the event store unchanged but adds processing cost and maintenance burden as the version chain grows.Weak schema with optional fields: design events with optional fields from the start. Adding a new field is backward-compatible — old events just have that field as null. Removing a field is also safe — new code ignores it. This works for additive evolution but not for breaking changes.Event versioning: store the schema version alongside the event. Consumers switch on the version and handle each accordingly. This is explicit but verbose.Copy-transform (nuclear option): create a new event stream, replay all events from the old stream through a transformer that outputs events in the new schema, and switch consumers to the new stream. This gives you a clean slate but requires downtime or complex cutover logic.My recommendation: design for weak schema evolution from day one. Use a schema registry (Confluent Schema Registry, or AWS Glue Schema Registry) to enforce backward/forward compatibility. Accept that breaking schema changes will require upcasters and plan for it.

Q20. Your monitoring shows everything green — all health checks passing, error rate under 0.1%, latency under SLA. But customers are complaining they cannot complete purchases. What is going on?

“The monitoring must be wrong.” Or they immediately start guessing without a systematic approach. They do not have a mental model for the gap between infrastructure metrics and user experience.
This is one of the most important debugging scenarios because it exposes the difference between infrastructure monitoring and user-experience monitoring. Green dashboards with angry customers means your monitoring has blind spots.Category 1: Success that is not success. Your API returns HTTP 200, which your monitoring counts as “healthy.” But the response body says { "success": false, "error": "Payment processor declined" }. The health check passes. The error rate metric does not increment. The customer cannot buy anything. This is the number-one cause of “everything green but customers angry.” Fix: monitor business-level success metrics, not just HTTP status codes. Track “successful purchases per minute,” not just “non-5xx responses per minute.”Category 2: The request never reaches your servers. DNS is resolving to an old IP. The CDN is serving a stale version of the JavaScript bundle that has a bug. A WAF rule is blocking legitimate requests from a specific geography. Your health checks pass because they hit the backend directly, bypassing the entire edge stack. Fix: synthetic monitoring (also called “real user monitoring” or “synthetic transactions”) that exercises the full path: DNS resolution -> CDN -> WAF -> load balancer -> backend -> database -> response validation. Tools like Datadog Synthetics, Catchpoint, or even a simple curl-based check from an external location.Category 3: Silent data corruption. The purchase flow completes, the API returns 200, but the order record in the database has a null shipping address because a recent migration introduced a bug in the address-parsing logic. The customer gets a confirmation email but the warehouse cannot fulfill the order. Fix: data integrity monitors that periodically validate business invariants: “every order has a non-null shipping address,” “every payment has a corresponding order,” “every order placed in the last hour has a fulfillment record.”Category 4: Client-side failures invisible to backend monitoring. A JavaScript error in the checkout page prevents the “Place Order” button from firing the API request. Your backend sees nothing because the request was never made. Or: the API call is made but fails with a CORS error that the backend never logs. Fix: client-side error tracking (Sentry, Bugsnag) and real user monitoring (RUM) that captures client-side errors, JavaScript exceptions, and failed network requests.Category 5: Partial failures in a multi-step flow. Step 1 (add to cart) works. Step 2 (enter shipping) works. Step 3 (payment) fails intermittently. Your aggregate error rate across all endpoints is 0.1%, but the payment endpoint’s error rate is 8%. The low-traffic endpoint’s failures are drowned in the aggregate metric. Fix: per-endpoint monitoring and funnel analysis. Track “started checkout” -> “entered shipping” -> “entered payment” -> “order confirmed” as a funnel. If 1,000 users start checkout but only 200 complete it, something is broken between steps.My immediate debugging protocol when this happens:
  1. Check client-side error tracking (Sentry) for JavaScript errors in the purchase flow.
  2. Check the purchase completion funnel for where users are dropping off.
  3. Inspect the actual HTTP response bodies for the failing endpoint — are we returning 200 with an error payload?
  4. Run a synthetic transaction that exercises the full purchase flow end-to-end, including payment processor integration.
  5. Check if the issue is geography-specific (CDN, WAF, DNS).
War Story: A SaaS company I worked with had a 72-hour period where their MRR dashboard showed signups dropping 60%. All engineering metrics were green. CPU, memory, error rates, latency — everything looked perfect. The actual problem: their Stripe webhook endpoint had a TLS certificate that expired, so Stripe could not deliver payment confirmation webhooks. New customers would sign up, their credit card was charged, but the application never received the “payment_succeeded” webhook, so their account was never activated. From the application’s perspective, nothing was wrong — it was just waiting for a webhook that would never arrive. Their monitoring did not have a “webhook received rate” metric. They added one that day.

Follow-up: Design a monitoring strategy that catches this class of problem before customers complain.

The framework I use is four layers of monitoring, from infrastructure up to business outcomes.Layer 1: Infrastructure metrics (the floor). CPU, memory, disk, network, pod health, node health. These catch hardware and orchestration problems. Necessary but far from sufficient.Layer 2: Application metrics (the walls). Per-endpoint error rates, latency distributions (not just averages — P50, P95, P99), throughput, and dependency health (database connection pool utilization, Redis hit rate, external API response times). These catch application-level problems.Layer 3: Synthetic transactions (the doors). Automated bots that execute critical user journeys every 1-5 minutes from external locations: sign up, log in, add to cart, complete purchase, upload a file. Each synthetic test validates the response body, not just the status code. If the synthetic purchase fails, you get paged — even if all infrastructure and application metrics are green.Layer 4: Business metrics (the ceiling). Revenue per minute, signups per hour, orders per hour, payment success rate, funnel conversion rates. These are the ultimate source of truth. If signups drop 50% at 3 PM on a Tuesday with no marketing explanation, something is broken. Alert on anomaly detection for these metrics. Tools like Datadog’s anomaly detection or even a simple “less than 50% of last week’s same-hour value” threshold.The key principle: each layer catches failures that the layers below miss. An expired TLS cert on a webhook endpoint is invisible to Layers 1-2 but immediately visible in Layer 4 (payment success rate drops). A CDN serving stale JavaScript is invisible to Layers 1-2 but caught by Layer 3 (synthetic test fails). Together, the four layers provide defense in depth.

Q21. A junior engineer proposes using a Singleton pattern for the database connection pool. It seems reasonable. Why might you push back, and what would you recommend instead?

“Singleton is an anti-pattern, never use it.” They repeat dogma without being able to articulate the specific problems in this context or acknowledge that Singletons have legitimate uses. Or they say “sure, Singleton is fine for a connection pool” without any nuance.
This is a question where the obvious answer — “Singleton is fine for a connection pool” — is partially correct but hides important problems. Connection pools are one of the more defensible uses of Singleton-like patterns, but the implementation details determine whether it helps or hurts.Why the junior engineer’s instinct is reasonable: you genuinely want exactly one connection pool per database in your application. Creating a new pool per request would exhaust database connections in seconds. Sharing a single pool is correct behavior. So the intent is right.Why a Singleton class is the wrong mechanism:Testing becomes painful. A Singleton class method like ConnectionPool.getInstance() is a global mutable reference. In unit tests, you cannot substitute it with a mock or an in-memory database without reflection hacks or resetting global state between tests. Every test that touches any code path that uses the database now implicitly depends on the Singleton. Test isolation is destroyed.Configuration rigidity. What happens when you need to connect to two different databases? The Singleton assumes exactly one instance. Now you refactor to ConnectionPool.getInstance("primary") and ConnectionPool.getInstance("analytics"), which is a dictionary of Singletons — at which point you have reinvented a service locator, which has its own problems.Lifecycle management. When the application shuts down, who closes the connection pool? The Singleton has no natural lifecycle hook. In a web framework, you want the pool to initialize on startup and close gracefully on shutdown, draining active connections. A Singleton’s lifecycle is “created on first access, destroyed when the process exits,” which often means connections are not closed cleanly.Hidden dependency. Any class in your codebase can call ConnectionPool.getInstance() anywhere. You cannot tell from a class’s constructor or interface what it depends on. This makes code harder to reason about and harder to refactor. If you need to split the application into two modules, which one gets the connection pool?What I recommend: dependency injection. Create the connection pool at application startup (in your composition root — main(), the DI container setup, the framework bootstrap). Pass it as a constructor parameter to the services that need it. This gives you:
  • Testability: pass a mock pool in tests.
  • Flexibility: pass different pools to different services.
  • Explicit dependencies: a class’s constructor tells you exactly what it needs.
  • Lifecycle control: the startup code creates it, the shutdown hook closes it.
The pool itself is still a single instance at runtime — you just do not enforce that through the Singleton pattern. The “exactly one pool” constraint is a deployment-time decision, not a compile-time constraint.When I would accept a Singleton: in a small script or CLI tool with no tests, no dependency injection framework, and no lifecycle requirements. Sometimes db.getInstance() in a 200-line Python script is the pragmatic choice. Software engineering principles exist to manage complexity — if there is no complexity, the principle is overhead.War Story: A team had a Singleton connection pool in a Java service. When they needed to add a read-replica connection for reporting queries, they could not — the Singleton assumed one database. They forked the Singleton into PrimaryPool and ReplicaPool, but now every DAO method needed to decide which pool to use, and the Singletons proliferated. When I joined, we refactored to dependency injection with Spring’s @Qualifier annotations. The refactor touched 40 files and took a week, but after that, adding a third database (Redis, for caching) took 20 minutes. The Singleton pattern was the bottleneck to architectural evolution.

Follow-up: Name a case where a Singleton is genuinely the right choice, not just the convenient one.

Hardware resource wrappers. If you are managing access to a physical device — a GPU, a serial port, a hardware security module (HSM) — there is genuinely one physical resource, and creating a second wrapper would cause conflicts at the hardware level. The Singleton is not just convenient; it models a physical reality.Logger instances. Logging frameworks (Log4j, SLF4J, Python’s logging module) use Singletons internally, and this is correct. The logger is a write-only, stateless (from the caller’s perspective), globally-needed utility. You never need to mock the logger in tests (you mock the handler/appender instead). There is no lifecycle complexity. Multiple instances would just be wasteful.Configuration registries (read-only after initialization). A global config object loaded once at startup and never modified is a safe Singleton because it has no mutable state to cause concurrency issues and no lifecycle management needs.The pattern I use: Singletons are appropriate when the resource is physically singular, stateless or immutable, and has no testing implications. Connection pools fail on all three — they are logically singular (not physically), mutable (connections are checked in/out), and absolutely have testing implications.

Q22. You are designing a system that must work across three geographic regions with users who expect sub-100ms reads. Strong consistency is required for financial data, but you also have a social feed feature where eventual consistency is fine. How do you architect this?

“Use a global database like Spanner.” They reach for a single tool and apply it uniformly. They do not differentiate between data paths with different consistency requirements, and they cannot estimate whether Spanner’s cross-region latency budget fits within 100ms.
This question is really about understanding that different data paths within the same system deserve different architectures. There is no single database or pattern that optimally serves both requirements. Let me split the system by consistency needs.For the financial data path (strong consistency):CockroachDB or Google Spanner, deployed across three regions with a leaseholder in each region. Here is the latency math: cross-region round trip between US-East, Europe, and Asia is roughly 80-150ms per hop. A strongly consistent write in Raft requires the leader to replicate to a majority. If the leader is in US-East and the followers are in Europe and Asia, the write completes when the faster follower (Europe, ~80ms) acknowledges. So write latency is approximately 80-100ms from the leader’s perspective, but 160-250ms from a client in Asia writing to a US-East leader.The key optimization: CockroachDB lets you pin leaseholders (read-serving nodes) to specific regions using zone configurations. You pin financial data leaseholders by user’s home region. A user in Europe reads from the European leaseholder with single-digit millisecond latency. Writes from that user also go to their local leaseholder first. This gives you sub-100ms reads everywhere and strongly consistent reads within a region.The trade-off: cross-region writes (a user in Europe transferring money to a user in Asia) will take 100-200ms because the write must replicate to a majority of replicas across regions. This is the fundamental cost of strong consistency across geographies, and no architecture avoids it without violating physics (speed of light).For the social feed path (eventual consistency):Completely different architecture. DynamoDB Global Tables replicated across all three regions. Each region gets a full replica and can serve reads locally with sub-10ms latency. Writes are replicated asynchronously (typically under 1 second). If two users in different regions modify the same data simultaneously, DynamoDB resolves the conflict with last-writer-wins at the attribute level (CRDT-based internally).For the feed itself, I would use a regional Redis cluster per region as the hot-path read store. The feed is pre-computed (fan-out-on-write for normal users, fan-out-on-read for high-follower users) and pushed to the local Redis cluster. Reads are sub-millisecond from local Redis. If Redis misses, fallback to the regional DynamoDB replica.The integration layer: How do these two systems interact? If a financial transaction (strong consistency path) generates a social event (“Alice sent Bob $50”), the financial write to CockroachDB succeeds first, then an event is published (via the Outbox pattern) to a Kafka cluster that replicates the event to all regions. The social feed consumers in each region read from their local Kafka and update the regional DynamoDB/Redis stores. There is a delay of 1-5 seconds between the transaction completing and the social feed updating, which is perfectly acceptable for a feed.The architecture summary:
  • Financial data: CockroachDB multi-region with pinned leaseholders. Strong consistency. Reads sub-100ms from user’s home region.
  • Social feed: DynamoDB Global Tables + regional Redis. Eventual consistency. Reads sub-10ms.
  • Integration: Kafka event bridge with Outbox pattern for cross-domain events.
  • Total operational surface: 3 databases (CockroachDB, DynamoDB, Redis), 1 event backbone (Kafka). This is manageable with a competent platform team.
War Story: A fintech company I worked with initially tried to use a single CockroachDB cluster for everything — financial transactions and social features. The social feed queries (aggregations, fan-out reads) were putting so much load on the CockroachDB cluster that financial transaction latency spiked to 500ms during peak hours. Separating the workloads into purpose-built stores reduced financial transaction P99 from 500ms to 40ms and feed read latency from 200ms to 3ms. The operational complexity increased, but the performance characteristics matched the actual requirements.

Follow-up: A user in Europe makes a deposit and immediately checks their balance. Can you guarantee they see the updated balance if the leaseholder is in US-East?

This is the “read-your-own-writes” consistency problem, and it is the most user-visible consistency issue in geo-distributed systems.If using CockroachDB with the leaseholder in US-East: the write from Europe goes to US-East (~80ms), commits, and returns to the European client (~80ms), total ~160ms. If the user immediately reads, and the read goes to a European follower that has not yet received the replication, they could see a stale balance. But CockroachDB prevents this by default: reads go to the leaseholder, so the European user’s read also goes to US-East (80ms round trip). They see the updated balance, but with the cross-region latency cost.The optimization: pin the leaseholder for European users’ financial data to Europe. Now both the write and the subsequent read happen locally in Europe, with sub-10ms latency. The write still needs a majority acknowledgment across regions (~80ms for the write to be considered durable), but the read after the write sees the data immediately because the local node holds the lease and has the latest write.In CockroachDB specifically: you can use ALTER TABLE accounts CONFIGURE ZONE USING lease_preferences = '[[+region=eu]]' WHERE region = 'eu' to pin European user data leaseholders to the European region. This gives European users fast reads after their own writes, while maintaining strong consistency guarantees.The worst-case scenario: the leaseholder fails over to another region right after the write but before the read. The new leaseholder in the other region will have the write (because the write was acknowledged by a majority before the response was sent), so the user still sees their updated balance — but the read takes longer due to the cross-region hop. Correctness is maintained; latency temporarily spikes.

Interview-Ready One-Liners

Memorize these. They are the crisp, senior-engineer-sounding openers for high-frequency topics.
TopicOne-Liner
CAP”Partition tolerance is not a choice; during a partition I pick consistency or availability. I prefer PACELC because it tells me what to do when there is no partition.”
Consistency model”Strong consistency for money, read-your-writes for user actions, eventual for feeds and analytics.”
Caching”Cache-aside is the default; I add Redis only on the hot path and measure hit rate before committing to it.”
Rate limiting”Token bucket for APIs, sliding window for per-user limits, leaky bucket when smoothing bursts matters.”
Queue choice”Kafka for log-style fan-out at scale, RabbitMQ for per-message routing, SQS when I want to not operate a queue.”
Auth”401 is authn, 403 is authz. I never roll my own auth; I use the IDP and focus on authorization policy.”
Microservices”Services, not microservices. I split only when team boundaries or scaling axes justify the ops cost.”
Tests”Unit tests for pure logic, integration tests against real infra, one e2e smoke per critical journey. I don’t mock the database for integration.”
Observability”Metrics for rate/error/duration (RED), logs for forensics, traces for request flow. Alert on symptoms, not causes.”
Deployments”Canary with automated rollback on error-rate regression, feature flags for decoupling deploy from release.”
Database choice”Postgres until you prove you’ve outgrown it. At 50K writes/sec or multi-region writes, then I evaluate Spanner/CockroachDB/Cassandra.”
Kubernetes”Autopilot for small teams and stateless, Standard for GPU/DaemonSet/custom. RBAC is code-reviewed; nobody gets cluster-admin.”
Schema migration”Expand -> migrate -> contract. Never backward-incompatible in one deploy. pg_repack or gh-ost for online rewrites.”
Memory leak”Reproduce, snapshot, diff, find the retainer. Five patterns: globals, timers, closures capturing too much, detached DOM, orphaned listeners.”
Thundering herd”Jittered retry + circuit breaker + request coalescing. Don’t let every client retry at the same second.”
Idempotency”Client supplies an idempotency key; server stores result keyed by it for a bounded window. Retry-safe by construction.”
Distributed lock”Redis SET NX EX is not safe; use Redlock carefully or better, use Zookeeper/etcd leases with a fencing token.”

Quick-Fire Q&A: 60-Second Answers

60-second answer: It pulls every column including large TEXT/JSON/blobs you don’t need, blocks query planner from using covering indexes, and breaks when schema changes (new column added silently). Projection matters: listing columns lets the planner pick an index-only scan, the network carries less data, and ORM hydration is faster. The only place SELECT * is okay is in ad-hoc debugging.
60-second answer: 502 (Bad Gateway) means the upstream returned garbage — malformed response, connection reset. 504 (Gateway Timeout) means the upstream did not respond in time. 502 is usually an upstream crash or protocol mismatch; 504 is usually load/latency. Different fixes: 502 means debug the upstream service; 504 means increase timeout, scale upstream, or shed load.
60-second answer: If 99% of rows have deleted_at IS NULL, an index on that column is not selective — the planner picks a sequential scan anyway. Solutions: (a) partial index WHERE deleted_at IS NULL which only indexes live rows and is much smaller, (b) don’t soft-delete if the deletion ratio is low, (c) move deleted rows to a separate archive table.
60-second answer: When the bottleneck moves to something not shard-scalable: (a) the database, (b) a shared cache, (c) a downstream third-party API, (d) network bandwidth, (e) coordination/consensus that requires majority agreement. The pattern: “we added 4x more API servers but QPS plateaued” -> the next bottleneck just revealed itself. Profile before scaling further.
60-second answer: Moving quality checks earlier in the dev lifecycle — security scans in IDE/pre-commit instead of pre-deploy, integration tests on every PR instead of nightly, performance tests in staging instead of prod. The goal: catch defects when they cost minutes to fix, not hours. Trap: shifting left without reducing right-side work just adds overhead.
60-second answer: Exponential backoff alone causes thundering herds when everyone retries at the same time after an outage. Add jitter: sleep = base * 2^attempt * random(0.5, 1.5). Also cap max attempts (give up eventually) and max backoff (don’t wait 30 minutes). Finally, combine with a circuit breaker so you stop hammering a dead service at all.
60-second answer: Blue-green: full switch-over. Cleanest rollback (flip the LB back). Best for schema-coupled apps where mixed versions are dangerous. Doubles infrastructure during cutover. Canary: gradual rollout, 1% -> 10% -> 50% -> 100%. Best when you want to catch issues with a small blast radius. Requires metrics to automate promotion/rollback. Modern default: canary with automated health gates.
60-second answer: JWTs are self-contained — you can’t invalidate them server-side without a revocation list (which defeats the stateless benefit). For sessions you need logout to work instantly, password change to kill all sessions, and admin-force-logout. Sessions stored in Redis with a session ID in a cookie give you all three. JWT is for short-lived service-to-service tokens where revocation is handled by expiry.
60-second answer: At-least-once: messages may be delivered 2+ times if an ack is lost; consumer must be idempotent. Exactly-once: messages are delivered once and only once; requires deduplication at the broker or consumer (Kafka transactions, idempotent producers). Exactly-once is expensive and rarely worth it — most systems do at-least-once with idempotent handlers, which is operationally simpler.
60-second answer: Two hard parts: (1) knowing when data changed (dual-write, TTL, pub/sub, CDC — each has drawbacks), (2) preventing stale reads during the window between write and invalidation. Real systems combine: TTL as a safety net + explicit invalidation for correctness + versioned keys to avoid ambiguity. The trap is thinking you can solve it perfectly; you design for bounded staleness and accept it.

AI-Assisted Lens per Concept

How would you use AI tools to work with each topic in a real interview / real job?
TopicAI-assisted lens
System designUse Claude/GPT to stress-test your design by role-playing an adversarial reviewer: “What are 5 failure modes of this design?”
DebuggingPaste stack trace + recent code changes into the LLM for root-cause hypotheses, then verify each manually.
SQL optimizationGive the LLM your EXPLAIN ANALYZE output and ask for the top 3 fixes with reasoning. Validate with EXPLAIN (ANALYZE, BUFFERS) after each change.
Code reviewAsk the LLM to review a PR with a specific lens: “Review for concurrency bugs only” or “Review for cost efficiency only.” More signal than a generic review.
Test generationLLMs are excellent at generating property-based test cases and edge cases you would miss. Useful for serialization, parsers, and date/time math.
DocumentationDraft ADRs from bullet points, generate runbook skeletons from an incident report. You polish; AI does the boilerplate.
Learning new techAsk the LLM for a “concept map” of the new tech vs something you already know. “How is Temporal like and unlike Airflow?”
Interview prepHave the LLM generate 10 adversarial follow-ups to your answer. Use it as an interview sparring partner.
What strong candidates say about AI: “I use LLMs as a fast-feedback partner — stress-test my design, generate test cases, draft docs. I treat AI output as a junior engineer’s first draft: useful signal, but I verify everything that touches production. I also know its failure modes — hallucinated APIs, outdated library advice, and confident-but-wrong SQL — so I always run the code.”What weak candidates say: “I’d just ask ChatGPT” — no verification mindset, no understanding of AI’s failure modes. Interviewers can tell the difference between someone using AI as a tool vs. someone using it as a crutch.