Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

REST & Microservices Interview Questions (55+ Deep-Dive Q&A)

1. REST API Design

What interviewers are really testing: Whether you understand REST as an architectural style with specific constraints rather than just “HTTP + JSON”. Senior candidates know which constraints are routinely violated and why.Answer:
  1. Client-Server — Separation of concerns. The client handles UI/UX, the server handles data storage and business logic. They evolve independently. Example: your React SPA can be rewritten to React Native without touching the API.
  2. Stateless — Every request from client to server must contain all the information needed to understand it. The server stores no session state between requests. This is what makes horizontal scaling possible — any server can handle any request because there is no “sticky” state. In practice, this means auth tokens (JWTs) travel with every request instead of relying on server-side sessions.
  3. Cacheable — Responses must implicitly or explicitly label themselves as cacheable or non-cacheable. This enables clients, CDNs, and intermediary proxies to cache responses and reduce server load. Headers like Cache-Control, ETag, and Last-Modified drive this.
  4. Uniform Interface — The most fundamental constraint. It requires: resource identification via URIs, manipulation through representations (JSON/XML), self-descriptive messages (content-type headers), and HATEOAS (hypermedia links in responses). This is what distinguishes REST from plain HTTP RPC.
  5. Layered System — The client cannot tell whether it is connected to the end server or an intermediary. Load balancers, CDNs, API gateways, and WAFs can sit between client and server transparently. This enables security policies, caching layers, and load distribution without client awareness.
  6. Code on Demand (Optional) — Servers can temporarily extend client functionality by transferring executable code (e.g., JavaScript). This is the only optional constraint and is rarely cited in API design.
Key insight most candidates miss: Nearly every “REST API” in production violates at least one constraint, most commonly HATEOAS. Roy Fielding himself has written that if your API does not use hypermedia, it is not actually REST — it is HTTP-based RPC. In practice, very few APIs are truly RESTful. Netflix, GitHub, and Stripe all make pragmatic trade-offs. The interview value here is showing you understand the ideal and the pragmatic reality.Red flag answer: “REST means using HTTP verbs with JSON responses.” This conflates REST with HTTP RPC and misses the architectural constraints entirely.Follow-up questions:Q: Which REST constraint do most production APIs violate, and is that acceptable? HATEOAS. Almost no production API actually embeds hypermedia links that drive application state. GitHub’s API is one of the few that includes link headers for pagination. The pragmatic reason is that most client developers prefer documented, hardcoded endpoints over dynamically discovered ones. It is acceptable because the trade-off between developer experience and theoretical purity favors DX in most business contexts. However, for long-lived APIs with many consumers (like a banking platform), HATEOAS can reduce breaking changes significantly because clients follow links rather than constructing URLs.Q: How does the stateless constraint affect authentication in a microservices architecture? Every request must carry its own auth context. This is why JWTs became the de facto standard — they are self-contained tokens that carry user claims. The server validates the signature without hitting a session store. The trade-off is that JWTs cannot be easily revoked mid-lifetime. Solutions include short expiry (15 minutes) plus refresh tokens, or maintaining a token blocklist in Redis (which technically reintroduces statefulness, but at the auth layer only). At scale, companies like Auth0 recommend keeping JWTs under 15 minutes and using token introspection for sensitive operations.Q: If REST requires statelessness, how do you handle shopping carts or multi-step workflows? The cart is a resource stored on the server, not session state. The distinction is subtle but critical: POST /carts creates a cart resource with an ID. The client sends that cart ID with subsequent requests. The server is still stateless — it does not “remember” the client between requests. The cart data lives in the database, not in server memory. Multi-step workflows work the same way: each step creates or updates a resource, and the client tracks its own progress.
What interviewers are really testing: Whether you understand idempotency deeply enough to design reliable APIs, especially in distributed systems where network retries are inevitable.Answer:
VerbPurposeSafeIdempotentRequest Body
GETRetrieve resourceYesYesNo
POSTCreate new resourceNoNoYes
PUTFull replacementNoYesYes
PATCHPartial updateNoIt dependsYes
DELETERemove resourceNoYesOptional
HEADLike GET, headers onlyYesYesNo
OPTIONSCapabilities/CORS preflightYesYesNo
Key nuances:
  • Safe means the method does not modify server state. GET and HEAD are safe. You should never design a GET /deleteUser/123 endpoint — it violates safety and will be triggered by crawlers and prefetch.
  • Idempotent means making the same request N times produces the same result as making it once. DELETE /users/123 returns 200 the first time and 404 after, but the server state is the same (user is gone). That is still idempotent.
  • PATCH is tricky: A JSON Merge Patch ({"email": "new@example.com"}) is idempotent. But a JSON Patch operation like [{"op": "add", "path": "/tags/-", "value": "new"}] appends an element each time — that is NOT idempotent. The idempotency of PATCH depends entirely on the patch format used.
  • POST is never idempotent by default, which is why payment APIs (Stripe, PayPal) require Idempotency-Key headers — to make POST behave idempotently for critical operations.
Production example: At Stripe, every POST /v1/charges request requires an Idempotency-Key. If the client retries (network timeout), Stripe returns the cached response from the first successful processing. Without this, a flaky network could charge a customer twice. They store idempotency keys for 24 hours.Red flag answer: “PUT and PATCH are basically the same thing” or “Idempotent means the response is always the same” (it means the server state is the same, not the response code).Follow-up questions:Q: A mobile client sends a POST to create an order but gets a network timeout. It retries. How do you prevent duplicate orders without idempotency keys? You could use a unique constraint on a business-level identifier (e.g., a client-generated order reference), but this is fragile. The robust solution is idempotency keys: the client generates a UUID before the first attempt and sends it with every retry. The server checks a key-value store (Redis with a TTL of 24-48 hours) — if the key exists, return the stored response. If not, process the request and store the result keyed by that UUID. This pattern is used by Stripe, Square, and Amazon Pay. The key expiry should match your business context — Stripe uses 24 hours, which covers any reasonable retry window.Q: Can a GET request ever modify server state? When might this be acceptable? Technically it should never modify state per the HTTP spec. In practice, tracking endpoints like GET /emails/track.png in marketing emails update a “read” counter — the side effect is non-destructive and the response (a 1x1 pixel) is always the same. Analytics logging on GET is another common case. The principle: incidental side effects (logging, metrics) are acceptable, but the GET should never be the primary mechanism for state mutation.Q: You are designing an API for a banking transfer. Which verb do you use and why? POST /transfers with an idempotency key. Not PUT because we are creating a new resource, not replacing an existing one. The idempotency key is non-negotiable for financial operations. The response should include the transfer ID and status. The endpoint should validate funds, create the transfer record, and enqueue the actual money movement asynchronously. Return 201 Created on success with a Location header pointing to the transfer resource.
What interviewers are really testing: Whether you design APIs that communicate clearly through status codes, or whether you are the kind of developer who returns 200 OK with {"error": "Something broke"} in the body.Answer:2xx — Success:
  • 200 OK — Standard success. Use for GET, PUT, PATCH responses with a body.
  • 201 Created — Resource created. Must include Location header pointing to the new resource. Use after successful POST.
  • 202 Accepted — Request received but not yet processed. Used for async operations (batch jobs, long-running tasks). Client should poll or receive a webhook.
  • 204 No Content — Success with no response body. Typical for DELETE or PUT when you do not need to return the updated resource.
3xx — Redirection:
  • 301 Moved Permanently — Resource permanently moved. Browsers and clients cache this. Use when you deprecate an endpoint permanently.
  • 302 Found — Temporary redirect. Used in OAuth flows to redirect to the authorization server.
  • 304 Not Modified — Client’s cached version is still valid. Server checks If-None-Match (ETag) or If-Modified-Since headers and skips sending the body. Saves bandwidth — critical for mobile APIs.
4xx — Client Errors:
  • 400 Bad Request — Malformed request (invalid JSON, missing required fields). Include a structured error body with field-level details.
  • 401 Unauthorized — Actually means unauthenticated. No valid credentials provided. The naming is historically misleading.
  • 403 Forbidden — Authenticated but not authorized. You know who they are, but they lack permission.
  • 404 Not Found — Resource does not exist. Also commonly used to hide the existence of resources from unauthorized users (security through ambiguity).
  • 409 Conflict — State conflict. Example: trying to create a user with an email that already exists, or updating a resource that has been modified since you last read it (optimistic concurrency).
  • 422 Unprocessable Entity — Request is syntactically valid JSON but semantically wrong (e.g., "age": -5). Rails and many modern APIs prefer this over 400 for validation errors.
  • 429 Too Many Requests — Rate limit exceeded. Must include Retry-After header. Clients that respect this are good citizens.
5xx — Server Errors:
  • 500 Internal Server Error — Unhandled exception. Your monitoring should alert on these. Never leak stack traces to clients in production.
  • 502 Bad Gateway — The gateway/proxy received an invalid response from upstream. Common when a microservice behind an API gateway crashes.
  • 503 Service Unavailable — Server is overloaded or in maintenance. Include Retry-After. Used during deployments or circuit breaker activations.
  • 504 Gateway Timeout — Upstream service did not respond in time. Common in microservices when a downstream dependency is slow.
Production war story: A team returned 200 OK for every response and put the actual status in the JSON body as {"status": "error", "code": 500}. Their monitoring showed 0% error rate because it only checked HTTP status codes. When a database outage hit, they had no alerts for 45 minutes. Proper status codes are not just for clients — they drive your entire observability pipeline (Datadog, Grafana dashboards, alerting rules, SLO tracking).Red flag answer: “I just use 200 for success and 500 for errors” or confusing 401 with 403.Follow-up questions:Q: When should you return 404 vs 403 for a resource the user cannot access? It depends on whether revealing the resource’s existence is a security concern. For a banking API, if user A requests user B’s account, return 404 — do not confirm the account exists. For an internal admin panel, 403 is more helpful for debugging. GitHub does this: if you request a private repo you do not have access to, you get 404, not 403. This prevents enumeration attacks where an attacker probes for valid resource IDs.Q: Your API returns 400 for every validation error. A frontend developer complains they cannot tell malformed JSON from a missing required field. How do you fix this? Adopt a structured error response format. RFC 7807 (Problem Details for HTTP APIs) is the standard. Return {"type": "validation_error", "title": "Invalid request", "status": 422, "errors": [{"field": "email", "message": "must be a valid email address", "code": "invalid_format"}]}. Use 400 for truly malformed requests (unparseable JSON) and 422 for semantically invalid data. This separation lets frontend teams build field-level error highlighting without parsing error message strings.Q: How do status codes interact with circuit breakers in a microservices architecture? Circuit breakers (Resilience4j, Polly, Hystrix) typically count 5xx responses and timeouts toward the failure threshold. A spike of 502s or 503s from a downstream service will trip the breaker, causing the calling service to fail fast with its own error (typically 503 with a fallback response). The key configuration decision is: should 4xx errors count toward the circuit breaker threshold? Usually no — 4xx means the client sent a bad request, not that the downstream service is unhealthy. But watch out for 429s at scale: if your downstream aggressively rate-limits you, a flood of 429s might warrant a brief back-off that resembles circuit-breaking behavior.
What interviewers are really testing: Whether you can articulate when each approach makes sense, not just list differences. SOAP is not dead — it is alive in banking, healthcare, and government.Answer:
AspectRESTSOAP
ProtocolHTTP (architecturally bound)Transport-independent (HTTP, SMTP, JMS, TCP)
Data FormatJSON, XML, or any formatXML only (with strict envelope structure)
ContractInformal (OpenAPI is optional)Formal WSDL (auto-generates clients)
Error HandlingHTTP status codesSOAP Faults (structured XML errors)
SecurityHTTPS + OAuth/JWTWS-Security (message-level encryption, signing)
StatefulnessStateless by designCan be stateful (WS-ReliableMessaging)
TransactionsNo built-in supportWS-AtomicTransaction, WS-Coordination
PerformanceLightweight, fast (JSON is 30-50% smaller than XML)Heavier (XML parsing, SOAP envelope overhead)
ToolingManual or code-gen from OpenAPIAuto-generated strongly-typed clients from WSDL
When SOAP still wins:
  • Financial services — WS-Security provides message-level encryption (the message is encrypted even if TLS terminates at a load balancer). Banks processing SWIFT transfers use SOAP.
  • Healthcare — HL7 FHIR is moving to REST, but legacy hospital systems use SOAP for compliance.
  • Enterprise integration — When you need guaranteed delivery (WS-ReliableMessaging) and distributed transactions (WS-AtomicTransaction) across organizational boundaries.
  • Formal contracts — WSDL allows auto-generation of strongly-typed client libraries. This matters when your API has 500+ operations and multiple consumer teams.
When REST wins: Basically everything else. Public APIs, mobile backends, SPAs, microservices. The developer experience advantage is overwhelming — JSON is human-readable, tooling is everywhere, and the learning curve is minimal.Red flag answer: “SOAP is outdated and should never be used” — this shows lack of enterprise experience. Also, “REST uses JSON and SOAP uses XML” without understanding the deeper architectural differences.Follow-up questions:Q: You join a fintech company that has a SOAP-based core banking API. A new team wants to build a REST API. What do you recommend? Build a REST facade (Anti-Corruption Layer / BFF) in front of the SOAP core. Do not rewrite the core — it is battle-tested and compliant. The REST layer translates JSON to SOAP XML, handles auth translation (OAuth to WS-Security), and presents a modern interface to new consumers. Tools like MuleSoft, Apache CXF, or even a lightweight Node/Go service can do this translation. Over time, if the core is being rewritten, you can route traffic to new REST microservices (Strangler Fig pattern). This is exactly what many banks like Capital One did during their modernization.Q: What is the security difference between transport-level (TLS) and message-level (WS-Security) encryption? TLS encrypts the transport pipe — data is encrypted in transit but decrypted at each hop (load balancer, API gateway, reverse proxy). If any intermediary is compromised, the message is exposed in plaintext. WS-Security encrypts the message itself — only the intended recipient with the private key can decrypt it. In a zero-trust architecture with multiple intermediaries, message-level security provides end-to-end confidentiality. The trade-off is significant performance overhead (XML digital signatures are computationally expensive) and complexity. For most architectures, TLS with mutual authentication (mTLS) is sufficient.
What interviewers are really testing: Whether you understand the theory behind REST’s most controversial constraint and whether you have an informed opinion on its practical value.Answer:HATEOAS = Hypermedia As The Engine Of Application State.The core idea: API responses include links that tell the client what actions are possible next. The client should not need to hardcode URL patterns — it discovers them from responses.Example:
{
  "id": 42,
  "status": "pending",
  "amount": 150.00,
  "links": [
    { "rel": "self", "href": "/orders/42", "method": "GET" },
    { "rel": "cancel", "href": "/orders/42/cancel", "method": "POST" },
    { "rel": "payment", "href": "/orders/42/pay", "method": "POST" }
  ]
}
If the order status were "shipped", the cancel link would disappear — the client knows cancellation is no longer possible without any client-side business logic.Why it matters in theory:
  • Decoupling — URLs can change without breaking clients (clients follow links, not hardcoded paths).
  • Discoverability — A new client can explore the API starting from just the root URL.
  • Server-driven workflows — The server controls what transitions are valid, reducing client-side state management.
Why it rarely works in practice:
  • Most frontend teams prefer explicit documentation (Swagger UI) over dynamically discovering endpoints.
  • Adding link generation to every response increases response size by 20-40% for simple resources.
  • Client SDKs are typically auto-generated from OpenAPI specs, making URL hardcoding a non-issue.
  • The industry standard (GitHub, Stripe, Twilio) is partial HATEOAS: pagination links and Location headers, but not full hypermedia-driven navigation.
Where it actually adds value:
  • Long-lived APIs with many consumer teams (banking, government) where URLs may evolve over decades.
  • Workflow-driven APIs where valid next actions depend on current state (order fulfillment, insurance claims processing).
  • HAL, JSON:API, Siren are established media types that standardize how links are embedded.
Red flag answer: “HATEOAS means the API is self-documenting so you don’t need Swagger” — this misunderstands both concepts. Or having never heard of HATEOAS at all.Follow-up questions:Q: You are designing an API for an insurance claims workflow with 12 possible states. Would you use HATEOAS? Yes, this is actually a strong use case. Each claim state (filed, under_review, approved, denied, appealed, etc.) has different valid transitions. Embedding those transitions as links means the frontend does not need to replicate the state machine logic. When the business adds a new state or changes transition rules, only the backend changes. Without HATEOAS, every client must update their hardcoded state transition logic. The claims team at a Fortune 100 insurer I worked with saved significant cross-team coordination overhead by adopting this approach.Q: What is the difference between HAL, JSON:API, and Siren? All three are media types for hypermedia APIs but differ in philosophy. HAL (Hypertext Application Language) is the simplest — adds _links and _embedded to JSON responses. Great for simple APIs. JSON:API is a full specification that standardizes filtering, sorting, pagination, sparse fieldsets, and relationships. It is opinionated and reduces bikeshedding but adds payload overhead. Siren goes furthest — it includes actions (forms) with field definitions, so the client knows what data to send for each action. Siren is closest to true HATEOAS but is the most complex. For most teams, HAL or simple link objects (like GitHub uses) are the pragmatic choice.
What interviewers are really testing: Whether you understand the performance implications of different pagination strategies and can choose the right one based on data characteristics and access patterns.Answer:Offset/Limit (Page-based):
GET /users?offset=1000&limit=20
  • The database executes SELECT * FROM users ORDER BY id LIMIT 20 OFFSET 1000 — this means the DB reads and discards 1000 rows before returning 20.
  • Performance degrades linearly with depth. Page 1 is fast. Page 500 (offset 10,000) forces the DB to scan 10,000 rows.
  • Supports “jump to page 50” and total count (SELECT COUNT(*)).
  • Data drift problem: If items are inserted or deleted between page requests, users see duplicates or miss items. If a new user is inserted at position 5 while you are on page 2, user #20 from page 1 appears again on page 2.
Cursor-based (Keyset):
GET /users?after=eyJ1c2VyX2lkIjogNDJ9&limit=20
  • The cursor is typically a Base64-encoded pointer (e.g., the last seen id or created_at value).
  • The database executes SELECT * FROM users WHERE id > 42 ORDER BY id LIMIT 20 — uses an index seek, consistently fast regardless of depth.
  • No page jumping — you can only go forward (or backward with a before cursor). No total count without a separate query.
  • Stable under mutations — inserting or deleting rows does not cause drift because the cursor is anchored to a specific record.
Seek-based (Time-based):
GET /events?since=2024-01-15T10:30:00Z&limit=100
  • A variant of cursor pagination using timestamps. Works well for event streams and audit logs.
  • Gotcha: If multiple events share the same timestamp, you can miss records. Solution: use composite cursors (timestamp + tie-breaking ID).
When to use what:
  • Offset — Internal admin dashboards where users expect “Page 3 of 50” and datasets are small (< 100K rows).
  • Cursor — Public APIs, mobile feeds, any endpoint where deep pagination is expected or data mutates frequently. Used by Facebook, Twitter, Slack, and Stripe.
  • Seek/Time — Event sourcing, audit logs, changelog endpoints.
Production metric: At a company processing 2M API requests/day, switching from offset to cursor pagination on their search endpoint reduced p95 latency from 1,200ms to 45ms for deep pages (page 100+). The DB went from full table scans to index-only lookups.Red flag answer: “Just use LIMIT and OFFSET” without acknowledging the performance cliff, or not knowing what cursor pagination is.Follow-up questions:Q: Your API uses cursor pagination but the product team demands a “total count” displayed as “Showing 1-20 of 14,523.” How do you handle this? SELECT COUNT(*) on large tables is expensive (full table scan on InnoDB, though PostgreSQL is slightly better with its visibility map). Options: (1) Cache the count with a short TTL (30 seconds) and accept slight inaccuracy. (2) Use a COUNT(*) estimate from EXPLAIN (Postgres: SELECT reltuples FROM pg_class WHERE relname = 'users' — fast but approximate). (3) Display “Showing 1-20 of ~14.5K” with an approximate count. (4) Track the count in a separate materialized counter that updates asynchronously. Slack’s API returns "has_more": true instead of a total count — a UX-friendly alternative.Q: How does cursor pagination work when sorting by a non-unique field like created_at? If two records have the same created_at, a cursor pointing to that timestamp is ambiguous. Solution: use a compound cursor that includes both the sort field and a unique tie-breaker (typically the primary key). The query becomes WHERE (created_at, id) > ('2024-01-15 10:30:00', 42) ORDER BY created_at, id LIMIT 20. This guarantees deterministic ordering. The cursor encodes both values: base64({"created_at": "2024-01-15T10:30:00Z", "id": 42}).Q: A client paginates through a dataset that is being actively written to. How do you guarantee consistency? With cursor pagination, new records inserted after the cursor position will appear on subsequent pages (which is usually desired for feeds). Records inserted before the cursor are naturally excluded. If the client needs a frozen snapshot, you can use a versioned cursor that includes a timestamp filter: WHERE created_at <= :snapshot_time AND (sort_key, id) > (:cursor_sort, :cursor_id). Alternatively, use database-level snapshot reads (PostgreSQL’s REPEATABLE READ isolation) for short-lived pagination sessions.
What interviewers are really testing: Whether you can balance backwards compatibility with the need to evolve APIs, and whether you understand the organizational (not just technical) impact of versioning decisions.Answer:URI Path Versioning:
GET /v1/users
GET /v2/users
  • Most common in practice (used by Stripe, Twilio, Google).
  • Simple, explicit, cacheable (CDNs can cache /v1/ and /v2/ separately).
  • Downside: pollutes the URI space. Every version is essentially a new API surface to maintain.
Header Versioning:
GET /users
Accept-Version: v2
or using custom media types:
Accept: application/vnd.myapi.v2+json
  • Keeps URLs clean. GitHub uses media type versioning (application/vnd.github.v3+json).
  • Harder to test (cannot just paste a URL in a browser). Requires client education.
Query Parameter:
GET /users?version=2
  • Easy to implement and test. Used by some Google APIs.
  • Pollutes query strings. Can cause caching issues (CDNs might not key on query params by default).
No explicit versioning (Additive-only / Evolutionary):
  • Never remove or rename fields. Only add new ones.
  • Used by LinkedIn (Restli) and some internal APIs.
  • Requires rigorous discipline: every change must be backward-compatible.
  • Works best when you control most clients or have a small number of consumers.
The real strategic question is not how to version but when:
  • Non-breaking changes (add a field, add an endpoint): no version bump. Ever.
  • Breaking changes (remove a field, change a field type, rename a field): version bump.
  • Stripe’s approach: version pinning. Each API key is pinned to a version. Stripe maintains every version since 2011. They use an internal compatibility layer that transforms requests/responses between versions. This is engineering-expensive but customer-friendly.
Production reality: Maintaining multiple API versions is expensive. At Stripe’s scale, they have dedicated infrastructure for version translation. For most teams, maintain at most 2 concurrent versions (current + previous), with a deprecation timeline of 6-12 months and clear migration guides.Red flag answer: “Just use /v1/ in the URL” without discussing when to actually bump versions, how to deprecate, or the cost of maintaining multiple versions.Follow-up questions:Q: Your API has 200 consumers and you need to make a breaking change. Walk me through your strategy. (1) Ship the new version alongside the old one with a minimum 6-month overlap. (2) Add Sunset and Deprecation response headers to v1 responses (RFC 8594). (3) Notify consumers via email, developer portal announcement, and dashboard warnings. (4) Track v1 usage per consumer in your analytics (Datadog/Mixpanel). (5) Offer a migration guide and ideally a compatibility shim. (6) As the sunset date approaches, personally reach out to the top 10 consumers still on v1. (7) Return 410 Gone after the sunset date. Stripe gives 12+ months and personally helps large customers migrate.Q: How does versioning interact with your CI/CD pipeline? How do you test multiple versions? Each version needs its own contract tests. Use consumer-driven contract testing (Pact) so that every consumer’s expected contract is verified against the current provider. In CI, run the test suite for every supported version. Maintain version-specific fixtures/snapshots. API version translation logic (if using Stripe-style version pinning) should have its own unit tests. Integration environments should support routing by version header or URL path.
What interviewers are really testing: Whether you understand how HTTP content negotiation actually works at the protocol level and can handle multi-format APIs gracefully.Answer:Content negotiation is the mechanism by which a client and server agree on the representation format of a resource.Server-driven negotiation (most common):
  1. Client sends Accept: application/json in the request header.
  2. Server checks if it can produce JSON. If yes, responds with Content-Type: application/json.
  3. If the server cannot produce the requested format, it returns 406 Not Acceptable.
  4. If the client sends no Accept header, the server uses its default format (typically JSON).
Advanced Accept header features:
Accept: application/json; q=0.9, application/xml; q=0.8, text/csv; q=0.5
The q parameter (quality factor, 0 to 1) indicates client preference. The server should try to satisfy the highest-quality preference it supports. Most developers never use this, but load testing tools and enterprise integrations sometimes do.Content-Type for requests: The Content-Type header on the request tells the server what format the body is in. If a client sends Content-Type: application/xml but the server only accepts JSON, return 415 Unsupported Media Type.Real-world patterns:
  • Format suffix: GET /users/42.json or GET /users/42.csv. Used by Rails and some legacy APIs. Generally discouraged in modern API design because it mixes resource identity with representation.
  • Vendor media types: Accept: application/vnd.mycompany.user.v2+json — combines versioning and content negotiation in one header. Used by GitHub.
  • API response envelope: Some APIs always return JSON but allow ?format=csv as a query parameter for data export endpoints. This is not standard content negotiation but is pragmatic for download features.
Red flag answer: Only knowing Content-Type but not Accept, or not knowing what a 406 or 415 status code means.Follow-up questions:Q: Your API needs to support JSON for the web app and Protocol Buffers for internal gRPC clients. How do you handle this? Use content negotiation at the gateway level. The API gateway (Kong, Envoy) inspects the Accept header. For application/json, route to the REST handler. For application/protobuf, either route to a gRPC backend directly, or have the REST handler serialize the same domain model using a Protobuf serializer. Frameworks like gRPC-Gateway (Go) can automatically serve both REST/JSON and gRPC/Protobuf from the same service definition. The key is to keep the business logic format-agnostic and serialize at the edge.Q: A client sends a request with no Accept header. What should your API do? Return your default format (JSON for most modern APIs) with the appropriate Content-Type header. Do not return an error. The HTTP spec says the absence of an Accept header means the client accepts any media type. However, document your default clearly. Some older APIs default to XML, which surprises modern clients.
What interviewers are really testing: Whether you understand the semantic difference deeply enough to avoid bugs in concurrent update scenarios and can articulate when each is appropriate.Answer:PUT — Full replacement of the resource:
PUT /users/42
{ "name": "Alice", "email": "alice@new.com", "role": "admin" }
If you omit role, the server should set it to null/default — you sent the full representation, so missing fields mean “intentionally absent.” The server replaces the entire resource with what you sent.PATCH — Partial modification:
PATCH /users/42
{ "email": "alice@new.com" }
Only the email field is updated. All other fields remain unchanged.Two PATCH standards:
  1. JSON Merge Patch (RFC 7386) — Simple. Send a partial JSON object. Fields present are updated, fields absent are untouched, fields set to null are deleted. The gotcha: you cannot distinguish “set this field to null” from “do not touch this field” for optional fields.
  2. JSON Patch (RFC 6902) — Operation-based. A sequence of operations:
[
  { "op": "replace", "path": "/email", "value": "alice@new.com" },
  { "op": "add", "path": "/tags/-", "value": "premium" },
  { "op": "remove", "path": "/legacy_field" }
]
More powerful (supports move, copy, test operations) but more complex. The test operation enables optimistic concurrency: {"op": "test", "path": "/version", "value": 5} — the patch fails if the version has changed.Concurrency implications:
  • Two clients simultaneously PUT the full user object. Client A reads version 3, Client B reads version 3. Client A PUTs with updated email. Client B PUTs with updated name. Client B’s PUT overwrites Client A’s email change (lost update). Solution: If-Match header with ETags.
  • With PATCH, the same scenario is safer if both clients only send the fields they changed. Client A patches email, Client B patches name — no conflict. But this is not guaranteed safe for all operations.
Red flag answer: “PATCH is just PUT but with fewer fields” — misses the semantic difference. Or not knowing about JSON Patch vs JSON Merge Patch.Follow-up questions:Q: How would you implement optimistic concurrency control with PUT? Use ETags. On GET, the server returns ETag: "abc123" (a hash of the resource state). On PUT, the client sends If-Match: "abc123". The server checks if the current ETag matches. If yes, process the update and return the new ETag. If no, return 412 Precondition Failed — the resource was modified between the client’s read and write. The client must re-read, merge changes, and retry. This is the standard HTTP mechanism for preventing lost updates and is used by S3, CosmosDB, and Google Cloud Storage.Q: When would you choose JSON Patch over JSON Merge Patch? JSON Patch when you need: (1) Array manipulation (add/remove specific elements by index). (2) Atomic multi-step operations (move a field from one location to another). (3) The test operation for server-side precondition checking. (4) Distinguishing between “set to null” and “do not modify.” JSON Merge Patch when: you have a simple flat object structure and want minimal complexity. For most CRUD APIs, JSON Merge Patch is sufficient.
What interviewers are really testing: Whether you can design async API patterns that are reliable, observable, and client-friendly — a critical skill for any non-trivial backend system.Answer:The async request pattern:
  1. Client sends POST /reports/generate with parameters.
  2. Server validates the request, enqueues the work, and immediately returns:
HTTP/1.1 202 Accepted
Location: /tasks/abc-123
Retry-After: 5
  1. Client polls GET /tasks/abc-123:
{
  "id": "abc-123",
  "status": "processing",
  "progress": 45,
  "estimated_completion": "2024-01-15T10:35:00Z"
}
  1. When complete:
{
  "id": "abc-123",
  "status": "completed",
  "result_url": "/reports/xyz-789",
  "completed_at": "2024-01-15T10:34:12Z"
}
Alternatives to polling:
  • Webhooks — Server POSTs the result to a client-provided callback URL when done. More efficient than polling but requires the client to expose an endpoint and handle retries/failures.
  • Server-Sent Events (SSE) — Client opens a long-lived HTTP connection and receives status updates as a stream. Good for dashboards and progress bars.
  • WebSockets — Bidirectional. Overkill for simple task status but useful if the client needs to send cancellation requests or adjust parameters mid-processing.
Production design considerations:
  • Idempotency — The initial POST should be idempotent (use an idempotency key). If the client retries, return the existing task ID, not create a duplicate.
  • Task expiry — Tasks should have a TTL. Do not store completed task results forever. Return 410 Gone after the TTL expires.
  • Cancellation — Expose POST /tasks/abc-123/cancel. The worker must check for cancellation at reasonable intervals.
  • Error handling — If the task fails, the status should include error details: {"status": "failed", "error": {"code": "TIMEOUT", "message": "Upstream payment processor timed out"}}.
  • Rate limiting on polls — Return Retry-After headers to prevent aggressive polling. If a client polls every 100ms, throttle them.
Real-world example: AWS S3 multipart upload, Stripe’s file upload API, GitHub’s code scanning analysis, and Google Cloud’s long-running operations API all follow this pattern. Google even published an API design guide (AIP-151) specifically for long-running operations that includes standardized Operation resource schemas.Red flag answer: “Just make the endpoint synchronous with a longer timeout” — this blocks threads, wastes resources, and fails at scale. Or designing the async pattern without considering cancellation, expiry, or error states.Follow-up questions:Q: You have a report generation endpoint that takes 2-30 minutes. Some users call it 50 times a day. How do you design for efficiency? (1) Deduplicate identical requests — hash the input parameters and return a cached result if the same report was generated recently. (2) Queue with priority — use a priority queue (SQS with priority, or separate high/low priority queues in RabbitMQ) so that small reports do not get blocked behind large ones. (3) Progressive delivery — if possible, stream partial results as they are available. (4) Resource limits — cap concurrent tasks per user (e.g., 3 concurrent report generations) and return 429 with a clear message. (5) Pre-computation — for popular reports, generate them on a schedule (cron) and serve cached results.Q: The polling pattern wastes bandwidth. How would you design a push-based alternative? Implement webhooks with a retry policy. The client registers a callback URL: POST /webhooks {"url": "https://client.com/callback", "events": ["task.completed", "task.failed"]}. When the task completes, your system POSTs the result to the callback URL. Implement exponential backoff retries (1s, 2s, 4s, 8s… up to 5 attempts) with a dead letter queue for permanently failed deliveries. Include HMAC signatures in the webhook payload (X-Webhook-Signature header) so the client can verify authenticity. Stripe, GitHub, and Shopify all use this exact pattern.

2. Microservices Architecture

What interviewers are really testing: Whether you have the judgment to know when microservices are the right choice vs. cargo-culting the latest architecture trend. The best answer acknowledges that most teams should start with a monolith.Answer:
AspectMonolithMicroservices
DeploymentSingle artifact (WAR/JAR/binary)Independent per-service deployments
ScalingVertical or clone the whole appScale individual services independently
Team structureOne team, shared codebaseTeam-per-service (Conway’s Law)
DataSingle shared databaseDatabase-per-service
ComplexityIn the code (big codebase)In the infrastructure (networking, orchestration)
DebuggingStack trace, single processDistributed tracing across services
TransactionsACID (simple)Sagas, eventual consistency (complex)
LatencyIn-process function calls (nanoseconds)Network calls (milliseconds)
Ideal forSmall teams (< 10 devs), new products, startupsLarge organizations, high-scale systems, independent team scaling
The nuanced take that impresses interviewers:
  • Start monolith, extract later. Shopify is a 3M+ line Ruby monolith serving $200B+ in GMV. Basecamp runs on a monolith. The “microservices first” approach has killed startups by drowning small teams in operational complexity before finding product-market fit.
  • The modular monolith is often the right middle ground — a single deployment unit with strict module boundaries, separate databases per module, and well-defined internal APIs. When a module needs independent scaling, extract it.
  • Microservices are an organizational scaling strategy, not a technical one. You adopt them when you have enough teams that deploy independence and service ownership reduce coordination costs. Amazon’s “two-pizza team” rule was the driver, not a technical requirement.
  • The hidden cost: Microservices require: service mesh or API gateway, distributed tracing (Jaeger/Zipkin), centralized logging (ELK/Datadog), container orchestration (Kubernetes), CI/CD per service, contract testing, saga orchestration. This is easily 2-3 dedicated platform engineers just to maintain the infrastructure.
Red flag answer: “Microservices are always better because they scale independently” — shows no understanding of the operational complexity trade-off. Or “monoliths are legacy and should be avoided.”Follow-up questions:Q: You are the CTO of a 5-person startup. A senior engineer proposes microservices from day one. What do you say? No. At 5 engineers, the coordination overhead of microservices will consume more time than feature development. Start with a well-structured monolith (clean module boundaries, dependency injection, separate database schemas per domain). Use feature flags for deployment flexibility. When you hit ~20 engineers and deploy frequency becomes a bottleneck (teams stepping on each other, merge conflicts, slow CI), extract the highest-contention module into a service first. Etsy, Shopify, and GitHub all started as monoliths and extracted services gradually.Q: How do you decide which service to extract first from a monolith? Look for: (1) The module with the highest independent change frequency (deploys blocked by other teams). (2) The module with distinct scaling requirements (e.g., image processing needs GPU, while the rest is CPU-bound). (3) The module with the cleanest data boundary (fewest foreign keys to other modules’ tables). (4) The module owned by a distinct team. Do NOT extract modules that have heavy transactional dependencies with other modules — you will end up building a distributed transaction system (saga) before you have built a feature.Q: What is the “distributed monolith” anti-pattern? It is microservices done wrong: services that are deployed independently but are so tightly coupled that a change in one requires coordinated changes and deployments across multiple services. Symptoms: shared databases, synchronous call chains 5+ services deep, services that cannot be deployed independently, shared libraries with domain logic. You get all the operational complexity of microservices with none of the benefits. The fix is usually to merge tightly coupled services back together or introduce async communication (events) to decouple them.
What interviewers are really testing: Whether you understand data ownership in microservices and can handle the hard problems (reporting, consistency) that arise when you cannot just JOIN across services.Answer:The principle: Each microservice owns its data store exclusively. No other service may read from or write to that database directly. All access goes through the service’s API.Why this matters:
  • Loose coupling — Services can change their schema without coordinating with other teams. If the User service switches from PostgreSQL to MongoDB, no other service is affected.
  • Independent scaling — The Search service can use Elasticsearch while the Order service uses PostgreSQL. Each database is sized for its own workload.
  • Encapsulation — The service controls its data invariants. No other service can put the data in an inconsistent state by writing directly.
The hard problems this creates:
  1. Cross-service queries / Reporting:
    • You cannot JOIN users ON orders.user_id = users.id across two databases.
    • Solutions: (a) API Composition — the caller queries both services and joins in memory. Works for simple cases but creates latency and coupling. (b) CQRS read store — publish events to build a denormalized read model (Elasticsearch, data warehouse) that contains pre-joined data. (c) Data lake / ETL — for analytics, replicate data into a warehouse (Snowflake, BigQuery) where analysts can query freely.
  2. Distributed transactions:
    • You cannot use a single database transaction across two services.
    • Solution: Sagas (covered in Q15). Accept eventual consistency where possible.
  3. Data duplication:
    • The Order service needs the user’s name for the invoice. Should it call the User service on every request?
    • Pragmatic approach: Store a denormalized copy of frequently needed data (user name, email) in the Order service, updated via events. Accept that it may be slightly stale (eventual consistency).
  4. Referential integrity:
    • No foreign keys across services. If the User service deletes a user, the Order service still has orders referencing that user ID. Handle with soft deletes, event-driven cleanup, or orphan detection jobs.
Anti-pattern: Shared database. Two services reading/writing the same tables creates invisible coupling. When Service A changes a column, Service B breaks. Schema changes require cross-team coordination. You lose every benefit of microservices while keeping the complexity.Red flag answer: “Each service has its own database” without addressing how you handle cross-service queries, reporting, or data consistency.Follow-up questions:Q: Your analytics team needs to run complex queries that join data from 5 services. Each service has its own database. What do you build? Build a read-optimized data warehouse. Each service publishes domain events (UserCreated, OrderPlaced) to a message bus (Kafka). A dedicated ETL service (or Kafka Connect + dbt) consumes these events and writes denormalized data into a warehouse (Snowflake, BigQuery, or a dedicated PostgreSQL analytics replica). The analytics team queries the warehouse with full SQL join capabilities. This decouples analytics from operational databases — no risk of an analyst’s heavy query slowing down production reads. Update frequency depends on business needs: near-real-time (Kafka streaming) or batch (hourly/daily ETL jobs).Q: Two services need to maintain consistent state (e.g., deducting inventory and creating an order). How do you handle this without a shared database? Use the Saga pattern with compensating transactions. The Order service creates an order in PENDING state and publishes an OrderCreated event. The Inventory service reserves the items and publishes InventoryReserved. The Order service then confirms the order. If the Inventory service fails, it publishes InventoryReservationFailed, and the Order service executes a compensating transaction (cancels the order). Use the Outbox pattern (write the event to a local outbox table in the same transaction as the state change, then publish from the outbox) to avoid the dual-write problem where the DB write succeeds but the event publish fails.
What interviewers are really testing: Whether you understand the gateway as a critical infrastructure component — not just a “reverse proxy” — and can discuss its failure modes and organizational implications.Answer:An API Gateway is the single entry point for all client requests in a microservices architecture. It sits between clients and backend services, handling cross-cutting concerns.Core responsibilities:
  • Request routing — Routes /users/* to the User service, /orders/* to the Order service. Can route based on path, headers, or query parameters.
  • Authentication & Authorization — Validates JWT tokens, API keys, or OAuth tokens before the request reaches any backend service. Centralizes auth logic.
  • Rate limiting & Throttling — Enforces per-client, per-endpoint, or per-tier rate limits. Returns 429 with Retry-After.
  • SSL/TLS termination — Handles HTTPS at the edge. Internal traffic can use plain HTTP or mTLS.
  • Request/Response transformation — Translates between external and internal formats (e.g., REST to gRPC, JSON to Protobuf).
  • Request aggregation — Combines multiple backend calls into a single client response (reduces mobile network round-trips).
  • Caching — Caches GET responses at the gateway layer. Reduces load on backend services.
  • Observability — Logs all requests, adds trace IDs, emits metrics (latency, error rates, request counts per endpoint).
Popular tools:
  • Kong — Open-source, plugin-based, Lua/Go. Strong plugin ecosystem.
  • AWS API Gateway — Managed, serverless-friendly, integrates with Lambda.
  • Envoy — High-performance C++ proxy, foundation of Istio service mesh.
  • Traefik — Cloud-native, auto-discovers services in Docker/K8s.
  • NGINX / HAProxy — Traditional reverse proxies that can function as lightweight gateways.
  • Apigee (Google) — Enterprise API management with developer portal, monetization.
The single point of failure problem: The gateway is the most critical component in your architecture. If it goes down, everything goes down. Mitigation: deploy multiple gateway instances behind a load balancer, use health checks, implement graceful degradation (if the auth service is down, can you still serve cached public endpoints?), and ensure the gateway itself has circuit breakers for backend services.Performance consideration: Every request passes through the gateway, adding latency (typically 1-5ms). For internal service-to-service communication, do NOT route through the gateway — use direct service-to-service calls or a service mesh. The gateway is for external-to-internal traffic.Red flag answer: “The API Gateway is just a reverse proxy” or not considering it as a single point of failure.Follow-up questions:Q: You have a mobile app, a web app, and third-party API consumers. Should they all share one gateway? No. Use the Backend for Frontend (BFF) pattern — separate gateways (or gateway configurations) per client type. The mobile BFF aggregates more data per request (fewer round trips over cellular) and returns smaller payloads (compressed images). The web BFF can afford more granular endpoints. The external API gateway has stricter rate limits, different auth (API keys vs session cookies), and versioning. Kong and AWS API Gateway support this via separate “stages” or “workspaces.”Q: How do you prevent the API gateway from becoming a bottleneck? (1) Keep the gateway thin — only cross-cutting concerns, no business logic. If you find yourself writing if (userType === 'premium') in the gateway, that logic belongs in a backend service. (2) Horizontal scaling — gateways should be stateless so you can run 10+ instances behind a load balancer. (3) Async where possible — use non-blocking I/O (Envoy, Kong, NGINX are all event-loop based). (4) Caching — cache auth token validation results (short TTL) and GET responses. (5) Monitor gateway latency as a first-class SLI — if p99 gateway overhead exceeds 10ms, investigate.
What interviewers are really testing: Whether you understand how services find each other in dynamic environments where IPs change constantly, and the trade-offs between discovery mechanisms.Answer:In microservices, services are deployed across dynamic infrastructure (containers, VMs) where IP addresses change on every deployment, scaling event, or failure recovery. Service discovery solves the problem of “how does Service A know where Service B is right now?”Client-side discovery:
  1. Each service registers itself with a Service Registry (Consul, Eureka, etcd, ZooKeeper) on startup with its IP, port, and health check URL.
  2. When Service A needs to call Service B, it queries the registry for healthy instances of Service B.
  3. Service A’s client-side load balancer (Ribbon in Spring Cloud) picks an instance and makes the call directly.
  • Pro: No extra network hop through a load balancer. Lower latency.
  • Con: Every service needs a discovery client library. Language-coupling (Java-centric in Spring Cloud ecosystem).
Server-side discovery:
  1. Services register with the registry (or are automatically registered).
  2. Service A calls a load balancer / reverse proxy (NGINX, Envoy, AWS ALB).
  3. The load balancer queries the registry and forwards the request to a healthy instance.
  • Pro: Services are dumb — they just make HTTP calls to a known address. Language-agnostic.
  • Con: Extra network hop. Load balancer is a potential bottleneck.
Kubernetes DNS (the modern default):
  • Kubernetes provides built-in server-side discovery. Each Service object gets a DNS name: user-service.default.svc.cluster.local.
  • kube-proxy handles load balancing across healthy pods.
  • No external registry needed. This is why Consul/Eureka usage has declined in K8s-native environments.
  • Gotcha: K8s DNS uses TTLs. If a pod dies and a new one starts, DNS may briefly resolve to the dead pod. Solutions: shorter DNS TTL, readiness probes, or using headless services with client-side load balancing.
Health checks are critical: A registry with stale entries (dead instances still registered) causes requests to fail. Services must: (1) send heartbeats to the registry (Eureka: every 30s). (2) Expose health check endpoints (/health or /readyz). (3) The registry deregisters instances that miss multiple heartbeats. Consul uses both TTL-based and script-based health checks.Red flag answer: “Just hardcode the IP addresses” or not knowing how Kubernetes handles service discovery.Follow-up questions:Q: Your service calls another service via Consul discovery. The target service deploys a new version, and for 30 seconds, some requests fail. What happened? During deployment, old instances deregister and new instances register. If the new instances have not passed their health checks yet (startup time + initial health check interval), there is a window where the registry has fewer healthy instances — or none, causing failures. Solutions: (1) Use rolling deployments with readiness probes — new pods only receive traffic after health checks pass. (2) Increase the minimum number of healthy instances during deployment. (3) Use blue/green deployment so the old version keeps serving until the new version is fully healthy. (4) Client-side retry with a different instance on failure.Q: How does service discovery interact with a service mesh like Istio? In Istio, the sidecar proxy (Envoy) handles all service discovery transparently. Your application code does not need a discovery client — it makes calls to http://user-service:8080 and the Envoy sidecar intercepts the call, resolves the destination via Istio’s Pilot control plane (which watches K8s API server for endpoint changes), applies load balancing, retries, circuit breaking, and mTLS — all without any application code changes. This is the “service mesh” approach to discovery: push all networking concerns into the infrastructure layer.
What interviewers are really testing: Whether you can handle the hardest problem in microservices — maintaining data consistency across services without distributed transactions. This separates senior engineers from mid-level ones.Answer:A Saga is a sequence of local transactions where each transaction updates a single service’s database. If any step fails, the saga executes compensating transactions in reverse order to undo the preceding steps.Example — E-commerce order flow:
  1. Order Service: Create order (status: PENDING)
  2. Payment Service: Charge customer’s credit card
  3. Inventory Service: Reserve items
  4. Shipping Service: Schedule delivery
If step 3 fails (out of stock):
  • Compensate step 2: Refund the credit card
  • Compensate step 1: Cancel the order (status: CANCELLED)
Two coordination approaches:Choreography (Event-driven, decentralized):
  • Each service publishes events after completing its step. The next service reacts to that event.
  • Order Service publishes OrderCreated —> Payment Service listens, charges card, publishes PaymentCompleted —> Inventory Service listens, reserves items, publishes InventoryReserved —> etc.
  • Pro: No single coordinator. Services are loosely coupled. Simple for 3-4 step sagas.
  • Con: Hard to track the overall saga state. Debugging requires correlating events across services. Difficult to reason about as the number of steps grows. Cyclic event dependencies can emerge.
Orchestration (Centralized coordinator):
  • A Saga Orchestrator (dedicated service or state machine) tells each service what to do and handles responses.
  • Orchestrator sends ChargeCard command to Payment Service, waits for response, then sends ReserveInventory to Inventory Service, etc.
  • Pro: Clear flow. Easy to debug and monitor. Handles complex branching logic (if payment type is X, skip step Y).
  • Con: Orchestrator is a single point of failure and a coupling point. Can become a “god service” if not carefully scoped.
Critical implementation details:
  • Idempotency of compensations — Compensating transactions must be idempotent. If the “refund” message is delivered twice (at-least-once delivery), it should not refund twice.
  • The Outbox Pattern — Write the event/command to a local outbox table in the same database transaction as the state change. A separate process (Debezium, polling publisher) reads the outbox and publishes to the message broker. This prevents the dual-write problem (DB commits but message publish fails).
  • Semantic vs. technical compensation — Refunding a charge is not the same as “undoing” it. The customer sees two transactions on their statement. Some operations are not reversible (email sent, physical goods shipped). Design compensations for what is business-meaningful, not technically perfect.
  • Timeout handling — If a service does not respond, the orchestrator must decide: retry, compensate, or alert for manual intervention. Use a dead letter queue for permanently failed steps.
Tools: Temporal.io, AWS Step Functions, Camunda, Netflix Conductor, Apache Camel saga.Red flag answer: “Just use a distributed transaction (2PC)” — Two-phase commit does not scale in microservices due to lock contention and coordinator availability. Or describing sagas without mentioning compensating transactions.Follow-up questions:Q: A saga step fails but the compensation also fails. What do you do? This is the “saga of the saga” problem. Options: (1) Retry the compensation with exponential backoff. Compensations must be idempotent. (2) After max retries, write to a dead letter queue and alert the operations team for manual intervention. (3) For critical financial operations, maintain a “pending compensation” state that a background reconciliation job retries periodically. (4) In extreme cases, have a human review queue. Stripe’s internal systems have manual review workflows for stuck payment reversals. The key insight: eventual consistency means “eventually” might involve human intervention for edge cases.Q: When would you use choreography vs orchestration? Choreography for simple sagas with 2-3 steps where services are independently developed by different teams who publish well-defined events. Orchestration for complex workflows with 4+ steps, conditional branching, timeout handling, and when you need clear visibility into saga state. A practical heuristic: if you cannot draw the saga flow on a whiteboard in under 2 minutes, use orchestration. Netflix uses Conductor (orchestration) for complex content delivery workflows. Many teams start with choreography and migrate to orchestration as complexity grows.Q: How do you test sagas? (1) Unit test each service’s local transaction and compensation independently. (2) Integration test the happy path end-to-end in a staging environment. (3) Chaos test failure scenarios: kill a service mid-saga, inject network delays, simulate duplicate messages. (4) For orchestration, test the state machine independently with mocked service responses. (5) Use contract testing (Pact) to ensure each service honors the expected event/command schema. (6) In production, implement saga-level observability: a dashboard showing in-progress sagas, their current step, age, and failure rate.
What interviewers are really testing: Whether you understand when CQRS is justified (not always) and can design the synchronization between write and read models without losing data.Answer:CQRS separates the write model (commands that mutate state) from the read model (queries that return data). Instead of one model that handles both reads and writes (the default CRUD approach), you have two optimized for their respective workloads.How it works:
  1. Write side — Receives commands (CreateOrder, UpdateAddress). Validates business rules. Writes to a normalized, consistent database (PostgreSQL, DynamoDB).
  2. Sync mechanism — Changes are propagated to the read side via events (Kafka, RabbitMQ) or Change Data Capture (Debezium watching the write DB’s WAL).
  3. Read side — Maintains denormalized, query-optimized projections. Could be Elasticsearch (full-text search), Redis (fast lookups), a materialized view in PostgreSQL, or a dedicated read replica.
When CQRS makes sense:
  • Read/write ratio is heavily skewed — Most apps are 90%+ reads. Optimize the read model for query patterns without compromising write consistency.
  • Different read patterns — The search API needs Elasticsearch, the dashboard needs pre-aggregated analytics, the detail page needs a document store. Each is a different “projection” of the same data.
  • High-scale systems — Scale read replicas independently of the write database. Netflix serves billions of reads from cached/denormalized stores while writes go to a consistent master.
  • Complex domain models — When the write model (rich domain objects with business rules) is very different from what clients need to read (flat DTOs).
When CQRS is overkill:
  • Simple CRUD apps with basic query patterns.
  • Small teams where the operational complexity of maintaining two models and a sync mechanism is not justified.
  • When strong consistency is required everywhere (no tolerance for eventual consistency on reads).
The critical trade-off — Eventual Consistency: After a write, the read model may be stale for milliseconds to seconds. A user creates an order and immediately refreshes — they might not see it. Solutions: (1) Read-your-own-writes: after a write, query the write DB directly for that user’s next read. (2) Client-side optimistic update: the UI shows the new order immediately before confirmation from the read model. (3) Tune sync latency (Kafka with low batch size can achieve sub-100ms propagation).Red flag answer: “We should use CQRS for all our services” — CQRS adds significant complexity and should be applied selectively. Or not mentioning the eventual consistency challenge.Follow-up questions:Q: The read model gets out of sync with the write model. How do you detect and fix this? (1) Build a reconciliation job that periodically compares write DB state with read model state and publishes corrective events. (2) Include version numbers or timestamps in both models — if the read model’s version is behind the write model’s, trigger a re-sync for that entity. (3) Rebuild the entire read model from scratch by replaying all events (this is where Event Sourcing + CQRS is powerful — you have the full event history). (4) Monitor the sync lag as a metric: the time between a write and its appearance in the read model. Alert if lag exceeds your SLA (e.g., 5 seconds).Q: How does CQRS relate to Event Sourcing? Do you need both? They are complementary but independent. CQRS is about separating read and write models. Event Sourcing is about storing state as a sequence of events. They pair well because: events from Event Sourcing naturally feed CQRS projections — each projection consumes the event stream and builds its own view. But you can do CQRS without Event Sourcing (use CDC on a traditional DB) and Event Sourcing without CQRS (replay events into a single model). Use both when you need auditability (Event Sourcing) AND optimized read performance (CQRS).
What interviewers are really testing: Whether you understand the paradigm shift from “storing current state” to “storing events” and can articulate when this complexity is worth it.Answer:Instead of storing the current state of an entity (e.g., account_balance = 500), you store every event that ever happened to it:
Event 1: AccountCreated { id: 42, owner: "Alice" }
Event 2: MoneyDeposited { amount: 1000 }
Event 3: MoneyWithdrawn { amount: 300 }
Event 4: MoneyWithdrawn { amount: 200 }
Current state is derived by replaying events: 0 + 1000 - 300 - 200 = 500.Key characteristics:
  • Events are immutable — Once written, never modified or deleted. This is an append-only log.
  • Complete audit trail — You know not just the current state but every change that ever happened, when, and why.
  • Temporal queries — “What was the account balance at 3 PM yesterday?” Replay events up to that timestamp.
  • Event replay — If you add a new projection (e.g., a fraud detection model), you can replay all historical events through it to backfill the data.
Challenges (the hard parts):
  1. Event schema evolution — Events are stored forever. What happens when you need to add a field to MoneyDeposited? You need event versioning and upcasting (transforming old events to new schemas on read).
  2. Replay performance — An entity with 1 million events takes a long time to replay. Solution: Snapshots — periodically store the current state (e.g., every 100 events). Replay only events after the snapshot.
  3. Eventual consistency — Projections (read models) are rebuilt from the event stream and may lag behind the write model.
  4. Complexity — Developers must think in events, not state. Debugging requires replaying event sequences. This is a fundamental mindset shift.
  5. Storage growth — Events accumulate forever. For high-volume systems, this can be significant. Use event archiving strategies (move old events to cold storage).
Where Event Sourcing shines:
  • Financial systems — Regulatory requirement for full audit trails. Banks, trading platforms, insurance.
  • Collaborative editing — Google Docs stores operations (events), not document state. Enables undo/redo and conflict resolution.
  • Gaming — Replay systems (watch a game replay) are event sourcing.
  • Compliance-heavy domains — Healthcare, legal, government where you must prove what happened and when.
Tools: EventStoreDB (purpose-built), Apache Kafka (as event log), Axon Framework (Java), Marten (.NET).Red flag answer: “Event Sourcing is just saving events to a table” without discussing replay, snapshots, schema evolution, or when NOT to use it. Or proposing Event Sourcing for a simple CRUD app.Follow-up questions:Q: You have an entity with 10 million events. Replaying takes 30 seconds. How do you fix this? Implement snapshotting. After every N events (e.g., 1000), store a snapshot of the current state. To rebuild, load the latest snapshot and replay only events after it. For 10M events with a snapshot every 1000, the worst case is replaying 999 events. Additionally, consider: (1) In-memory caching of frequently accessed entities. (2) Asynchronous projection rebuilds that do not block reads. (3) Partitioning the event store by aggregate ID for parallel replay.Q: A developer accidentally publishes a “MoneyDeposited” event with the wrong amount. Events are immutable. What do you do? Publish a corrective event: DepositCorrected { original_event_id: "xyz", corrected_amount: 500, reason: "Data entry error" }. The projection logic handles the correction. You NEVER delete or modify existing events — that would break the audit trail and any downstream consumers that have already processed the original event. In extreme cases (GDPR right to erasure), use “crypto-shredding” — encrypt events with a per-user key and destroy the key when erasure is required, making the events unreadable.
What interviewers are really testing: Whether you understand cascading failure prevention and can configure circuit breakers with real thresholds, not just describe the concept abstractly.Answer:A circuit breaker wraps calls to an external service and monitors for failures. When failures exceed a threshold, it “trips” and stops making calls, returning a fallback response immediately. This prevents cascading failures where one slow/failing service takes down the entire system.Three states:
  1. Closed (normal) — Requests flow through. Failures are counted. If failure count exceeds threshold within a time window (e.g., 5 failures in 10 seconds), transition to Open.
  2. Open (tripped) — All requests immediately fail without calling the downstream service. Returns a fallback response or error. After a timeout period (e.g., 30 seconds), transition to Half-Open.
  3. Half-Open (testing) — Allow a limited number of requests through (e.g., 1-3). If they succeed, transition back to Closed. If they fail, transition back to Open.
Configuration that matters in production:
failure_threshold: 50% failure rate over 10 requests (sliding window)
timeout_duration: 30 seconds in Open state
half_open_requests: 3 permitted test requests
slow_call_threshold: 60% of calls exceeding 2000ms
What counts as a failure?
  • 5xx responses from downstream
  • Connection timeouts (TCP connect timeout, typically 1-3 seconds)
  • Read timeouts (no response within expected time, typically 5-30 seconds)
  • NOT 4xx responses — those indicate client errors, not downstream health issues (exception: 429 may warrant circuit breaking)
Fallback strategies:
  • Return cached data (stale but available)
  • Return a default/degraded response (e.g., show a generic product recommendation instead of personalized ones)
  • Return an error with a meaningful message (503 Service Unavailable, please retry)
  • Queue the request for later processing (if the operation is not time-sensitive)
Production war story: Netflix’s 2012 architecture had a service call chain: API Gateway —> User Service —> Recommendation Service —> ML Model Service. When the ML Model Service became slow (not failing, just slow — 10s response times), it consumed all threads in the Recommendation Service, which consumed all threads in the User Service, which made the API Gateway unresponsive. Every single Netflix user saw errors. This led Netflix to develop Hystrix (now in maintenance, succeeded by Resilience4j), which introduced circuit breakers as a standard pattern.Tools: Resilience4j (Java), Polly (.NET), gobreaker (Go), opossum (Node.js), Istio (infrastructure-level circuit breaking via Envoy).Red flag answer: “If the downstream service fails, just retry” — retries without circuit breaking cause retry storms that amplify the failure. Or not knowing the three states (Closed, Open, Half-Open).Follow-up questions:Q: How do circuit breakers interact with retries? What is a “retry storm”? Retries and circuit breakers must be layered carefully. Retries happen inside the circuit breaker — if the circuit is closed, a failed request may be retried (with exponential backoff). If retries also fail, the failure is counted toward the circuit breaker threshold. Without a circuit breaker, retries amplify failures: if 100 clients each retry 3 times, a failing service receives 400 requests instead of 100, making it even more likely to fail. This is a retry storm. The pattern is: Circuit Breaker (outer) —> Retry with backoff (inner) —> Timeout (innermost). Resilience4j lets you compose these decorators in order.Q: Should you use application-level (library) or infrastructure-level (Istio) circuit breakers? Infrastructure-level (Istio/Envoy) is language-agnostic and applies uniformly across all services without code changes. It handles connection-level circuit breaking well. Application-level (Resilience4j) provides more granular control: per-method circuit breakers, custom fallback logic, access to business context. Ideal approach: use infrastructure-level for baseline protection (connection pooling, outlier detection) and application-level for business-critical paths where you need custom fallback responses.Q: A downstream service is not failing but is very slow (5-second responses instead of 100ms). How does this affect circuit breakers? This is the “gray failure” scenario and is actually more dangerous than hard failures. A slow service consumes threads/connections on the caller while waiting for responses, leading to thread pool exhaustion. Configure the circuit breaker with a slow call rate threshold (Resilience4j supports this): if more than 60% of calls in the sliding window exceed 2 seconds, trip the breaker. Also critical: set aggressive timeouts. A 30-second default timeout is far too long for most microservice calls. Use 1-3 second timeouts for synchronous service-to-service calls.
What interviewers are really testing: Whether you can plan and execute a monolith-to-microservices migration without a risky “big bang” rewrite.Answer:Named after the strangler fig tree that grows around a host tree, gradually replacing it until the host dies. The pattern incrementally migrates a monolith to microservices.Step-by-step approach:
  1. Place a facade (API Gateway/Reverse Proxy) in front of the monolith. All traffic routes through this facade. Initially, 100% of requests go to the monolith. This is zero-risk — behavior does not change.
  2. Identify the first module to extract. Choose based on: independent business domain, high change frequency, distinct scaling needs, clean data boundaries. Usually a peripheral feature (notifications, search) rather than core domain logic.
  3. Build the new service. Implement the same functionality in the new microservice. Write the same API contract (or an improved version with backwards compatibility).
  4. Route traffic gradually.
    • Start with 1% canary traffic (shadow mode: send to both, compare responses).
    • Increase to 10%, 50%, 100% as confidence grows.
    • Use feature flags or gateway routing rules to control traffic splitting.
  5. Decommission the old module. Once 100% of traffic goes to the new service and the old code path has been inactive for a safe period (2-4 weeks), remove the old code from the monolith.
  6. Repeat for the next module.
Critical success factors:
  • Parallel run / Shadow testing — Run both old and new implementations simultaneously, compare outputs, and log discrepancies. Do this before routing real traffic to the new service.
  • Shared data migration — The hardest part. If the old module read/wrote to the monolith’s shared database, the new service needs its own database. Options: dual-write during transition (risky), event-based sync, or database replication with a cutover.
  • Strangling the database — Often harder than strangling the code. Use views or database triggers to redirect reads/writes during the transition.
  • Timeline — Real migrations take 12-24 months for a medium monolith. Amazon’s migration took years. Be patient.
Red flag answer: “Just rewrite the whole thing as microservices” — big bang rewrites are the #1 cause of failed migration projects. Or describing the pattern without addressing the data migration challenge.Follow-up questions:Q: You are mid-migration. The new User Service is handling 50% of traffic. You discover a subtle data inconsistency between the old and new implementations. What do you do? (1) Immediately route 100% traffic back to the monolith (rollback via the gateway). (2) Investigate the discrepancy using the shadow testing logs. (3) Fix the new service. (4) Replay the shadow comparison for a larger dataset to verify the fix. (5) Gradually ramp traffic back up. The beauty of the Strangler Fig pattern is that the monolith is always there as a fallback. This is exactly why you do gradual migration instead of big-bang.Q: How do you handle database access during the transition period when both the monolith and the new service need the same data? Options ordered by risk: (1) API-based access — The monolith calls the new service’s API for data it used to access directly. Safest but requires modifying the monolith. (2) Change Data Capture — Use Debezium to stream changes from the monolith’s database to the new service’s database. Both have their own copy. (3) Shared database with schemas — Temporarily share the database but use separate schemas. The new service reads/writes its own schema, with a sync process between schemas. (4) Dual writes — Write to both databases in the same request. Most dangerous due to consistency risks. Always prefer CDC over dual writes.
What interviewers are really testing: Whether you understand resource isolation as a resilience strategy and can apply it to thread pools, connection pools, and infrastructure.Answer:Named after ship bulkheads — watertight compartments that prevent a hull breach in one section from sinking the entire ship. In software, the Bulkhead pattern isolates resources so that a failure in one component does not exhaust shared resources and cascade.Types of bulkheads:
  1. Thread pool isolation:
    • Instead of one shared thread pool for all outgoing HTTP calls, create separate pools per downstream service.
    • Service A gets 20 threads, Service B gets 30 threads, Service C gets 10 threads.
    • If Service A is slow and saturates its 20 threads, Services B and C are unaffected — their pools are independent.
    • Netflix Hystrix popularized this approach.
  2. Connection pool isolation:
    • Separate database connection pools for different operations (reads vs writes, critical vs non-critical).
    • If a batch reporting query exhausts its connection pool, the user-facing API still has connections available.
  3. Semaphore isolation:
    • Lighter than thread pools. Uses a counter to limit concurrent calls to a resource.
    • If the semaphore limit is 10 and 10 calls are in-flight, the 11th call fails immediately.
    • Lower overhead than thread pools (no context switching) but no timeout control.
  4. Infrastructure-level isolation:
    • Deploy critical services on dedicated nodes (Kubernetes node affinity/taints).
    • Separate Kubernetes namespaces with resource quotas (CPU, memory limits).
    • Separate database instances for critical vs non-critical services.
    • Multi-AZ deployment with per-AZ isolation.
Sizing bulkheads: The critical question is: how many threads/connections per pool? Too few and you throttle healthy traffic. Too many and the isolation is meaningless. Formula: pool_size = (requests_per_second * p99_latency_seconds) + buffer. Example: if Service A receives 100 RPS and p99 latency is 200ms, you need 100 * 0.2 = 20 threads + 30% buffer = 26 threads.Production example: An e-commerce platform had a shared thread pool of 200 threads for all downstream calls. The payment service had a transient issue causing 5-second timeouts. All 200 threads were blocked waiting for payment responses, meaning product search, inventory checks, and user authentication all failed. After implementing bulkheads — 50 threads for payment, 50 for product, 50 for inventory, 50 for auth — a payment service issue only affected checkout, while browsing continued normally.Red flag answer: Describing bulkheads only as “separate thread pools” without discussing connection pools, infrastructure isolation, or how to size the pools.Follow-up questions:Q: How do bulkheads interact with circuit breakers? They are complementary. The bulkhead limits resource consumption (prevents thread exhaustion). The circuit breaker prevents making futile calls (stops calling a known-failing service). Together: the bulkhead ensures that even if the circuit breaker is not yet tripped (failure threshold not reached), the slow service cannot consume more than its allocated resources. Think of bulkheads as preventive (limit blast radius) and circuit breakers as reactive (stop bleeding when failure is detected).Q: You have a Kubernetes cluster with 10 services. How do you apply the bulkhead pattern at the infrastructure level? (1) Set resource requests and limits on every pod (resources.requests.cpu: 500m, resources.limits.cpu: 1000m). This prevents one service from consuming all cluster CPU. (2) Use ResourceQuotas per namespace to cap total resource consumption per team. (3) Use PodDisruptionBudgets to ensure minimum availability during node maintenance. (4) For critical services, use node taints and tolerations to dedicate nodes. (5) Use PriorityClasses so that critical pods evict non-critical pods under resource pressure, not the other way around.

3. Communication Protocols

What interviewers are really testing: Whether you can choose the right communication protocol based on the use case, not just recite differences. The best candidates have used both and know the operational trade-offs.Answer:
AspectgRPCREST
TransportHTTP/2 (required)HTTP/1.1 or HTTP/2
SerializationProtocol Buffers (binary)JSON (text), XML
SchemaStrict .proto contract (required)Optional (OpenAPI)
StreamingBidirectional, server, client streamingRequest/response only (SSE for server push)
Code generationAuto-generated clients/servers in 10+ languagesManual or OpenAPI codegen
Browser supportLimited (requires gRPC-Web proxy like Envoy)Native
Performance5-10x faster serialization, 30-50% smaller payloadsHuman-readable, easier to debug
Load balancingRequires L7 LB (Envoy) — HTTP/2 multiplexes on one TCP connectionStandard L4/L7 LB works
DebuggingRequires special tools (grpcurl, BloomRPC, Postman)curl, browser, any HTTP client
When to use gRPC:
  • Internal service-to-service communication where performance matters and both sides are controlled by your org.
  • Streaming use cases — real-time data feeds, event streams, chat backends.
  • Polyglot environments — auto-generated clients eliminate hand-written HTTP client code in each language.
  • High-throughput systems — at thousands of RPS, the serialization savings compound. Google reports saving significant CPU by using Protobuf internally.
When to use REST:
  • Public/External APIs — every developer knows HTTP + JSON. No learning curve.
  • Browser-facing — gRPC does not work natively in browsers (gRPC-Web requires a proxy).
  • Simple CRUD — the overhead of Protobuf schemas, code generation, and HTTP/2 infrastructure is not justified.
The gRPC load balancing gotcha: HTTP/2 multiplexes multiple requests over a single TCP connection. If you use a Layer 4 load balancer (AWS NLB), all requests from one client hit the same backend server — no load distribution. You need a Layer 7 load balancer (Envoy, Linkerd proxy) that understands HTTP/2 frames and distributes requests across backends. This catches many teams moving from REST to gRPC.Red flag answer: “gRPC is faster so we should use it for everything” without mentioning browser limitations, debugging difficulty, or the load balancing complexity.Follow-up questions:Q: You have a REST API and want to add gRPC for internal services. How do you run both without maintaining two codebases? Define your service in .proto files. Use gRPC-Gateway (Go), grpc-spring (Java), or similar libraries to auto-generate a REST/JSON reverse proxy that translates HTTP/JSON requests into gRPC calls. The business logic lives in the gRPC service implementation. External clients hit the REST gateway, internal services call gRPC directly. Google Cloud APIs use this pattern — every API has both REST and gRPC interfaces generated from the same .proto definitions.Q: You switch from REST to gRPC for an internal service and latency increases. What went wrong? Most likely causes: (1) TLS handshake overhead — gRPC requires HTTP/2 which requires TLS. If you were using unencrypted HTTP/1.1 internally, the TLS handshake adds latency to the first connection. Solution: use connection pooling and keep-alives. (2) Protobuf serialization of very small payloads may be comparable to JSON. The performance benefit shows at scale, not for individual small messages. (3) DNS resolution overhead — gRPC’s built-in name resolver may behave differently than your HTTP client’s DNS caching. (4) Missing connection reuse — the gRPC client is not reusing channels (connections), establishing a new TCP + TLS + HTTP/2 handshake per call.
What interviewers are really testing: Whether you can evaluate GraphQL honestly — knowing both its power and its operational challenges — rather than treating it as a silver bullet.Answer:
AspectGraphQLREST
EndpointSingle (/graphql)Multiple (/users, /orders)
Data fetchingClient specifies exact fields neededServer determines response shape
Over-fetchingEliminated (get only what you asked for)Common (fixed response includes all fields)
Under-fetchingEliminated (one query can traverse relationships)Common (need multiple requests for related data)
VersioningTypically no versioning (deprecate fields)URL or header versioning
CachingComplex (POST requests, dynamic queries)Simple (HTTP caching on GET URLs)
File uploadsNot natively supported (multipart workaround)Standard multipart form data
Error handlingAlways returns 200 with errors arrayHTTP status codes
N+1 problemDataLoader pattern requiredDoes not apply (server controls queries)
Rate limitingDifficult (query complexity varies wildly)Simple (per-endpoint limits)
Where GraphQL excels:
  • Mobile apps with constrained bandwidth — fetch exactly what the screen needs in one request.
  • Aggregation layers (BFF) — one GraphQL server in front of multiple REST microservices.
  • Rapid frontend iteration — frontend teams can change data requirements without backend changes.
  • Complex, interconnected data — social graphs, e-commerce catalogs with deep relationships.
Where GraphQL is painful:
  • Caching — REST endpoints are trivially cacheable (CDN caches GET /users/42). GraphQL uses POST for queries, making CDN caching hard. Solutions: persisted queries (hash the query, cache by hash), Apollo cache.
  • Security — A malicious query can request deeply nested data: { users { posts { comments { author { posts { comments } } } } } }. Without query depth limiting and cost analysis, this can DoS your database. Tools: graphql-depth-limit, graphql-cost-analysis.
  • N+1 queries — A naive GraphQL resolver fetches related data one record at a time. If a query returns 100 users and each needs their orders, that is 1 query for users + 100 queries for orders = 101 DB queries. DataLoader (Facebook’s library) batches these into 2 queries.
  • Monitoring/Observability — All requests go to POST /graphql with status 200. Traditional HTTP monitoring (error rates by endpoint) does not work. You need GraphQL-aware monitoring (Apollo Studio, GraphQL Hive).
Production reality: GitHub moved to GraphQL (v4 API) and it works well for them — their data is deeply interconnected (repos, issues, PRs, users, organizations). Shopify uses GraphQL for their admin API. But many teams have adopted GraphQL prematurely and spent more time fighting caching, authorization, and performance issues than building features.Red flag answer: “GraphQL is always better than REST because it prevents over-fetching” — shows no awareness of the caching, security, and operational complexity trade-offs.Follow-up questions:Q: You are building a public API. Would you choose GraphQL or REST? REST with OpenAPI for a public API. Reasons: (1) Universal familiarity — every developer knows HTTP. (2) HTTP-native caching (CDNs, browser caches) works out of the box. (3) Rate limiting is straightforward. (4) Security is well-understood (no query depth attacks). (5) Status codes provide clear error semantics. Consider offering GraphQL in addition for advanced consumers who need flexible queries (like GitHub does), but REST should be the primary interface. Stripe, Twilio, and AWS all use REST for their public APIs.Q: Your GraphQL API has terrible performance. The team says “GraphQL is slow.” How do you diagnose and fix it? GraphQL itself adds negligible overhead. The real issues: (1) Check for N+1 queries — add DataLoader for every resolver that fetches related data. (2) Profile the most expensive queries using Apollo tracing or your APM (Datadog). (3) Implement query complexity limits — reject queries above a cost threshold. (4) Add persisted queries — clients send a hash instead of the full query string, enabling caching and reducing payload size. (5) Check resolver implementation — each field resolution should be O(1) with batched data loading, not independent DB queries. (6) Consider field-level caching in the resolver layer.
What interviewers are really testing: Whether you can design a reliable webhook system that handles the messy reality of network failures, slow consumers, and security concerns.Answer:Webhooks are HTTP callbacks — when an event occurs on the provider’s side, the provider sends an HTTP POST request to a URL previously registered by the consumer.“Don’t call us, we’ll call you” — instead of the consumer polling for changes, the provider pushes notifications.How a production webhook system works:
  1. Registration — Consumer registers a callback URL and specifies event types:
POST /webhooks
{ "url": "https://consumer.com/callback", "events": ["order.created", "payment.completed"] }
  1. Event occurrence — An order is created on the provider’s side.
  2. Delivery — Provider sends:
POST https://consumer.com/callback
Headers:
  Content-Type: application/json
  X-Webhook-Signature: sha256=abc123...
  X-Webhook-ID: evt_abc123
Body:
  { "event": "order.created", "data": { "id": 42, ... }, "timestamp": "2024-01-15T10:30:00Z" }
  1. Acknowledgment — Consumer returns 200 OK within a timeout (typically 5-30 seconds).
Production-critical design decisions:
  • Security (HMAC signatures) — Sign the payload with a shared secret. Consumer verifies the signature before processing. Without this, anyone can POST to the callback URL. Stripe uses HMAC-SHA256; GitHub uses HMAC-SHA256 with a configurable secret.
  • Retry policy — If the consumer returns a non-2xx or times out, retry with exponential backoff: 1 min, 5 min, 30 min, 2 hours, 24 hours. After max retries (typically 5-10), disable the webhook and notify the consumer.
  • Idempotency — Webhooks may be delivered more than once (retry after timeout, network glitch). Include a unique event ID (X-Webhook-ID). Consumers must check for duplicate event IDs and skip already-processed events.
  • Ordering — Webhook delivery order is NOT guaranteed. Events may arrive out of order, especially with retries. Include a timestamp or sequence number. Consumers should use the timestamp or sequence to resolve conflicts, not assume ordering.
  • Timeout — If the consumer takes 60 seconds to process, the provider’s HTTP client times out and retries, causing duplicate processing. Best practice: consumer immediately returns 200 OK and processes the event asynchronously (enqueue to a local job queue).
Red flag answer: “Webhooks are just HTTP POST notifications” without discussing signatures, retries, idempotency, or ordering.Follow-up questions:Q: A consumer’s webhook endpoint is down for 6 hours. How does your system handle this? (1) Retries with exponential backoff queue up during the outage. (2) After max retries, the events go to a dead letter queue. (3) The webhook is marked as “failing” in the dashboard. (4) When the consumer comes back online, they can request replay of missed events via a recovery API: GET /events?since=2024-01-15T04:00:00Z. (5) The provider stores events for a retention window (7-30 days) to support replay. Stripe and Shopify both provide event replay APIs for exactly this scenario.Q: How do you prevent a slow webhook consumer from affecting your system’s throughput? (1) Use a separate delivery queue (SQS, RabbitMQ) per consumer or per consumer tier. (2) Set aggressive timeouts (5-10 seconds) on the delivery HTTP call. (3) Implement per-consumer rate limiting on delivery attempts. (4) Use a circuit breaker on each consumer’s endpoint — if a consumer consistently fails, stop attempting delivery and notify them. (5) Never deliver webhooks synchronously in the main request path — always enqueue for async delivery.
What interviewers are really testing: Whether you know when SSE is a simpler, better choice than WebSockets and understand its limitations.Answer:SSE is a standard HTTP mechanism for the server to push events to the client over a single, long-lived HTTP connection.How it works:
  1. Client opens a connection: GET /events with Accept: text/event-stream.
  2. Server keeps the connection open and sends events as they occur:
data: {"type": "price_update", "symbol": "AAPL", "price": 150.25}

data: {"type": "price_update", "symbol": "GOOG", "price": 2800.50}
id: 42

event: alert
data: {"message": "Market closing in 5 minutes"}
  1. The EventSource browser API auto-reconnects on disconnection and sends Last-Event-ID header to resume from where it left off.
SSE vs WebSockets:
AspectSSEWebSockets
DirectionUnidirectional (server to client)Bidirectional
ProtocolStandard HTTPCustom protocol (ws://)
ReconnectionAutomatic (built into EventSource API)Manual (must implement)
Data formatText (UTF-8) onlyText or binary
HTTP/2 compatibilityExcellent (multiplexed with other requests)Separate TCP connection
Load balancer / Proxy supportWorks through standard HTTP infrastructureRequires WebSocket-aware proxies
Browser supportAll modern browsers (no IE)All modern browsers
Max connections~6 per domain in HTTP/1.1, unlimited in HTTP/2No browser limit
When to use SSE over WebSockets:
  • Live feeds — stock prices, news, social media feeds, notifications. Server pushes, client displays.
  • Progress updates — long-running task progress, build logs, deployment status.
  • Dashboard updates — metrics, monitoring data, real-time charts.
  • The key signal: if the client only needs to receive data and never sends data through the same channel, SSE is simpler and more robust than WebSockets.
When SSE falls short:
  • Chat / Collaborative editing — requires bidirectional communication.
  • Gaming — requires low-latency bidirectional binary data.
  • Binary data — SSE is text-only (though you can Base64-encode, at a size penalty).
Red flag answer: “Just use WebSockets for everything” without knowing that SSE exists or is simpler for unidirectional use cases.Follow-up questions:Q: You need to push real-time notifications to a web app. Would you choose SSE or WebSockets? Why? SSE. Notifications are unidirectional (server to client). SSE provides automatic reconnection with Last-Event-ID resumption — if the client disconnects (mobile network switch), it automatically reconnects and catches up on missed events. WebSockets would require implementing reconnection, resumption, and message buffering manually. SSE also works through corporate proxies and CDNs that may block WebSocket upgrades. The only reason to choose WebSockets here would be if the client also needs to send acknowledgments (read receipts) through the same channel.Q: How does SSE scale to 100,000 concurrent connections? Each SSE connection holds an open HTTP connection, consuming server resources (file descriptor, small memory per connection). At 100K connections: (1) Use an event-loop based server (Node.js, Go, Nginx push module) that handles many concurrent connections on few threads. (2) Fan out events via a pub/sub system (Redis Pub/Sub, Kafka) — the SSE server subscribes to event channels and pushes to connected clients. (3) Horizontally scale SSE servers behind a load balancer with sticky sessions or connection-aware routing. (4) With HTTP/2, multiple SSE streams multiplex over a single TCP connection, reducing resource usage. (5) Consider a managed service (AWS AppSync, Pusher, Ably) for very high connection counts.
What interviewers are really testing: Whether you understand the full lifecycle, when WebSockets are truly necessary, and the operational challenges they introduce.Answer:WebSockets provide full-duplex, bidirectional communication over a single TCP connection. After an initial HTTP handshake (upgrade), the connection switches to the WebSocket protocol (ws:// or wss://).Lifecycle:
  1. Handshake — Client sends HTTP GET with Upgrade: websocket header. Server responds with 101 Switching Protocols.
  2. Communication — Both sides can send text or binary frames at any time. No request/response pattern. Extremely low overhead per message (2-14 bytes header vs ~700+ bytes for HTTP headers).
  3. Keep-alive — Ping/pong frames maintain the connection. If no pong received, connection is considered dead.
  4. Close — Either side sends a close frame. Clean shutdown with status codes.
Use cases where WebSockets are the right choice:
  • Real-time chat — bidirectional, low-latency message exchange.
  • Multiplayer gaming — sub-10ms latency for game state updates.
  • Collaborative editing — Google Docs-style real-time cursor and text sync.
  • Financial trading — order book updates and trade execution.
  • Live sports / auction — bidirectional interaction (bidding) with real-time updates.
Operational challenges:
  • Scaling — Each WebSocket connection is stateful (tied to a specific server). If you have 10 servers and a user connects to server 3, their messages go through server 3. If server 3 crashes, the connection is lost. Solutions: Redis Pub/Sub or Kafka for cross-server message broadcasting, sticky sessions with graceful reconnection.
  • Load balancing — Standard HTTP load balancers round-robin each request. WebSocket connections are long-lived, so the initial connection routing determines which server handles it for the duration. Use session affinity or an L7 load balancer that understands WebSocket upgrade.
  • Authentication — WebSockets do not natively support headers after the handshake. Options: (a) authenticate during the HTTP upgrade handshake (check cookie/token), (b) send an auth message as the first frame after connection.
  • Reconnection — Connections will drop (mobile network switches, server deploys). The client must implement reconnection logic with exponential backoff and state recovery (what messages did I miss?).
  • Proxies and firewalls — Some corporate proxies terminate long-lived connections. wss:// (WebSocket Secure) usually passes through because it looks like HTTPS.
Red flag answer: “WebSockets are the best way to do real-time” without acknowledging the scaling, reconnection, and operational complexity. Or using WebSockets for simple notifications that SSE handles better.Follow-up questions:Q: You are building a chat application that needs to support 1 million concurrent connections. How do you architect the WebSocket layer? (1) Use event-loop servers (Go, Erlang/Elixir, Node.js) — a single Go server can handle 500K+ concurrent WebSocket connections with proper tuning (file descriptor limits, TCP buffer sizes). (2) Use Redis Pub/Sub or Kafka as a message bus between WebSocket servers. When user A on Server 1 sends a message to user B on Server 3, Server 1 publishes to a Redis channel, Server 3 subscribes and delivers. (3) Group connections by chat room/channel to reduce broadcast scope. (4) Implement connection-aware load balancing (consistent hashing by user ID). (5) Use a dedicated presence service that tracks which server each user is connected to. Discord handles 10M+ concurrent connections using Elixir/Erlang for WebSocket servers with internal message routing.Q: During a server deployment, all WebSocket connections are dropped and users see a disconnection. How do you fix this? (1) Implement rolling deployments — drain connections from one server at a time. Send a close frame with a “server going away” reason before shutting down, giving clients a chance to reconnect to another server. (2) Client-side reconnection with exponential backoff and jitter (0.5s, 1s, 2s + random jitter to avoid thundering herd). (3) On reconnect, client sends a “last_event_id” or timestamp so the server can replay missed messages. (4) Use connection draining timeout (30-60 seconds) in the load balancer to let active messages complete before killing the old server. (5) Consider deploying WebSocket servers on a separate, less-frequently-deployed infrastructure from the REST API.
What interviewers are really testing: Whether you understand binary serialization deeply enough to know its advantages AND its complexities around schema evolution.Answer:Protocol Buffers (Protobuf) is Google’s language-neutral, platform-neutral, extensible binary serialization format. It is the default serialization for gRPC.How it works:
  1. Define a schema in a .proto file:
syntax = "proto3";
message User {
  int32 id = 1;
  string name = 2;
  string email = 3;
  repeated string tags = 4;
}
  1. Run the protoc compiler to generate code in your target language (Go, Java, Python, C++, etc.).
  2. Serialize/deserialize using the generated code. The binary format uses field numbers (1, 2, 3) not field names — this is why renaming a field is free but changing its number breaks compatibility.
Why Protobuf over JSON:
  • Size — Protobuf messages are 30-80% smaller than equivalent JSON. Field names are not transmitted (only numbers). Integers use variable-length encoding (small numbers use fewer bytes).
  • Speed — Binary parsing is 5-100x faster than JSON text parsing. No string-to-number conversion, no quote parsing, no whitespace handling.
  • Schema enforcement — The .proto file is the contract. Both sides must agree on the schema. Type mismatches are caught at compile time, not runtime.
  • Code generation — Strongly-typed serialization/deserialization code generated for every supported language. No hand-written JSON parsing.
Schema evolution rules (critical for production):
  • Safe: Add a new field (with a new field number). Old clients ignore it. New clients see default value from old messages.
  • Safe: Remove a field (mark it reserved so the number is not reused). Old clients sending this field are fine — new code ignores it.
  • Safe: Rename a field (field numbers are used in the binary format, not names).
  • Dangerous: Change a field’s type (e.g., int32 to string). Causes deserialization errors.
  • Dangerous: Reuse a field number for a different purpose. Old data with that field number will be misinterpreted.
Protobuf vs other binary formats:
  • Avro — Schema is sent with the data (or fetched from a schema registry). Better for big data pipelines (Kafka + Schema Registry). No field numbers needed.
  • MessagePack — JSON-like but binary. No schema required. Simpler but less type-safe.
  • FlatBuffers — Zero-copy deserialization (read fields directly from the buffer without parsing). Used in gaming and performance-critical mobile apps.
Red flag answer: “Protobuf is just a faster JSON” without understanding schema evolution, field numbers, or the code generation workflow.Follow-up questions:Q: You need to add a required field to a Protobuf message that is used by 50 services. How do you roll this out safely? In proto3, there are no “required” fields — all fields have default values. Add the new field with a new field number. Deploy consumers first (they will handle the new field). Then deploy producers (they start sending it). Services that have not upgraded yet will simply see the default value for the new field. If you need to enforce that the field is always present, do it at the application layer (validation logic), not the protobuf layer. This is the key lesson from proto2 to proto3: “required” fields were a source of compatibility nightmares.Q: Your team is debating between Protobuf and JSON for Kafka messages between microservices. What do you recommend? Protobuf with a Schema Registry (Confluent Schema Registry supports Protobuf). Reasons: (1) Kafka messages at scale benefit enormously from the 30-80% size reduction — less network bandwidth, less disk storage, lower broker CPU. (2) Schema Registry enforces compatibility rules on every publish, preventing producers from pushing breaking changes. (3) The schema doubles as documentation. The trade-off: debugging Kafka messages requires decoding binary Protobuf (tools like kafkacat with schema registry integration help). For low-volume internal messaging or rapid prototyping, JSON with a JSON Schema registry is simpler.
What interviewers are really testing: Whether you understand asynchronous messaging patterns deeply enough to choose between queues and streams, configure delivery guarantees, and handle failure scenarios.Answer:Message queues decouple producers (senders) from consumers (receivers), enabling asynchronous communication. The producer publishes a message to the queue and continues without waiting for the consumer to process it.Two fundamental models:Message Queue (Point-to-Point):
  • One message is consumed by exactly one consumer (competing consumers pattern).
  • Once consumed, the message is removed from the queue.
  • Tools: RabbitMQ, Amazon SQS, ActiveMQ.
  • Use case: Task distribution — 10 worker pods each pulling jobs from the queue.
Message Stream (Publish/Subscribe):
  • Messages are appended to a durable, ordered log. Multiple consumer groups each read all messages independently.
  • Messages persist for a configurable retention period (not removed on consumption).
  • Tools: Apache Kafka, Amazon Kinesis, Redis Streams, Pulsar.
  • Use case: Event bus — Order service publishes OrderCreated, and Payment, Inventory, Notifications, and Analytics services all consume it independently.
Delivery guarantees:
GuaranteeMeaningImplementationUse case
At-most-onceFire and forget. Message may be lost.No acknowledgment.Metrics, logs (acceptable loss)
At-least-onceMessage delivered one or more times. Duplicates possible.Consumer acknowledges after processing. Broker redelivers unacknowledged.Most business use cases (with idempotent consumers)
Exactly-onceMessage processed exactly once.Transactional outbox + idempotent consumer, or Kafka’s transactional API.Financial transactions, inventory
RabbitMQ vs Kafka (the most common comparison):
  • RabbitMQ — Smart broker, dumb consumer. Broker tracks which messages are delivered and acknowledged. Supports complex routing (topic exchange, headers exchange, dead letter exchange). Lower throughput (~50K messages/sec) but more flexible routing. Best for task queues, RPC-style messaging.
  • Kafka — Dumb broker, smart consumer. Consumers track their offset (position in the log). Messages persist for days/weeks. Massive throughput (millions of messages/sec per cluster). Best for event streaming, log aggregation, real-time analytics.
Production patterns:
  • Dead Letter Queue (DLQ) — Messages that fail processing after max retries go to a DLQ for manual investigation. Without this, poison messages (that always fail) block the queue forever.
  • Backpressure — When consumers cannot keep up, the queue grows. Monitor queue depth and consumer lag as key metrics. Alert when lag exceeds SLA.
  • Message ordering — RabbitMQ guarantees ordering per queue. Kafka guarantees ordering per partition. To maintain order for a specific entity (all events for user 42), use a consistent partition key.
Red flag answer: “Kafka and RabbitMQ are the same thing” or not understanding delivery guarantees.Follow-up questions:Q: Your Kafka consumer group has 3 consumers but the topic has 12 partitions. What happens? What if you scale to 15 consumers? With 3 consumers and 12 partitions, each consumer processes 4 partitions. If you scale to 15 consumers, 12 will each get 1 partition, and 3 will be idle — a Kafka consumer group cannot have more active consumers than partitions. This is why partition count is a capacity planning decision: it sets the upper bound on consumer parallelism. If you expect to scale to 20 consumers, create 20+ partitions upfront (adding partitions later loses ordering guarantees for existing data). Common production rule: start with 2 * expected_max_consumers partitions.Q: A consumer processes a message and crashes before acknowledging it. What happens? With at-least-once delivery (the standard): the broker (RabbitMQ) redelivers the message to another consumer, or (Kafka) the consumer group rebalances and another consumer picks up from the last committed offset. The message is processed again — potentially a duplicate. This is why consumers must be idempotent: use a unique message ID to detect duplicates, or design operations to be naturally idempotent (set a value, not increment it). In Kafka, the un-acknowledged messages between the last committed offset and the crash point (“uncommitted offset gap”) will all be reprocessed.
What interviewers are really testing: Whether you understand the specific performance problems HTTP/2 solves and the new problems it introduces.Answer:HTTP/2 is a major revision of the HTTP protocol focused on performance. It was standardized in 2015 based on Google’s SPDY protocol.Key improvements:
  1. Multiplexing — Multiple requests and responses are interleaved over a single TCP connection. In HTTP/1.1, browsers open 6-8 parallel TCP connections per domain because each connection handles only one request at a time. HTTP/2 eliminates this — one connection carries hundreds of concurrent streams. This eliminates head-of-line blocking at the HTTP layer (but not at the TCP layer — see HTTP/3).
  2. Header Compression (HPACK) — HTTP headers are repetitive (same Cookie, User-Agent, Accept on every request). HPACK compresses headers using a static dictionary (common headers), dynamic table (previously seen headers), and Huffman encoding. Reduces header overhead by 30-90%. Critical for mobile where every byte matters.
  3. Server Push — Server proactively sends resources it knows the client will need (e.g., push CSS and JS along with the HTML response). In practice, this feature was rarely used correctly and is being deprecated in some contexts (Chrome removed support in 2022). Early hints (103 Early Hints) is the modern replacement.
  4. Stream Prioritization — Clients can assign priority to streams (e.g., CSS is more important than images). Servers can use this to optimize resource delivery order. In practice, browser implementations and server support vary, making this unreliable.
  5. Binary framing — HTTP/2 messages are split into binary frames, making parsing more efficient and eliminating the ambiguities of HTTP/1.1 text parsing.
Performance numbers:
  • HTTP/2 typically reduces page load times by 15-30% for asset-heavy web pages.
  • API-to-API communication sees less dramatic improvement unless making many parallel requests (gRPC benefits significantly).
  • Connection setup is faster because one TCP + TLS handshake serves all requests (vs 6 handshakes for 6 connections).
The TCP head-of-line blocking problem: HTTP/2 multiplexes over one TCP connection. If a single TCP packet is lost, all streams are blocked until the packet is retransmitted (TCP guarantees ordered delivery). This can make HTTP/2 worse than HTTP/1.1 on lossy networks (mobile, satellite) where HTTP/1.1’s multiple connections provide natural isolation. This is the primary motivation for HTTP/3 (QUIC).Red flag answer: “HTTP/2 is just faster” without being able to explain why (multiplexing, header compression) or the TCP head-of-line blocking trade-off.Follow-up questions:Q: Your HTTP/1.1 application uses domain sharding (static1.example.com, static2.example.com) to increase parallelism. Should you keep this with HTTP/2? No, remove it. Domain sharding was a workaround for HTTP/1.1’s per-connection limitation. With HTTP/2, a single connection handles all requests in parallel. Domain sharding with HTTP/2 actually hurts performance because each domain requires a separate TCP + TLS connection setup, and HTTP/2’s header compression and stream prioritization only work within a single connection. Consolidate all assets to a single domain and let HTTP/2 multiplexing handle the parallelism.Q: How does HTTP/2 impact your load balancer and reverse proxy configuration? HTTP/2 changes the connection model: instead of many short-lived connections, you get fewer long-lived connections with many multiplexed streams. (1) Connection-based load balancing (round-robin per connection) becomes inefficient — one connection carries all traffic. Use request/stream-level (L7) load balancing. (2) Your reverse proxy (NGINX, Envoy) should terminate HTTP/2 from clients but can use HTTP/1.1 or HTTP/2 to backends — choose based on backend capabilities. (3) Monitor active streams per connection rather than active connections as a load metric.
What interviewers are really testing: Whether you understand the specific TCP limitation that HTTP/3 addresses and the broader implications of moving HTTP to UDP.Answer:HTTP/3 uses QUIC (Quick UDP Internet Connections) as its transport layer instead of TCP. QUIC was developed by Google and standardized by the IETF in 2022 (RFC 9000).The problem HTTP/3 solves: HTTP/2 multiplexes streams over one TCP connection. If one TCP packet is lost, ALL streams are blocked until retransmission completes (TCP’s ordered delivery guarantee). This is TCP-level head-of-line blocking. On lossy networks (mobile, WiFi), this can make HTTP/2 slower than HTTP/1.1.How QUIC fixes it: QUIC runs on UDP and implements its own reliable delivery per-stream. If a packet belonging to Stream A is lost, only Stream A blocks. Streams B, C, D continue receiving data uninterrupted. This is true stream-level independence.Additional QUIC benefits:
  1. Faster handshake — TCP + TLS requires 2-3 round trips before data can flow. QUIC combines the transport and crypto handshake into 1 round trip. For repeat connections, 0-RTT resumption sends data in the very first packet (at the cost of replay attack risk for the 0-RTT data).
  2. Connection migration — QUIC connections are identified by a Connection ID, not the 4-tuple (source IP, source port, dest IP, dest port). When your phone switches from WiFi to cellular, the IP address changes, but the QUIC connection survives. TCP connections break on IP change.
  3. Built-in encryption — QUIC always encrypts (TLS 1.3 is mandatory). Even the packet headers are partially encrypted, making it resistant to middlebox interference.
  4. Userspace implementation — QUIC is implemented in user space (not the OS kernel like TCP). This enables faster iteration and deployment of improvements without waiting for OS updates.
Current adoption:
  • Google services (YouTube, Search) serve ~30% of traffic over QUIC.
  • Cloudflare, Fastly, and Akamai support HTTP/3 on their CDNs.
  • Chrome, Firefox, Safari, and Edge all support HTTP/3.
  • NGINX and Caddy support HTTP/3. HAProxy support is experimental.
Trade-offs:
  • Some corporate firewalls block UDP (they only allow TCP 80/443). HTTP/3 implementations fall back to HTTP/2 over TCP when UDP is blocked.
  • Debugging is harder — existing TCP analysis tools (tcpdump, Wireshark) need QUIC-specific dissectors.
  • CPU usage can be higher than TCP due to userspace processing and per-stream encryption.
Red flag answer: “HTTP/3 is UDP-based so it’s faster” without explaining why UDP helps (per-stream reliability, no HOL blocking) or acknowledging the fallback and debugging trade-offs.Follow-up questions:Q: If QUIC is better, why has not everyone migrated from HTTP/2? (1) Middlebox interference — many corporate networks, firewalls, and ISPs block or throttle UDP traffic (historically, UDP was mainly DNS and gaming). (2) Server-side support is still maturing — not all load balancers and reverse proxies have production-ready HTTP/3. (3) The performance improvement is most noticeable on lossy networks. On low-loss datacenter networks (service-to-service), the difference is negligible. (4) Operational tooling — network engineers are deeply familiar with TCP debugging. QUIC requires new skills and tools. (5) For most web applications, HTTP/2 is “good enough” and the migration effort to HTTP/3 is not justified by the marginal improvement.Q: How does 0-RTT in QUIC affect security? When would you disable it? 0-RTT sends application data in the very first packet, before the handshake completes. This is vulnerable to replay attacks — an attacker who captures a 0-RTT packet can resend it, and the server processes the request again. Safe for idempotent requests (GET). Dangerous for non-idempotent requests (POST /transfer). Most implementations allow the server to reject 0-RTT data or mark it as potentially replayed. Disable 0-RTT for APIs handling financial transactions or state mutations. Enable it for static content delivery (CDNs) where replay has no side effects.
What interviewers are really testing: Whether you can design payment-grade reliability into APIs where duplicates can cost real money.Answer:An idempotency key is a unique identifier (typically a UUID) sent by the client in a request header to ensure that retrying the same request does not cause duplicate side effects.How it works:
POST /v1/payments
Headers:
  Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
Body:
  { "amount": 100, "currency": "usd", "customer": "cus_123" }
Server-side flow:
  1. Receive request. Extract Idempotency-Key.
  2. Check key-value store (Redis, DynamoDB) for the key.
  3. Key not found — Process the request. Store {key: "550e...", response: {status: 200, body: ...}, created_at: now} with a TTL (Stripe uses 24 hours). Return the response.
  4. Key found, request still processing — Return 409 Conflict or 425 Too Early to prevent concurrent duplicate execution.
  5. Key found, request completed — Return the stored response verbatim. No re-processing.
Critical implementation details:
  • Key storage must be atomic. Use SET key value NX EX 86400 in Redis (set-if-not-exists with TTL). This prevents race conditions where two concurrent retries both see “key not found.”
  • Store the full response, not just “processed.” The client expects the same response on retry, including the created resource ID.
  • TTL decision: 24 hours covers any reasonable retry window. Shorter TTLs risk the client retrying after key expiry and creating a duplicate. Longer TTLs waste storage.
  • Key scope: The key should be scoped to the API key or user. Two different users sending the same key should be treated as different requests.
  • What happens if the request partially succeeds? Example: payment is charged but the response never reaches the client. On retry, the server sees the key with a stored response (success) and returns it. The client gets the confirmation it missed. Without idempotency keys, the client would retry and the payment would be charged twice.
Who uses this: Stripe (all mutating operations), Amazon Pay, Square, PayPal, Adyen. It is considered a best practice for any API where duplicates cause financial or business harm.Red flag answer: “Just check if the payment already exists by amount and customer” — this is fragile (two legitimate $100 payments to the same customer would be incorrectly deduplicated). Or not understanding why the response must be stored, not just a “processed” flag.Follow-up questions:Q: The client sends a POST with an idempotency key, but the server crashes mid-processing (after charging the card but before storing the response). What happens on retry? This is the hardest edge case. The card is charged, but no idempotency record exists. On retry, without careful design, the card gets charged again. Solutions: (1) Use the Outbox pattern — write the idempotency record and the business state change in the same database transaction. If the transaction commits, both are recorded atomically. (2) Two-phase approach — first, atomically reserve the idempotency key (mark as “in-progress”). Then process the payment. Then update the key with the response. On retry, if the key is “in-progress,” check the payment processor for the existing charge (using a payment reference ID) before retrying. (3) Payment processor’s own idempotency — Stripe’s API is itself idempotent if you pass their idempotency key. Even if your server retries the Stripe call, Stripe will not double-charge.Q: How would you implement idempotency keys for a distributed system with multiple API server instances? The key-value store must be shared and consistent across all instances. Redis (single-instance or Redis Cluster) is the standard choice. The critical operation is the atomic check-and-set: SETNX (or SET ... NX) ensures only one instance processes a given key. All instances read/write the same Redis cluster. For higher availability, use DynamoDB with conditional writes (attribute_not_exists(idempotency_key)) or PostgreSQL with INSERT ... ON CONFLICT DO NOTHING. The choice depends on your existing infrastructure and latency requirements.

4. Security & Scaling

What interviewers are really testing: Whether you understand when to use each flow and the security implications of choosing the wrong one. This separates developers who have implemented auth from those who just configured a library.Answer:OAuth 2.0 is an authorization framework (not authentication). It allows a third-party application to access a user’s resources without knowing their password.Key roles: Resource Owner (user), Client (app), Authorization Server (issues tokens), Resource Server (API).Flows (Grant Types):
  1. Authorization Code Flow (Server-side apps):
    • The most secure flow for applications with a backend server.
    • User is redirected to the auth server, logs in, approves, and the auth server redirects back with a short-lived code.
    • The backend server exchanges the code + client_secret for tokens (server-to-server, the secret never touches the browser).
    • Use for: traditional web apps (Rails, Django, Express).
  2. Authorization Code + PKCE (Mobile/SPA):
    • Same as above but the client_secret is replaced by a dynamically generated code_verifier + code_challenge (SHA-256 hash).
    • PKCE (Proof Key for Code Exchange) prevents authorization code interception attacks.
    • The OAuth 2.1 draft recommends PKCE for ALL clients, even server-side.
    • Use for: mobile apps, single-page applications, desktop apps — any public client that cannot safely store a secret.
  3. Client Credentials (Machine-to-Machine):
    • No user involved. The application authenticates itself with client_id + client_secret.
    • Returns an access token directly (no user interaction).
    • Use for: service-to-service communication, cron jobs, internal APIs.
  4. Device Authorization (Smart TVs, IoT):
    • Device displays a code. User goes to a URL on their phone, enters the code, and approves.
    • Device polls the auth server until approval.
    • Use for: devices without a browser or keyboard.
Deprecated flows:
  • Implicit — Returned tokens directly in the URL fragment. Vulnerable to token leakage via browser history and referrer headers. Replaced by Auth Code + PKCE.
  • Resource Owner Password Credentials (ROPC) — Client collects username/password directly. Only for first-party apps migrating from legacy. Avoid.
Token types:
  • Access token — Short-lived (15 min - 1 hour). Sent with every API request.
  • Refresh token — Long-lived (days to months). Used to get new access tokens without re-authenticating. Store securely (httpOnly cookie or secure storage).
Red flag answer: “We use the Implicit flow for our SPA” (deprecated and insecure) or “OAuth handles login” (it handles authorization, not authentication — OIDC adds authentication).Follow-up questions:Q: Your SPA uses Authorization Code + PKCE. Where do you store the tokens? This is hotly debated. Options: (1) In-memory (JavaScript variable) — most secure against XSS (tokens not in DOM or storage) but lost on page refresh. Best for high-security apps. (2) HttpOnly, Secure, SameSite cookie — protected from JavaScript access (XSS cannot read it). The BFF (Backend for Frontend) pattern handles tokens server-side and uses a session cookie for the SPA. (3) LocalStorage — persistent but vulnerable to XSS. Not recommended for sensitive tokens. (4) SessionStorage — slightly better (cleared on tab close) but still XSS-vulnerable. The emerging best practice for SPAs is the BFF pattern: the SPA never touches tokens directly.Q: A refresh token is stolen. What is the blast radius and how do you mitigate it? A stolen refresh token grants long-term access — the attacker can generate new access tokens indefinitely. Mitigations: (1) Refresh token rotation — issue a new refresh token with every access token refresh, and invalidate the old one. If the old one is used (by the attacker), invalidate the entire token family. (2) Bind refresh tokens to client identity (device fingerprint, IP range). (3) Short access token lifetime (15 minutes) limits the window of a stolen access token. (4) Anomaly detection — flag if a refresh token is used from a new IP/device. (5) Revocation — maintain a token revocation list checked on every use.
What interviewers are really testing: Whether you understand the distinction between authentication and authorization, and why OAuth alone is not enough for “login.”Answer:OIDC = Authentication layer built on top of OAuth 2.0.OAuth 2.0 answers: “Is this app allowed to access this user’s photos?” (authorization). OIDC answers: “Who is this user?” (authentication).How OIDC extends OAuth 2.0:
  1. ID Token — A JWT that contains user identity claims (who they are):
{
  "iss": "https://auth.example.com",
  "sub": "user_12345",
  "aud": "my_app_client_id",
  "exp": 1705312200,
  "iat": 1705308600,
  "name": "Alice Smith",
  "email": "alice@example.com",
  "email_verified": true
}
The access token says what you can do. The ID token says who you are.
  1. UserInfo endpointGET /userinfo with the access token returns additional user profile information.
  2. Standard scopesopenid (required), profile (name, picture), email, address, phone.
  3. DiscoveryGET /.well-known/openid-configuration returns the provider’s endpoints, supported scopes, signing algorithms, etc. This enables auto-configuration of OIDC clients.
Why this matters in practice: Before OIDC, developers abused OAuth access tokens to identify users — calling the provider’s user info API with the access token. This was fragile, non-standard, and insecure (access tokens are meant for resource servers, not identity). OIDC standardized the “login” flow.OIDC providers: Google, Microsoft (Entra ID), Okta, Auth0, Keycloak, AWS Cognito.The JWT in the ID Token must be validated:
  1. Verify the signature (using the provider’s public key from JWKS endpoint).
  2. Check iss (issuer) matches the expected provider.
  3. Check aud (audience) matches your client ID.
  4. Check exp (expiration) is in the future.
  5. Check iat (issued at) is not too far in the past.
Red flag answer: Confusing OAuth with OIDC, or “we use OAuth for login” without understanding that OAuth alone does not provide identity.Follow-up questions:Q: You are building a microservices system where users log in via OIDC. How do you propagate the user’s identity to downstream services? The API gateway validates the OIDC ID token on the initial request. Internally, you can: (1) Pass the validated JWT to downstream services in a header (each service verifies the signature — stateless, but increases latency if services re-validate). (2) Exchange the external OIDC token for an internal JWT with only the claims downstream services need (reduces token size, limits exposure). (3) Use mTLS between services and embed user context in a custom header that only the gateway can set (downstream services trust the gateway implicitly). Option 2 is the most common in production — Istio can automate the internal token exchange.Q: What is the difference between an ID Token and an Access Token? Can you use the ID Token to call APIs? No. The ID Token is meant for the client application to learn about the user. The access token is meant for the resource server (API) to authorize requests. Sending the ID Token to an API is a security anti-pattern: (1) The ID Token’s aud claim is your client ID, not the API. The API would need to accept your client ID as a valid audience, which breaks the security model. (2) ID Tokens may contain sensitive PII (email, name) that should not be transmitted to every API. (3) The token lifetimes and validation rules are different. Always use access tokens for API calls.
What interviewers are really testing: Whether you can design rate limiting that protects your system without punishing legitimate users, and whether you know the algorithmic trade-offs.Answer:Rate limiting controls how many requests a client can make in a given time window. It protects against abuse, ensures fair resource allocation, and prevents cascading overload.Algorithms:
  1. Fixed Window — Count requests per fixed time window (e.g., 100 requests per minute, window resets at :00, :01, :02…). Simple but has a burst problem: a client can send 100 requests at 0:59 and 100 at 1:00 — 200 requests in 2 seconds.
  2. Sliding Window Log — Track the timestamp of each request. Count requests in the last 60 seconds from now. Smooth but expensive (stores every timestamp).
  3. Sliding Window Counter — Hybrid: combine the current fixed window count with a weighted previous window count. rate = prev_count * (1 - elapsed/window_size) + current_count. Good balance of accuracy and memory.
  4. Token Bucket — A bucket holds up to N tokens. Tokens are added at a fixed rate (e.g., 10/sec). Each request consumes a token. If the bucket is empty, the request is rejected. Allows controlled bursts (bucket fills up during idle periods). Used by AWS API Gateway and most CDNs.
  5. Leaky Bucket — Requests enter a queue (bucket) processed at a fixed rate. Excess requests overflow (rejected). Smooths bursty traffic into a constant rate. Used in network traffic shaping.
Implementation in distributed systems: The challenge: if you have 10 API servers, each must enforce the same limit. Solutions:
  • Centralized counter (Redis)INCR + EXPIRE on a key like rate:user_123:2024-01-15T10:30. All servers check Redis. Adds 1-2ms latency per request. Use Lua scripts for atomic check-and-increment.
  • Local counter with sync — Each server maintains its own counter, periodically syncing with a central store. Allows brief over-limit bursts between syncs.
  • Cell-based / Approximate — Each server gets total_limit / num_servers. Simple but brittle when servers have uneven traffic.
Response headers (standard practice):
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1705312200
Rate limit dimensions: Per API key, per user, per IP, per endpoint, per tier (free: 100/min, pro: 1000/min, enterprise: custom).Red flag answer: “Return 429 when there are too many requests” without discussing algorithms, distributed implementation, or how to communicate limits to clients.Follow-up questions:Q: Your API has a global rate limit of 10,000 requests/min. A single customer is sending 8,000 requests/min, starving others. How do you fix this? Implement per-customer rate limits in addition to global limits. The global limit protects your infrastructure; the per-customer limit ensures fairness. Create tiered limits: free tier (100/min), paid tier (1000/min), enterprise (10,000/min). For the heavy customer, offer an enterprise plan or a dedicated endpoint. Also consider: (1) Weighted rate limiting — expensive endpoints (report generation) consume more “tokens” than cheap ones (GET by ID). (2) Burst allowance — allow 2x the rate for short bursts (token bucket) but enforce the average over longer windows.Q: How would you implement rate limiting at the API gateway level vs the application level? What are the trade-offs? Gateway-level (Kong, AWS API Gateway): catches abusive traffic before it reaches your services. Lower latency, simpler implementation, but coarser-grained (hard to rate-limit based on business logic like subscription tier). Application-level: access to business context (user tier, endpoint cost), but every service must implement it. Best practice: use gateway-level for global protection (IP-based, total RPS) and application-level for business-logic rate limiting (per-user tier, per-endpoint cost).
What interviewers are really testing: Whether you understand that different clients have fundamentally different API needs and can design accordingly.Answer:The BFF pattern creates a separate backend (or API gateway configuration) for each type of client application — web, mobile, third-party developers, etc.Why BFF exists: A single API cannot optimally serve all clients. Mobile needs: smaller payloads (save bandwidth), fewer round trips (high latency), offline-friendly responses. Web needs: rich data, real-time updates. Third-party APIs need: stability, versioning, strict rate limits.Architecture:
Mobile App  --> Mobile BFF  --> Microservices
Web App     --> Web BFF     --> Microservices
3rd Party   --> Public API Gateway --> Microservices
What each BFF handles:
  • Response shaping — Mobile BFF returns low-res image URLs. Web BFF returns high-res.
  • Aggregation — Mobile BFF combines 3 microservice calls into 1 response (saves round trips). Web BFF can afford 3 parallel calls.
  • Auth translation — Mobile BFF handles token refresh transparently. Web BFF manages session cookies. Public API uses API keys.
  • Caching strategy — Mobile BFF caches aggressively (users tolerate slightly stale data). Web BFF uses shorter cache TTLs.
  • Error handling — Mobile BFF returns simplified error codes (offline-friendly). Web BFF returns detailed error messages.
BFF ownership model: The BFF should be owned by the frontend team, not the backend team. The iOS team owns the mobile BFF. The web team owns the web BFF. This eliminates the bottleneck where frontend teams wait for backend teams to add “just one more field” to a shared API.When to use BFF vs a single API:
  • Single API — 1-2 client types with similar needs, small team.
  • BFF — 3+ client types with different needs, separate frontend teams, mobile + web + third-party.
Tools: GraphQL can serve as a BFF (each client queries for exactly what it needs from a single endpoint). Netflix uses “Federated GraphQL” where each microservice exposes a GraphQL subgraph, and a gateway composes them.Red flag answer: “Just make one API that returns everything and let the client filter” — this leads to over-fetching, security risks (returning sensitive fields to mobile), and performance problems.Follow-up questions:Q: You have a mobile BFF and a web BFF. Both call the same 5 microservices. You notice a lot of duplicated logic. How do you handle it? Extract shared logic into a shared library (DTO mappers, auth helpers, retry logic) or an internal “experience API” layer between the BFFs and microservices. Keep client-specific logic (response shaping, caching) in the BFF. Key principle: DRY the infrastructure and business logic, but let each BFF customize the client experience independently. Do NOT merge the BFFs back into one — the whole point is client-specific optimization.Q: How does the BFF pattern interact with GraphQL? Could GraphQL replace BFF? GraphQL can partially replace BFF by letting each client request exactly the fields it needs. But GraphQL alone does not solve: (1) Different auth strategies per client. (2) Different caching strategies. (3) Response format differences (image sizes, pagination styles). (4) Different security policies (rate limits, field-level access). A practical hybrid: use a GraphQL API as the aggregation layer behind client-specific BFFs. The BFF handles auth, caching, and security, then queries GraphQL for data. Netflix and Airbnb use variations of this approach.
What interviewers are really testing: Whether you understand why service meshes exist, how they work (sidecar proxy model), and when the operational overhead is justified.Answer:A service mesh is a dedicated infrastructure layer for managing service-to-service communication. It handles networking concerns (security, observability, traffic management) without requiring application code changes.How it works — the sidecar model: Every service pod gets an additional container (the sidecar proxy — typically Envoy). All inbound and outbound network traffic passes through this proxy. The proxy handles:
  • mTLS — Mutual TLS between every service pair. Automatic certificate rotation. Zero-trust networking without application code changes.
  • Traffic management — Retries, timeouts, circuit breaking, load balancing algorithms — all configured via YAML, not code.
  • Observability — The proxy emits metrics (latency, error rates, request counts), traces (distributed tracing spans), and access logs for every request. No instrumentation code needed.
  • Traffic splitting — Route 5% of traffic to v2 (canary deployment), 95% to v1. A/B testing by header value. Blue/green deployments.
  • Access policy — “Service A can call Service B but not Service C.” Enforced at the mesh level.
Control plane vs Data plane:
  • Data plane — The sidecar proxies (Envoy). Handles actual traffic. Runs on every pod.
  • Control plane — The management layer (Istio Pilot, Citadel, Galley; or Linkerd’s control plane). Configures the proxies, manages certificates, collects telemetry.
Istio vs Linkerd:
AspectIstioLinkerd
ProxyEnvoy (C++, feature-rich)linkerd2-proxy (Rust, lightweight)
ComplexityHigh (many components, steep learning curve)Lower (simpler architecture, easier setup)
Resource overheadHigher (~50-100MB per sidecar)Lower (~10-20MB per sidecar)
FeaturesMore extensive (Wasm plugins, multi-cluster)Focused on core features (simpler but sufficient)
CommunityLarger, CNCF graduatedSmaller, CNCF graduated
When to use a service mesh:
  • 20+ microservices where managing retries, mTLS, and observability per-service is unsustainable.
  • Zero-trust security requirements (mTLS everywhere).
  • Complex deployment strategies (canary, traffic mirroring) needed regularly.
  • Polyglot environment where you cannot mandate a single language’s resilience library.
When to avoid:
  • Fewer than 10 services — the operational overhead is not justified.
  • Performance-sensitive workloads where the sidecar’s ~1-2ms latency overhead per hop matters (add up across a 5-service call chain).
  • Teams without Kubernetes expertise (service meshes are deeply tied to K8s).
Red flag answer: “We need Istio for our 3 microservices” (massive overkill) or not understanding the sidecar proxy model.Follow-up questions:Q: How does a service mesh add latency and how would you measure its impact? Each sidecar proxy adds 1-3ms of latency per hop (inbound proxy + outbound proxy = 2-6ms per service-to-service call). For a request traversing 5 services, the mesh adds 10-30ms total. Measure by: (1) Comparing p50/p99 latency with and without the mesh (deploy a baseline without sidecars). (2) Checking the mesh’s own metrics (Istio exposes istio_request_duration_milliseconds). (3) Using distributed tracing to see time spent in the proxy vs the application. Linkerd’s Rust-based proxy typically adds less latency than Istio’s Envoy proxy.Q: Your team is debating between implementing resilience patterns (retries, circuit breakers) in application code vs the service mesh. What do you recommend? Use the mesh for infrastructure-level resilience (connection-level retries, TCP circuit breaking, mTLS) and application code for business-level resilience (custom fallback responses, per-endpoint timeout tuning, semantic retry conditions like “retry on 503 but not 409”). The mesh gives you baseline protection without code changes across all services. Application-level libraries (Resilience4j) give you fine-grained control where business logic matters. They are complementary, not competing.
What interviewers are really testing: Whether you think about security proactively when designing APIs, not just as an afterthought.Answer:The OWASP API Security Top 10 (2023) identifies the most critical API security risks:
  1. Broken Object Level Authorization (BOLA / IDOR) — The #1 API vulnerability. GET /api/users/42/orders — can user 43 access user 42’s orders by changing the ID? Every endpoint that takes an object ID must verify the requesting user has permission to access that specific object. Not just “is the user authenticated?” but “does this user own this resource?”
  2. Broken Authentication — Weak password policies, missing brute-force protection, tokens in URLs, missing token expiry. Fix: use established auth frameworks (OAuth2, OIDC), enforce MFA for sensitive operations, rate-limit login attempts.
  3. Broken Object Property Level Authorization — The API returns fields the user should not see (e.g., is_admin: true in the user profile response) or accepts fields they should not set (mass assignment: POST /users {"role": "admin"}). Fix: explicit allowlists for input and output fields. Never blindly serialize entire database rows.
  4. Unrestricted Resource Consumption — No rate limiting, no pagination limits, no payload size limits. An attacker sends GET /users?limit=1000000 and your server OOMs. Fix: enforce max page sizes, rate limits, request body size limits, and timeout all operations.
  5. Broken Function Level Authorization — Regular user accessing admin endpoints (DELETE /admin/users/42). Fix: role-based access control enforced at every endpoint, not just the UI.
  6. Unrestricted Access to Sensitive Business Flows — Automated attacks on business logic: bot buying limited stock, automated account creation for spam. Fix: CAPTCHA, rate limiting on business-critical flows, device fingerprinting.
  7. Server Side Request Forgery (SSRF) — API accepts a URL parameter and the server fetches it: POST /fetch {"url": "http://169.254.169.254/metadata"} — accesses the cloud metadata endpoint, potentially leaking IAM credentials. Fix: URL allowlists, block private IP ranges, use a sandboxed fetch service.
  8. Security Misconfiguration — Missing CORS restrictions, verbose error messages in production (stack traces), default credentials, unnecessary HTTP methods enabled.
  9. Improper Inventory Management — Old API versions running unpatched, undocumented debug endpoints still accessible in production.
  10. Unsafe Consumption of APIs — Your API calls a third-party API and blindly trusts the response without validation. The third party gets compromised, and your system starts processing malicious data.
Red flag answer: “We handle security at the API gateway level” — the gateway handles perimeter security, but BOLA, mass assignment, and business logic vulnerabilities must be addressed in the application layer.Follow-up questions:Q: Walk me through how you would prevent BOLA in a multi-tenant SaaS application. (1) Every database query must include a tenant ID filter: WHERE tenant_id = :current_tenant AND id = :requested_id. Never rely solely on the object ID. (2) Create a middleware/interceptor that extracts the tenant from the auth token and injects it into every query. (3) Use database row-level security (PostgreSQL RLS) as a defense-in-depth layer. (4) Write integration tests that specifically test cross-tenant access: “user in tenant A requests resource in tenant B, expect 404.” (5) Code review checklist: “Does this endpoint verify object ownership?” Stripe’s API famously caught a BOLA vulnerability in 2019 through automated testing.Q: A security audit found that your API returns is_admin and password_hash in user responses. How do you fix this systematically? (1) Never serialize database models directly to API responses. Create explicit response DTOs (Data Transfer Objects) with only the fields clients need. (2) Use a serialization allowlist, not a blocklist. An allowlist (include: [:id, :name, :email]) is safer than a blocklist (exclude: [:password_hash]) because new sensitive fields are hidden by default. (3) Add automated tests that snapshot API responses and fail if unexpected fields appear. (4) For GraphQL, use field-level authorization that checks permissions before resolving each field.
What interviewers are really testing: Whether you understand JWT security beyond basic usage and can identify and prevent common attacks.Answer:JWTs (JSON Web Tokens) are widely used for stateless authentication, but several well-known vulnerabilities exist:
  1. "alg": "none" attack — The JWT header specifies the signing algorithm. If the server accepts "alg": "none", an attacker can forge tokens by removing the signature entirely. Fix: always validate the algorithm on the server side. Maintain an explicit allowlist of acceptable algorithms (e.g., only RS256 or ES256). Never rely on the token’s self-declared algorithm.
  2. Algorithm confusion (RS256 to HS256) — Server uses RSA (asymmetric: private key signs, public key verifies). Attacker changes the header to HS256 (symmetric: same key signs and verifies). If the server blindly trusts the algorithm, it uses the RSA public key (which is publicly known) as the HMAC secret to verify. The attacker signs the forged token with the public key, and verification passes. Fix: the server must enforce the expected algorithm, not read it from the token.
  3. Weak signing secrets — HMAC secrets like "secret" or "password123" can be brute-forced. Tools like jwt-cracker and hashcat can crack weak secrets in minutes. Fix: use a minimum 256-bit random secret for HS256. Better: use asymmetric keys (RS256, ES256) where the private key is never exposed.
  4. Missing expiration / Replay attacks — Tokens without exp (expiry) are valid forever. Stolen tokens can be replayed indefinitely. Fix: set short exp (15 min for access tokens). Use jti (JWT ID) for one-time-use tokens. For sensitive operations, also check nbf (not before) and iat (issued at).
  5. Token stored in localStorage — XSS attacks can read localStorage, stealing the JWT. Fix: store tokens in httpOnly cookies (inaccessible to JavaScript) or in memory (lost on refresh but most secure). Use the BFF pattern for SPAs.
  6. No revocation mechanism — JWTs are stateless — once issued, they are valid until expiry. You cannot “log out” a JWT. Fix: (a) Short expiry + refresh tokens. (b) Token blocklist in Redis (checked on every request — adds statefulness). (c) Token versioning: store a token_version per user in the database; increment on logout; reject tokens with old version.
  7. Overly permissive claims — Tokens containing sensitive data (SSN, credit card) in the payload. JWT payloads are Base64-encoded, not encrypted (anyone can decode them). Fix: keep claims minimal (user ID, roles, expiry). Encrypt the JWT if sensitive data must be included (JWE — JSON Web Encryption).
Red flag answer: “JWTs are secure because they are signed” — signing prevents tampering but does not prevent the algorithm confusion attack, the none attack, or payload exposure.Follow-up questions:Q: Your application uses JWTs for authentication. A user changes their password. How do you invalidate all their existing tokens? (1) Maintain a token_version (integer) per user in the database. Include it as a claim in the JWT. On password change, increment token_version. On every request, compare the token’s version with the database. Mismatch means the token is invalidated. (2) Alternative: add the token’s jti to a Redis blocklist on password change. Set the blocklist entry’s TTL to match the token’s remaining lifetime. (3) If using refresh tokens: revoke all refresh tokens for the user. Existing access tokens remain valid until they expire (keep access token lifetime short — 15 minutes).Q: Should you use symmetric (HS256) or asymmetric (RS256/ES256) JWT signing? What are the trade-offs? Asymmetric (RS256/ES256) for almost all production use cases. The private key stays on the auth server; the public key is distributed to all services (via JWKS endpoint). Services can verify tokens without having the private key. This means a compromised microservice cannot forge tokens. Symmetric (HS256) means every service that needs to verify a token must have the shared secret — if any one is compromised, the attacker can forge tokens for all users. HS256 is acceptable only in single-service architectures or when the auth server is the only token verifier.
What interviewers are really testing: Whether you understand CORS deeply enough to debug it (instead of just adding * to the Access-Control-Allow-Origin header) and understand the security implications.Answer:CORS is a browser security mechanism that controls which origins (domains) can make requests to your API. It exists because of the Same-Origin Policy: by default, JavaScript running on app.example.com cannot make requests to api.different.com.How CORS works:Simple requests (GET, HEAD, POST with standard content types):
  1. Browser sends the request with an Origin: https://app.example.com header.
  2. Server responds with Access-Control-Allow-Origin: https://app.example.com.
  3. Browser checks the header. If it matches, JavaScript can access the response. If not, the browser blocks it.
Preflight requests (PUT, DELETE, PATCH, custom headers, JSON content type):
  1. Browser sends an OPTIONS request (preflight) with:
    • Origin: https://app.example.com
    • Access-Control-Request-Method: DELETE
    • Access-Control-Request-Headers: Authorization, Content-Type
  2. Server responds with:
    • Access-Control-Allow-Origin: https://app.example.com
    • Access-Control-Allow-Methods: GET, POST, PUT, DELETE
    • Access-Control-Allow-Headers: Authorization, Content-Type
    • Access-Control-Max-Age: 86400 (cache preflight for 24 hours)
  3. If the preflight succeeds, the browser sends the actual request.
Critical headers:
  • Access-Control-Allow-Origin — Which origins are allowed. Never use * with credentials.
  • Access-Control-Allow-Credentials: true — Allows cookies/auth headers. When set, Allow-Origin MUST be a specific origin, not *.
  • Access-Control-Expose-Headers — By default, JavaScript can only read a few “simple” response headers. This header exposes additional ones (e.g., X-RateLimit-Remaining).
  • Access-Control-Max-Age — How long the browser caches the preflight response. Set to 86400 (24h) to reduce OPTIONS request overhead.
Common mistakes:
  • Access-Control-Allow-Origin: * with Access-Control-Allow-Credentials: true — browsers reject this combination. You must specify the exact origin.
  • Reflecting the Origin header directly into Access-Control-Allow-Origin without validation — this effectively allows any origin, defeating CORS entirely. An attacker’s site can make authenticated requests to your API.
  • Not handling OPTIONS preflight — returning 404 or 405 for OPTIONS breaks all non-simple requests from browsers.
Important context: CORS is a browser security mechanism. Server-to-server calls, cURL, Postman, and mobile apps do not send CORS headers and are not subject to CORS restrictions. CORS protects against malicious websites making requests on behalf of a logged-in user.Red flag answer: “Just set Access-Control-Allow-Origin: * to fix the CORS error” without understanding the security implications, or thinking CORS applies to server-to-server communication.Follow-up questions:Q: A developer complains: “My API works in Postman but gets CORS errors in the browser.” What is happening and how do you fix it? Postman is not a browser — it does not enforce the Same-Origin Policy or send CORS preflight requests. The browser does. The fix depends on the error: (1) If “No Access-Control-Allow-Origin header” — the server needs to return the CORS headers. Configure your framework’s CORS middleware (express cors(), Django django-cors-headers, Spring @CrossOrigin). (2) If “blocked by preflight” — the server must respond to OPTIONS requests with the correct Allow headers. (3) If the server is a third-party you do not control — use a proxy: your backend fetches from the third-party API and serves it to the frontend (bypassing CORS since the browser only talks to your origin).Q: Your API needs to support requests from https://app.example.com and https://staging.example.com. How do you configure CORS? Maintain a server-side allowlist of permitted origins: ["https://app.example.com", "https://staging.example.com"]. On each request, read the Origin header, check it against the allowlist, and if it matches, set Access-Control-Allow-Origin to that specific origin. Include Vary: Origin in the response so CDNs and proxies cache responses per-origin, not as a single cached version. Never use regex matching on origins without extreme care — https://app.example.com.evil.com would match a naive regex for example.com.
What interviewers are really testing: Whether you can configure autoscaling based on the RIGHT metrics, not just CPU. The default CPU-based autoscaling is often wrong for API workloads.Answer:Kubernetes Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas based on observed metrics.Standard metrics (built-in):
  • CPU utilization — Scale when average CPU across pods exceeds target (e.g., 70%). Works for CPU-bound workloads (computation, image processing).
  • Memory utilization — Scale when average memory exceeds target. Rarely useful as a scaling trigger because memory does not decrease when load decreases (most runtimes retain allocated memory).
Why CPU/Memory is often wrong for APIs: An API pod at 30% CPU might have 500 connections waiting on database I/O. CPU is low but the pod is effectively overloaded. The correct metrics for APIs are:Custom metrics (what you should actually scale on):
  • Requests per second (RPS) — Scale when RPS per pod exceeds capacity. Requires Prometheus adapter or Datadog metrics.
  • Request latency (p95) — Scale when response time degrades. Catches I/O-bound overload that CPU misses.
  • Queue depth — For async workers, scale based on messages in the queue (SQS queue length, Kafka consumer lag). This is the most accurate metric for workers.
  • Active connections — Scale when concurrent connections per pod exceed a threshold.
HPA configuration example (custom metrics):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      policies:
      - type: Percent
        value: 50
        periodSeconds: 30
Scaling behavior tuning:
  • Scale up fast, scale down slow. Scale up aggressively on load spikes (50% more pods per 30 seconds). Scale down slowly (10% per minute with a 5-minute stabilization window) to avoid flapping.
  • Minimum replicas — Never scale below 3 replicas for production services (1 pod update + 1 pod failure = 0 available).
  • Cooldown periods — Prevent thrashing (scaling up and down repeatedly). The stabilization window considers the last N minutes of recommendations and picks the safest.
KEDA (Kubernetes Event-Driven Autoscaling): For queue-based workloads, KEDA extends HPA with scalers for SQS, Kafka, RabbitMQ, Prometheus, and 50+ event sources. It can scale to zero (no pods when no messages) and scale from zero when messages arrive. This is ideal for async workers.Red flag answer: “We autoscale on CPU at 80%” for an API workload without considering that API pods are typically I/O-bound, not CPU-bound.Follow-up questions:Q: Your API scales from 5 to 50 pods in 2 minutes during a traffic spike. Users report errors during the spike. What went wrong? The new pods were not ready to serve traffic immediately. Causes: (1) Container startup time — pulling images, initializing connections, JVM warmup. Solution: use pre-pulled images, init containers, and readiness probes that only pass when the app is truly ready. (2) Connection pool warming — new pods need to establish database/Redis connections. Cold connection pools cause the first requests to be slow or fail. Solution: warm up connections in the readiness probe. (3) Load balancer registration lag — even after the pod is ready, it takes time for the endpoints to propagate. Solution: use pod readiness gates. (4) If using JVM: JIT compilation makes the first requests slow. Solution: use GraalVM native image or longer readiness probe delays.Q: How does HPA interact with Cluster Autoscaler? What happens when HPA wants 50 pods but the cluster only has capacity for 30? HPA creates pending pods (in Pending state due to insufficient resources). Cluster Autoscaler detects pending pods, provisions new nodes (AWS EC2 instances, GKE node pools), and schedules the pending pods once nodes are ready. The lag between “HPA requests pods” and “pods are actually running” can be 3-10 minutes (node provisioning + pod scheduling + startup). Mitigation: (1) Use overprovisioned nodes (placeholder pods that get evicted to make room). (2) Use Karpenter instead of Cluster Autoscaler for faster provisioning. (3) Set HPA max to a value your cluster can handle within your scaling budget.
What interviewers are really testing: Whether you understand how to debug problems in distributed systems where a single request touches 5-10 services.Answer:Distributed tracing tracks a single request as it flows through multiple services, providing a complete picture of latency, errors, and dependencies.Core concepts:
  • Trace — The end-to-end journey of a single request. Has a unique Trace ID (e.g., 64-bit hex: 4bf92f3577b34da6a3ce929d0e0e4736).
  • Span — A single operation within a trace (e.g., “user-service.getUser” or “postgres.query”). Each span records: start time, duration, status (OK/ERROR), tags (http.method, http.status_code), and logs.
  • Parent-child relationship — Spans form a tree. The API gateway span is the root. It has child spans for each downstream service call. Each downstream service has child spans for its database queries, cache lookups, etc.
How trace context propagates:
  1. The entry point (API gateway) generates a Trace ID and Span ID.
  2. These are passed to the next service via HTTP headers: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 (W3C Trace Context standard).
  3. Each service extracts the trace context, creates a child span, and propagates the context to its downstream calls.
  4. All spans are sent (asynchronously) to a collector which assembles them into the full trace.
Header standards:
  • W3C Trace Context (traceparent, tracestate) — The standard. Supported by all major tracing tools.
  • B3 Headers (X-B3-TraceId, X-B3-SpanId) — Zipkin’s legacy format. Still widely used.
  • Jaeger headers (uber-trace-id) — Jaeger’s native format. Being migrated to W3C.
Tools:
  • Jaeger — Open source (Uber). CNCF graduated. Strong K8s integration.
  • Zipkin — Open source (Twitter). Simpler than Jaeger.
  • Datadog APM — Commercial. Combines traces with metrics and logs. Excellent correlation.
  • AWS X-Ray — Managed. Integrates with Lambda and ECS.
  • OpenTelemetry — The emerging standard for instrumentation. Vendor-neutral SDK that exports to any backend (Jaeger, Zipkin, Datadog, Honeycomb).
Sampling strategies: At high traffic (millions of RPS), storing every trace is prohibitively expensive. Sampling strategies:
  • Head-based sampling — Decide at the start whether to trace this request (e.g., 1% of requests). Simple but misses rare errors.
  • Tail-based sampling — Collect all spans, then decide after the request completes whether to keep the trace. Keep all traces with errors or high latency, sample normal traces. More useful but requires buffering all spans temporarily. Jaeger and OpenTelemetry Collector support tail-based sampling.
Red flag answer: “We use logging for debugging” without understanding that logs from different services have no correlation unless trace IDs are included.Follow-up questions:Q: A user reports that the API is slow. You open Jaeger and see the trace. The total request took 2.5 seconds. How do you identify the bottleneck? Look at the trace waterfall view. Each span shows its duration and timing relative to the parent. (1) Find the longest span — that is your primary bottleneck. (2) Check if spans are sequential (serialized calls that could be parallelized) or parallel. (3) Look for gaps between spans — time not accounted for by child spans is time spent in the parent service (processing, serialization, garbage collection). (4) Check for N+1 patterns: 50 small database query spans under one service span. (5) Look at span tags for errors or retries. Common findings: a database query that lacks an index (long DB span), a downstream service with high latency (long HTTP span), or sequential calls that should be parallel.Q: How do you connect distributed traces to your logs and metrics (the “three pillars of observability”)? Include the Trace ID in every log line: logger.info("Processing order", extra={"trace_id": span.context.trace_id}). Configure your log aggregation (ELK, Datadog Logs) to make Trace ID a clickable link to the trace view. For metrics, use exemplars (OpenTelemetry/Prometheus feature): attach a Trace ID to specific metric data points. When a dashboard shows a latency spike, click the data point to jump to an example trace. This correlation (metrics —> trace —> logs) is the gold standard of observability. Datadog, Honeycomb, and Grafana Tempo all support this three-way correlation.

5. Operations

What interviewers are really testing: Whether you can design a logging strategy for distributed systems that is actually useful for debugging, not just “we write to stdout.”Answer:In a microservices architecture with 50+ services running across hundreds of containers, you cannot SSH into individual machines to read logs. Centralized log aggregation collects, processes, stores, and makes searchable all logs from all services in one place.The standard pipeline:
  1. Emit — Services write structured logs (JSON) to stdout/stderr. Include: timestamp, service name, trace ID, severity, message, and contextual fields.
{"timestamp": "2024-01-15T10:30:00Z", "service": "order-service", "trace_id": "abc123", "level": "ERROR", "message": "Payment failed", "order_id": 42, "error_code": "INSUFFICIENT_FUNDS"}
  1. Collect — A log shipper (Fluentd, Fluent Bit, Filebeat, Vector) runs as a DaemonSet (one per node) or sidecar. It reads stdout from containers, adds metadata (pod name, namespace, node), and forwards to the aggregator.
  2. Process — The aggregator (Logstash, Fluentd, Vector) parses, filters, enriches, and routes logs. Example: parse JSON, extract trace ID into a searchable field, drop debug-level logs in production, route error logs to a separate alert stream.
  3. Store — Elasticsearch (most common), Loki (Grafana’s lightweight alternative), Datadog Logs, Splunk, CloudWatch Logs. Elasticsearch indexes every field for fast search. Loki only indexes labels (cheaper but less flexible).
  4. Visualize & Alert — Kibana (with Elasticsearch), Grafana (with Loki), Datadog dashboards. Create dashboards for error rates, alert on patterns (5xx spike, specific error messages).
The ELK stack (Elasticsearch + Logstash + Kibana) is the classic setup but expensive at scale (Elasticsearch is memory-hungry). The modern alternative: PLG stack (Promtail + Loki + Grafana) — significantly cheaper because Loki does not index full text, only labels.Structured logging is non-negotiable. Unstructured logs ("Error processing order 42") require regex parsing and break on format changes. Structured logs (JSON with consistent fields) enable reliable filtering, aggregation, and alerting. Use a logging library that enforces structure (Winston, Logback, zerolog).Critical fields every log must include:
  • Timestamp (ISO 8601 with timezone)
  • Service name
  • Trace ID / Correlation ID (links logs to distributed traces)
  • Log level (ERROR, WARN, INFO, DEBUG)
  • Request context (user ID, request ID, endpoint)
Red flag answer: “We use console.log for debugging” or not including trace IDs in logs (makes cross-service debugging impossible).Follow-up questions:Q: Your Elasticsearch cluster is growing 50GB/day and costs are out of control. How do you reduce it? (1) Drop DEBUG and TRACE logs in production at the shipper level (Fluent Bit filter). (2) Sample verbose INFO logs (keep 10% of health check logs). (3) Use index lifecycle management (ILM): hot tier for 7 days (fast SSD), warm tier for 30 days (cheaper storage), delete after 90 days. (4) Switch to Loki for non-critical services (80% cheaper than Elasticsearch). (5) Reduce log payload size: do not log full request/response bodies for successful requests. (6) Deduplicate repeated error logs (log the first occurrence, then count).Q: An incident is in progress. A user reports an error. Walk me through how you debug it using logs and traces. (1) Get the user’s request ID or error correlation ID from their error message or support ticket. (2) Search centralized logs for that correlation ID — find the originating service and trace ID. (3) Open the distributed trace (Jaeger/Datadog) using the trace ID — see the full request path and identify which service errored. (4) In the erroring service’s logs, search by trace ID to find the detailed error message, stack trace, and context. (5) Check if the error is isolated (one user) or widespread (check error rate dashboard). (6) Check the erroring service’s metrics (CPU, memory, error rate) for anomalies. Total time: 2-5 minutes for an experienced engineer with proper tooling.
What interviewers are really testing: Whether you understand the difference between “the process is alive” and “the process can serve traffic” — getting this wrong causes cascading failures.Answer:Kubernetes uses two types of health probes (plus a third, startup probe) to manage pod lifecycle:Liveness Probe — “Is this process alive or stuck?”
  • If the liveness probe fails, Kubernetes restarts the pod (kills and recreates the container).
  • Use for: detecting deadlocks, infinite loops, memory leaks that cause hangs.
  • Should check: “Can this process respond at all?” — a lightweight /healthz endpoint that returns 200 OK.
  • Must NOT check external dependencies. If the liveness probe checks the database and the database goes down, Kubernetes restarts ALL pods simultaneously. The pods restart, still cannot reach the database, fail liveness again, enter a restart loop. Meanwhile, your service is completely down instead of partially degraded.
  • Keep it simple: respond to HTTP request, maybe check that the main thread is alive. Nothing more.
Readiness Probe — “Can this pod serve traffic right now?”
  • If the readiness probe fails, the pod is removed from the Service’s endpoints (stops receiving traffic) but is NOT restarted.
  • Use for: checking if external dependencies are available (database connected, cache warm, config loaded).
  • Should check: database connection pool, required cache availability, background initialization completion.
  • A pod that fails readiness is expected to recover on its own (e.g., when the database comes back up). Traffic routes to other healthy pods in the meantime.
Startup Probe — “Has this application finished starting up?”
  • Used for slow-starting applications (JVM apps, apps that load large models).
  • While the startup probe is running, liveness and readiness probes are disabled. This prevents Kubernetes from killing a pod that is just slow to start (not actually dead).
  • Example: a Java app that takes 60 seconds to initialize. Without a startup probe, the liveness probe (with a 10-second timeout) would kill it repeatedly.
Configuration best practices:
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 10
  failureThreshold: 3     # 3 consecutive failures before restart
  timeoutSeconds: 1       # fast timeout - if the app hangs, this catches it

readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3
  timeoutSeconds: 2

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10       # 30 * 10 = 300 seconds to start up
Red flag answer: “The liveness probe checks the database” — this is the #1 most common and dangerous misconfiguration. Or not knowing the difference between liveness and readiness.Follow-up questions:Q: Your application connects to both a database and a Redis cache. Redis goes down. What should your health checks report? It depends on how critical Redis is. If Redis is a required dependency (cache-aside for all reads), the readiness probe should fail — the pod cannot serve useful responses without it. If Redis is optional (caching for performance but fallback to database works), the readiness probe should still pass, but the pod should log warnings and emit metrics showing degraded mode. The liveness probe should pass in both cases — the process itself is healthy, it just cannot serve traffic effectively. This nuance is critical: liveness checks the process, readiness checks the capability to serve.Q: During a rolling deployment, users see errors for 10-15 seconds. Both liveness and readiness probes pass. What is happening? The old pod receives a SIGTERM and starts shutting down, but in-flight requests are dropped. The readiness probe might pass right up until termination. Fix: (1) Handle SIGTERM gracefully — stop accepting new requests, finish in-flight requests, then exit. (2) Add a preStop hook with a brief sleep (5 seconds) to allow the Kubernetes endpoints controller to remove the pod from the Service before the pod actually stops. (3) Set terminationGracePeriodSeconds high enough for in-flight requests to complete. (4) Immediately fail the readiness probe on SIGTERM so the pod is removed from the endpoint list before it shuts down.
What interviewers are really testing: Whether you understand the operational difference between policy-based traffic control (rate limiting) and reactive system protection (throttling).Answer:These terms are often used interchangeably, but they serve different purposes:Rate Limiting — Policy enforcement:
  • Applied per client, per API key, per tier, per endpoint.
  • “Free tier users get 100 requests/minute. Pro tier gets 1,000.”
  • Enforced regardless of server load. Even if your servers are at 10% CPU, a free-tier user exceeding 100/min gets 429.
  • Configured in advance based on business rules and capacity planning.
  • Returns: 429 Too Many Requests with Retry-After header and rate limit headers.
Throttling — System protection:
  • Applied globally based on current system health.
  • “Server CPU is at 95%. Reject 50% of incoming requests to prevent a crash.”
  • Kicks in dynamically when the system is under stress. Traffic that would normally be allowed is rejected.
  • Based on real-time metrics: CPU, memory, queue depth, response time.
  • Returns: 503 Service Unavailable with Retry-After.
Key differences:
AspectRate LimitingThrottling
TriggerClient exceeds quotaSystem is overloaded
ScopePer-clientGlobal or per-service
ConfigurationStatic policyDynamic, metric-driven
PurposeFairness, monetizationSurvival, stability
When appliedAlways (even under low load)Only during stress
Response code429503
Production example: An API serves 3 tiers: Free (100/min), Pro (1,000/min), Enterprise (10,000/min). Rate limiting enforces these tiers. During a Black Friday traffic spike, the system hits 90% capacity. Throttling kicks in: Enterprise clients start getting 503s for 10% of requests, Pro clients for 30%, Free clients for 70%. Rate limiting handles fairness; throttling handles survival.Advanced patterns:
  • Adaptive throttling (Google SRE approach): Each backend tracks its capacity and rejection rate. When rejections exceed a threshold, clients back off proportionally. Google’s Doorman uses this for internal load management.
  • Priority-based throttling: During overload, protect high-priority traffic (checkout, payments) and shed low-priority traffic (analytics, recommendations) first.
  • Load shedding: The extreme form of throttling. Proactively drop requests to keep the system alive. Netflix’s concurrency limiter dynamically adjusts the number of in-flight requests per server.
Red flag answer: Using “rate limiting” and “throttling” interchangeably, or only knowing about rate limiting without understanding reactive load shedding.Follow-up questions:Q: Your payment API is getting overloaded. You need to shed load. How do you decide which requests to reject? Implement priority-based shedding: (1) Tag each request with a priority (payment completion = critical, product recommendations = low). (2) Under load, start rejecting low-priority requests first. (3) Use a token bucket per priority class. When the system is overloaded, reduce tokens for low-priority buckets while maintaining critical bucket allocation. (4) Return 503 with Retry-After for shed requests. (5) Monitor the shed rate as a key SLI — if you are shedding critical traffic, something is very wrong.Q: How would you implement global throttling across 20 API server instances? (1) Centralized approach: all instances check a Redis counter tracking total RPS. Simple but adds latency and creates a single point of failure. (2) Decentralized approach: each instance tracks local metrics (CPU, response time, queue depth) and makes independent throttling decisions. If local CPU exceeds 80%, start rejecting. No coordination needed. (3) Hybrid: use a lightweight gossip protocol (like Consul) where instances share their load metrics. Each instance makes decisions based on both local and cluster-wide health. (4) Infrastructure-level: use Kubernetes HPA and pod disruption budgets to prevent overload, and use Envoy’s circuit breaking as the throttling mechanism.
What interviewers are really testing: Whether you can manage API evolution without breaking consumers, and whether you understand what constitutes a “breaking change.”Answer:Semantic Versioning (SemVer) for APIs: MAJOR.MINOR.PATCH
  • MAJOR (v1 to v2): Breaking changes. Clients must update.
  • MINOR (v1.1 to v1.2): New features, backward-compatible. Clients do not need to update.
  • PATCH (v1.1.1 to v1.1.2): Bug fixes, backward-compatible.
What constitutes a breaking change in an API:
  • Removing a field from the response.
  • Renaming a field.
  • Changing a field’s type ("price": 10 to "price": "10.00").
  • Removing an endpoint.
  • Changing the authentication mechanism.
  • Making an optional field required.
  • Changing the URL structure.
  • Changing the meaning of a status code.
What is NOT a breaking change (additive changes):
  • Adding a new field to a response (clients should ignore unknown fields).
  • Adding a new optional field to a request.
  • Adding a new endpoint.
  • Adding a new enum value (controversial — some clients break on unknown enum values).
  • Adding new headers.
API versioning in practice:
  • Most public APIs only expose the MAJOR version: /v1/, /v2/. Minor and patch versions are invisible to clients.
  • Breaking changes are expensive: every consumer must update. Good API teams ship 1-2 major versions in the API’s entire lifetime.
  • Deprecation lifecycle: (1) Announce the new version. (2) Add Sunset and Deprecation headers to old version responses. (3) Monitor usage of the old version. (4) Give 6-12 months notice. (5) Return 410 Gone after sunset.
The “robustness principle” (Postel’s Law): “Be conservative in what you send, liberal in what you accept.” Clients should ignore unknown fields. Servers should accept requests with extra unknown fields. This principle is what makes additive changes non-breaking.Red flag answer: “Every new feature gets a new version” — this shows no understanding of backward compatibility. Or “just bump the version whenever we deploy.”Follow-up questions:Q: You need to rename a response field from userName to user_name (inconsistent casing fix). Is this a breaking change? How do you handle it? Yes, it is a breaking change. Clients parsing userName will break when it becomes user_name. Handle with a transition period: (1) Return BOTH fields simultaneously: {"userName": "Alice", "user_name": "Alice"}. (2) Mark userName as deprecated in documentation. (3) After all consumers have migrated (track via analytics or consumer surveys), remove userName in a new major version. GitHub did exactly this during their API standardization — they returned both field names for months.Q: A new enum value is added to a response field. A client’s switch statement hits the default case and crashes. Whose fault is it and how do you prevent this? Both sides share responsibility. The API should document that new enum values may be added (making it a non-breaking change by contract). The client should handle unknown enum values gracefully (default case should log and continue, not crash). To prevent: (1) Document in the API contract that enums are “open” — new values can be added. (2) Client SDKs should generate enum types with an UNKNOWN value. (3) Consider using strings instead of enums if the set of values is expected to grow frequently.
What interviewers are really testing: Whether you treat documentation as a first-class engineering artifact, not an afterthought.Answer:OpenAPI Specification (OAS) — formerly known as Swagger — is the industry standard for describing REST APIs. It is a YAML/JSON file that defines endpoints, request/response schemas, authentication, and examples.Why it matters:
  • Single source of truth — The spec defines what the API does. Code, documentation, and client SDKs are all generated from it.
  • Interactive UI — Swagger UI and Redoc render the spec into a browsable, testable documentation site. Developers can try API calls directly from the docs.
  • Client SDK generationopenapi-generator produces typed client libraries in 40+ languages from the spec. No hand-written HTTP clients.
  • Server stub generation — Generate server boilerplate in Go, Java, Python, etc. Ensure the implementation matches the spec.
  • Contract testing — Validate that the running API matches the spec. Tools like Dredd and Schemathesis test every endpoint against the spec.
  • Gateway configuration — Import the spec into API gateways (Kong, AWS API Gateway) to auto-configure routing and validation.
Design-first vs Code-first:
  • Design-first: Write the OpenAPI spec before writing code. Review the API contract with consumers (frontend team, mobile team). Once agreed, generate server stubs and implement. Best for public APIs and APIs with multiple consumers.
  • Code-first: Write the code, annotate with decorators/annotations, and auto-generate the spec. Best for internal APIs and rapid prototyping. Tools: Springdoc (Java), FastAPI (Python — auto-generates from type hints), tsoa (TypeScript).
A good API spec includes:
  • Descriptive endpoint summaries and detailed descriptions.
  • Request/response examples for every endpoint (not just schemas).
  • Error response schemas with all possible error codes.
  • Authentication requirements per endpoint.
  • Deprecation notices for old endpoints/fields.
  • Rate limit documentation.
Tools ecosystem:
  • Swagger UI / Redoc — Render the spec as interactive docs.
  • Stoplight / Bump.sh — API design platforms with visual editors.
  • Postman — Import OpenAPI specs to auto-generate API collections.
  • Spectral — Lint your OpenAPI spec for consistency (naming conventions, missing descriptions).
Red flag answer: “We document our API in a Confluence page” or “the code is the documentation.” Both decay instantly and are not machine-readable.Follow-up questions:Q: Your team uses code-first documentation generation. The docs are often out of date because developers forget to update annotations. How do you fix this? (1) Add a CI step that generates the spec from code, compares it with the committed spec, and fails if they diverge. (2) Use contract testing (Dredd, Schemathesis) in CI — it runs every endpoint in the spec against the running API and fails if the response does not match. (3) Use a framework that generates docs from types automatically (FastAPI, tsoa) so documentation cannot drift from implementation. (4) Require spec updates in PR review checklists.Q: How does API documentation fit into the developer experience of your platform? The best API platforms (Stripe, Twilio, Algolia) treat docs as a product. Elements: (1) Interactive “try it” in the docs (Swagger UI). (2) Language-specific code examples (curl, Python, JavaScript, Go). (3) Quick-start guides. (4) Changelog with every API change documented. (5) Status page integration (show if an endpoint is currently degraded). (6) API explorer in the dashboard. Stripe’s documentation is often cited as the gold standard — every endpoint has copy-paste code examples in 7 languages with real API responses.
What interviewers are really testing: Whether you can design APIs that degrade gracefully rather than failing entirely when one component is unavailable.Answer:In a microservices architecture, a single API request may depend on 5+ backend services. If one fails, should the entire request fail? Usually no.Patterns for partial failure handling:
  1. Partial response with degradation indicators:
{
  "user": { "id": 42, "name": "Alice" },
  "recommendations": null,
  "_meta": {
    "degraded": ["recommendations"],
    "message": "Recommendation service unavailable"
  }
}
The client gets user data (from User Service) even though Recommendations Service is down. The _meta section tells the client which parts are degraded.
  1. GraphQL’s native error handling:
{
  "data": {
    "user": { "id": 42, "name": "Alice" },
    "recommendations": null
  },
  "errors": [
    { "message": "Recommendation service timeout", "path": ["recommendations"] }
  ]
}
GraphQL returns 200 OK with both data and errors. The client can render what is available and show a fallback for the failed section.
  1. Fallback responses: When Recommendations Service is down, return cached recommendations (stale but better than nothing), popular items (generic fallback), or an empty list. Netflix’s API returns generic “Top 10” recommendations when personalization fails — users see content, not errors.
  2. HTTP 207 Multi-Status (WebDAV): Used when a batch request has mixed results:
{
  "results": [
    { "id": 1, "status": 200, "data": { "..." } },
    { "id": 2, "status": 404, "error": "Not found" },
    { "id": 3, "status": 200, "data": { "..." } }
  ]
}
Implementation pattern (Scatter-Gather):
  1. API Gateway fans out requests to 5 services in parallel.
  2. Set a timeout for each (e.g., 500ms).
  3. After all responses return (or timeout), assemble the composite response.
  4. Include what succeeded, mark what failed, apply fallbacks for failed components.
  5. Return 200 OK with the partial response (not 500 — the request partially succeeded).
Red flag answer: Returning 500 Internal Server Error when one of five downstream services fails, or not considering graceful degradation at all.Follow-up questions:Q: The product team says “if payment service is down, the checkout page should not render at all.” But the recommendations service going down should not affect anything. How do you implement this? Classify downstream dependencies as critical (payment, inventory, user auth — page cannot function without them) and non-critical (recommendations, analytics, personalization — page degrades gracefully). For critical dependencies: fail the request if they fail (return 503). For non-critical dependencies: use circuit breakers with fallbacks. Implement this as a middleware or decorator pattern: @critical("payment-service") vs @degradable("recommendations-service", fallback=default_recs). This classification should be documented and reviewed with the product team.Q: Your API aggregates data from 3 services. Two respond in 50ms, one takes 3 seconds. How do you handle this? (1) Set a per-service timeout (e.g., 500ms). After 500ms, proceed without the slow service’s data. (2) Return the partial response immediately with the slow service’s data marked as degraded. (3) If the slow service is non-critical, apply a fallback. (4) If the slow service is critical, consider: can you return a loading state and let the client poll? Or use SSE to push the remaining data when it arrives? (5) Investigate the slow service — 3-second responses suggest a performance problem that should be fixed, not just tolerated.
What interviewers are really testing: Whether you understand how infrastructure concerns are separated from application logic in containerized environments.Answer:The sidecar pattern deploys a helper container alongside the main application container within the same pod (Kubernetes) or task (ECS). They share the same network namespace (localhost), filesystem (via shared volumes), and lifecycle.Why sidecars exist: Infrastructure concerns (logging, metrics, TLS, tracing, config management) are the same across all services. Instead of embedding these in every application (in every language), extract them into a reusable sidecar container.Common sidecar use cases:
  1. Service mesh proxy (Envoy) — Handles mTLS, retries, circuit breaking, load balancing, observability. The application makes plain HTTP calls to localhost; the sidecar handles everything else. This is how Istio and Linkerd work.
  2. Log collection (Fluent Bit / Fluentd) — Reads log files from a shared volume and ships them to the centralized logging system. The application writes to a file; the sidecar handles collection, parsing, and shipping.
  3. Configuration management (Vault Agent) — Fetches secrets from HashiCorp Vault, writes them to a shared volume or injects them as environment variables. The application reads secrets from a file; the sidecar handles auth and renewal.
  4. Metrics collection (Prometheus exporter) — Exposes application metrics in Prometheus format. The application writes metrics to a local endpoint; the sidecar reformats and exposes them.
  5. Database proxy (Cloud SQL Proxy, PgBouncer) — The application connects to localhost:5432; the sidecar handles connection pooling, TLS, and authentication to the actual database.
Sidecar vs init container:
  • Init container — Runs ONCE before the main container starts. Use for: database migrations, config file generation, dependency checks.
  • Sidecar — Runs alongside the main container for the pod’s entire lifetime. Use for: ongoing concerns (proxying, logging, metrics).
Resource and latency considerations: Each sidecar consumes CPU and memory. With Envoy sidecars (50-100MB RAM each) across 200 pods, that is 10-20GB of memory just for proxies. The sidecar also adds ~1-2ms latency per hop. For latency-critical paths, this matters. Some teams use “proxyless” service mesh (gRPC xDS protocol) to avoid sidecar overhead.Red flag answer: Describing sidecars as “just another container” without understanding why co-location in the same pod matters (shared network, shared lifecycle) or the resource implications at scale.Follow-up questions:Q: Your team has 100 microservices, each with an Envoy sidecar. Resource usage is significant. How do you reduce it? (1) Tune Envoy resource limits — most sidecars do not need the default allocation. Profile actual usage and right-size. (2) Use Envoy’s hot restart to reduce memory during configuration updates. (3) Consider “ambient mesh” (Istio ambient mode) which uses node-level proxies instead of per-pod sidecars, significantly reducing resource overhead. (4) For services that do not need the full mesh (internal batch jobs), exclude them from sidecar injection. (5) Evaluate Linkerd, which uses a Rust-based proxy consuming ~10MB vs Envoy’s ~50-100MB.Q: What happens if the sidecar crashes but the main application container is still running? In Kubernetes, if the sidecar container crashes, the kubelet restarts it (based on the container’s restart policy). The main application continues running but may experience connectivity issues (if the sidecar was the proxy for all traffic). With Istio, if the Envoy sidecar crashes, the application cannot make or receive network calls (all traffic routes through the sidecar via iptables rules). This is a failure mode to plan for: configure appropriate health checks and alerts on sidecar container restarts.
What interviewers are really testing: Whether you understand the principles that make applications cloud-native and easy to operate in containerized environments.Answer:The Twelve-Factor App methodology (by Heroku co-founder Adam Wiggins) defines 12 principles for building cloud-native SaaS applications. These principles are the foundation of modern Docker/Kubernetes deployments.The 12 factors with practical implications:
  1. Codebase — One codebase tracked in version control, many deploys. Do not have separate repos for staging vs production. Use branches and environment config.
  2. Dependencies — Explicitly declare and isolate dependencies (package.json, requirements.txt, go.mod). Never rely on system-level packages being pre-installed.
  3. Config — Store configuration in environment variables, not in code. Never commit database URLs, API keys, or feature flags to the repo. Use: Kubernetes ConfigMaps/Secrets, Vault, AWS Parameter Store.
  4. Backing services — Treat databases, caches, queues, email services as attached resources accessed via URL/connection string. Swapping a local PostgreSQL for Amazon RDS should require only changing an environment variable, not code changes.
  5. Build, Release, Run — Strictly separate build (compile + package), release (combine build with config), and run (execute in the environment) stages. Use immutable Docker images as the build artifact.
  6. Processes — Execute the app as one or more stateless processes. Store all persistent data in backing services (database, Redis), not in local files or in-memory. Any in-memory cache is a performance optimization, not the source of truth. This enables horizontal scaling.
  7. Port binding — Export services via port binding. The app is self-contained and binds to a port (e.g., EXPOSE 8080). No dependency on an external web server.
  8. Concurrency — Scale out via the process model. Run multiple instances (pods) rather than one large process with many threads. Different process types for different workloads (web process, worker process, scheduler process).
  9. Disposability — Maximize robustness with fast startup and graceful shutdown. Processes can be started or killed at any time (Kubernetes rolling deployments, preemptible VMs). Handle SIGTERM gracefully.
  10. Dev/prod parity — Keep development, staging, and production as similar as possible. Use Docker to ensure identical environments. Do not use SQLite in dev and PostgreSQL in prod.
  11. Logs — Treat logs as event streams. Write to stdout, not to files. Let the execution environment (Docker, Kubernetes) handle log routing and aggregation.
  12. Admin processes — Run admin/management tasks (database migrations, console commands) as one-off processes in the same environment as the app. Use kubectl exec or Kubernetes Jobs, not SSH.
The factors most commonly violated:
  • #3 (Config) — Hardcoded config in code. Fix: use environment variables.
  • #6 (Processes) — Storing state in local filesystem. Fix: use shared backing services.
  • #9 (Disposability) — Not handling SIGTERM. Fix: implement graceful shutdown in the application.
  • #10 (Dev/prod parity) — “Works on my machine.” Fix: Docker compose for local development.
Red flag answer: Listing the 12 factors without understanding why they matter or not knowing which ones are most commonly violated.Follow-up questions:Q: Factor #6 says processes should be stateless. But your application caches user sessions in memory. Is this a violation? How do you fix it? Yes, it violates factor #6. If the pod is killed or a new request routes to a different pod, the session is lost. Fix: externalize session storage to Redis or a database. Every pod reads/writes sessions from the shared store. In Kubernetes, do not rely on sticky sessions (session affinity) — this couples the client to a specific pod and breaks horizontal scaling. The performance overhead of Redis for session storage is minimal (sub-millisecond latency).Q: Factor #9 says processes should start fast. Your Java application takes 90 seconds to start. How do you reconcile this with Kubernetes rolling deployments? (1) Use GraalVM native image to compile ahead of time — reduces startup to 1-2 seconds. (2) If native image is not feasible, use Kubernetes startup probes with generous thresholds (failureThreshold * periodSeconds = 120+ seconds). (3) Minimize startup work: lazy-initialize connections, defer non-essential initialization. (4) Use CDS (Class Data Sharing) in Java 17+ to speed up class loading. (5) Adjust rolling deployment strategy: set maxUnavailable: 0, maxSurge: 1 so old pods keep serving until new ones pass readiness probes.
What interviewers are really testing: Whether you understand the data transformation challenge in microservices and how to avoid N-squared integration complexity.Answer:A Canonical Data Model (CDM) defines a standard, shared data format for communication between services. Instead of each service pair agreeing on a custom format (N services = N*(N-1) possible mappings), all services translate to and from the canonical model.The problem it solves: Without a CDM, if 5 services need to exchange user data, each pair may represent “user” differently:
  • Service A: { "firstName": "Alice", "lastName": "Smith" }
  • Service B: { "name": "Alice Smith" }
  • Service C: { "user_name": "asmith", "full_name": "Smith, Alice" }
With 5 services, you potentially need 20 (5*4) different transformation mappings. With a CDM:
  • Canonical: { "given_name": "Alice", "family_name": "Smith", "display_name": "Alice Smith" }
  • Each service has 2 mappings: to-canonical and from-canonical. Total: 10 (5*2) mappings.
When CDM makes sense:
  • Enterprise integration — Many systems (ERP, CRM, billing, shipping) need to exchange customer/order data. The CDM is the “lingua franca.”
  • Event-driven architectures — Events on the message bus use the canonical schema. Each service translates at the boundary.
  • Industry standards — Healthcare (HL7 FHIR), finance (FIX protocol), e-commerce (GS1) provide canonical models.
When CDM is overkill:
  • Small teams with 3-5 services where direct communication is manageable.
  • When services are loosely coupled and rarely exchange data.
Anti-patterns to avoid:
  • Over-centralized CDM — If the canonical model is maintained by one team and every change requires their approval, it becomes a bottleneck. Use a federated ownership model where each domain owns its portion of the canonical model.
  • CDM as a “god object” — The canonical model grows to include every field from every service, becoming a massive, unwieldy schema. Keep it focused on shared concepts. Service-specific fields stay internal.
  • Forcing CDM on internal service APIs — The CDM is for inter-service events and integration, not for every API call. Internal APIs can use domain-specific models.
Red flag answer: “All services should use the same data format” without understanding the complexity of maintaining a shared schema across teams or when it is not worth the overhead.Follow-up questions:Q: Your canonical data model needs to change (add a new field). How do you coordinate this across 20 services? (1) Use schema evolution rules (like Protobuf): new fields are always optional with defaults. Old services ignore them, new services use them. (2) Manage the canonical schema in a Schema Registry (Confluent, AWS Glue) with compatibility rules (backward or forward compatible). (3) The registry rejects changes that break compatibility, preventing accidental breaking changes. (4) Version the canonical model with SemVer. Non-breaking additions are minor versions. Breaking changes require a major version with a migration plan. (5) Communicate changes via an API/schema changelog that all teams subscribe to.Q: How does a Canonical Data Model relate to Domain-Driven Design (DDD) bounded contexts? In DDD, each bounded context (service) has its own internal model that is optimized for its domain. The CDM lives at the boundary between contexts — in the “Anti-Corruption Layer” (ACL). The ACL translates between the service’s internal model and the canonical model. This allows each service to evolve its internal model freely while maintaining a stable external contract. The CDM represents the “Published Language” in DDD terms — a shared, versioned schema that all contexts agree on for cross-boundary communication.
What interviewers are really testing: Whether you understand how to prevent microservice integration breakages without relying solely on expensive end-to-end tests.Answer:Contract testing verifies that the API provider honors the contract expected by the consumer, and vice versa. It catches breaking changes before deployment, without requiring a running integration environment.How it works (using Pact as an example):
  1. Consumer writes a contract: The consumer (e.g., frontend or another service) defines what it expects from the provider:
    • “When I send GET /users/42, I expect a response with {"id": 42, "name": "string", "email": "string"}.”
    • This is saved as a “pact file” (JSON contract).
  2. Provider verifies the contract: The provider runs the pact file against its actual API. For each consumer expectation, it sends the described request and checks if the response matches.
  3. If verification fails: The provider knows its change will break a consumer. The change is blocked.
Why contract testing beats end-to-end (E2E) testing for microservices:
AspectContract TestingE2E Testing
SpeedSeconds (unit test speed)Minutes to hours
ReliabilityHighly reliable (no flaky infra)Flaky (network, data, timing)
ScopeOne interaction at a timeFull system
MaintenanceLow (auto-generated from consumer)High (shared test env, test data)
Feedback speedImmediate in CISlow (requires full deployment)
Environment needsNone (mocked/stubbed)Full staging environment
The Pact workflow in CI/CD:
  1. Consumer generates pact files, publishes to Pact Broker.
  2. Provider CI pulls consumer pacts, runs verification.
  3. Pact Broker tracks compatibility matrix: which provider versions are compatible with which consumer versions.
  4. can-i-deploy tool checks the matrix before deployment: “Is this provider version compatible with all consumers currently in production?”
What contract tests catch:
  • Provider removes a field consumers depend on.
  • Provider changes a field type (number to string).
  • Provider changes the URL structure.
  • Provider changes the auth mechanism.
What contract tests do NOT catch:
  • Business logic bugs.
  • Performance issues.
  • Full workflow correctness.
Beyond Pact: Schema-based contract testing using OpenAPI specs. Tools like Schemathesis generate tests from OpenAPI specs and run them against the API, checking that every response matches the schema. Dredd does similar validation.Red flag answer: “We test integration with end-to-end tests in staging” — this does not scale to 50+ services. Or not knowing what contract testing is.Follow-up questions:Q: You have 30 microservices. Adding a new field to a response works fine in your service but breaks a downstream consumer. How could contract testing have prevented this? If the downstream consumer had a pact that expected the response to have a specific shape, adding a field would NOT break it (additive changes are non-breaking in Pact). But if you removed or renamed a field, the consumer’s pact verification would fail in CI. The provider’s CI pipeline would show: “Consumer X expects field userName but the current provider returns user_name.” The provider developer sees this before merging, adds backward compatibility (return both fields), and the pact passes.Q: Your team resists contract testing because “it’s too much overhead.” How do you make the case? (1) Calculate the cost of the last integration-related outage. At most companies, a single production integration failure costs more in engineering time than setting up contract testing for all services. (2) Start small: add contract testing to the one integration that breaks most frequently. Show the ROI. (3) Use auto-generated contracts: tools like Spring Cloud Contract can generate contracts from actual request/response pairs captured during E2E tests. (4) The real overhead is maintaining shared staging environments for E2E tests — contract testing eliminates that entirely.

6. Advanced Topics (Bonus Questions)

What interviewers are really testing: Whether you can model real-world domains as resources with clean, consistent, and intuitive URL structures.Answer:Good API URL design follows consistent patterns that are intuitive for developers:Core principles:
  • Nouns, not verbs: /users/42/orders not /getOrdersForUser?userId=42. Resources are nouns. HTTP methods provide the verbs.
  • Plural collection names: /users, /orders, /products. Not /user/42 (inconsistent when the collection is also /user).
  • Hierarchical relationships: /users/42/orders/7 — Order 7 belongs to User 42. Maximum 2-3 levels deep. Deeper nesting becomes unwieldy.
  • Use query parameters for filtering: /orders?status=shipped&created_after=2024-01-01, not /orders/shipped/after/2024-01-01.
  • Consistent naming convention: lowercase with hyphens (/user-profiles) or lowercase without (/userprofiles). Pick one and enforce it.
Common design challenges:
  1. Actions that do not map to CRUD:
    • “Cancel an order” is not a DELETE (the order still exists, it is in a cancelled state).
    • Options: POST /orders/42/cancel (sub-resource action), PATCH /orders/42 {"status": "cancelled"} (state change), or POST /orders/42/actions/cancel (explicit action resource).
    • Industry consensus: for simple state changes, use PATCH. For complex actions with side effects, use a sub-resource POST.
  2. Search across multiple resource types:
    • GET /search?q=alice&type=users,orders — a dedicated search endpoint.
    • Not /users?search=alice and /orders?search=alice separately (forces the client to make multiple calls).
  3. Batch operations:
    • POST /users/batch with [{...}, {...}, {...}] in the body.
    • Return individual results per item (some may succeed, some may fail).
  4. Tenant-scoped resources:
    • /tenants/acme/users/42 (explicit tenant in URL).
    • Or extract tenant from auth token and scope implicitly (cleaner URLs but less explicit).
Real-world examples of good API design:
  • Stripe: /v1/customers/cus_123/payment_methods — clean hierarchy, consistent naming.
  • GitHub: /repos/owner/repo/pulls/42/reviews — deep hierarchy but intuitive.
  • Twilio: /2010-04-01/Accounts/AC123/Messages.json — date-versioned, explicit format suffix.
Red flag answer: Using verbs in URLs (/getUser, /createOrder) or inconsistent naming patterns across endpoints.Follow-up questions:Q: A designer asks for an endpoint that “archives all orders older than 30 days.” How do you model this? POST /orders/actions/archive with {"older_than_days": 30} in the body. This is a bulk action, not a CRUD operation on a single resource. Return 202 Accepted with a task ID since it is a long-running operation. The alternative (DELETE /orders?older_than=30d) is dangerous because a misconfigured client could accidentally delete everything. POST for destructive bulk operations is safer because it cannot be triggered by a browser prefetch or a link click.Q: Should you nest resources (/users/42/orders) or use flat URLs (/orders?user_id=42)? Both are valid. Nested resources make the ownership relationship explicit and enable URL-based authorization (/users/42/* requires being user 42). Flat resources are simpler when the child resource has a globally unique ID (order 7 is unique regardless of user). Practical guideline: nest one level deep for strong ownership (user’s orders), use flat with query params for loose relationships (orders by status). GitHub uses nested (/repos/owner/repo/issues) and Stripe uses flat with expansion (/invoices?customer=cus_123).
What interviewers are really testing: Whether you can keep an API functioning when things go wrong, and whether you understand progressive delivery.Answer:Graceful degradation means the API continues to function at reduced capability when components fail, rather than returning errors for the entire request.Strategies:
  1. Circuit breaker + fallback: When the Recommendations service trips its circuit breaker, return cached recommendations or popular items instead of an error.
  2. Feature flags for API behavior:
    • Enable/disable features without deploying: if (featureFlags.isEnabled("new_search_algorithm")) { ... }.
    • Use feature flags to control rollout of new API versions: 1% of traffic gets the new behavior.
    • Tools: LaunchDarkly, Unleash, Split.io, Flagsmith, AWS AppConfig.
    • Kill switches: instant rollback of a misbehaving feature without deployment.
  3. Read-only mode: During a database failover, switch the API to read-only mode. All write operations return 503 Service Unavailable with Retry-After. Reads continue from replicas.
  4. Stale-while-revalidate: Return cached (potentially stale) data immediately while asynchronously refreshing the cache in the background. HTTP Cache-Control: stale-while-revalidate=60 enables this at the CDN/proxy level.
Feature flag best practices for APIs:
  • Short-lived flags — Feature flags are for release management, not permanent configuration. Remove flags after rollout completes (within 2-4 weeks). Permanent flags become tech debt.
  • Server-side evaluation — Evaluate feature flags on the server, not the client. The client should not know about flags.
  • Targeted rollout — Roll out by: percentage of traffic, specific user IDs (dogfooding), user attributes (region, subscription tier), header value (for QA testing).
  • Monitoring per flag — Track error rates and latency separately for flag-on vs flag-off cohorts. If flag-on shows higher errors, automatically disable it (automated rollback).
Production example: Netflix uses feature flags extensively. During the 2012 Christmas Eve outage, they used kill switches to disable personalization, search autocompletion, and browse history — features that depended on overloaded backend systems — while keeping basic playback working. Users could still watch movies; they just saw generic recommendations.Red flag answer: “We deploy features and if they break, we roll back the whole deployment” — this is slow (minutes to hours vs seconds for a flag flip) and high-risk.Follow-up questions:Q: A feature flag is evaluated 10,000 times per second. How do you ensure it does not become a performance bottleneck? (1) Cache the flag evaluation result locally (in-memory) with a short TTL (10-30 seconds). (2) Use a streaming connection (SSE/WebSocket) from the flag service to receive updates in near-real-time, rather than polling on every request. (3) LaunchDarkly’s SDK caches all flag rules locally and evaluates in-process (microsecond latency). Only configuration updates come from the network. (4) For ultra-high-performance paths, precompute flag values at startup and refresh periodically.Q: Your team has 200 feature flags in production, half of which are stale (feature fully rolled out months ago). What do you do? Feature flag debt is real and dangerous (dead code paths, confusing logic). (1) Add a created_at and expected_removal_date to every flag. (2) Automated alerts when a flag exceeds its expected lifetime. (3) Quarterly “flag cleanup” sprints. (4) Use a flag management tool that tracks flag age and usage. (5) Make flag creation require an owner and a removal plan. (6) In code reviews, enforce that removing a flag is part of the feature completion definition-of-done.
What interviewers are really testing: Whether you understand the distinct roles of these three networking components and can explain when each applies.Answer:These three components are frequently confused because they share overlapping capabilities. The key distinction is where they sit in the traffic flow and what traffic they manage.
AspectLoad BalancerAPI GatewayService Mesh
Traffic typeAny TCP/HTTP trafficExternal-to-internal (north-south)Internal service-to-service (east-west)
LayerL4 (TCP) or L7 (HTTP)L7 (HTTP/gRPC)L7 (HTTP/gRPC/TCP)
ScopeDistributes traffic across backendsSingle entry point for all external clientsAll inter-service communication
FeaturesHealth checks, round-robin, sticky sessionsAuth, rate limiting, routing, transformationmTLS, retries, tracing, traffic splitting
ExamplesAWS ALB/NLB, HAProxy, NGINXKong, AWS API Gateway, ApigeeIstio, Linkerd, Consul Connect
Deployed asStandalone infrastructureStandalone service/clusterSidecar proxy per pod
When you need each:
  • Load balancer only: Simple applications. One or a few services behind a single entry point. No auth, rate limiting, or complex routing needed.
  • API Gateway: Multiple services, external consumers, need for centralized auth, rate limiting, and routing. This is the standard for any microservices architecture exposed to the internet.
  • Service mesh: 10+ microservices where managing mTLS, observability, and resilience policies per-service becomes unsustainable. The mesh handles service-to-service concerns that the API gateway does not cover.
How they work together:
Internet --> Load Balancer --> API Gateway --> Service Mesh --> Microservices
                                              (sidecar proxies)
The Load Balancer distributes traffic across API Gateway instances. The API Gateway handles external concerns (auth, rate limiting, routing). The Service Mesh handles internal concerns (mTLS, retries, circuit breaking between services).Red flag answer: “API gateway and service mesh do the same thing” or not knowing the north-south vs east-west traffic distinction.Follow-up questions:Q: You have Kong as your API gateway and Istio as your service mesh. A request hits Kong, then passes through Istio’s sidecar. Is there redundant processing? Yes, potentially. Both Kong and Envoy (Istio’s sidecar) can do TLS termination, rate limiting, retries, and tracing. The solution: (1) Clear responsibility separation — Kong handles external concerns (API key validation, external rate limits, request transformation). Istio handles internal concerns (mTLS between services, internal retries, observability). (2) Disable overlapping features in one layer. For example, let Istio handle all retries and remove retry logic from Kong. (3) Some architectures use Envoy as both the gateway and the mesh proxy (Istio’s ingress gateway), eliminating the separate API gateway layer entirely.Q: A startup with 5 microservices asks if they need a service mesh. What do you advise? No. 5 services do not justify the operational overhead of a service mesh (control plane management, sidecar resource consumption, steep learning curve). Use: (1) An API gateway (Kong or AWS API Gateway) for external traffic. (2) Application-level resilience (Resilience4j, Polly) for retries and circuit breaking. (3) Mutual TLS via Kubernetes secrets or cert-manager if you need encryption. (4) OpenTelemetry SDKs for distributed tracing. Revisit the service mesh decision when you hit 20+ services or when managing per-service resilience configuration becomes a bottleneck.
What interviewers are really testing: Whether you understand the unique constraints of mobile environments and can design APIs that perform well on unreliable, high-latency networks.Answer:Mobile clients operate under constraints that web clients do not face: high latency (100-500ms RTT on cellular), unreliable connections (tunnels, elevators, network switches), limited bandwidth (metered data plans), and battery constraints (network radio is expensive).API design principles for mobile:
  1. Minimize round trips: Each HTTP request on cellular costs 100-500ms in RTT plus radio wake-up time. If a screen needs data from 3 services, the BFF should aggregate them into a single response. Reduce the number of API calls per screen to 1-2.
  2. Reduce payload size:
    • Return only fields the client needs (GraphQL or BFF with field selection).
    • Use compression (gzip/brotli — 60-80% size reduction for JSON).
    • Use efficient serialization (Protobuf if using gRPC — 30-80% smaller than JSON).
    • Return image URLs at the appropriate resolution for the device screen, not full-resolution.
  3. Design for offline-first:
    • Return ETag and Last-Modified headers for conditional requests. On retry, send If-None-Match — if data has not changed, get 304 Not Modified (no body, saves bandwidth).
    • Include Cache-Control: max-age=300 for data that can be cached client-side.
    • Design resources to be independently cacheable (avoid deeply nested responses that invalidate too frequently).
  4. Handle unreliable connections:
    • Every mutating endpoint must be idempotent (idempotency keys for POST).
    • Client retries are inevitable. The server must handle them gracefully.
    • Return partial responses (degraded) rather than errors when possible.
  5. Pagination that works offline:
    • Cursor-based pagination works better for mobile because clients can resume from where they left off after a disconnect.
    • Include a sync endpoint: GET /orders?updated_since=2024-01-15T10:30:00Z so the client can fetch only what changed since last sync.
  6. Delta / Patch responses:
    • Instead of returning the full resource on every request, return only changed fields since the client’s last known version.
    • GET /user/42?since_version=5 returns {"version": 7, "changes": {"email": "new@example.com"}}.
Real-world example: Instagram’s API returns different image resolutions based on the device’s screen density. The BFF aggregates feed data, user info, and stories into a single response. They use aggressive client-side caching with conditional requests to minimize data transfer.Red flag answer: “Just use the same API for web and mobile” without considering bandwidth, latency, and offline constraints.Follow-up questions:Q: A mobile user submits a form, goes into a tunnel (no network), then comes back online. How do you ensure the submission is not lost? (1) The mobile app queues the request locally (SQLite or local storage) when offline. (2) When connectivity returns, the app retries with the original idempotency key. (3) The server processes the request (or returns the cached response if already processed). (4) The app reconciles local state with the server response. This is the “offline queue” pattern. Libraries like Workbox (web) and custom sync managers (mobile) implement this. The idempotency key is generated at form submission time, not at send time, ensuring retries are deduplicated.Q: Your mobile app makes 15 API calls to render the home screen. Users on 3G networks see a 6-second load time. How do you fix this? (1) Build a BFF that aggregates all 15 calls into 1-2 server-side calls (server-to-server is fast, 1-5ms per call). (2) The BFF returns a single optimized response for the home screen. (3) Use HTTP/2 multiplexing so remaining calls share one connection. (4) Implement server-side caching in the BFF for frequently accessed data. (5) Prioritize above-the-fold data: return critical content first, load supplementary content lazily. (6) Compress responses (gzip reduces JSON by 60-80%). (7) Consider using Protobuf for the most data-heavy endpoints. Target: < 1 second for first meaningful paint.
What interviewers are really testing: Whether you can design a comprehensive observability system that lets you answer “what is happening and why” across a distributed system.Answer:Observability is the ability to understand the internal state of a system by examining its external outputs. For microservices, this requires three pillars working together: metrics, logs, and traces.The three pillars:
  1. Metrics — Numerical measurements aggregated over time. “What is happening?”
    • RED method (for services): Rate (requests/sec), Errors (error rate), Duration (latency distribution).
    • USE method (for resources): Utilization, Saturation, Errors.
    • Tools: Prometheus + Grafana, Datadog, CloudWatch.
    • Key metrics for APIs: p50/p95/p99 latency per endpoint, error rate (4xx and 5xx separately), RPS, concurrent connections, queue depth, consumer lag.
  2. Logs — Structured event records. “What happened in this specific request?”
    • Structured JSON logs with trace IDs (covered in Q41).
    • Tools: ELK, Loki + Grafana, Datadog Logs, Splunk.
  3. Traces — Distributed request flow visualization. “Where did the time go?”
    • Spans across services showing latency breakdown (covered in Q40).
    • Tools: Jaeger, Zipkin, Datadog APM, Honeycomb.
The fourth pillar — Profiling (emerging): Continuous profiling shows where CPU and memory are spent across your fleet. Tools: Pyroscope, Datadog Continuous Profiler, Parca. Useful when metrics show high CPU but you need to know which function is consuming it.How the pillars connect:
  1. A dashboard (metrics) shows a latency spike on the Order service at 3:15 PM.
  2. Click the spike data point — exemplar links to a specific trace.
  3. The trace shows the Order service waiting 4 seconds on a database call.
  4. Click the slow span — linked logs show a specific SQL query with a missing index.
  5. Total diagnosis time: 3 minutes.
SLIs, SLOs, and Error Budgets (Google SRE approach):
  • SLI (Service Level Indicator) — A measurable metric. Example: “99.5% of requests complete in < 200ms.”
  • SLO (Service Level Objective) — The target for the SLI. Example: “99.9% availability over 30 days.”
  • Error Budget — The acceptable amount of unreliability. If your SLO is 99.9%, you have a 0.1% error budget (~43 minutes of downtime per month). When the error budget is exhausted, freeze deployments and focus on reliability.
Alerting best practices:
  • Alert on symptoms (high error rate, high latency) not causes (CPU high, disk full). Symptoms directly impact users.
  • Use multi-window alerts: fire only if the error rate exceeds the threshold for both the last 5 minutes AND the last 1 hour (reduces false positives).
  • Page (wake someone up) only for customer-impacting issues. Everything else is a ticket.
Anti-patterns:
  • Dashboard without alerts — Nobody watches dashboards 24/7. If it is important enough to dashboard, it is important enough to alert on.
  • Alerting on everything — Alert fatigue. Engineers start ignoring alerts. Only alert on SLO violations and customer-impacting symptoms.
  • Metrics without context — “CPU is high” is useless without: which service, which pod, what is it doing, and which customers are affected.
Red flag answer: “We have Grafana dashboards” without discussing how metrics, logs, and traces connect, or not knowing what SLOs are.Follow-up questions:Q: You join a team with 30 microservices and zero observability. Where do you start? Priority order: (1) Structured logging with trace IDs to stdout (1-2 days per service with a shared library). (2) Log aggregation (deploy Loki + Grafana or ELK — 1 week). (3) Basic metrics: RED metrics per service via Prometheus + Grafana (1 week). (4) Distributed tracing: instrument with OpenTelemetry, deploy Jaeger (1-2 weeks). (5) Define SLOs for the top 3 customer-facing services. (6) Set up alerting on SLO violations. Total time to basic observability: 4-6 weeks. Do not try to build perfect observability for all 30 services at once — start with the 5 most critical services and expand.Q: Your SLO is 99.9% availability. You have burned through 80% of your error budget in the first week of the month. What do you do? (1) Immediately investigate what caused the error budget burn (deploy gone wrong? dependency failure? traffic spike?). (2) Pause all non-essential deployments until the error budget recovers or the root cause is fixed. (3) If a specific deploy caused it, roll back. (4) Review the incident and add preventive measures (better canary analysis, more integration tests). (5) If the SLO is too tight for the system’s current reliability, have a conversation with stakeholders about adjusting it. Google SRE teams literally freeze feature deployments when error budgets are exhausted — this incentivizes teams to invest in reliability.
What interviewers are really testing: Whether you understand that microservices communication is not just REST calls and can design systems that use asynchronous events for loose coupling.Answer:Request-Response (Synchronous):
  • Service A sends a request to Service B and waits for a response.
  • Direct coupling: A must know about B, and A blocks until B responds.
  • Simple, predictable, easy to debug (single request/response flow).
  • Fragile: if B is down, A fails.
Event-Driven (Asynchronous):
  • Service A publishes an event (“OrderCreated”) to a message broker (Kafka, RabbitMQ).
  • Services B, C, D independently subscribe and react.
  • Loose coupling: A does not know (or care) who consumes the event.
  • Resilient: if B is down, events queue up and are processed when B recovers.
When to use each:
ScenarioBest approach
Client needs an immediate responseRequest-Response
User-facing API callRequest-Response
Notifying multiple services of a changeEvent-Driven
Operations that can be processed laterEvent-Driven
Long-running workflowsEvent-Driven (Saga)
Data sync between servicesEvent-Driven (CDC)
Audit loggingEvent-Driven
Event-driven challenges:
  • Debugging — An event is published, and something goes wrong downstream. Tracing requires correlation IDs across the event chain.
  • Ordering — Events may arrive out of order. Design consumers to be order-independent or use Kafka’s partition ordering guarantees.
  • Schema evolution — Events are contracts. Changing an event’s schema affects all consumers. Use a Schema Registry.
  • Eventual consistency — After publishing an event, the system is temporarily inconsistent. Clients may not see changes immediately.
  • Zombie events — Consumers that are never removed continue processing events, wasting resources.
Hybrid approach (most common in practice): Use synchronous request-response for queries (reads) and event-driven for commands (writes/mutations). The Order API responds synchronously to GET /orders/42 but publishes an OrderCreated event for downstream processing (payment, inventory, notifications).Red flag answer: “Everything should be event-driven for loose coupling” without acknowledging the debugging, consistency, and operational complexity. Or not knowing when synchronous communication is the better choice.Follow-up questions:Q: Your team publishes 50 different event types. Documentation is scattered and consumers do not know what events are available. How do you fix this? (1) Create an Event Catalog (AsyncAPI spec is the standard — like OpenAPI but for async APIs). Document every event type with schema, example payload, producer, and expected consumers. (2) Use a Schema Registry (Confluent, AWS Glue) that enforces schema compatibility and serves as a discovery mechanism. (3) Build an internal “event storefront” — a UI where teams can browse available events, subscribe, and see example payloads. (4) Require every new event type to be registered in the catalog before it can be published (CI check).Q: Service A publishes an event, Service B consumes it and publishes another event, Service C consumes that and calls Service D synchronously. Service D fails. How do you debug this? This is why correlation IDs (trace IDs) are non-negotiable in event-driven architectures. (1) Every event must carry a correlation ID originating from the initial trigger. (2) When Service B publishes its event, it includes the same correlation ID. (3) When Service C makes a synchronous call, it passes the correlation ID in a header. (4) Search your centralized logging by correlation ID to see the full chain: A’s event, B’s processing, B’s event, C’s processing, C’s call to D, D’s failure. (5) Distributed tracing (OpenTelemetry) supports async span propagation — the trace links all of these together into one visual timeline.