Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Part VIII — APIs and Integration

Real-World Story: Stripe’s API Design Philosophy. When Stripe launched in 2011, the payments landscape was dominated by clunky SOAP APIs with 70-page integration guides. Stripe made a radical bet: treat the API itself as the product. Every endpoint followed consistent conventions — predictable resource naming, uniform error formats, idempotency keys on every mutating operation, and versioning baked in from day one. The result? Developers could go from zero to processing payments in under an hour. Stripe pins each merchant to the API version they integrated with, so deploying a new version never breaks existing integrations. Their API changelog is a masterclass in backward compatibility. Today, Stripe’s API is widely regarded as the gold standard for public API design — not because it uses exotic technology, but because every design decision was made from the developer-consumer’s perspective. The lesson: API design is UX design for developers.
Real-World Story: Airbnb’s Migration from Monolithic MySQL to Service-Oriented Data Architecture. By 2015, Airbnb’s entire platform ran on a single monolithic MySQL database. Every team — search, payments, messaging, trust — shared the same database, the same connection pool, and the same on-call rotation when things broke. A slow query from the messaging team could starve the payments team of connections, and schema migrations required cross-team coordination that could take weeks. Airbnb began decomposing their monolith into domain-specific services, each owning its own data store. Search moved to Elasticsearch. The messaging team got its own database. Payment data was isolated behind a service boundary with strict access controls. The key insight was not “use microservices” — it was “data ownership must follow team ownership.” When one team’s schema change can break another team’s queries, you do not have independent services, you have a distributed monolith with extra network hops. Airbnb’s migration took years and required building internal tooling for data replication between services (they open-sourced SpinalTap, a MySQL changelog stream). The payoff: teams could evolve their schemas independently, scale their data stores to match their specific access patterns, and on-call pages dropped dramatically because failures were isolated.
Cross-chapter context. This chapter covers API design fundamentals and database foundations. Several companion chapters go deeper on specific topics introduced here:
  • Database Deep Dives — PostgreSQL internals (MVCC, VACUUM, query planner), DynamoDB single-table design and partition strategies, MongoDB aggregation pipelines, and Redis architecture. When this chapter says “add an index,” that chapter explains why the B-tree is 4 levels deep and what happens during a page split.
  • GraphQL at Scale — Federation, persisted queries, query complexity analysis, and schema governance. This chapter introduces GraphQL as an API paradigm; that chapter covers what it takes to run it in production with multiple teams and billions of requests.
  • API Gateways & Service Mesh — Rate limiting, authentication, TLS termination, and request routing at the gateway level. This chapter covers rate limiting from the API designer’s perspective; that chapter covers it from the infrastructure perspective (Kong, Envoy, Istio).
  • Cloud Service Patterns — RDS vs Aurora, DynamoDB capacity modes, ElastiCache deployment patterns, and RDS Proxy for serverless connection management. When this chapter discusses database selection, that chapter covers the specific AWS managed services and their operational characteristics.

Chapter 15: API Design

The 3 Rules of Great API Design. Every well-designed API — Stripe, Twilio, GitHub — follows the same three principles whether they know it or not:
  1. Predictable — Consistent naming conventions everywhere. If one endpoint uses created_at, every endpoint uses created_at (not createdAt in some and creation_date in others). If GET /users returns a list, GET /orders returns a list in the same envelope. A developer who has used one endpoint can guess the shape of any other endpoint.
  2. Forgiving — When something goes wrong, the API tells you exactly what happened and how to fix it. Not 400 Bad Request with no body — but { "error": { "code": "invalid_currency", "message": "Currency 'usd' is not valid. Did you mean 'USD'? Currencies must be uppercase ISO 4217 codes.", "param": "currency" } }. Good error messages are the best API documentation.
  3. Evolvable — Versioning is baked in from day one, not bolted on when the first breaking change happens. Every response includes only additive fields. Removing or renaming a field goes through a deprecation cycle. The API you ship on launch day should be designed assuming it will still be running in five years with clients you have never heard of.
These are not nice-to-haves. If your API is unpredictable, developers waste hours guessing conventions. If it is unforgiving, every integration turns into a support ticket. If it is not evolvable, your first breaking change becomes a crisis.
Over-Designing APIs. Building a perfectly RESTful API with HATEOAS, content negotiation, and every status code used correctly — for an internal service called by one frontend. Match the API design investment to the consumer count. Internal APIs between two services you own: keep it simple, use a few conventions, move fast. Public APIs consumed by hundreds of third-party developers: invest heavily in design, documentation, versioning, and backward compatibility.
Strong answer:
  1. Authentication mechanism — API keys for simplicity, OAuth 2.0 if delegated access is needed.
  2. Versioning strategy — URL-based (/v1/) for discoverability, with a deprecation policy from day one.
  3. Pagination — cursor-based (not offset) for consistency at scale; every list endpoint returns pagination metadata.
  4. Error format — consistent error response schema across all endpoints (error code, message, details, request_id for support).
  5. Rate limiting — per API key, with 429 Too Many Requests and Retry-After header.
Also: use consistent naming conventions (snake_case or camelCase, not a mix), return request_id in every response for debugging, and write OpenAPI spec before writing code.What weak candidates say vs what strong candidates say:
  • Weak: “I would use REST with JSON and add authentication.” — Too vague, no specifics on versioning, pagination, or error handling. Shows no awareness of the decisions that cause real pain later.
  • Strong: “My first five decisions would be versioning, auth, pagination, error format, and rate limiting — because those are the five things that are hardest to change after launch and cause the most partner friction when done wrong.” — Shows prioritization and understanding of irreversibility.
Senior vs Staff lens. A senior engineer focuses on getting the five decisions right for the initial launch: choosing URL-path versioning, picking OAuth 2.0, implementing cursor pagination, defining the error schema, and configuring rate limits. A staff/principal engineer adds: “I would also define our backward compatibility contract in writing before launch — what constitutes a breaking change, our deprecation timeline, how we handle additive changes — because the decisions we make in month 1 constrain every API evolution for the next 5 years. I would set up a schema registry with CI-enforced breaking-change detection from day one, not after the first incident.”
Follow-up chain:
  • Failure mode: “What happens if you skip the idempotency decision at launch?” — Every retry creates duplicates. Retrofitting idempotency keys onto an existing API is a breaking change (new required header). Partners who integrated without it will resist.
  • Rollout: “How do you roll out rate limiting to existing consumers without breaking them?” — Announce 30 days before enforcement, start in monitoring-only mode (log but do not reject), then enforce with generous initial limits. Tighten over time.
  • Measurement: “How do you know your API design is working?” — Track time-to-first-successful-call for new integrators (target: <5 minutes with docs), support ticket rate per 1,000 API calls, and error rate breakdown by status code.
  • Cost: “What is the cost of getting versioning wrong?” — Stripe maintains hundreds of API versions through a transformation chain. This is viable at $14B revenue. For most companies, rearchitecting versioning after launch costs 3-6 months of engineering time.
  • Security: “What auth decision would you avoid for a public API?” — Never use API keys alone without HTTPS. Never embed secrets in URLs (they appear in logs, browser history, referer headers). Always support key rotation without downtime.
  • Governance: “Who owns the API design decisions?” — A single API design lead or working group, not each team independently. Inconsistency across endpoints is the number-one developer complaint.
Work-sample prompt: “You are reviewing an API spec draft where GET /users returns camelCase JSON but GET /orders returns snake_case. The team says it is fine because different developers wrote them. Write the Slack message you would send to the team explaining why this must be fixed before launch.”
Structured Answer Template.
  1. Name the five decisions — versioning, auth, pagination, error format, rate limiting. Lead with the list; defend each in one sentence.
  2. Explain irreversibility — each decision becomes a contract once partners integrate. Retrofitting breaks them.
  3. Tie each to a failure mode — “no idempotency = duplicate charges on retry,” “offset pagination = timeouts at scale.”
  4. Mention the artifacts — OpenAPI spec, changelog, deprecation policy. Artifacts outlive conversations.
  5. Close with governance — one design lead, not each team winging it.
Real-World Example. Stripe’s API was designed at launch (2011) with exactly this set of decisions locked in — API keys for auth, cursor pagination, consistent error envelope with code/message/param/type, Idempotency-Key header on every mutation, per-key rate limits. Over a decade later, they have never had to rewrite the basics. Contrast with Twitter’s v1 API, which lacked idempotency keys and proper cursor pagination — retrofitting both required breaking changes that took years to resolve.
Big Word Alert — cursor pagination. Opaque, base64-encoded tokens that encode the position of the last item instead of a row offset. Immune to insertion/deletion shifts and fast at any depth. Say “cursor pagination with stable sort key” when contrasting with offset — it signals you understand why offset breaks at scale.
Big Word Alert — idempotency key. A client-generated unique identifier (UUID) sent with mutating requests so the server can deduplicate retries. Retried requests with the same key return the stored response, not re-execute. Mention the 24-hour TTL (Stripe’s standard) when discussing implementation.
Follow-up Q&A Chain:Q: What is the single most expensive mistake you can make on day 1 of API design? A: Shipping without idempotency keys on mutating endpoints. Every retry — and there will be millions of retries across network blips, client timeouts, and load balancer failovers — creates a duplicate. Retrofitting idempotency keys is a breaking change that requires every consumer to send a new header. Stripe’s entire financial integrity depends on having shipped with idempotency keys from day one.Q: How do you enforce API consistency across 50 engineers working on different endpoints? A: Three layers: (1) a written style guide (Zalando or Microsoft’s API guidelines are excellent starting points), (2) CI linting with spectral (OpenAPI linter) that fails builds on style violations like camelCase/snake_case mixing, and (3) an API design review board that signs off on new endpoints before they ship. The board is the backstop for judgment calls the linter cannot catch.Q: A developer pushes back and says “versioning is premature optimization; we’ll add it when we need it.” What do you say? A: Adding versioning later is a breaking change — every existing client now has to send a version header they did not send before, or migrate from unversioned /users to /v1/users. Adding versioning on day one is a 30-minute decision that costs nothing. The asymmetry is obvious: cheap to do now, impossible to do later without breakage. It is not premature — it is the cheapest point in the project’s lifetime to make the decision.
Further Reading:
LLMs and AI coding assistants are changing how engineers approach API design:
  • OpenAPI spec generation: Tools like GitHub Copilot and Claude can generate a complete OpenAPI 3.0 spec from a natural-language description of your resources and relationships. This accelerates the “spec-first” approach — you describe the domain, the LLM generates the spec, and you refine it. What used to take a day of YAML editing takes 30 minutes of iterating with an AI assistant.
  • Consistency enforcement: Ask an LLM to review your OpenAPI spec for naming inconsistencies, missing error codes, or pagination gaps. It catches the “camelCase here, snake_case there” problems faster than manual review.
  • Client SDK generation: With a clean OpenAPI spec, tools like openapi-generator produce client SDKs in 40+ languages. LLMs can customize the generated code to match your internal patterns.
  • What LLMs cannot do: They cannot make the judgment calls — whether to use URL-path or header versioning, whether your error format should include stack traces in development mode, or whether your rate limits should allow bursts. These decisions require understanding your consumers, your team’s operational maturity, and your business model. AI accelerates the execution; humans own the strategy.

15.1 REST API Design

Resources as nouns (/users), HTTP methods for actions, plural nouns, hierarchy for relationships (/users/123/orders).
Status codes: 200 OK, 201 Created, 204 No Content, 400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found, 409 Conflict, 422 Unprocessable Entity, 429 Too Many Requests, 500 Internal Server Error. Pagination: Offset-based (?page=2&size=20) is simple but slow on large datasets. Cursor-based (?cursor=abc&size=20) is consistent regardless of size. Always return pagination metadata.

API Pagination Patterns — Cursor vs Offset vs Keyset

Pagination seems simple until you have 50 million rows, real-time inserts, and clients that expect stable page results. The strategy you choose has direct consequences for database performance, client complexity, and data consistency. Here is what actually works. Three approaches compared:
CriteriaOffset-BasedCursor-Based (Opaque)Keyset (Seek)
Request format?page=3&size=20 or ?offset=40&limit=20?cursor=eyJpZCI6MTIzfQ&limit=20?after_id=123&limit=20
SQL generatedSELECT ... ORDER BY id LIMIT 20 OFFSET 40SELECT ... WHERE id > 123 ORDER BY id LIMIT 20 (decoded from cursor)SELECT ... WHERE id > 123 ORDER BY id LIMIT 20
Performance at page 1FastFastFast
Performance at page 10,000Slow — DB scans and discards 200,000 rowsFast — index seek, O(log n)Fast — index seek, O(log n)
Stable results during inserts/deletesNo — inserting a row shifts all subsequent pages (users see duplicates or miss items)Yes — cursor is anchored to a specific positionYes — keyset is anchored to a specific value
Jump to arbitrary pageYes (?page=500)No — must traverse sequentiallyNo — must traverse sequentially
Sort flexibilityAny sort orderAny sort order (encoded in cursor)Requires a unique, sortable column in the WHERE clause
Client complexityLow — just increment page numberLow — pass opaque cursor from previous responseMedium — client must track the last-seen sort key
CacheableYes (each page URL is stable if data does not change)Difficult (cursor encodes position, varies per user)Difficult (same issue as cursor)
Best forAdmin dashboards, small datasets (<100K rows), paginated UI with page numbersPublic APIs, mobile feeds, any large datasetInternal APIs, streaming through large datasets, data exports
Used byMany internal toolsStripe, Slack, GitHub GraphQL (as Connections)Twitter timeline (originally), many analytics APIs
Why offset pagination breaks at scale:
-- Page 1: fast (reads 20 rows)
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20 OFFSET 0;

-- Page 10,000: slow (reads 200,020 rows, discards 200,000)
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20 OFFSET 200000;
PostgreSQL must scan 200,000 index entries just to throw them away. At 500M rows, deep offset pagination can take seconds or minutes. There is no index optimization that fixes this — it is fundamental to how OFFSET works. Cursor-based pagination done right:
-- First page
SELECT id, name, created_at FROM orders
ORDER BY created_at DESC, id DESC
LIMIT 21;  -- fetch one extra to determine has_more

-- Subsequent pages (cursor decodes to created_at + id)
SELECT id, name, created_at FROM orders
WHERE (created_at, id) < ('2025-12-01T10:30:00Z', 9847)
ORDER BY created_at DESC, id DESC
LIMIT 21;
This uses a composite (created_at, id) comparison for stable ordering even when multiple rows share the same created_at. The cursor is an opaque base64 encoding of these values — clients never parse it. Response envelope for cursor pagination:
{
  "data": [ ... ],
  "pagination": {
    "has_more": true,
    "next_cursor": "eyJjcmVhdGVkX2F0IjoiMjAyNS0xMi0wMVQxMDozMDowMFoiLCJpZCI6OTg0N30=",
    "previous_cursor": "eyJjcmVhdGVkX2F0IjoiMjAyNS0xMi0wMVQxMTo0NTowMFoiLCJpZCI6OTg2N30="
  }
}
Decision rule: If you need page numbers in the UI (admin tables, search results) and the dataset is small (<100K rows), offset is fine — do not over-engineer. For everything else — public APIs, mobile feeds, anything with large datasets or real-time inserts — use cursor-based pagination. The performance difference is not theoretical: offset pagination at page 10,000 on a 50M-row table typically takes 2-5 seconds; cursor pagination at any depth takes 2-5 milliseconds.
The “fetch N+1” trick. Request limit + 1 rows from the database. If you get limit + 1 results, there is a next page — return has_more: true and drop the extra row. If you get limit or fewer, there is no next page. This avoids a separate COUNT(*) query, which on a 500M-row table can take 30+ seconds.
GraphQL pagination. The GraphQL community standardized on the Relay Connection specification, which is cursor-based pagination with a specific shape: edges (list of nodes with cursors), pageInfo (hasNextPage, hasPreviousPage, startCursor, endCursor). If you are building a GraphQL API, follow this spec — it is what every GraphQL client library expects. See the GraphQL at Scale chapter for implementing Connections efficiently with DataLoader and avoiding N+1 problems in paginated resolvers.

15.2 gRPC

Binary protocol using Protocol Buffers over HTTP/2. Created by Google to replace their internal RPC framework (Stubby) — designed for high-throughput, low-latency service-to-service communication. How it works: Define your service in a .proto file (service methods + message types). The protoc compiler generates client and server code in your language. The client calls methods on a generated stub that looks like a local function call — the gRPC framework handles serialization, HTTP/2 framing, and network transport. Why it is faster: Protocol Buffers are 3-10x smaller than JSON (binary, no field names in payload). HTTP/2 multiplexes multiple RPCs over a single TCP connection (no connection-per-request overhead). Generated code eliminates serialization bugs and reduces boilerplate. Streaming: Server-side streaming (server sends a stream of responses to one request — live updates, large result sets). Client-side streaming (client sends a stream of messages — file upload, batch ingestion). Bidirectional streaming (both stream simultaneously — chat, real-time collaboration). Trade-offs: Not browser-friendly (needs gRPC-Web or Envoy proxy). Not human-readable (cannot curl and read the response — need grpcurl or Postman). No HTTP caching (all requests are POST). Steeper learning curve (proto files, code generation, streaming complexity). For public APIs consumed by third-party developers, REST is still the standard.
Protocol Buffers. A language-neutral, platform-neutral, extensible mechanism for serializing structured data. You define your data structures in .proto files, and the compiler generates code in your language. Much smaller and faster to serialize/deserialize than JSON.

15.3 Idempotency

An idempotent operation produces the same result regardless of how many times it is called. GET /user/123 is naturally idempotent (reading does not change state). POST /orders is NOT naturally idempotent (calling it twice creates two orders). Making non-idempotent operations safe for retries is one of the most important distributed systems skills. Implementation: Client generates a unique idempotency key (UUID) per request. Server checks if it has processed that key. If yes, returns the stored response. If no, processes and stores. Essential for payments, order creation, any operation where duplicates have real consequences.

Idempotency Key Patterns in Practice

Idempotency keys are more nuanced than “store a UUID.” Here is how production systems handle them: The basic pattern:
  1. Client generates a UUID v4 and sends it in the Idempotency-Key header.
  2. Server checks a lookup table (Redis or database) for the key.
  3. If key exists and the request completed: return the stored response (same status code, same body).
  4. If key exists and the request is in-progress: return 409 Conflict (prevents concurrent retries from double-processing).
  5. If key does not exist: create the record with status processing, execute the operation, store the response, update status to completed.
How payment systems handle this (Stripe’s model):
  • Idempotency keys are scoped to the API key (two different merchants can use the same key string without collision).
  • Keys expire after 24 hours — retrying after that creates a new operation.
  • If a retried request has a different body than the original, the server returns an error (the key is bound to the original parameters).
  • The stored response includes the exact HTTP status code and body — a retried POST /charges that originally returned 402 Payment Failed returns 402 again, not 200.
Database schema for idempotency:
CREATE TABLE idempotency_keys (
  key         TEXT PRIMARY KEY,
  api_key_id  BIGINT NOT NULL,
  request_path TEXT NOT NULL,
  request_hash TEXT NOT NULL,  -- hash of the request body
  status      TEXT NOT NULL DEFAULT 'processing',  -- processing | completed | failed
  response_code INT,
  response_body JSONB,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  expires_at  TIMESTAMPTZ NOT NULL DEFAULT now() + INTERVAL '24 hours'
);
For critical operations like payments, wrap the idempotency check and the business operation in a single database transaction. This prevents a crash between “recorded the key” and “completed the operation” from leaving the system in a state where the key exists but the operation never finished. Use the status field to detect and recover from this — a processing key older than a timeout threshold should be retried.
Senior vs Staff lens. A senior engineer implements idempotency keys correctly: UUID generation, database lookup, stored responses, TTL expiry. A staff/principal engineer additionally considers: “What happens when the idempotency store itself becomes a scaling bottleneck at 10M writes/day? I would partition the idempotency table by created_at, use the database (not Redis) as the authoritative store for payment operations, add Redis as a fast-path cache, and design the TTL and cleanup strategy to keep the table bounded. I would also instrument the idempotency hit rate — if 15% of requests are retries hitting the idempotency store, that tells me we have a reliability problem upstream causing excessive retries.”
Work-sample prompt: “Debug this: A payment service uses Redis for idempotency key storage. During a Redis failover (30-second window), three customers report being charged twice for the same order. The team says Redis has persistence enabled. Explain the failure mode and propose a fix that survives Redis outages.”
LLMs accelerate idempotency implementation in specific ways:
  • Schema generation: Describe your idempotency requirements to an LLM and it generates the database schema, the Lua script for Redis fast-path checks, and the application middleware — in minutes rather than hours of manual coding.
  • Edge-case discovery: Ask an LLM “what are the failure modes of this idempotency implementation?” and it will enumerate: crash between key insert and business logic, Redis/DB inconsistency, concurrent requests with the same key, key collision across API consumers, TTL expiry during long-running operations. This is faster than discovering these edge cases in production.
  • Code review: LLMs are excellent at spotting idempotency gaps in pull requests — “this POST endpoint modifies state but does not check for an idempotency key” or “the idempotency check and the database write are not in the same transaction.”
  • Limitation: LLMs may suggest overly complex solutions (distributed locks, two-phase commits) when a simple database transaction with an idempotency table is sufficient. Always prefer the simplest correct implementation.

15.4 API Versioning and Backward Compatibility

Safe changes: Adding optional fields, new endpoints, new optional parameters. Breaking changes: Removing fields, renaming fields, changing types, making optional fields required.

Versioning Strategies Compared

StrategyExampleProsConsUsed By
URL path/v1/users, /v2/usersMost discoverable, easy to route, easy to cacheURL pollution, harder to sunsetStripe, GitHub, Twitter
Custom headerX-API-Version: 2Clean URLs, flexibleHidden from browser, easy to forgetAzure, Jira
Accept header (content negotiation)Accept: application/vnd.myapi.v2+jsonHTTP-standards compliant, fine-grainedComplex, hard to test in browserGitHub (also supports URL)
Query parameter/users?version=2Easy to add, discoverableBreaks caching, messyRarely recommended
Recommendation: Use URL path versioning (/v1/) for public APIs — it is the most discoverable and requires no special client configuration. Reserve header-based versioning for internal APIs where you control all clients. Content negotiation is standards-purist but creates friction for third-party developers.
The Expand and Contract Pattern for APIs. To rename a field: (1) add the new field alongside the old one, (2) update clients to use the new field, (3) remove the old field. Never do step 1 and 3 together.
Real-world deprecation policy example (modeled after Stripe):
  1. Announce deprecation in API changelog and response headers (Deprecation: true, Sunset: 2026-12-01).
  2. Give at least 12 months notice for major version sunsetting.
  3. New accounts default to the latest version; existing accounts are pinned to their current version.
  4. Provide a migration guide for every breaking change.

API Deprecation Programs — Running Them at Scale

Deprecation announcements without enforcement are wishful thinking. A real deprecation program has teeth and timelines. The deprecation funnel (what actually happens):
PhaseTimelineActionResponse Rate (typical)
AnnounceMonth 0Changelog entry, Sunset header on responses, email to all consumers5% migrate immediately
NudgeMonth 3Dashboard banner, targeted emails to top-20 consumers by volume, office hours for migration questions15-20% cumulative
PressureMonth 6Rate limits on deprecated endpoints tightened by 25%, new features only on latest version40-50% cumulative
WarnMonth 9Weekly emails to remaining consumers with countdown, Warning: 299 header on every response70-80% cumulative
BrownoutMonth 11Disable deprecated endpoints for 1-hour windows on a published schedule — consumers see real failures in non-prod first90-95% cumulative
SunsetMonth 12Return 410 Gone with migration guide URL, keep transformation code for emergency re-enable100% (forced)
Handling clients you do not control: The hardest part of deprecation is not the 200 partners you emailed — it is the 5,000 developers who built integrations without registering. They discover the deprecation when their integration breaks. Mitigations: (1) return machine-readable deprecation data in response headers on every call for months before sunset, (2) publish deprecation to status page RSS/webhooks that monitoring tools consume, (3) maintain a “legacy compatibility” tier where deprecated endpoints return the old format but with a 410 after 24 months (not 12). Clients you literally cannot reach: SDKs embedded in mobile apps have a long tail. An app released 2 years ago with your SDK v1 may still have 100K monthly active users who never update. You cannot email them — you do not know who they are. You cannot force an update — the app store controls that. Mitigations: (1) Instrument your API to track SDK version in a custom header (X-SDK-Version). Monitor the version distribution weekly. If 15% of traffic still comes from a deprecated SDK version, sunsetting the underlying API will break those users. (2) For mobile SDKs, build a “minimum supported version” mechanism: the SDK phones home on startup and receives a “please update” prompt if below minimum. (3) Accept that some old versions will exist forever. Your API versioning strategy must account for a 3-5 year tail, not just the 12-month deprecation cycle. This is why Stripe pins merchants to the API version they integrated with indefinitely rather than forcing upgrades. Deprecation of internal APIs between teams: Internal deprecation is harder socially and easier technically. Technically, you control all the clients — you can grep every codebase for the deprecated endpoint and submit PRs. Socially, teams resist migration because it is “not on their roadmap.” Effective internal deprecation requires executive sponsorship (the deprecation is a company priority, not just the API team’s preference), migration tooling (provide a codemod or migration script, not just documentation), and a hard deadline enforced by the platform (after date X, the API gateway returns 410 — no exceptions, no extensions). Cost of eternal backward compatibility: Stripe maintains hundreds of API versions simultaneously through a transformation chain. This is viable when API revenue justifies the engineering cost (~$14B revenue). For most companies, supporting 3-5 active versions is the practical ceiling. Each additional version costs roughly 2-4 engineering weeks per year in maintenance, testing, and incident response. Sunset aggressively — the cost of not sunsetting is not just code maintenance, it is the constraint on your ability to evolve the data model underneath.
Breaking Change Traps — API Edition.These look safe but will break real clients in production:
  1. Adding a required field to a request body. Even on a new version — existing clients on older SDKs will send requests missing the field. Always make new fields optional with sensible defaults.
  2. Changing a field from null to absent. JSON {"score": null} and {} (missing score) are different. Clients that check if response.score is not None behave differently than if "score" in response. Picking one convention and switching is breaking.
  3. Changing enum values or error codes. If clients switch on error.code == "card_declined" and you rename it to payment_declined, their error handling breaks silently — no parse error, just wrong code path.
  4. Changing the sort order of list endpoints. Clients that assume “first item is newest” will surface wrong data if you switch from created_at DESC to updated_at DESC.
  5. Tightening validation. A field that accepted 500 characters now rejects at 255. Existing data created under the old rules fails on update. Clients that worked yesterday fail today.
  6. Changing numeric precision. Returning 10.50 as 10.5 breaks clients doing string comparison on monetary amounts. Returning 10.500 breaks clients parsing with fixed-width expectations.
  7. Adding pagination to a previously unpaginated endpoint. Clients that expected the full list now get page 1. Their totals, reports, and reconciliation break silently.
  8. Wrapping a response in an envelope. Going from [{...}] to {"data": [{...}], "meta": {...}} breaks every client, even though the data is identical.
  9. Changing error response structure. Clients build error-handling logic around your error format. If {"error": "not_found"} becomes {"errors": [{"code": "not_found", "detail": "..."}]}, every client’s error parser breaks. Error format changes are breaking changes, even if the HTTP status code stays the same.
  10. Changing a timestamp format. Going from "2025-12-01T10:30:00Z" (ISO 8601 with Z) to "2025-12-01T10:30:00+00:00" (with explicit offset) breaks clients doing string comparison or using strict parsers. Both are valid ISO 8601, but they are different strings. Pick a format in v1 and never change it.
  11. Introducing rate limiting on a previously unlimited endpoint. Clients that poll every second suddenly get 429s. Even if the limit is generous, clients without retry logic will fail. Always announce rate limits at least 30 days before enforcement and provide Retry-After headers from day one.
  12. Making a synchronous endpoint asynchronous. Changing POST /exports from returning the result in the response body to returning 202 Accepted with a status URL breaks every client that expected the data in the response. This is an architectural change disguised as an API change.

Compatibility Under Rolling Deploys

During a rolling deployment, old and new versions of your service run simultaneously for minutes (or hours in large fleets). API responses from old instances and new instances must be compatible with clients calling either. The dual-compatibility window: When you deploy version N+1, the fleet runs a mix of N and N+1 for the duration of the rollout. Any client request might be routed to either version. If N+1 returns a new field that N does not, clients see the field appear and disappear randomly depending on which instance they hit. This causes intermittent test failures, flaky UIs, and hard-to-diagnose production bugs. Rules for safe rolling deploys:
  • New fields must be present in all instances before any client code references them — deploy the API change first, then the client change.
  • Removed fields must be absent from all client code before any instance stops returning them — deploy the client change first, then the API change.
  • Database schema changes must be compatible with both N and N+1 application code simultaneously (see the “Breaking Change Traps — DB Schema Edition” section below).
  • If a deploy takes 30 minutes to roll across 100 instances, any endpoint called during that window must return a response that satisfies clients expecting either version.
Contract Ownership Across Teams In a microservices architecture, the question “who owns this API contract?” determines how fast you can evolve.
ModelHow It WorksProsCons
Provider-drivenThe team owning the service defines the contract. Consumers adapt.Fast iteration for the provider, clear ownershipConsumers discover breaks at integration time, not design time
Consumer-driven (Pact)Each consumer publishes expectations. Provider runs consumer tests before deploying.Consumers are protected from unexpected breaksSlow provider evolution — every change needs consumer approval. Does not scale past ~10 consumers
Schema registry (shared artifact)A versioned schema (OpenAPI, protobuf, AsyncAPI) lives in a shared repo. CI checks breaking changes on every PR.Single source of truth, automated compatibility checksGovernance overhead, schema repo becomes a bottleneck
Recommendation for most teams: Provider-driven with a schema registry and CI-enforced breaking change detection. Use buf breaking for protobuf, openapi-diff for REST, or Pact for the 2-3 most critical consumer relationships. Full consumer-driven contracts are expensive to maintain and create coupling that slows both sides.

15.5 GraphQL

A query language for APIs where the client specifies exactly what data it needs. A single endpoint (/graphql) serves all queries. The client sends a query describing the shape of the response. Benefits: No over-fetching (client gets only requested fields). No under-fetching (client gets all needed data in one request, unlike REST which may require multiple endpoints). Strongly typed schema acts as documentation. Great for mobile clients where bandwidth matters. Trade-offs: Caching is harder (every request is a POST with a unique body — HTTP caching does not work). Rate limiting is harder (one query can be trivial or extremely expensive). N+1 problems move to the server (resolving nested fields can trigger many database calls without careful dataloader implementation). Security: clients can craft expensive nested queries — implement query depth limits and complexity analysis. When to use: Multiple client types needing different data shapes (web shows full details, mobile shows summary). Rapidly evolving frontend requirements. When the relationship between data is graph-like. When NOT to use: Simple CRUD APIs, file uploads, real-time streaming (use subscriptions or SSE), when your team is small and REST is working fine.
Tools: Apollo (client and server, most popular GraphQL ecosystem). Hasura (instant GraphQL over PostgreSQL). graphql-codegen (TypeScript type generation from schema).
Going deeper with GraphQL. This section covers when and why to choose GraphQL. For production-scale concerns — federation across multiple services, persisted queries for performance, query cost analysis for rate limiting, schema governance across teams, and security hardening against malicious queries — see the dedicated GraphQL at Scale chapter. That chapter covers what GitHub, Shopify, and Netflix learned after running GraphQL at billions of requests per month.
Connection: Idempotency (Part VIII) connects to retries (Part V — retry a failed request safely only if the operation is idempotent), duplicate messages (Part XV — at-least-once delivery means consumers must be idempotent), optimistic locking (Part IX — a retry that conflicts is safely rejected), and the outbox pattern (Part VII — an idempotent relay process can safely re-publish).

15.6 Rate Limiting

Control how many requests a client can make in a given time window. Protects against abuse, ensures fair resource distribution, and prevents cascade failures. Algorithms: Fixed window (simple, allows bursts at window boundaries). Sliding window (smoother, more complex). Token bucket (allows controlled bursts up to a limit, then steady rate). Leaky bucket (smoothest output rate, constant processing).

Rate Limiting Algorithms — Deep Comparison

AlgorithmHow It WorksBurst HandlingMemoryPrecisionBest For
Fixed windowCounter resets every N secondsAllows 2x burst at window boundary (end of one + start of next)Very low (1 counter per key)LowSimple internal APIs where boundary bursts are acceptable
Sliding window logStores timestamp of each request, counts within rolling windowNo boundary burstHigh (stores every timestamp)ExactLow-volume APIs where precision matters more than memory
Sliding window counterWeighted average of current and previous windowMinimal boundary burstLow (2 counters per key)ApproximateMost production APIs — good balance of accuracy and memory
Token bucketBucket fills at steady rate, each request takes a token, bucket has max capacityAllows controlled bursts up to bucket size, then steady rateLow (token count + last refill time)GoodAPIs where short bursts are acceptable (Stripe, AWS)
Leaky bucketRequests enter a queue processed at constant rateNo bursts — excess requests are queued or droppedModerate (queue)Smoothest outputTraffic shaping, when downstream systems need constant rate
Token bucket vs sliding window — the practical choice:
  • Token bucket is best when you want to allow short bursts (a mobile app that syncs 20 items at once, then goes idle). Set bucket size = max burst, refill rate = sustained limit.
  • Sliding window counter is best when you want strictly even distribution and bursts are harmful (payment processing API).
Implementation: API gateway level (Kong, AWS API Gateway — easiest). Application middleware level (express-rate-limit, ASP.NET rate limiting). Distributed rate limiting with Redis (INCR with TTL, or Redis Cell module for token bucket). Return 429 Too Many Requests with Retry-After header.
Connection: Rate limiting at the API gateway is a form of backpressure (Part IV, Ch 7.5) that protects your circuit breakers (Part V, Ch 9.2) from opening unnecessarily. If your rate limiter is well-tuned, your circuit breakers should rarely trip.
Gateway-level rate limiting. In a microservices architecture, rate limiting should live at the API gateway, not inside each service. The gateway sees the full request flow, can enforce per-consumer limits across all services, and can shed load before requests consume backend resources. Kong, AWS API Gateway, and Envoy all support configurable rate limiting with Redis-backed distributed counters. For a deep dive on gateway architecture — including how to configure rate limiting policies, authentication at the edge, TLS termination, and request routing across service versions — see the API Gateways & Service Mesh chapter.
Per what? Per IP (simplest, breaks for users behind shared IPs). Per API key (standard for authenticated APIs). Per user (fairest for logged-in users). Per tenant (essential in multi-tenant SaaS). Combine multiple: per-user within per-tenant limits. Work-sample prompt: “You are on-call and see this alert: ‘429 responses spiked from 0.1% to 12% in the last 5 minutes. Top consumer: partner-api-key-xyz.’ The partner has not changed their integration. Walk through your diagnosis, including what you check in Redis, the API gateway logs, and whether this is a real problem or a false positive.”

Multi-Tenant Rate Limiting and Fairness

In a multi-tenant SaaS API, “one tenant per rate limit” is not enough. You need a fairness model that prevents a single large tenant from degrading the experience for everyone else, while still allowing large tenants to use the capacity they are paying for. Tiered rate limiting with burst and sustained limits:
TierSustained RateBurst AllowanceMonthly Cost (typical)Overage Handling
Free10 req/s20 req/s for 10s$0Hard reject at burst limit
Growth100 req/s200 req/s for 30s$99-499/moSoft throttle with Retry-After
Enterprise1,000 req/s2,000 req/s for 60s$2,000-10,000/moGraceful degradation, priority queue
PlatformCustomCustomCustom contractDedicated capacity, SLA-backed
The fairness problem: Enterprise tenants that burst to 2,000 req/s consume shared backend resources (database connections, CPU, memory). If 5 enterprise tenants burst simultaneously, the shared infrastructure handles 10,000 req/s of demand — potentially saturating the database and causing latency spikes for all tenants, including free-tier users who are well within their limits. Solutions for multi-tenant fairness:
  1. Weighted fair queuing. Assign each tenant a weight proportional to their tier. During resource contention, the request scheduler serves higher-tier tenants proportionally more, but never starves lower tiers. AWS API Gateway supports this through usage plans.
  2. Admission control. When backend utilization exceeds 80%, start rejecting requests from tenants above their sustained rate (even if below burst). This preserves capacity for tenants within their sustained limit.
  3. Per-tenant connection pools. Dedicate a slice of database connections to each tier. Enterprise gets 40% of the pool, Growth gets 40%, Free gets 20%. No tier can exhaust connections for another.
  4. Cost attribution. Track per-tenant infrastructure cost (compute seconds, database IOPS, bandwidth). Tenants whose usage consistently exceeds their tier’s cost profile are candidates for upsell — this is how API-first companies like Twilio and Stripe align pricing with resource consumption.
Operational economics of rate limiting: Rate limiting is not just about protection — it is about cost control. At 0.0000004perLambdainvocationand0.0000004 per Lambda invocation and 0.09 per GB of data transfer, an unthrottled free-tier tenant making 1M requests/day costs you 0.40/dayincomputealonebeforeanydatabaseorbandwidthcosts.Multiplyby10,000freetiertenantsandyouareburning0.40/day in compute alone before any database or bandwidth costs. Multiply by 10,000 free-tier tenants and you are burning 4,000/day on users who pay nothing. Rate limiting at the free tier is a business requirement, not just a technical one.
Strong answer:The core problem is a throughput mismatch — our system produces events 100x faster than the partner can consume them. Sending callbacks at our rate would overwhelm their system, causing dropped events, cascading retries, and a worse situation.Design approach:
  1. Decouple with an outbound queue. Instead of calling the partner synchronously as events occur, publish events to a message queue (SQS, Kafka, or Redis Streams) dedicated to this partner.
  2. Rate-limited consumer. A dedicated worker consumes from the queue at the partner’s rate — 100 RPS. Use a token bucket or leaky bucket rate limiter on the consumer side. The queue absorbs the burst; the consumer drains it steadily.
  3. Batching. If the partner supports batch callbacks, aggregate events into batches of 50-100 per request. This delivers 5,000-10,000 events per second within their 100 RPS limit. This is the highest-leverage optimization.
  4. Backpressure and overflow. Set a queue depth limit. If the partner falls behind (their system is down), the queue grows. Add dead-letter queue (DLQ) for events that fail after N retries. Alert when queue depth exceeds a threshold — this means the partner cannot keep up even at steady state.
  5. Webhook with retry and exponential backoff. For the actual callback, use exponential backoff with jitter (1s, 2s, 4s, …, max 5 minutes). Include an Idempotency-Key header so the partner can safely deduplicate retries.
  6. Pull alternative. Offer the partner a poll-based API (GET /events?since=cursor) as an alternative to push. This lets them control their own consumption rate. Many partners prefer this because it puts them in control of backpressure.
Key trade-offs: Push (webhook) is real-time but requires careful rate management. Pull (polling) is simpler but introduces latency up to the polling interval. Offer both; let the partner choose. Batching is the single biggest win — it is always worth proposing.
Structured Answer Template.
  1. Frame it as a throughput mismatch — 100x gap between producer and consumer is not solvable by just “sending slower.”
  2. Decouple with a queue — per-partner SQS or Kafka topic absorbs bursts, consumer drains at partner rate.
  3. Batch if supported — this is the highest-leverage single change; a 50x batch at 100 RPS delivers 5,000 events/second.
  4. Define backpressure & DLQ — what happens when the partner is down for an hour?
  5. Offer pull as an alternative — lets the partner control their own consumption rate.
Real-World Example. Shopify’s webhook infrastructure serves exactly this pattern: a dedicated SQS queue per merchant, a consumer that respects per-merchant rate limits (many merchants have slower endpoints), exponential backoff with 48-hour retry window, and automatic migration to a “pull-based Events API” after repeated delivery failures. They also support webhook batching for merchants who opt in, bundling up to 50 events per POST.
Big Word Alert — backpressure. A signal from a downstream consumer that it cannot keep up, causing the upstream producer to slow down or queue. In a webhook system, backpressure manifests as a growing outbound queue; in HTTP streaming, as TCP window exhaustion. Name it when describing how you prevent upstream overload.
Big Word Alert — dead-letter queue (DLQ). A secondary queue where messages that fail delivery after N retries are parked for human investigation. Without a DLQ, stuck messages either block the queue (head-of-line blocking) or get silently dropped. Always pair a retrying consumer with a DLQ — it is the “I gave up, please look at this” channel.
Follow-up Q&A Chain:Q: The partner’s endpoint is down for 4 hours. What should your system do? A: Three stages: (1) exponential backoff with jitter, capping around 5 minutes per retry, for the first 30-60 minutes. (2) After sustained failure, move to a “deferred” state with retries every 15 minutes for up to 48 hours. (3) After the 48-hour window, route messages to a DLQ and alert the partner via an out-of-band channel (email/dashboard). Do not retry indefinitely — it eventually becomes spam for legitimate outages that the partner knows about.Q: How do you verify a webhook is actually from you and not a spoofed request? A: HMAC signatures. Include a header like X-Signature: sha256=abc... where the value is HMAC-SHA256(shared_secret, request_body). The partner computes the same HMAC and compares. Rotate the shared secret periodically, and support two active secrets during rotation. Include a timestamp in the signed payload with a short acceptance window (5 minutes) to prevent replay attacks.Q: If your queue grows to 10 million messages, what does that tell you? A: Either the partner has been down for a sustained period, or your consumer is genuinely unable to keep up with the partner’s 100 RPS limit. Check the consumer’s metrics: is it throttled at 100 RPS but the producer is generating 200 RPS? Then batching is the fix. Is the consumer sitting idle because the partner is returning 5xx? Then the partner is the blocker and you should open a ticket. The queue depth is a symptom; the diagnosis is upstream or downstream.
Further Reading:
  • Shopify Engineering — “How we designed Shopify’s webhook infrastructure”
  • Stripe Docs — Best practices for using webhooks
  • RFC 8131 — “Authentication-Scheme Name Registrations” (for webhook signature conventions)
Strong answer:A single POST with a 5GB body is a non-starter — any network interruption means restarting from zero, most proxies and load balancers have body size limits (often 1-10MB), and you cannot report meaningful progress.Design: Chunked Resumable Upload ProtocolStep 1: Initiate the upload.
POST /uploads
Content-Type: application/json
{ "filename": "dataset.parquet", "size_bytes": 5368709120, "content_type": "application/octet-stream" }

Response 201:
{ "upload_id": "upl_abc123", "chunk_size": 5242880, "total_chunks": 1024, "upload_url": "/uploads/upl_abc123" }
The server calculates chunks (5MB each for a 5GB file = ~1024 chunks). Returns an upload_id that the client uses for all subsequent requests.Step 2: Upload chunks.
PUT /uploads/upl_abc123/chunks/0
Content-Type: application/octet-stream
Content-Range: bytes 0-5242879/5368709120
[binary data]

Response 200: { "chunk_index": 0, "status": "received", "checksum": "sha256:abc..." }
Each chunk includes its index and byte range. The server stores each chunk and returns a checksum for client-side verification.Step 3: Resume after failure. If the connection drops, the client queries which chunks were received:
GET /uploads/upl_abc123/status
Response: { "upload_id": "upl_abc123", "received_chunks": [0, 1, 2, 3, 4], "missing_chunks": [5, 6, ..., 1023], "progress_pct": 0.49 }
The client resumes from chunk 5. No re-uploading completed chunks.Step 4: Finalize.
POST /uploads/upl_abc123/complete
Response 200: { "file_id": "file_xyz789", "size_bytes": 5368709120, "checksum": "sha256:final...", "url": "/files/file_xyz789" }
The server assembles chunks, verifies the full-file checksum, and returns the final file resource.Key design decisions:
  • Chunk size: 5MB is a good default — small enough that a failed chunk is cheap to retry, large enough that 1024 chunks is manageable. Let the server decide (clients with slow connections could request smaller chunks).
  • Idempotency: Uploading the same chunk index twice is idempotent (server overwrites or ignores if checksum matches). This makes retries safe.
  • Progress: The client knows received_chunks / total_chunks. For real-time progress, the client can calculate bytes uploaded per second from the chunk upload times.
  • Expiry: Incomplete uploads expire after 24 hours. The client can extend with a PATCH /uploads/{id} to refresh the TTL.
  • Parallel uploads: Chunks are independent, so the client can upload 3-5 chunks in parallel for better throughput on high-bandwidth connections.
This is essentially the protocol used by Google’s resumable upload API, tus.io (an open protocol for resumable uploads), and AWS S3 multipart uploads.
Structured Answer Template.
  1. Reject the naive single-POST design — explain why a 5GB single request is a non-starter (LB timeouts, proxy body limits, no resume).
  2. Propose chunked protocol — initiate, upload chunks, resume, finalize. Four endpoints, clean lifecycle.
  3. Choose chunk size — 5 MB is the sweet spot (S3-compatible, survivable for retries, manageable chunk count).
  4. Handle resume and verification — client queries which chunks arrived; server returns per-chunk checksums.
  5. Close with prior art — tus.io, S3 multipart, Google resumable uploads all follow this shape. Do not reinvent.
Real-World Example. The tus.io protocol (adopted by Vimeo, Cloudflare Stream, Supabase Storage) implements this exact lifecycle as an open standard. Dropbox’s upload API uses a near-identical chunked protocol with a 150 MB chunk size for their desktop client, and their engineering blog has described debugging resume edge cases across flaky mobile networks — specifically the “client thinks chunk 47 uploaded, server never received it” race that HEAD-based verification resolves.
Big Word Alert — multipart upload. Splitting a large file into independently uploaded parts that the server reassembles at finalization. S3’s CreateMultipartUpload / UploadPart / CompleteMultipartUpload API is the canonical example. Say “multipart upload with 5 MB parts and parallel chunk upload” — the numeric specificity tells the interviewer you have shipped this.
Big Word Alert — content-addressable storage. Storing a file under a key derived from its hash (SHA-256) so that duplicate uploads are deduplicated automatically and corruption is detectable. S3 combined with client-side SHA-256 hashing gives you CAS semantics over a flat object store. Worth mentioning when discussing integrity verification and dedup.
Follow-up Q&A Chain:Q: What if two clients upload the same file concurrently? Do you want two copies or one? A: Content-addressable storage solves this for free: both clients compute sha256(file) client-side, both try to upload to /files/<hash>. The server checks if the hash already exists — if yes, skip the upload and return the existing URL; if no, store it. This gives you dedup for free. Dropbox, Git, and IPFS all use this model.Q: A 5 GB upload is 95% complete and the client loses power. What happens on retry? A: The client restarts and calls GET /uploads/:id/status, which returns the list of received chunks. The client uploads only the missing chunks. No re-uploading the 4.75 GB that already arrived. This is the entire point of chunked uploads — partial progress survives interruption. TTL on the upload session (24 hours) bounds how long the server holds partial state.Q: How do you prevent a malicious client from uploading 10,000 bogus 5 GB uploads and filling your storage? A: Three defenses: (1) require authentication on POST /uploads — no anonymous uploads. (2) Enforce per-user storage quotas enforced at POST /uploads time (reject if user is at 90% of quota). (3) Expire incomplete uploads after 24 hours (the TTL) — a bogus upload that never finalizes is garbage-collected. For very large files (>1 GB), require payment authorization up front or rate-limit the number of concurrent upload sessions per user.
Further Reading:

15.7 REST vs gRPC vs GraphQL — When to Use Each

CriteriaRESTgRPCGraphQL
ProtocolHTTP/1.1 or HTTP/2HTTP/2 (always)HTTP (typically POST)
FormatJSON (text)Protocol Buffers (binary)JSON (text)
Type safetyOpenAPI/Swagger (optional)Strong (code-generated from .proto)Strong (schema-defined)
StreamingSSE or WebSocket (separate)Native bidirectional streamingSubscriptions (over WebSocket)
Browser supportNativeRequires gRPC-Web proxyNative
CachingHTTP caching works nativelyNo HTTP caching (binary POST)Hard (every request is unique POST)
Payload sizeLarger (JSON text + field names)3-10x smaller (binary, no field names)Variable (client picks fields)
Best forPublic APIs, browser clientsInternal service-to-service, high-throughputMultiple client types needing different data
Worst forMobile (over-fetching)Browser clients, public APIsSimple CRUD, real-time streaming
VersioningURL-based (/v1/), easyProto file evolution, carefulSchema evolution, deprecation directives

Decision Matrix with Concrete Scenarios

ScenarioChooseWhy
Public API for third-party developersRESTUniversal tooling, discoverable, HTTP caching, curl-friendly
Mobile app + web app consuming same backend, different data needsGraphQLEach client fetches exactly the fields it needs, one endpoint
Microservice A calls microservice B 10,000 times/secondgRPCBinary serialization, HTTP/2 multiplexing, code-generated stubs, streaming
Internal service with complex data relationships (social graph)GraphQLNatural fit for graph-like data traversal
Real-time bidirectional communication (chat, collab editing)gRPC streamingNative bidi streaming over HTTP/2, no WebSocket needed
Simple CRUD API, small team, getting startedRESTLowest complexity, best documentation, largest talent pool
Backend-for-frontend (BFF) aggregating multiple servicesGraphQLSingle query resolves data from multiple sources
IoT device sending telemetry at very high volumegRPCMinimal payload size, efficient binary serialization
Decision rule: Public API -> REST. Internal service-to-service with high throughput -> gRPC. Multiple client types with different data needs -> GraphQL. When in doubt -> REST (simplest, most understood, best tooling).
Analogy: Restaurant vs Buffet. REST is like ordering from a restaurant menu — you pick from fixed options, and the kitchen decides what comes on the plate (maybe you get fries you did not want). GraphQL is like a buffet — you walk up and pick exactly what you want, nothing more, nothing less. The restaurant is simpler to run (fixed menu, predictable load), but the buffet is better when every diner has different dietary needs. gRPC is like a professional catering service — incredibly efficient, highly optimized, but you need a contract (proto file) agreed upon in advance, and it is not something you set up for a casual lunch.
You do not have to pick just one. Many production systems use REST for public APIs, gRPC between internal microservices, and GraphQL for the frontend gateway. Netflix, Airbnb, and Shopify all run hybrid architectures.

15.8 Contract Testing

Consumer defines expectations, provider verifies it satisfies them. Catches breaking changes before deployment.
Tools: Pact for consumer-driven contract testing. OpenAPI/Swagger for API documentation and code generation. Postman for API testing and collaboration.

15.9 Shadow Traffic, Canary Validation, and Safe Rollout

Deploying API changes to production without testing against real traffic patterns is like testing a bridge with toy cars. Shadow traffic and canary validation let you test with real load before real users are affected. Shadow traffic (dark launching): Duplicate a percentage of production traffic to the new API version. The shadow receives real requests but its responses are discarded — only the current version’s response goes to the client. Compare shadow responses against production responses for correctness (field-level diff), latency (P50, P95, P99 comparison), and error rate. Tools: Envoy’s mirroring filter, nginx mirror directive, Istio’s traffic mirroring, or application-level fork (publish request to a queue consumed by the shadow). When shadow traffic catches what tests miss:
  • A database query that performs fine on test data but times out on production data volumes.
  • A response field that returns null on 0.3% of real records due to a legacy data edge case your test fixtures do not cover.
  • Serialization overhead that adds 50ms at the 99th percentile on real payload sizes but is invisible on small test payloads.
Canary deployments for APIs: Route 1-5% of real traffic to the new version. Unlike shadow traffic, canary responses go to real users — but the blast radius is limited. Monitor error rate, latency, and business metrics (conversion rate, checkout completion) for the canary cohort vs the baseline. If metrics degrade beyond a threshold, automatically roll back. AWS API Gateway supports canary releases natively with weighted routing. Kubernetes-based setups use Argo Rollouts or Flagger. Rollback strategy for API deploys:
  • Stateless changes (new logic, bug fixes): Roll back the deployment. Previous version handles traffic. Straightforward.
  • Stateful changes (new database columns, new cache keys): Rolling back the deployment is not enough if the new version wrote data in a new format. The old version must be able to read or ignore the new data. This is why the expand-and-contract pattern is non-negotiable — the “expand” phase must be complete and backward-compatible before any behavior change deploys.
  • Schema-coupled changes: If the API change depends on a database migration that has already run, rolling back the API requires either (a) the migration being backward-compatible (expand phase only) or (b) running a reverse migration, which is risky under production load. Always decouple schema migrations from application deploys — migrate the schema first, deploy the app change later.
Handling backfills during API evolution: When a new field is computed from existing data (full_name derived from first_name + last_name), the backfill must complete before clients start depending on the field. A half-backfilled table means some API responses return the field and some do not — this breaks client assumptions about field presence. Run backfills as idempotent batch jobs with progress tracking. Only flip the API to return the new field after 100% of rows are backfilled and verified.
Cross-Chapter Connections for API Design.
  • Caching — API responses are prime caching candidates. Use HTTP cache headers (Cache-Control, ETag, Last-Modified) for GET endpoints, and application-level caching (Redis) for expensive computations behind endpoints. A well-cached API can serve 100x the traffic without scaling the backend. See the Caching chapter for cache invalidation patterns and HTTP caching strategies.
  • Performance and Reliability — Rate limiting (covered above) is only one layer of API protection. Circuit breakers protect your API from slow downstream dependencies. Timeouts prevent a single slow request from consuming a connection indefinitely. See the Performance chapter for the full resilience stack: timeouts, retries, circuit breakers, and bulkheads.
  • Messaging — For APIs that trigger long-running operations (video processing, report generation, bulk imports), return 202 Accepted immediately and publish the work to a message queue. The client polls a status endpoint or receives a webhook callback when complete. This pattern prevents HTTP timeouts and lets you scale processing independently from request handling. See the Messaging chapter for async job patterns and webhook delivery guarantees.
  • GraphQL at Scale — If you are evaluating GraphQL as an alternative to REST, that chapter covers the production reality: federation for multi-team schemas, persisted queries for performance, calculated query costs for rate limiting, and the organizational challenges (schema governance, breaking change detection) that only emerge at scale.
  • API Gateways & Service Mesh — Many of the concerns discussed in this chapter — rate limiting, authentication, versioning, TLS — are best handled at the gateway layer rather than inside each service. That chapter covers how Kong, Envoy, and Istio handle these cross-cutting concerns, plus service mesh patterns for internal service-to-service communication.
Further reading: RESTful Web APIs by Leonard Richardson and Mike Amundsen — deep dive into REST API design principles. API Design Patterns by JJ Geewax — reusable patterns for designing consistent, developer-friendly APIs. Stripe API Reference — the gold standard for public API design; study the consistency of naming, error formats, pagination, and versioning across every resource.REST best practices: Microsoft REST API Guidelines — comprehensive, opinionated guidelines from Microsoft covering naming, versioning, pagination, error handling, and long-running operations; the most thorough public API style guide available. Zalando RESTful API Guidelines — battle-tested guidelines from Zalando’s engineering team covering URL design, JSON conventions, HTTP headers, deprecation, and event-driven APIs; excellent complement to Microsoft’s guidelines with a European engineering perspective.gRPC: gRPC Official Documentation — the canonical reference for gRPC concepts, language-specific quickstarts, authentication, error handling, and performance best practices; start with the “What is gRPC?” overview and the language tutorial for your stack. Google’s Introduction to gRPC — Google’s comparison of gRPC, OpenAPI, and REST with guidance on when each is appropriate; valuable for understanding Google’s own reasoning since they created gRPC.GraphQL: GraphQL Official Learning Guide — the best starting point for GraphQL fundamentals; covers queries, mutations, schemas, validation, and execution in a clear, progressive tutorial written by the GraphQL Foundation. Apollo GraphQL Documentation — comprehensive docs for the most popular GraphQL ecosystem; covers client-side caching, server-side schema design, federation for microservices, and performance monitoring; particularly valuable for understanding how GraphQL works in production. GraphQL in Action by Samer Buna — practical guide to building GraphQL APIs from schema design through production deployment.Rate limiting: Cloudflare Blog: What is Rate Limiting? — clear explanation of rate limiting algorithms (fixed window, sliding window, token bucket) with visual diagrams; Cloudflare processes millions of requests per second, so their perspective on rate limiting at scale is authoritative. Stripe’s Rate Limiting Approach — how Stripe implements multi-tier rate limiting (request rate limiter, concurrent request limiter, fleet usage load shedder, worker utilization load shedder); the most practical production-grade rate limiting write-up available.Idempotency: Stripe’s Idempotent Requests Documentation — Stripe’s official documentation on idempotency keys, including key scoping, expiration, and error handling; the industry reference implementation for idempotent API design. Brandur Leach’s blog on idempotency keys — deep, production-tested writing on implementing idempotency keys with database transactions, rocket rides (the canonical example), and designing APIs that survive distributed failures; the most thorough technical deep dive on the subject.API versioning: Stripe’s API Versioning Approach — how Stripe versions their API by pinning each integration to a version, running version-specific request/response transforms, and maintaining backward compatibility across hundreds of versions; essential reading for anyone building a long-lived public API.

Part IX — Databases and Data Systems

The database is where most systems spend the most time, have the most bugs, and face the hardest scaling challenges. The reason databases are hard is not the SQL — it is the trade-offs: consistency vs performance (isolation levels), read speed vs write speed (indexes), flexibility vs integrity (schema), and single-node simplicity vs distributed resilience (replication/sharding). Every database decision is a trade-off decision.
Real-World Story: Slack’s Migration from MySQL to Vitess. By 2017, Slack’s MySQL infrastructure was hitting a wall. Their architecture used one MySQL database per workspace (customer), which worked well at first — it provided natural data isolation and meant a problem in one workspace could not affect another. But as Slack grew to millions of workspaces, they were managing tens of thousands of MySQL instances. Schema migrations had to be rolled out across all of them. Connection pooling was a nightmare. Large enterprise workspaces were outgrowing single-node MySQL, but their per-workspace-database model did not support sharding within a workspace. Slack adopted Vitess — a MySQL-compatible horizontal sharding layer originally built at YouTube. Vitess sits between the application and MySQL, routing queries to the correct shard transparently. It handles connection pooling (solving the thousands-of-connections problem), online schema migrations (applying DDL changes across shards without downtime), and horizontal resharding (splitting a hot shard into smaller ones). The migration was not a rewrite — Slack’s application still spoke MySQL. Vitess acted as an intelligent proxy that made a cluster of MySQL instances behave like a single, horizontally scalable database. The lesson: you do not always need to abandon your database when you hit scale limits. Sometimes the answer is a layer that makes your existing database distributed.
Real-World Story: GitHub’s Zero-Downtime Schema Migrations with gh-ost. GitHub runs one of the world’s largest MySQL deployments — over 300 million repositories backed by MySQL. When you need to add a column or change an index on a table with billions of rows, traditional ALTER TABLE locks the table for hours or days — completely unacceptable for a service that developers depend on 24/7. GitHub built gh-ost (GitHub Online Schema Transmogrifier), an open-source tool that performs schema migrations without locking the original table. How it works: gh-ost creates a “ghost” table with the new schema, then uses MySQL’s binary log (binlog) stream to copy rows from the original table to the ghost table while simultaneously capturing and replaying any new writes. When the ghost table is caught up, it performs an atomic table swap. The entire process runs with minimal impact on production traffic because it is throttle-aware — it monitors replication lag and slows down or pauses the migration if the database is under stress. gh-ost replaced the previous approach (pt-online-schema-change from Percona) because it does not use triggers, which were a source of unpredictable load spikes. GitHub now runs thousands of schema migrations per year across their fleet with zero downtime. The lesson: at scale, schema migration is not a DBA task — it is a distributed systems problem that requires dedicated tooling.

Chapter 16: Relational Foundations

16.1 Normalization and Denormalization

Normalization: Eliminate redundancy — each fact stored exactly once. A customer’s address is in the customers table, not duplicated in every orders row. Benefits: no update anomalies (change address once, reflected everywhere), smaller storage, data integrity. Cost: queries need JOINs to reassemble related data. Denormalization: Introduce controlled redundancy for read performance. Store the customer’s name directly in the orders table so displaying an order does not require a JOIN. Benefits: faster reads (no JOINs), simpler queries. Cost: update anomalies (customer changes name -> must update it in every order row), more storage, risk of inconsistency. When to denormalize: When reads vastly outnumber writes (product catalog read 1000x per write). When JOIN performance is unacceptable (joining 5 tables for a dashboard query). When you are building a read-optimized model (CQRS read side, search index, analytics). When NOT to: When write consistency is critical (financial data). When the data changes frequently (denormalized copies go stale). As a first resort (optimize queries and indexes first).

16.2 Indexing In Depth

Analogy: The Table of Contents. A database index is like a book’s table of contents — without one, you would have to read every page to find what you need (a full table scan). The table of contents tells you “Chapter 7 is on page 142,” so you jump straight there. A composite index is like a table of contents organized by topic AND subtopic — you can quickly find “Networking > TCP Handshake” without scanning every networking entry. A partial index is like an index that only covers chapters 5-8, because those are the ones readers look up most often. And just like a real book, adding too many indexes (imagine a 50-page table of contents for a 200-page book) makes the book harder to update and physically larger — that is write amplification.

Index Types — When Each Shines

Index TypeData StructureSupportsBest ForAvoid When
B-treeBalanced treeEquality, range (>, <, BETWEEN), sorting, prefix LIKE 'foo%'Default choice for almost everything — primary keys, foreign keys, most WHERE clausesYou only ever do exact-match lookups at extreme scale (hash might edge it out)
HashHash tableEquality only (=)Exact-match lookups in theory (O(1) vs B-tree O(log n))Range queries, sorting, prefix matching. In PostgreSQL, hash indexes were not crash-safe before v10 and remain uncommon. B-tree handles equality efficiently enough that hash rarely justifies losing range support
CompositeB-tree (multi-column)Queries matching the leftmost prefix of columnsMulti-column WHERE clauses, covering queriesColumn order does not match query patterns (leftmost prefix rule)
PartialB-tree (filtered)Queries matching the WHERE condition of the indexIndexing a subset — e.g., WHERE status = 'active' when 95% of rows are inactiveThe filtered subset is most of the table (no space savings)
CoveringB-tree with INCLUDE columnsIndex-only scans (database never touches the heap)Read-heavy queries where all selected columns fit in the indexWide rows or frequently updated INCLUDE columns
GIN (Generalized Inverted Index)Inverted indexArray containment, full-text search, JSONB key/value queriesPostgreSQL full-text search (to_tsvector), JSONB @> operators, array && overlapSimple scalar equality — GIN has higher write cost
GiST (Generalized Search Tree)Balanced tree (extensible)Range overlap, containment, nearest-neighborPostGIS geospatial queries (ST_Within, ST_DWithin), range types, ltree hierarchiesStandard equality/range on scalar types (B-tree is faster)
Write amplification: Each index adds a write operation to every INSERT, UPDATE, and DELETE. A table with 8 indexes means a single row insert causes ~9 disk writes (1 table + 8 indexes). In concrete terms: a single-index INSERT on PostgreSQL typically takes 0.1-0.5 ms. With 8 indexes, that same INSERT takes 0.8-4 ms — not a problem for individual writes, but at 10,000 inserts/second, the difference is between 5 seconds and 40 seconds of cumulative write time per second, which can saturate disk I/O. For write-heavy tables, every index must justify its existence. This is why write-optimized databases (LSM-tree based: Cassandra, RocksDB, LevelDB) use append-only writes and defer index maintenance — they trade read latency for write throughput. Common mistakes: Over-indexing (slows writes due to write amplification). Under-indexing foreign keys. Functions on indexed columns preventing index usage (WHERE UPPER(name) = 'JOHN' does not use an index on name — use a functional index instead).
How to decide: Start with B-tree for every new index. Switch to GIN only if you are searching inside JSONB, arrays, or full-text. Switch to GiST only for geospatial or range-type queries. Use partial indexes to shrink index size when you only query a subset of rows. Use composite indexes when your queries consistently filter on 2-3 columns together — put the highest-cardinality column first for equality checks, and the range column last.

16.3 Transactions and Isolation Levels

ACID: Atomicity, Consistency, Isolation, Durability.

Isolation Levels, Anomalies, and When to Use Each

Isolation LevelDirty ReadNon-Repeatable ReadPhantom ReadWrite SkewPerformanceDefault In
Read UncommittedPossiblePossiblePossiblePossibleFastestAlmost never used
Read CommittedPreventedPossiblePossiblePossibleFastPostgreSQL, Oracle, SQL Server
Repeatable ReadPreventedPreventedPossible (SQL standard) / Prevented (PostgreSQL)PossibleModerateMySQL InnoDB
SerializablePreventedPreventedPreventedPreventedSlowestNone by default
Anomalies explained:
  • Dirty read: Transaction A reads data written by Transaction B before B commits. If B rolls back, A used data that never existed. Prevented at Read Committed and above.
  • Non-repeatable read: Transaction A reads a row, Transaction B updates and commits that row, Transaction A reads again and gets a different value. Within a single transaction, the same query returns different results. Prevented at Repeatable Read and above.
  • Phantom read: Transaction A queries rows matching a condition (WHERE price < 100), Transaction B inserts a new row matching that condition and commits, Transaction A re-queries and gets an extra row that was not there before. Prevented at Serializable (and at PostgreSQL’s Repeatable Read due to snapshot isolation).
  • Write skew: Two transactions each read a value, make a decision based on it, and write — but their combined writes violate a constraint that neither individual write violates. Example: two doctors both check that at least one doctor is on-call, each decides to go off-call, both commit — now zero doctors are on-call. Only Serializable prevents this.
Practical guidance:
  • Read Committed: Use for typical web requests — showing a product page, listing orders, updating a user profile. Slight staleness within a transaction is fine because each statement sees the latest committed data.
  • Repeatable Read: Use for reports that run multiple queries and need consistent numbers — “total revenue this month” should not change between the SELECT for orders and the SELECT for refunds within the same report transaction.
  • Serializable: Use for financial operations where correctness is critical — transferring money between two accounts (both balances must be read and written atomically), inventory reservation (two users trying to buy the last item must not both succeed).
PostgreSQL’s REPEATABLE READ is actually snapshot isolation (MVCC), which prevents phantom reads. MySQL’s REPEATABLE READ uses gap locks and behaves differently. Know your database’s specific implementation.

16.4 Locking

Pessimistic locking: SELECT ... FOR UPDATE acquires a lock on the row. Other transactions wait. Use when contention is high and conflicts are expected. Optimistic locking: Include a version column. On update, check WHERE version = expected_version. If no row is updated, someone else changed it — retry or fail. Use when contention is low. Advisory locking: Application-level locks using the database as a coordination point. PostgreSQL provides pg_advisory_lock(key). Use for distributed locking when you want to coordinate across application instances — e.g., ensuring only one instance runs a scheduled job.
Advisory Lock. Unlike row locks that protect data, advisory locks protect application logic. They do not lock any table or row — they are purely cooperative. Your code must explicitly acquire and release them. Useful for job scheduling, migration coordination, and singleton processing.
Deadlocks: Transaction A holds Row 1, waits for Row 2. Transaction B holds Row 2, waits for Row 1. Prevention: consistent lock ordering, short transactions, lock timeouts.
Gap Locks in MySQL. MySQL InnoDB’s REPEATABLE READ uses “gap locks” that lock ranges of index records, not just specific rows. This can cause unexpected blocking and deadlocks in high-concurrency write scenarios. PostgreSQL’s MVCC approach avoids this by using snapshot isolation instead.

16.5 CAP Theorem and Distributed Data

CAP theorem: Consistency, Availability, Partition Tolerance. The common “pick two” framing is misleading — network partitions are not optional, they will happen. The real choice is: during a partition, do you sacrifice consistency or availability? In normal operation (no partition), you can have both.

PACELC — The More Practical Model

The PACELC model (Daniel Abadi) extends CAP to capture what matters during normal operation: during a Partition, choose Availability or Consistency; Else (normal operation), choose Latency or Consistency. This captures the fact that even without partitions, there is a trade-off between response speed and consistency (synchronous replication is consistent but slower).
SystemDuring Partition (PAC)Normal Operation (ELC)Classification
PostgreSQL (sync replication)Sacrifice Availability (CP)Sacrifice Latency for Consistency (EC)PC/EC
CassandraSacrifice Consistency (AP)Sacrifice Consistency for Latency (EL)PA/EL
DynamoDB (eventual)Sacrifice Consistency (AP)Sacrifice Consistency for Latency (EL)PA/EL
DynamoDB (strong reads)Sacrifice Availability (CP)Sacrifice Latency for Consistency (EC)PC/EC
MongoDB (majority writes)Sacrifice Availability (CP)Sacrifice Latency for Consistency (EC)PC/EC
CockroachDBSacrifice Availability (CP)Sacrifice Latency for Consistency (EC)PC/EC
PACELC is more useful than CAP for real engineering decisions because most of the time your system is NOT partitioned. The daily trade-off you actually make is latency vs consistency — and PACELC makes that explicit. When someone says “we chose Cassandra for availability,” the more honest statement is “we chose Cassandra because we value low latency over strong consistency in normal operation, AND we prefer availability over consistency during partitions.”
CP systems: ZooKeeper, etcd, traditional RDBMS with synchronous replication. Refuse to respond rather than return stale data. AP systems: Cassandra, DynamoDB (eventual consistency mode), CouchDB. Return potentially stale data rather than refusing. Practical nuance: Real systems are not purely CP or AP. They make different trade-offs per operation. Write CP, read AP is a common hybrid approach.
Eventual Consistency. After an update, replicas may return stale data for a period, but will eventually converge to the latest value. “Eventually” could be milliseconds or seconds. The question is whether your application can tolerate the window of inconsistency.

16.6 Database Replication

Synchronous vs Asynchronous vs Semi-Synchronous

Replication ModeHow It WorksData Loss RiskWrite LatencyAvailability Impact
SynchronousWrite confirmed only after replica acknowledgesZero (for acknowledged replicas)High (network round-trip to replica)Replica outage blocks writes
AsynchronousWrite confirmed after primary writes locally; replica catches up laterPossible (primary fails before replica receives write)LowestReplica outage has no impact on writes
Semi-synchronousWrite confirmed after at least one replica acknowledges; remaining replicas are asyncMinimal (at least one replica has the data)Moderate (one round-trip, not all)Tolerates all-but-one replica failures
Synchronous replication: Write is confirmed only after at least one replica (or a configurable quorum) acknowledges. Guarantees no data loss if the primary fails. Adds latency to writes. Waiting for all replicas is possible but rarely done — it makes writes as slow as the slowest replica and means a single replica outage blocks all writes. Asynchronous replication: Write is confirmed after primary acknowledges. Replicas catch up later. Risk of data loss if primary fails before replication completes. Semi-synchronous replication (the practical middle ground): Used in MySQL and PostgreSQL (via synchronous_standby_names with ANY 1). The primary waits for at least one replica to confirm, then acknowledges the write. Remaining replicas replicate asynchronously. This gives you durability (one replica always has the latest data) without the fragility of full synchronous replication (one slow replica does not block all writes). This is the most common production configuration for databases that need strong durability without sacrificing availability. Read replicas: Direct read traffic to replicas, reducing load on the primary. Reads may be slightly stale (replication lag).
Replication Lag. A user creates a record, is redirected to a page that reads it, and the read hits a replica that has not received the write yet. Solutions: “read your own writes” consistency (route reads to primary for the writing user for a short period), or use the primary for reads immediately after writes.
Strong answer: First, verify that reads are the bottleneck (not write contention or slow queries — adding replicas does not help those). Check pg_stat_activity — if you see 80%+ of active connections running SELECT queries and the primary CPU is pegged at 90%+, read replicas will help. If you see lock waits and slow writes, replicas will not.Then set up asynchronous replication to one read replica. On a typical cloud instance (e.g., RDS db.r6g.xlarge with 4 vCPUs), a single PostgreSQL node handles roughly 5,000-10,000 simple read queries per second. Adding one read replica doubles that read capacity to ~10,000-20,000 QPS. Route read-only queries from the application to the replica — this can be done at the ORM level (Rails connected_to, Django using), at the connection pool level, or via a proxy like PgBouncer with read/write splitting. Monitor replication lag — in normal operation on the same availability zone, expect 1-10 ms lag; cross-region replicas may see 50-200 ms lag depending on distance. Keep all writes on the primary.
Classic replication lag issue. The user wrote to the primary, then the feed page read from the replica, but the replica had not received the write yet. Fixes: (1) Route the writing user’s reads to the primary for the next 5 seconds after a write (track this in the session — “read your own writes” consistency). (2) After creating a post, return the new post in the API response and have the frontend optimistically insert it into the feed without waiting for a server round-trip. (3) For the specific “just wrote” scenario, include a ?after_write=true parameter that forces a primary read. Option 2 is best UX. Option 1 is the safest server-side fix.
Structured Answer Template.
  1. Name the phenomenon — “read-your-own-writes consistency violation due to replication lag.”
  2. Explain the cause — async replication; replica was behind at read time.
  3. Give three fixes ordered by cost — optimistic UI (cheapest, best UX), session stickiness (safe server-side), primary read on hot path.
  4. Quantify acceptable lag — same-AZ: under 10ms, cross-AZ: under 100ms, cross-region: up to seconds.
  5. Measurepg_stat_replication.replay_lag metric; alert at P95 > 1s.
Real-World Example. LinkedIn published a 2019 engineering post about exactly this issue on their feed service — users posted an update, refreshed, and saw an empty feed. They adopted a “causal consistency token” pattern: the write returns an LSN, the client sends it on the next read, and the replica waits up to 500ms for that LSN before responding (falling back to primary if slower). This eliminated the “my post disappeared” complaint class without forcing primary reads globally.
Big Word Alert — replication lag. The time delay between a write committing on the primary and appearing on a replica. Measured in milliseconds in same-AZ setups, seconds to minutes for cross-region or under heavy write load. Always measure it as a distribution (P50/P95/P99) — averages hide the spikes that cause user complaints.
Big Word Alert — read-your-own-writes (RYOW) consistency. A consistency guarantee that a client always sees its own writes, even if other clients see stale data on the same replicas. Weaker than linearizability, stronger than eventual consistency. Name it when proposing session-sticky routing as the fix.
Follow-up Q&A Chain:Q: Instead of routing to primary, can you wait on the replica until the write arrives? A: Yes — causal consistency tokens. The write returns a position marker (LSN in PostgreSQL, GTID in MySQL). The client sends it on the next read, and the replica blocks briefly until it has replayed to that position. Cheaper than routing to primary (primary capacity is precious), and works even without session affinity. Downside: tail latency increases by the replication lag.Q: A user profile update must be instantly visible globally across all regions. Is that realistic? A: No — physics says no. Cross-region replication lag is bounded by the speed of light plus network routing (~100-200ms trans-Atlantic). For “instant globally,” you need either synchronous cross-region replication (adds 200ms to every write — usually unacceptable) or a design that tolerates the delay (optimistic UI, eventual convergence with conflict resolution). Promise users “visible within a few seconds,” not “instant everywhere.”Q: Your replica is 30 seconds behind during a write spike. What broke? A: Most likely: (1) the replica’s single-threaded WAL apply cannot keep up with the write rate — solution is parallel apply or scaling up the replica disk. (2) A long-running query on the replica is holding a snapshot that blocks WAL replay past that point (hot_standby_feedback trade-off). (3) Cross-region network saturation — check whether 100% of the WAL bandwidth is being consumed. The diagnosis path is pg_stat_replication.replay_lag + pg_stat_activity for long queries on the replica.
Further Reading:
  • PostgreSQL Docs — Hot Standby and Replication Lag
  • LinkedIn Engineering Blog — “Consistency as a Service: Auditing Cloud Consistency”
  • AWS RDS Docs — Replication lag metrics and alerting

16.7 Backup and Recovery

The principle: Backups are worthless. Restores are priceless. If you have never tested restoring from your backups, you do not have backups — you have hope. Backup types: Full backup (complete copy of all data — slow, large, but simple to restore). Incremental backup (only changes since last backup — fast, small, but restore requires the full backup plus all increments). Differential backup (changes since last full backup — middle ground). Point-in-Time Recovery (PITR): Restore the database to any specific moment. PostgreSQL achieves this through WAL (Write-Ahead Log) archiving — the base backup plus archived WAL segments let you replay transactions up to any timestamp. MySQL uses binary logs (binlogs) similarly. Cloud-managed databases (RDS, Cloud SQL) provide automated PITR with configurable retention (typically 1-35 days). Backup testing: Schedule regular restore drills (monthly minimum). Measure restore time — does it meet your RTO? Verify data integrity after restore. Automate the verification (restore to a test instance, run validation queries, compare row counts). If you only discover your backup is corrupted during a real disaster, you have failed. Cloud-managed backup: RDS automated backups (daily snapshots + continuous WAL archiving). Cloud SQL automated backups. Azure SQL geo-redundant backups. These are good defaults but understand their RPO (usually 5 minutes for PITR) and retention limits.
Logical backups (pg_dump, mysqldump) are portable but slow for large databases. Physical backups (filesystem-level snapshots, pg_basebackup) are fast but tied to the same database version. For databases over 100 GB, always use physical backups for disaster recovery.

16.8 Schema Versioning and Migrations

Use the expand and contract pattern: add new column, deploy app writing to both, backfill, deploy app reading from new column, remove old column. Never rename a column in one step. Always test migrations on production-sized data. Include rollback scripts.
Tools: Flyway, Liquibase (JVM). Alembic (Python). Knex migrations (Node.js). Entity Framework migrations (.NET). golang-migrate (Go). dbmate (language-agnostic).

Online Schema Evolution — What Actually Happens at Scale

On a 500M-row table serving 5,000 QPS, ALTER TABLE ADD COLUMN is not a command — it is a coordinated operation with production-safety implications. Different databases handle this differently, and the differences matter. PostgreSQL (13+): ALTER TABLE ADD COLUMN with a non-volatile default is metadata-only — it completes in milliseconds regardless of table size. PostgreSQL stores the default in the catalog and returns it on read without rewriting existing rows. This is one of PostgreSQL’s best features for online schema evolution. However: ALTER TABLE ADD COLUMN ... NOT NULL DEFAULT now() (volatile default) does rewrite the table in older versions. Always use a non-volatile default for online adds. MySQL InnoDB (8.0+): Most ALTER TABLE operations support “instant” DDL (ALGORITHM=INSTANT) for column additions at the end of the table. Adding a column in the middle, changing a column type, or adding an index requires either ALGORITHM=INPLACE (online, allows concurrent DML) or ALGORITHM=COPY (blocks writes). The ALGORITHM hint controls behavior — always specify it explicitly in production migrations to prevent surprises. Online schema migration tools comparison:
ToolDatabaseMechanismReplication-AwareThrottlingUsed By
gh-ostMySQLBinlog stream to ghost table, atomic renameYes (pauses if replica lag exceeds threshold)Adaptive based on replication lagGitHub, Shopify
pt-online-schema-changeMySQLTriggers on original table copy to new tableLimitedManual or lag-basedPercona users, many MySQL shops
pg_repackPostgreSQLTrigger-based table rewrite without locksNoManualPostgreSQL shops needing VACUUM FULL without locks
ReshapePostgreSQLDual-write to old and new schema simultaneouslyNoNoneNewer PostgreSQL projects
LHM (Large Hadron Migrator)MySQLChunked copy with triggersYesConfigurableShopify (legacy, replaced by gh-ost)
VitessMySQLDistributed DDL across shards via VReplicationYes (built-in)Shard-awareSlack, PlanetScale, HubSpot
Migration rollback strategies:
Migration TypeRollback ApproachRisk LevelTime to Rollback
Add columnDrop the column (metadata-only in PG, fast)LowSeconds
Add indexDrop the index (DROP INDEX CONCURRENTLY in PG)LowSeconds to minutes
Rename columnAdd the old name back as an alias or reverse the renameMedium — any code deployed referencing new name breaksMinutes
Change column typeAdd old-type column back, backfill, swapHigh — data loss possible if narrowing typeHours
Drop columnRestore from backup or WAL replay to recover dataCritical — data is goneHours to impossible
Drop tableRestore from backupCriticalHours
The golden rule of migration rollbacks: Every migration you run forward must have a tested reverse migration. “Tested” means you have run it against a production-sized dataset and verified both data integrity and application compatibility. Untested rollback scripts are fiction.

Backfill Operations — The Hidden Landmine

A backfill is a bulk data update that populates a new column, transforms existing data, or computes derived values. Backfills look trivial in development (UPDATE orders SET new_col = old_col * 1.1) and become production incidents at scale. Why backfills are dangerous:
  1. Transaction duration. A single UPDATE ... SET ... WHERE ... on 100M rows opens one transaction that runs for hours. During that time: VACUUM cannot clean dead tuples behind it (PostgreSQL), replication lag spikes (the replica must replay the entire transaction atomically), and any crash requires replaying the entire transaction from the WAL.
  2. Lock contention. While the backfill holds row-level locks on millions of rows, application writes to those rows queue behind the locks. If the backfill touches a hot table (orders, users), write latency spikes for the duration.
  3. Checkpoint storms. A massive UPDATE dirties millions of pages in the buffer cache. When a checkpoint fires, all dirty pages are flushed to disk simultaneously, causing I/O saturation for 30-60 seconds.
Safe backfill patterns:
-- WRONG: Single transaction, runs for hours
UPDATE orders SET priority = 'normal' WHERE priority IS NULL;

-- RIGHT: Batched with explicit commits and throttling
DO $$ 
DECLARE
  batch_size INT := 10000;
  affected INT := 1;
BEGIN
  WHILE affected > 0 LOOP
    UPDATE orders SET priority = 'normal'
    WHERE id IN (
      SELECT id FROM orders WHERE priority IS NULL
      ORDER BY id LIMIT batch_size
      FOR UPDATE SKIP LOCKED  -- skip rows locked by app writes
    );
    GET DIAGNOSTICS affected = ROW_COUNT;
    COMMIT;  -- release locks, allow VACUUM, reset transaction
    PERFORM pg_sleep(0.1);  -- throttle: 100ms pause between batches
  END LOOP;
END $$;
Backfill coordination with API deploys: When a new API field is computed from existing data (full_name derived from first_name + last_name), the deploy sequence matters:
  1. Deploy 1: Add the column (NULLable, no default). No API change — the field is not returned yet.
  2. Deploy 2: Application writes to the new column on all new INSERTs and UPDATEs. Existing rows still have NULL.
  3. Backfill: Run the batched backfill. Monitor progress (SELECT COUNT(*) WHERE new_col IS NULL vs total).
  4. Deploy 3: Only after 100% of rows are backfilled, update the API to return the new field. If the API returns the field while rows are still NULL, clients see inconsistent data — some responses have the field, some do not. This breaks client assumptions about field presence.
  5. Deploy 4 (optional): Add a NOT NULL constraint after verifying zero NULLs remain.
Skipping step 3 or deploying step 4 before step 3 completes is how half-backfilled columns create production bugs that are extremely hard to diagnose (intermittent NULL responses depending on which row the API happens to fetch).

Change Data Capture (CDC) in Practice

CDC captures row-level changes from the database’s transaction log (WAL in PostgreSQL, binlog in MySQL) and streams them to consumers. This is the foundation for event-driven architectures, real-time data pipelines, cache invalidation, and search index synchronization. CDC tools and approaches:
ToolSource DBOutput TargetsExactly-OnceOperational Complexity
DebeziumPostgreSQL, MySQL, MongoDB, SQL Server, OracleKafka, Pulsar, Kinesis, Redis StreamsAt-least-once (consumers must be idempotent)Medium (runs on Kafka Connect)
AWS DMSMost relational DBsKinesis, Kafka, S3, target RDSAt-least-onceLow (managed service)
PostgreSQL logical replicationPostgreSQLPostgreSQL (subscriber)At-least-onceLow (built-in)
MaxwellMySQL (binlog)Kafka, Kinesis, RabbitMQ, stdoutAt-least-onceLow
SpinalTap (Airbnb)MySQLKafkaAt-least-onceMedium
CDC and the API/database mismatch problem: When your API returns data derived from multiple tables and you use CDC to sync a read-optimized store (Elasticsearch, Redis, denormalized view), the CDC events arrive at different times for different tables. A single API “order” response might combine data from orders, order_items, customers, and products. If the CDC event for order_items arrives 200ms after orders, the search index briefly contains an order with no items. Consumers querying during that window see inconsistent data. Mitigation patterns:
  • Debounce and aggregate. Buffer CDC events for 1-5 seconds before updating the read store. Group all events for the same entity (order ID) and apply them together.
  • Version vector. Include a monotonic version in the read store. Only update if the incoming version is higher. Reprocess out-of-order events.
  • Outbox pattern. Instead of relying on CDC from business tables, write a single outbox event per business operation with the complete, pre-assembled payload. CDC on the outbox table produces one consistent event per operation.
Cost of CDC: Debezium on Kafka Connect requires a dedicated Kafka cluster (or managed Confluent Cloud at ~0.100.30/GB).Adatabaseproducing10GB/dayofWALgenerates 0.10-0.30/GB). A database producing 10GB/day of WAL generates ~1-3/day in CDC streaming costs. At 1TB/day (large e-commerce), that is $100-300/day just for the CDC pipeline before downstream processing costs. Factor this into architectural decisions — CDC is not free.
Breaking Change Traps — DB Schema Edition.These look like safe migrations but cause production incidents during rolling deploys:
  1. Adding a NOT NULL column without a default. Old application instances (still deploying) INSERT without the new column — database rejects the INSERT. Fix: always add columns as NULLable or with a default. Add the NOT NULL constraint in a later migration after all instances write the column.
  2. Renaming a column. Old instances SELECT/INSERT the old name and fail. Use expand-and-contract: add new column, dual-write, backfill, migrate readers, drop old column. Minimum 3 deploys.
  3. Changing a column type. ALTER TABLE ALTER COLUMN price TYPE numeric(10,2) rewrites the table in MySQL and acquires a heavy lock. Even in PostgreSQL, some type changes require a table rewrite. Use a new column, not an in-place type change.
  4. Dropping a column that is still referenced. Any ORM that does SELECT * or references the column explicitly will fail. Verify zero references (grep the codebase, check CDC consumers, check reporting queries, check ETL pipelines) before dropping.
  5. Adding a unique constraint to existing data with duplicates. The migration fails. Always add the constraint as NOT VALID first, clean duplicates, then VALIDATE CONSTRAINT. In PostgreSQL, ADD CONSTRAINT ... NOT VALID does not scan the table; VALIDATE CONSTRAINT scans without a full lock.
  6. Creating an index without CONCURRENTLY. On a 500M-row table, CREATE INDEX holds a SHARE lock (blocks writes) for 10-60 minutes. Use CREATE INDEX CONCURRENTLY. If the concurrent build fails, clean up the invalid index.
  7. Running a backfill in a single transaction. UPDATE orders SET new_col = computed_value on 100M rows holds a transaction open for hours, prevents VACUUM, bloats the table, and causes replication lag. Batch in chunks of 10,000-50,000 rows with COMMIT between batches.
  8. Adding a foreign key to a large table. ALTER TABLE ADD CONSTRAINT ... REFERENCES ... scans the entire table to validate existing data while holding a lock. Use NOT VALID then VALIDATE CONSTRAINT separately.
  9. Changing the primary key type. Migrating from INT to UUID or BIGINT requires rewriting every foreign key on every referencing table. On a table with 10 foreign key relationships, this is 10 table rewrites. Plan this as a multi-week migration with expand-and-contract on every referencing table.
  10. Adding a trigger on a high-write table. A BEFORE INSERT trigger that calls a function runs on every INSERT. On a table receiving 10,000 inserts/second, even a 0.1ms trigger adds 1 second of cumulative overhead per second of wall-clock time. Triggers on high-write tables must be benchmarked at production write rates, not on empty test tables.
  11. Enabling pg_stat_statements or auto_explain without load testing. These extensions add overhead to every query execution. pg_stat_statements is lightweight (~1-2% overhead) but auto_explain with log_min_duration = 0 can add 5-15% overhead on high-QPS databases. Always load-test observability extensions before enabling in production.
  12. Changing the default value of an existing column. ALTER TABLE ALTER COLUMN SET DEFAULT 'new_value' only affects future INSERTs — existing rows keep the old value. If application logic assumes all rows have the new default, existing data violates the assumption. Always pair a default change with a backfill of existing rows.

Replica Lag — Realities and Measurement

Replication lag is not a number — it is a distribution. “Average lag is 50ms” means nothing if P99 lag is 30 seconds during write spikes. What actually causes replica lag:
  • Write spikes. A bulk import of 1M rows creates a WAL burst. The replica’s single-threaded apply process falls behind. PostgreSQL 15+ supports parallel replay but only for separate tables.
  • Long-running transactions on the replica. A reporting query holding a snapshot on the replica prevents WAL replay from advancing past that snapshot’s horizon (PostgreSQL’s hot_standby_feedback and max_standby_streaming_delay control this trade-off).
  • Network saturation. Cross-region replicas compete with application traffic for bandwidth. A 1Gbps link carrying 100MB/s of WAL and 100MB/s of application traffic has no headroom for bursts.
  • Disk I/O on the replica. The replica writes WAL and applies it to data files. If the replica’s disk is slower than the primary’s (common in cost-optimized setups), it falls behind during sustained write periods.
Measuring lag accurately:
  • PostgreSQL: SELECT now() - pg_last_xact_replay_timestamp() AS lag on the replica. But this only updates when WAL is applied — if no writes happen for 5 minutes, the lag appears to grow even though the replica is caught up. Better: use pg_stat_replication.replay_lag on the primary (available in PG 10+).
  • MySQL: SHOW SLAVE STATUS reports Seconds_Behind_Master. This metric is notoriously unreliable — it measures the timestamp difference of the last applied event, not the actual queue depth. Use pt-heartbeat (Percona) for accurate lag measurement: it writes a timestamp to the primary every second and reads it from the replica.
  • Alert threshold: Alert at P95 lag > 1 second for same-region replicas, > 5 seconds for cross-region. Investigate immediately at > 30 seconds — the replica may be falling permanently behind.
Read-your-own-writes consistency patterns:
PatternHow It WorksOverheadBest For
Session-sticky routingAfter a write, route all reads from the same session to the primary for N secondsLow (routing logic only)Web apps with session affinity
Causal consistency tokenWrite returns a LSN/GTID token. Client sends it on next read. Replica waits until it has applied that position before responding.Medium (replica may block briefly)Mobile apps, APIs where session affinity is hard
Primary-only for critical readsRoute specific endpoints (balance check, order status after creation) to the primary alwaysLowFinancial, order management
Optimistic UIReturn the written data in the write response. Client renders it immediately without a subsequent read.Zero server overheadModern SPAs, mobile apps

16.9 Database Connection Management in Production

Connection management is the database topic most likely to wake you up at 3 AM. It is not glamorous, it rarely comes up in tutorials, and it is the single most common cause of production database outages that have nothing to do with queries or data. Understanding connection pooling, limits, and serverless connection challenges is non-negotiable for production engineering.

Why Connections Are a Scarce Resource

Every database connection is expensive. A PostgreSQL connection consumes approximately 5-10 MB of server memory (the backend process, work_mem allocations, shared buffer mappings). MySQL’s per-connection overhead is lower (~1-3 MB) but still significant at scale. A database server with 16 GB of RAM can comfortably handle roughly 200-500 connections before memory pressure degrades query performance. PostgreSQL’s default max_connections is just 100. The math gets bad quickly: if you have 20 application servers, each running 4 worker processes, each holding 5 connections — that is 400 connections. Add a few cron jobs, migration scripts, monitoring tools, and admin sessions, and you are at 450+ connections. Your database is now spending more resources managing connections than executing queries.

Connection Pooling Strategies

Pooling StrategyHow It WorksConnections to DBBest ForTrade-offs
Application-level poolingEach app instance maintains a local pool (HikariCP, SQLAlchemy pool, pg pool in Node.js)app_instances x pool_sizeMonoliths, small deploymentsTotal connections = instances x pool_size; scales poorly as you add instances
External connection poolerA proxy sits between app and DB, multiplexing many app connections onto fewer DB connectionsConfigurable (e.g., 50) regardless of app instancesPostgreSQL at scale (PgBouncer, PgCat, Odyssey)Additional infrastructure; transaction-mode pooling breaks session-level features (prepared statements, SET, LISTEN/NOTIFY)
Database-native poolingBuilt into the database or its proxy (ProxySQL for MySQL, RDS Proxy for AWS)ConfigurableMySQL at scale, serverless architecturesAdds latency (~1-2 ms); vendor-specific configuration
PgBouncer pooling modes — critical distinction:
ModeWhen Connection Is Returned to PoolSupportsBreaks
Session poolingWhen the client disconnectsEverything (prepared statements, SET, LISTEN)Nothing — but minimal connection savings (one DB connection per client session)
Transaction poolingWhen the transaction ends (COMMIT/ROLLBACK)Most applicationsPrepared statements (unless using protocol-level), session-level SET commands, LISTEN/NOTIFY, advisory locks
Statement poolingAfter each statementSimple queries onlyMulti-statement transactions, everything in transaction mode, plus more
Recommendation: Transaction-mode pooling is the standard for web applications. It provides the best connection multiplexing (100 app connections can share 10 DB connections if transactions are short). If your application uses prepared statements, either disable them at the ORM level or use PgBouncer 1.21+ which supports protocol-level prepared statement handling.

Connection Pool Sizing — The HikariCP Formula

Most teams set their connection pool too large. The counterintuitive truth: a smaller pool almost always outperforms a larger one. The formula (from the HikariCP wiki, applicable to any pool):
optimal_pool_size = (core_count * 2) + effective_spindle_count
For an SSD-backed server with 4 cores: (4 * 2) + 1 = 9 connections. That is not a typo. A pool of 9-10 connections on a 4-core machine will outperform a pool of 100 connections, because:
  • With 100 connections, the database juggles 100 concurrent queries, each getting 1% of CPU time, context-switching constantly, thrashing the disk scheduler.
  • With 10 connections, each query gets 10% of CPU time, executes 10x faster, and the total throughput is higher because the CPU spends time executing queries instead of switching between them.
The waiting-is-faster paradox: 90 requests waiting in the application’s pool queue (in-memory, nanosecond operations) while 10 execute at full speed is faster than 100 requests all executing simultaneously with constant context switching and disk contention.
The “just increase max_connections” anti-pattern. When connection errors appear, the reflexive fix is max_connections = 1000. This is like widening a highway without adding lanes — you get more cars in, but they all go slower. At 1,000 connections, PostgreSQL allocates 5-10 GB just for connection overhead, leaving less memory for shared_buffers and work_mem. Query latency increases, lock contention increases, and the next failure mode is worse (OOM kill instead of connection refused). The correct fix is almost always a connection pooler, not a higher limit.

Serverless Connection Challenges

Serverless functions (AWS Lambda, Vercel Functions, Cloudflare Workers) fundamentally break traditional connection pooling because:
  1. No persistent process. Each Lambda invocation may be a new container with no existing connection. Cold starts establish a new database connection (TCP handshake + TLS + authentication = 50-100 ms for PostgreSQL).
  2. Unbounded concurrency. A traffic spike can spawn 1,000 Lambda instances simultaneously, each opening a database connection. Your database goes from 10 connections to 1,000 in seconds — far beyond max_connections.
  3. No shared pool. Each Lambda instance has its own pool of 1 connection. There is no cross-instance connection sharing. Application-level pooling is useless when every instance is isolated.
Solutions by effectiveness:
SolutionHow It WorksLatency AddedComplexityBest For
RDS ProxyAWS-managed connection pooler that sits between Lambda and RDS/Aurora. Maintains a warm connection pool, multiplexes Lambda connections.~1-5 ms per queryLow (managed service)AWS Lambda + RDS/Aurora — the default recommendation
PgBouncer on ECS/FargateRun PgBouncer as a sidecar or standalone container in your VPC. Lambda connects to PgBouncer, which maintains a fixed pool to PostgreSQL.~1-2 ms per queryMedium (you manage PgBouncer)Non-AWS environments, or when RDS Proxy pricing is prohibitive
HTTP-based database proxyServices like Neon, PlanetScale, or Supabase that expose a database over HTTP/WebSocket, eliminating TCP connection overhead entirely.Variable (~5-20 ms)Low (managed service)Edge functions (Cloudflare Workers), prototyping, low-throughput serverless
Aurora Serverless v2 Data APIAWS HTTP-based API to Aurora that manages connections internally. No VPC, no connection pooling needed.~10-30 ms per queryLowestSimple serverless apps where latency tolerance is higher
RDS Proxy deep dive. RDS Proxy pins a session to a specific database connection when certain features are used (prepared statements, SET commands, temporary tables). Pinning defeats the purpose of pooling because that connection is no longer shared. Monitor the DatabaseConnectionsCurrentlySessionPinned CloudWatch metric — if it is consistently high, your application is defeating the proxy. Common fixes: avoid session-level SET commands (use connection-level defaults), use parameterized queries instead of prepared statements, and avoid temporary tables in Lambda functions. For detailed RDS Proxy configuration and cost analysis, see the Cloud Service Patterns chapter.
Strong answer:The root cause is that Lambda scales horizontally without limit, and each instance opens its own database connection. At 500 concurrent Lambda invocations, you have 500 connections — PostgreSQL’s default limit is 100, and even tuned instances rarely go above 500.Immediate fix (minutes): Add RDS Proxy between Lambda and the database. RDS Proxy maintains a pool of connections to PostgreSQL (configurable, default is max_connections * 0.9) and multiplexes all Lambda connections through that pool. Lambda connects to the RDS Proxy endpoint instead of the database directly. This is the AWS-recommended pattern and requires only changing the connection string.Configuration: Set RDS Proxy’s MaxConnectionsPercent to 90% of the database’s max_connections. Set ConnectionBorrowTimeout to 120 seconds — this is how long a Lambda invocation waits for a connection from the pool before failing. This is better than failing immediately because most Lambda functions hold a connection for <100 ms, so the queue drains quickly.Application-level optimization: Reuse connections across invocations within the same Lambda container. Initialize the database client outside the handler function — Lambda reuses the container for subsequent invocations, and the existing connection is reused. This reduces connection churn from “one connection per invocation” to “one connection per container lifecycle.”
// WRONG: new connection per invocation
export const handler = async (event) => {
  const client = new Pool({ connectionString: process.env.DATABASE_URL });
  const result = await client.query('SELECT ...');
  await client.end();
};

// RIGHT: connection reused across invocations in same container
const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 1,  // single connection per Lambda instance
});

export const handler = async (event) => {
  const result = await pool.query('SELECT ...');
  return result.rows;
};
Long-term architecture: If connection pressure remains an issue, evaluate whether the workload should be on Lambda at all. A constantly running ECS/Fargate service with a proper connection pool (10-20 connections shared across all requests) is often more efficient and cheaper than 500 Lambda instances each holding their own connection. Lambda is ideal for bursty, stateless workloads — database-heavy CRUD operations with sustained traffic may be better served by containers.
Structured Answer Template.
  1. Name the root cause — Lambda scales horizontally without limit; each invocation opens a connection; PostgreSQL max_connections is a hard ceiling.
  2. Minutes-to-fix solution — RDS Proxy pools connections on the AWS side; Lambda connects to proxy endpoint.
  3. Code-level reuse — initialize DB client outside the handler so the same container reuses its connection across invocations.
  4. Pool sizing via HikariCP formula(cores * 2) + spindles. Surprising but correct: small pools outperform large ones.
  5. Last resort architectural — move sustained workloads off Lambda to ECS/Fargate with a shared pool.
Real-World Example. PlanetScale’s engineering blog documents a customer who scaled Lambda from 50 to 2,000 concurrent executions and saw too many connections errors cascade into full outages. The fix was RDS Proxy plus moving the DB client initialization outside the handler — net result: from 2,000 connections down to ~50 multiplexed through the proxy, with latency overhead of 1-2ms per query. The “connection reuse” code change was four lines.
Big Word Alert — connection pooling. Maintaining a shared pool of open DB connections that is reused across requests, rather than opening a new connection per request. PgBouncer (external proxy), RDS Proxy (AWS-managed), and HikariCP (JVM-native) are the three canonical implementations. Say “transaction-mode pooling” specifically when recommending PgBouncer — session-mode loses most of the benefit.
Big Word Alert — connection pinning (RDS Proxy). When a Lambda uses a feature incompatible with pooling (prepared statements, session-level SET, temp tables), RDS Proxy pins the connection to that Lambda for the session duration. Pinned connections cannot be shared — defeating the purpose of the proxy. Monitor DatabaseConnectionsCurrentlySessionPinned; high values mean your application is unknowingly defeating pooling.
Follow-up Q&A Chain:Q: Why does a smaller pool often outperform a larger one? A: Beyond roughly 2-4x the CPU core count, concurrent connections compete for the same CPUs, disk I/O, and lock resources. Context switching overhead grows. PostgreSQL’s per-connection memory (5-10 MB) eats into shared_buffers. The HikariCP wiki has the classic benchmark: on a 4-core server, a pool of 10 outperforms a pool of 100 because the 10 connections execute at full CPU speed while the 100 thrash. Little’s Law says pool_size = throughput × latency — anything more is wasted.Q: If RDS Proxy adds 1-2ms of latency, when is it NOT worth it? A: When your Lambda workload has very low concurrency (single-digit invocations) and you have raised max_connections enough to accommodate it. The 1-2ms is the cost of pooling; the benefit is unbounded scaling. At low scale, connect directly; at high scale, always proxy. The crossover is usually around 20-50 concurrent Lambdas.Q: Your team considers moving off RDS to DynamoDB to avoid connection management entirely. Reasonable? A: DynamoDB has no connections — every request is a stateless HTTP call — so the connection problem vanishes. But you trade it for access pattern rigidity: no JOINs, no ad-hoc queries, single-table design. The architectural question becomes “does our data model fit DynamoDB?” If yes and you are already struggling with RDS scaling, it is a rational migration. If no (relational data with many JOINs), you are trading one problem for a worse one. RDS Proxy solves the connection problem without the migration; pick it first.
Further Reading:

16.10 SQL vs NoSQL — Decision Framework

CriteriaSQL (PostgreSQL, MySQL)NoSQL (MongoDB, DynamoDB, Cassandra)
Data modelRelational (tables, rows, joins)Document, key-value, wide-column, graph
SchemaFixed schema, enforced by DBFlexible/schemaless (schema-on-read)
TransactionsFull ACID, multi-tableLimited (single-document in MongoDB, none in Cassandra)
JoinsNative, efficientApplication-side or denormalize
Write scalingVertical first, sharding is hardHorizontal built-in (DynamoDB, Cassandra)
Query flexibilitySQL — ad-hoc queries, aggregationsLimited (design for your access pattern upfront)
ConsistencyStrong by defaultEventual by default (tunable in some)
Best forComplex queries, transactions, relational data, reportingHigh write throughput, flexible schema, key-based access, massive scale
Decision rule: If you need transactions across multiple entities -> SQL. If you need flexible schema with key-based access at massive scale -> NoSQL. If unsure -> SQL (PostgreSQL). Adding a NoSQL store later is easy. Discovering you need transactions after choosing NoSQL is painful.

16.11 Redis vs Memcached

CriteriaRedisMemcached
Data structuresStrings, hashes, lists, sets, sorted sets, streams, HyperLogLogStrings only (key-value)
PersistenceRDB snapshots + AOFNone (pure cache)
Pub/SubBuilt-inNo
ClusteringRedis Cluster (auto-sharding)Client-side sharding
Lua scriptingYes (atomic operations)No
Multi-threadingSingle-threaded main loop (I/O threads since 6.0)Multi-threaded
Best forCaching + data structures + pub/sub + queuesPure caching, max memory efficiency
Decision rule: Need just caching -> either works. Need data structures, pub/sub, or persistence -> Redis. Need maximum memory efficiency for simple key-value at massive scale -> Memcached. Most teams choose Redis.

16.12 Specialized Data Stores

Document databases (MongoDB, CouchDB): Flexible schema, good for content management, catalogs. Key-value stores (Redis, DynamoDB): Ultra-fast lookups by key, session stores, caching. Time-series databases (InfluxDB, TimescaleDB): Optimized for time-stamped data, IoT, metrics. Search engines (Elasticsearch, OpenSearch): Full-text search, log analytics. Vector databases (Pinecone, Weaviate, pgvector): Similarity search for ML embeddings. Graph databases (Neo4j): Relationship-heavy queries (social networks, fraud detection).

Database Selection Guide

Use this table when choosing a database for a new project or feature. Start from your use case, not from a technology. This guide gives you the “what to pick” — for the “how it actually works under the hood” for each database, see the Database Deep Dives chapter.
Use CaseRecommended DB TypeSpecific OptionsReasoningDeep Dive
General web app (users, orders, products)Relational (SQL)PostgreSQL, MySQLACID transactions, JOINs, flexible queries, mature toolingPG internals: MVCC, VACUUM, query planner
High-write event stream (clickstream, IoT telemetry)Time-series or wide-columnTimescaleDB, InfluxDB, CassandraAppend-optimized writes, time-range queries, horizontal scale
Session store / caching layerKey-value (in-memory)Redis, MemcachedSub-millisecond reads, TTL expiry, simple key-based accessRedis architecture: eviction, persistence, clustering
Product catalog with variable attributesDocument storeMongoDB, DynamoDBFlexible schema (each product has different attributes), nested documentsDynamoDB single-table design, partition strategies
Full-text search (site search, log analytics)Search engineElasticsearch, OpenSearchInverted indexes, relevance scoring, faceted search
Social network (friends, followers, recommendations)Graph databaseNeo4j, Amazon NeptuneRelationship traversal (friends-of-friends) is O(1) per hop vs expensive JOINs
ML feature store / semantic searchVector databasePinecone, Weaviate, pgvectorApproximate nearest neighbor (ANN) search on high-dimensional embeddings
Financial ledger (strict consistency, audit trail)Relational (SQL)PostgreSQL, CockroachDBSerializable transactions, multi-row ACID, immutable audit logPG transaction internals, SSI
Global multi-region low-latency readsDistributed NoSQLDynamoDB Global Tables, CockroachDB, CassandraMulti-region replication, tunable consistency, automatic failoverDynamoDB Global Tables patterns
Queue / task brokerStream or message brokerRedis Streams, Kafka, SQSOrdered delivery, consumer groups, at-least-once semanticsRedis Streams internals
Cloud-managed database selection. If you are deploying on AWS, the choice is not just “PostgreSQL vs DynamoDB” — it is “RDS PostgreSQL vs Aurora PostgreSQL vs Aurora Serverless.” Each has different cost models, connection limits, failover behavior, and scaling characteristics. Aurora’s storage layer is fundamentally different from RDS (6-way replicated across 3 AZs, automatic storage scaling). For serverless workloads, Aurora Serverless v2 eliminates capacity planning but has cold-start considerations. The Cloud Service Patterns chapter covers these managed database services in depth — including when RDS Proxy is essential for Lambda-to-database connectivity.
The polyglot persistence principle: Most real systems use multiple databases, each for what it does best. A typical SaaS app might use PostgreSQL for transactional data, Redis for caching and sessions, Elasticsearch for search, and S3 for blob storage. Do not force one database to do everything. But also do not add a new database for every feature — each database is an operational burden (backups, monitoring, upgrades, expertise). Start with PostgreSQL for everything, then split out when you have a proven need.

Operational Economics — What Databases Actually Cost

Most teams track database performance but not database cost. Senior engineers think about both, because the cheapest query is the one you do not run, and the most expensive infrastructure is the infrastructure you over-provisioned. Real-world database cost breakdown (AWS, typical SaaS):
ComponentCost DriverMonthly RangeOptimization Lever
RDS instanceInstance type, Multi-AZ$200-5,000/moRight-size (most are over-provisioned by 2-4x). Use Reserved Instances for 30-60% savings.
Storage (gp3/io2)GB provisioned + IOPS$50-2,000/mogp3 with provisioned IOPS is 3-5x cheaper than io2 for most workloads. Monitor actual IOPS vs provisioned.
Read replicasSame as primary instance cost per replica$200-5,000/mo eachOne replica is usually enough. Two replicas “just in case” doubles cost with no benefit if the second is idle.
Backup storageGB beyond free tier + PITR retention$20-500/moDefault 7-day retention is fine. 35-day retention is 5x the storage cost — only if compliance requires it.
Data transferCross-AZ and cross-region replication$50-1,000/moCross-region replicas cost 0.02/GBintransfer.AwriteheavyDBproducing100GB/dayofWAL=0.02/GB in transfer. A write-heavy DB producing 100GB/day of WAL = 60/mo just in replication transfer.
RDS ProxyPer vCPU-hour$50-500/moOnly needed for serverless. Remove if you migrated away from Lambda.
DynamoDBRCU/WCU (provisioned) or per-request (on-demand)$50-50,000/moOn-demand is 5-7x more expensive per request than provisioned at steady state. Switch to provisioned with auto-scaling after traffic stabilizes.
ElastiCache (Redis)Node type, cluster size$100-3,000/mocache.t3.medium (65/mo)handles25Kops/sec.Teamscommonlyusecache.r6g.xlarge(65/mo) handles 25K ops/sec. Teams commonly use cache.r6g.xlarge (450/mo) when they do not need the memory.
The hidden cost: operational overhead per database technology. Every distinct database technology in your stack requires: monitoring dashboards, alerting rules, backup verification, upgrade runbooks, on-call expertise, and capacity planning. A rough estimate: each database technology costs 10-20% of one engineer’s time in operational overhead. A stack with PostgreSQL + Redis + Elasticsearch + DynamoDB means 40-80% of an engineer’s time goes to database operations. This is why “just add MongoDB for that one feature” is more expensive than it sounds — you are not adding a database, you are adding an operational commitment. Cost-aware query patterns:
  • DynamoDB: A scan of a 10GB table with 50M items costs ~6.25inRCU(atondemandpricing).Thesamedataretrievedbypartitionkeycosts6.25 in RCU (at on-demand pricing). The same data retrieved by partition key costs 0.000125. Design your access patterns to avoid scans.
  • Aurora: Aurora charges per I/O operation (0.20permillionI/Os).Aquerythatdoesasequentialscanon1Mpagescosts0.20 per million I/Os). A query that does a sequential scan on 1M pages costs 0.20. The same query with an index hitting 10 pages costs $0.000002. Indexes are not just about performance — they are about cost.
  • Cross-AZ queries: Reading from a replica in a different AZ costs 0.01/GBindatatransfer.At10GB/dayofqueryresults,thatis0.01/GB in data transfer. At 10GB/day of query results, that is 3/day or ~$90/month. Route reads to same-AZ replicas when possible.
Database Red Flags in Interviews — Things That Immediately Signal You Are Junior.
  1. Using SELECT * in production queries. It fetches every column including BLOBs and text fields you do not need, defeats covering indexes (which can serve the query entirely from the index without touching the table), and breaks when someone adds a column. Always specify the columns you need.
  2. Not mentioning indexes. If someone asks you to optimize a slow query and you do not say “index” in your first sentence, the interviewer has already mentally downgraded you. Indexes are the single highest-leverage optimization in all of database engineering.
  3. Ignoring connection pooling. Every database connection costs ~5-10 MB of memory on the server side. A web app with 100 application servers each opening 20 connections means 2,000 connections — PostgreSQL defaults to max_connections = 100. Without a connection pooler (PgBouncer, PgCat), your database falls over before your queries even execute.
  4. Not considering read replicas for read-heavy loads. If your answer to “the database is slow” is always “add an index” or “throw more hardware at the primary,” you are missing the architecture-level solution. Most web applications are 80-95% reads. Routing reads to replicas is table-stakes knowledge.
  5. Saying “just use NoSQL” without understanding the trade-offs. NoSQL is not faster. It makes different trade-offs. If you cannot articulate what you are giving up (transactions, JOINs, ad-hoc query flexibility) and what you are gaining (horizontal write scaling, flexible schema, key-based access patterns), you do not understand the choice.
  6. Never mentioning EXPLAIN ANALYZE. Optimizing a query without looking at the execution plan is like debugging code without reading the error message. If your optimization strategy is “add indexes and hope,” you are guessing.
  7. Ignoring data volume in your estimates. When someone says “we have 500 million rows” and your answer does not change compared to “we have 50,000 rows,” you are not thinking about scale. A full table scan on 500M rows takes minutes. On 50K rows, it takes milliseconds. The strategy is completely different.
Strong answer:“Indexes exist” does not mean “indexes are working.” This is a systematic diagnosis, not a guessing game. Let me put scale in perspective first: at 500M rows with an average row size of 200 bytes, we are talking about ~100 GB of raw data. A full table scan at typical cloud SSD throughput (~200 MB/s sequential) takes roughly 8-10 minutes. With a well-designed B-tree index, a point lookup on the same table takes 2-5 ms. That is a 100,000x difference — and it is the difference between “the page loads” and “the page times out.”Phase 1: Understand the queries.
  • Run EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) on the slow queries. This is non-negotiable — you cannot optimize what you have not measured. Look at the actual execution plan, not the estimated one.
  • Check if the planner is using your indexes at all. If you see Seq Scan on a 500M-row table, either the index is missing for that query pattern, the statistics are stale, or the planner decided a sequential scan was cheaper (which can happen if the query returns a large fraction of the table).
Reading a real query plan — concrete example. Suppose you run:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 42 AND status = 'shipped';
And get back:
Seq Scan on orders  (cost=0.00..12847623.00 rows=4821 width=196)
                    (actual time=287401.22..287892.55 rows=4793 loops=1)
  Filter: ((customer_id = 42) AND (status = 'shipped'::text))
  Rows Removed by Filter: 499995207
  Buffers: shared hit=1024 read=8472981
Planning Time: 0.12 ms
Execution Time: 287893.01 ms
This plan is screaming at you. Here is how to read it: (1) Seq Scan means PostgreSQL is reading every single row — all 500M of them. At roughly 10 MB/s sustained disk throughput on a typical cloud instance, that is approximately 4-5 minutes of wall-clock time, which matches the 287893 ms (~4.8 minutes) execution time. (2) Rows Removed by Filter: 499,995,207 means it examined 500M rows to find 4,793 matches — a selectivity of 0.001%. This is an index’s dream scenario. (3) Buffers: shared read=8,472,981 means it read ~8.5 million 8 KB pages from disk — about 65 GB of I/O for a query that should touch a few MB. (4) The cost=0.00..12847623.00 is the planner’s estimated cost in arbitrary units — higher means more expensive, and 12.8 million is astronomical.Now add a composite index:
CREATE INDEX idx_orders_customer_status ON orders(customer_id, status);
Re-run EXPLAIN ANALYZE:
Index Scan using idx_orders_customer_status on orders
    (cost=0.57..5241.33 rows=4821 width=196)
    (actual time=0.03..4.87 rows=4793 loops=1)
  Index Cond: ((customer_id = 42) AND (status = 'shipped'::text))
  Buffers: shared hit=4821
Planning Time: 0.15 ms
Execution Time: 5.12 ms
Same query, same data: from 287 seconds to 5 milliseconds — a 57,000x improvement. The Index Scan touches exactly the rows it needs (4,821 buffer hits vs 8.5 million disk reads). At 500M rows, a B-tree index is about 4-5 levels deep, so the lookup requires roughly 5 random I/O operations to find the first matching leaf page, then a sequential scan through adjacent leaf pages for the remaining matches. This is why EXPLAIN ANALYZE is non-negotiable — without it, you are guessing. Every database interview answer about query optimization that does not start with “I would run EXPLAIN ANALYZE” is incomplete.
How to read any EXPLAIN ANALYZE output in 30 seconds: (1) Scan for Seq Scan on large tables — that is your red flag. (2) Check actual time — the second number is total time in milliseconds. (3) Compare rows (estimated) to actual rows — a big mismatch means stale statistics (ANALYZE the table). (4) Look at Buffers: shared read — high numbers mean cold cache and heavy disk I/O. (5) Check for Sort or Hash Join with high row counts — those eat memory and spill to disk.
Phase 2: Statistics and maintenance.
  • Run ANALYZE on the table. PostgreSQL’s query planner uses statistics about data distribution to choose execution plans. On a 500M-row table, the default statistics target (100) may be too low — increase it with ALTER TABLE SET STATISTICS 1000 on high-cardinality columns used in WHERE clauses.
  • Check for bloat. On a heavily-updated table, dead tuples accumulate (MVCC). Run SELECT n_dead_tup, n_live_tup FROM pg_stat_user_tables WHERE relname = 'your_table'. If dead tuples are a significant fraction of live tuples, VACUUM FULL or pg_repack (non-locking) to reclaim space and reduce I/O.
  • Check index bloat. B-tree indexes on frequently-updated columns become bloated over time. REINDEX CONCURRENTLY rebuilds without locking.
Phase 3: Index optimization.
  • Verify composite index column order matches query patterns (leftmost prefix rule). An index on (status, created_at) helps WHERE status = 'active' AND created_at > '2025-01-01' but NOT WHERE created_at > '2025-01-01' alone.
  • Consider partial indexes if queries consistently filter a subset: CREATE INDEX idx_active ON orders(created_at) WHERE status = 'active' — if only 5% of rows are active, this index is 20x smaller and 20x faster to scan.
  • Consider covering indexes (INCLUDE) to enable index-only scans — if the query only needs id and amount, CREATE INDEX idx ON orders(status, created_at) INCLUDE (id, amount) avoids touching the heap entirely.
Phase 4: Table-level strategies for 500M+ rows.
  • Partitioning. If queries consistently filter by a time range (and they usually do), partition by month or quarter using PostgreSQL declarative partitioning. A query for “last 30 days” now scans one partition (maybe 15M rows) instead of 500M. Partition pruning is automatic. The numbers are dramatic: scanning 15M rows takes ~2-3 seconds vs ~8-10 minutes for 500M. With an index on the partition, point lookups drop to under 1 ms because the index is 30x smaller and fits entirely in memory.
  • Read replicas. Route analytics and reporting queries to a read replica. Keep the primary for OLTP.
  • Materialized views. For expensive aggregation queries that do not need real-time data, create a materialized view refreshed every hour. A dashboard that aggregates 500M rows every page load is a bug, not a feature.
Phase 5: Application-level changes.
  • Add pagination. No query should return 500M rows. Cursor-based pagination with WHERE id > last_seen_id ORDER BY id LIMIT 100 is O(1) regardless of offset. At 500M rows, an OFFSET 10000000 LIMIT 20 query still scans and discards 10 million rows — cursor-based pagination avoids this entirely.
  • Cache hot queries in Redis with a TTL. If the same aggregation is requested 1000 times per minute, compute it once. A Redis GET takes ~0.1 ms vs a complex PostgreSQL aggregation at 50-500 ms — that is a 500-5000x improvement for repeat queries. See the Caching chapter for cache invalidation strategies (write-through, write-behind, cache-aside) and how to avoid the thundering herd problem on cache expiry.
  • Review the ORM. ORMs often generate suboptimal queries — N+1 problems, unnecessary SELECT *, missing WHERE clauses. EXPLAIN ANALYZE the actual SQL the ORM produces, not what you think it produces. A classic N+1 on a 500M-row table turns a single 5 ms indexed query into 100 separate queries taking 500 ms total — plus the connection overhead.
What NOT to do: Do not throw hardware at it before understanding the query plans. Do not add random indexes hoping one helps — every index slows writes (on a table with 8 indexes, each INSERT pays ~9 disk writes). Do not jump to sharding before exhausting single-node optimizations — PostgreSQL can handle billions of rows on a single node with proper indexing and partitioning.
Structured Answer Template.
  1. Put scale in perspective first — full scan math: 500M rows × row size ÷ disk throughput = X minutes. Makes the stakes concrete.
  2. Measure, do not guessEXPLAIN (ANALYZE, BUFFERS), pg_stat_statements, pg_stat_user_tables.
  3. Escalate cheapest fixes first — stats refresh, missing indexes, partial/covering indexes before partitioning.
  4. Match index to query pattern — leftmost prefix, INCLUDE for covering, partial for skewed data.
  5. Architecture layer last — partitioning, read replicas, caching. These are months of work; exhaust cheaper options.
Real-World Example. Use The Index, Luke (Markus Winand) documents a case study where a 400M-row orders table had a query taking 45 seconds. The fix was a single covering index: CREATE INDEX idx ON orders(customer_id, status) INCLUDE (total, created_at). Query time dropped to 3ms — a 15,000x improvement from one line of DDL. The lesson: indexes are not “optimization”; they are the difference between a working system and a broken one at this scale.
Big Word Alert — index bloat. B-tree indexes on frequently updated columns accumulate empty leaf-node slots over time as rows are updated (MVCC) or deleted. A bloated index is slower to scan, uses more memory, and produces worse query plans. REINDEX CONCURRENTLY or pg_repack rebuilds it without a lock. Monitor via pg_stat_user_indexes and the pgstattuple extension.
Big Word Alert — covering index / index-only scan. A covering index contains all columns a query needs (via INCLUDE clause), so PostgreSQL answers the query entirely from the index without touching the heap. Index-only scans are 10-100x faster than regular index scans because they skip the heap fetch. Name it when discussing optimization beyond basic indexing.
Follow-up Q&A Chain:Q: You added an index and the planner still ignores it. Why? A: Five common reasons: (1) stats are stale — run ANALYZE. (2) The index does not match the query’s leftmost prefix rule. (3) Selectivity is so low that a sequential scan is actually cheaper (>15% of the table). (4) random_page_cost is at the HDD default of 4.0 — on SSDs, set it to 1.1. (5) The query uses UPPER(col) or another function that prevents index usage — use a functional index. EXPLAIN tells you which reason; do not guess.Q: You have 500M rows and a query that filters by a range of timestamps. Partitioning by month — helps or hurts? A: It helps only if the query includes the timestamp predicate. If the query filters by user_id without a date range, partitioning makes it worse — now the planner must scan every partition. The precondition for partitioning is: all hot-path queries include the partition key in the WHERE clause. Violate it and you trade one big index scan for N small ones — net worse.Q: Your p99 is 5 seconds but p50 is 20ms. Where is the tail coming from? A: Three common culprits at scale: (1) cold cache — the first query to touch a partition/index that is not in shared_buffers pays the disk read tax. (2) Autovacuum I/O contention — a VACUUM on a large table saturates disk for minutes. (3) Checkpoint flushes — every checkpoint_timeout interval (default 5 minutes), dirty pages flood the disk. The tail tells you which one: regular spikes every 5 min = checkpoints; sporadic spikes on specific queries = cold cache; long plateaus = autovacuum. auto_explain catches the slow plans; pg_stat_bgwriter confirms checkpoints.
Further Reading:
  • Markus Winand — Use The Index, Luke — the definitive free guide to SQL indexing
  • PostgreSQL Docs — Using EXPLAIN
  • pganalyze — “Advanced EXPLAIN ANALYZE: How to read PostgreSQL query plans”
Cross-Chapter Connections for Database Optimization.
  • Caching — The fastest database query is the one you never make. For read-heavy workloads, a caching layer (Redis, Memcached) in front of the database can reduce read load by 90%+. See the Caching chapter for strategies on cache invalidation, TTL design, and avoiding the thundering herd problem when a hot cache key expires and 1,000 requests simultaneously hit the database.
  • Performance and Reliability — Query tuning is one dimension of performance, but connection pooling (PgBouncer, ProxySQL), circuit breakers around database calls, and timeout configuration are equally critical. A single slow query without a timeout can hold a connection for minutes, starving the pool. See the Performance chapter for connection pool sizing, timeout strategies, and circuit breaker patterns around data stores.
  • Messaging and Async Processing — For write-heavy workloads, consider decoupling writes through a message queue. Instead of synchronously inserting into the database on every API request, publish to Kafka or SQS and let a consumer batch-insert. This absorbs traffic spikes (Black Friday: 10x normal write volume) without overwhelming the database. See the Messaging chapter for the outbox pattern, which ensures reliable publishing of database changes to a message broker without distributed transactions.
  • Database Deep Dives — This chapter covers database concepts (indexing, transactions, replication) across all databases. The Database Deep Dives chapter goes engine-specific: PostgreSQL MVCC internals and why VACUUM matters, DynamoDB partition key design and hot-partition avoidance, MongoDB aggregation pipeline optimization, and Redis memory management and eviction policies. If you are asked “how does PostgreSQL actually implement MVCC?” or “why is my DynamoDB table throttling?”, that is where to go.
  • Cloud Service Patterns — For AWS-specific database operations: RDS instance sizing and storage types (gp3 vs io2), Aurora’s distributed storage architecture and why failover is faster than RDS, DynamoDB on-demand vs provisioned capacity cost modeling, ElastiCache deployment patterns (cluster mode vs non-cluster), and RDS Proxy configuration for serverless workloads.
Further reading: Designing Data-Intensive Applications by Martin Kleppmann — chapters on replication, partitioning, and transactions are exceptional. Chapters 3 (Storage and Retrieval), 5 (Replication), 6 (Partitioning), and 7 (Transactions) are the most directly relevant to this section. Use The Index, Luke — the best free resource on SQL indexing; covers B-tree internals, composite index design, partial indexes, and how to read execution plans across PostgreSQL, MySQL, Oracle, and SQL Server. Percona’s Database Performance Blog — deep technical posts on MySQL and PostgreSQL performance tuning, replication, and operational best practices from the team behind pt-online-schema-change and Percona Server. GitHub Engineering Blog on Database Infrastructure — real-world case studies on zero-downtime migrations with gh-ost, MySQL fleet management at scale, and the operational challenges of running databases for 100M+ developers. Uber Engineering: Schemaless — Uber’s custom database layer built on top of MySQL that provides schemaless storage, append-only writes, and horizontal sharding; a fascinating case study in building application-level database abstractions. Google’s Spanner Paper — the foundational paper on globally distributed databases with external consistency; introduces TrueTime and explains how Google achieves strong consistency across data centers worldwide.Database indexing: Use The Index, Luke — SQL Indexing and Tuning e-Book — free, comprehensive guide to database indexing by Markus Winand; covers B-tree structure, composite index design, partial indexes, covering indexes, and how to read execution plans; essential for anyone writing SQL in production.PostgreSQL: PostgreSQL Wiki: Performance Optimization — community-maintained collection of PostgreSQL performance tuning techniques covering configuration (shared_buffers, work_mem, effective_cache_size), query optimization, and connection management. pganalyze Blog — in-depth articles on PostgreSQL performance including EXPLAIN plan analysis, index advisor strategies, vacuum tuning, bloat management, and connection pooling; written by the team behind the pganalyze monitoring tool.Transactions and ACID: PostgreSQL Documentation: Transaction Isolation — the authoritative reference on how PostgreSQL implements Read Committed, Repeatable Read (snapshot isolation), and Serializable isolation levels; essential for understanding MVCC behavior, write skew prevention, and serialization failure handling. Jepsen.io: Consistency Analyses — Kyle Kingsbury’s rigorous distributed systems correctness testing; tests real databases (PostgreSQL, MySQL, MongoDB, CockroachDB, Cassandra, and more) for consistency violations under network partitions and failures; the most honest source of truth about whether a database actually delivers the consistency guarantees it claims.CAP theorem and PACELC: Martin Kleppmann: Please Stop Calling Databases CP or AP — Kleppmann’s widely-cited critique explaining why the CAP theorem is frequently misunderstood, why real systems do not fit neatly into CP or AP categories, and why you should reason about specific consistency models (linearizability, causal consistency, eventual consistency) rather than using CAP as a classification tool.Replication: MySQL Documentation: Replication — the canonical reference for MySQL replication covering binary log replication, GTID-based replication, semi-synchronous replication, group replication, and troubleshooting replication lag. PostgreSQL Documentation: Streaming Replication — official guide to PostgreSQL’s WAL-based streaming replication, covering synchronous and asynchronous modes, replication slots, hot standby configuration, and monitoring replication lag.Database migrations: Flyway Documentation — the most popular SQL-based migration tool for JVM projects; covers versioned migrations, repeatable migrations, undo migrations, and baseline; well-suited for teams that prefer writing raw SQL migration scripts. Liquibase Documentation — database-agnostic migration tool supporting SQL, XML, YAML, and JSON changelog formats; particularly strong for teams managing multiple database vendors or needing rollback tracking. gh-ost: GitHub’s Online Schema Migration Tool — GitHub’s triggerless online schema migration tool for MySQL; performs non-blocking ALTER TABLE on tables with billions of rows by using the binary log stream; essential reading for anyone running MySQL at scale.Connection pooling: PgBouncer Documentation — the standard connection pooler for PostgreSQL; covers session, transaction, and statement pooling modes, configuration tuning, and monitoring; critical for any PostgreSQL deployment serving more than a handful of application instances. HikariCP Wiki — documentation for the fastest JVM connection pool; the “About Pool Sizing” article is particularly valuable — it explains why a small pool (often 10 connections) outperforms a large pool (100+) and provides the formula for calculating optimal pool size based on core count and disk characteristics. AWS RDS Proxy Documentation — essential for serverless architectures connecting to RDS/Aurora; covers connection multiplexing, IAM authentication, session pinning behavior, and monitoring with CloudWatch; see also the Cloud Service Patterns chapter for RDS Proxy sizing and cost patterns in production.

API Design Checklist

Use this checklist before shipping any API — internal or external. The items are ordered roughly by “pain caused if you skip this.” Every item you defer is technical debt that compounds with every new consumer.

1. Versioning Strategy

  • Choose a versioning scheme before writing the first endpoint (/v1/ URL path for public APIs, header-based for internal)
  • Define what constitutes a breaking change for your API (removing fields, renaming fields, changing types, changing status codes)
  • Establish a deprecation policy with a timeline (minimum 6 months notice for public APIs, 2 weeks for internal)
  • Pin each consumer to a version — new consumers get the latest, existing consumers keep theirs until they opt in
  • Include version in every response header (X-API-Version: 2024-01-15) so consumers can verify

2. Authentication and Authorization

  • Choose auth mechanism: API keys (simple, per-consumer), OAuth 2.0 (delegated access, scopes), JWT (stateless, short-lived)
  • Require HTTPS everywhere — no exceptions, even for internal APIs (zero-trust networking)
  • Implement API key rotation without downtime (support two active keys per consumer during rotation)
  • Return 401 Unauthorized (identity unknown) vs 403 Forbidden (identity known, insufficient permissions) — these are different
  • Log every authentication failure with IP, key prefix (never the full key), and timestamp for security auditing

3. Rate Limiting

  • Define rate limits per consumer tier (free: 100 req/min, paid: 10,000 req/min, enterprise: custom)
  • Return 429 Too Many Requests with Retry-After header (in seconds) and X-RateLimit-Remaining, X-RateLimit-Reset headers
  • Choose algorithm: token bucket for APIs allowing bursts, sliding window for strict rate enforcement
  • Implement at the API gateway level (not per-application-instance) for accurate distributed counting
  • Document rate limits prominently — developers should not discover limits by hitting them

4. Pagination

  • Choose cursor-based pagination for all list endpoints (not offset-based — offset degrades at scale)
  • Return pagination metadata in every list response: { "data": [...], "has_more": true, "next_cursor": "abc123" }
  • Set a maximum page_size (e.g., 100) and a sensible default (e.g., 20) — never return unbounded results
  • Ensure cursors are opaque to clients (base64-encoded internal identifiers, not raw database IDs)
  • Document the sort order — cursors are meaningless without a defined, stable ordering

5. Error Format

  • Define a single, consistent error schema used by every endpoint:
{
  "error": {
    "code": "invalid_parameter",
    "message": "The 'currency' field must be an uppercase ISO 4217 code (e.g., 'USD'). Received: 'usd'.",
    "param": "currency",
    "type": "validation_error",
    "request_id": "req_abc123"
  }
}
  • Include a machine-readable code (for programmatic handling) AND a human-readable message (for debugging)
  • Include request_id in every response (success and error) — this is the lifeline for support debugging
  • Use appropriate HTTP status codes: 400 (bad input), 401 (not authenticated), 403 (not authorized), 404 (not found), 409 (conflict), 422 (validation failed), 429 (rate limited), 500 (server error)
  • Never leak internal details (stack traces, SQL errors, internal service names) in production error responses

6. Idempotency

  • Require an Idempotency-Key header on all non-idempotent mutations (POST, PATCH)
  • Store idempotency keys with the response — retries return the stored response, not a new execution
  • Scope keys to the API consumer (two different consumers can use the same key string)
  • Set a TTL on idempotency records (24 hours is standard — Stripe’s model)
  • Return an error if a retried request has a different body than the original (the key is bound to the parameters)
  • For critical operations (payments, order creation), wrap the idempotency check and business logic in a single database transaction

7. Documentation

  • Write an OpenAPI (Swagger) spec before writing code — the spec is the contract
  • Include a runnable example for every endpoint (curl command or code snippet in 2-3 languages)
  • Document every error code with cause, example, and suggested fix
  • Provide a “Getting Started” guide that gets a developer from zero to first successful API call in under 5 minutes
  • Include a changelog that documents every change with the date, affected endpoints, and migration path
  • Host interactive API docs (Swagger UI, Redoc, or Stoplight) — developers should be able to try endpoints without writing code

8. Additional Production Concerns

  • CORS — Configure allowed origins, methods, and headers for browser-based consumers
  • Request/response logging — Log every request with method, path, status code, latency, and consumer ID (redact sensitive fields)
  • Health check endpointGET /health returns 200 with dependency status (database connected, cache reachable) for load balancer probes
  • Compression — Support Accept-Encoding: gzip — JSON compresses 5-10x, which matters for bandwidth-constrained mobile clients
  • Timeouts — Set and document server-side request timeouts (e.g., 30 seconds) — a client should never wait indefinitely
  • Webhooks (if applicable) — Support event notifications with retry logic, signature verification (HMAC-SHA256), and a test mode

Interview Deep-Dive Questions

These questions simulate what a senior or staff-level interviewer would ask in a 45-60 minute technical interview covering APIs and databases. They are ordered from foundational to staff-level, and each includes follow-up chains that mirror how a real interview escalates. The answers are written as a strong candidate would actually speak — structured, specific, grounded in production experience.
What the interviewer is really testing: Can you systematically diagnose a performance problem across the full stack (network, serialization, query, payload design) rather than jumping to a single silver bullet? Do you think about the client’s perspective, not just the server’s?Strong answer:The first thing I would do is decompose where the latency is actually coming from, because “slow API” could mean five different things.
  1. Measure the breakdown. Instrument the request lifecycle: DNS resolution, TCP/TLS handshake, time-to-first-byte (TTFB), and content transfer time. On a 50KB JSON response over a typical 4G connection (~10 Mbps), the transfer itself is only about 40ms. If the total latency is 2 seconds, the transfer is not the bottleneck — the server processing time is. But on a 3G connection in a developing market (~1 Mbps), that 50KB takes 400ms just to transfer, and now payload size actually matters.
  2. Enable gzip/brotli compression. JSON compresses extremely well — typically 5-10x. A 50KB response compresses to roughly 5-8KB. This is the single highest-leverage, lowest-risk change. Configure it at the API gateway or reverse proxy (nginx, CloudFront). This alone might cut perceived latency by 30-50% for bandwidth-constrained clients.
  3. Analyze the payload. Are mobile clients actually using all 50KB of that nested data? In my experience, the answer is almost always no. A mobile feed view might need 5 fields from a response that includes 40. This is the classic REST over-fetching problem. Solutions: (a) Add a ?fields=id,name,thumbnail sparse fieldset parameter so clients request only what they need. (b) Create a dedicated mobile-optimized endpoint or a Backend-for-Frontend (BFF) that returns a slim payload. (c) If the problem is widespread across many endpoints, this is where GraphQL earns its keep — the client specifies exactly the shape it needs.
  4. Check the query layer. Run EXPLAIN ANALYZE on the queries backing the endpoint. A deeply nested JSON response often means the server is doing multiple JOINs or N+1 queries to assemble the object graph. If the endpoint returns an order with line items, each with a product, each with a category — that could be 4 sequential queries or a massive JOIN. Use eager loading (include the joins upfront) or DataLoader-style batching if the backend is GraphQL.
  5. Cache aggressively. If this endpoint serves data that does not change every second (product catalog, user profiles), add HTTP cache headers (Cache-Control: max-age=300, stale-while-revalidate=60) and consider a CDN (CloudFront, Fastly) in front of the API. For authenticated endpoints where CDN caching is not appropriate, use server-side caching with Redis — cache the serialized response for 30-60 seconds.
The key trade-off: Sparse fieldsets and BFFs solve the problem at the API layer but add maintenance cost. GraphQL solves it architecturally but introduces its own complexity (caching, query cost analysis, N+1 on the server). For a single problematic endpoint, I would start with compression + sparse fieldsets. If multiple endpoints have the same problem, I would evaluate GraphQL or a BFF pattern.

Follow-up: The team proposes moving to GraphQL to solve the over-fetching. What concerns would you raise?

GraphQL solves over-fetching elegantly, but it introduces problems that REST does not have:
  1. HTTP caching breaks. Every GraphQL request is a POST with a unique body. You cannot use a CDN to cache responses without persisted queries (pre-registered query strings with an ID, so GET /graphql?id=abc123 becomes cacheable). That is additional infrastructure.
  2. Query cost is unpredictable. A single GraphQL query can be { user(id: 1) { name } } (trivial) or { users { orders { items { product { reviews { author { orders { ... } } } } } } } } (a denial-of-service attack disguised as a query). You need query depth limits, complexity analysis, and cost-based rate limiting from day one. This is non-trivial to implement correctly.
  3. N+1 problems move to the server. In REST, the server controls what data is fetched. In GraphQL, the client’s query determines the resolver call graph. Without DataLoader (or equivalent batching), resolving users { orders { ... } } for 100 users triggers 100 separate getOrders calls. This is the most common GraphQL performance trap.
  4. Organizational cost. Every frontend developer now needs to understand the schema, write queries, and manage a client-side cache (Apollo Client’s normalized cache is powerful but has a steep learning curve). The backend team needs to maintain the schema, handle schema evolution without breaking existing queries, and monitor query performance.
My recommendation: if over-fetching is the primary pain point and you have multiple client types (web, mobile, partner), GraphQL is worth the investment. If it is one or two endpoints, sparse fieldsets or a BFF are simpler and lower-risk.

Follow-up: How would you implement sparse fieldsets in a REST API without it becoming a maintenance nightmare?

The cleanest approach is a generic ?fields= query parameter that the framework interprets:
  • The client sends GET /orders/123?fields=id,total,customer.name,items.product_name
  • The server parses the fields parameter into a projection tree
  • The ORM generates a SELECT with only the requested columns (and the necessary JOINs for nested fields)
  • The serializer filters the response to include only the requested fields
Practically, I would implement this as middleware that wraps the serialization layer, not by hand-coding field filtering in every controller. Django REST Framework has this built-in with DynamicFieldsMixin. In Express/Node, you can write a response transformer that prunes the JSON based on the fields parameter.The pitfall is nested fields. customer.name requires joining the customer table even though the top-level orders query might not need it. You need the fields parser to understand the object graph and translate field requests into appropriate database includes. This is where it can become complex — and if it gets too complex, you have reinvented half of GraphQL’s resolver engine. That is the signal to just adopt GraphQL.
Structured Answer Template.
  1. Decompose the latency — DNS, TLS, TTFB, content transfer. Do not treat “slow” as one thing.
  2. Quantify the transfer cost — 50KB on 4G is 40ms; on 3G it is 400ms. Network context matters.
  3. Cheapest fix first — gzip/brotli compresses JSON 5-10x. One config change, huge win.
  4. Then attack over-fetching — sparse fieldsets, BFF, or GraphQL. Match the tool to the scale of the problem.
  5. Then the query layer — N+1, JOIN count, missing indexes. Network optimizations are wasted if the server takes 1.8s to assemble the response.
Real-World Example. Shopify’s mobile team documented a case where their REST product detail API returned 120KB payloads with 60+ nested fields. After enabling gzip (6KB compressed), adding sparse fieldsets (?fields=id,title,thumbnail,price), and adding a mobile-optimized BFF endpoint, mobile p95 dropped from 2.8s to 340ms on 4G networks. The most impactful change was the BFF — it cut over-fetching by 85%, far beyond what gzip alone achieved.
Big Word Alert — over-fetching. Returning more data than the client actually needs, common in REST where endpoints have fixed response shapes. The mobile/REST mismatch is the textbook case. GraphQL and sparse fieldsets (?fields=) both solve it, with different complexity.
Big Word Alert — Backend-for-Frontend (BFF). A dedicated API layer between clients and core services that aggregates and reshapes data for a specific client type (mobile, web, TV). Popularized by SoundCloud. Avoids over-fetching without adopting GraphQL wholesale. Mention it as the middle-ground solution between REST and GraphQL.
Follow-up Q&A Chain:Q: Why not use HTTP/2 server push to preload related resources? A: HTTP/2 server push is deprecated in practice (Chrome removed it in 2022) because the server cannot know what the client has already cached. It caused more bandwidth waste than it saved. The modern equivalent is <link rel=preload> hints, which let the client decide what to preload — much better fit. Do not recommend server push in 2026.Q: Your BFF adds a service hop. Is the latency trade worth it? A: If the BFF aggregates 5 downstream calls into 1 client-facing response, you trade one round-trip (BFF to client) plus 5 internal round-trips (BFF to services) for 5 round-trips from the client directly. Internal round-trips are sub-millisecond; client round-trips are 50-200ms. The BFF wins massively on mobile networks. Only when the BFF becomes a bottleneck itself (adding 100ms of processing) does the trade-off invert.Q: How do you cache BFF responses when they aggregate per-user data? A: Two-tier caching. L1: cache each downstream call with its own TTL and key (user-agnostic where possible). L2: cache the BFF’s composed response with a short TTL (30-60s) keyed by user + endpoint. Even if user-specific, 60s of caching eliminates 95% of redundant work during a traffic spike. Use stale-while-revalidate so clients never wait on a cache refresh.
Further Reading:
  • Sam Newman — Pattern: Backends For Frontends
  • Shopify Engineering — “How Shopify Reduced Mobile API Payload Size by 80%”
  • Cloudflare Blog — “HTTP/2 Server Push: Lessons Learned”
What the interviewer is really testing: Can you explain a concept at multiple levels of abstraction? Do you understand the distributed systems implications, not just the basic pattern?Strong answer:The simple version: Imagine you are at an ATM withdrawing 200.Youpress"confirm,"andthescreenfreezes.Didthetransactiongothrough?Youdonotknow.Soyoupress"confirm"again.Withoutidempotency,thebankmightdeduct200. You press "confirm," and the screen freezes. Did the transaction go through? You do not know. So you press "confirm" again. Without idempotency, the bank might deduct 400 — two successful withdrawals. With idempotency, your ATM request carries a unique token (the idempotency key). The bank says, “I have already processed token ABC-123 — here is the original result.” You get $200 once, no matter how many times you press the button.The implementation: Client generates a UUID v4 per logical operation (not per HTTP request — per business intent). Sends it in an Idempotency-Key header. Server flow: (1) Check if the key exists in the store. (2) If it exists and the operation completed — return the stored response verbatim (same status code, same body). (3) If it exists and the operation is in-progress — return 409 Conflict to prevent concurrent duplicate processing. (4) If it does not exist — record the key with status “processing,” execute the operation, store the result, mark “completed.”What breaks without idempotency in a distributed system:
  1. Network retries create duplicates. The client sends a POST to create an order. The server processes it, inserts the row, but the response is lost (TCP connection reset, load balancer timeout, anything). The client retries. Without idempotency, a second order is created. At Stripe’s scale — processing billions of dollars — even a 0.01% duplicate rate means millions in double charges.
  2. Message queue redelivery. Most message brokers provide at-least-once delivery, not exactly-once. If a consumer processes a message, crashes before acknowledging it, the broker redelivers it. The consumer must be idempotent or it processes the same event twice — double-sending an email, double-crediting an account, double-triggering a webhook.
  3. Load balancer failover. The client sends a request to Server A. Server A processes it but crashes before responding. The load balancer retries to Server B. Without a shared idempotency store, Server B has no idea Server A already processed it. Both servers execute the operation. This is why the idempotency store must be shared (Redis or the primary database), not in-process memory.
  4. Distributed transactions get messy. In a saga pattern where creating an order involves charging a payment, reserving inventory, and sending a notification — if the payment step succeeds but inventory reservation fails and you need to compensate, the compensation (refund) must also be idempotent. Otherwise, retrying a compensation can issue multiple refunds.

Follow-up: How would you handle idempotency in a system where the primary database and Redis (your idempotency store) might have different states after a crash?

This is the critical implementation detail that separates a blog-post-level understanding from a production-level one. The core problem: if you check the idempotency key in Redis, then write to the database, a crash between those two steps leaves them inconsistent.The solution is to use the database itself as the idempotency store — not Redis. Wrap the idempotency check and the business operation in a single database transaction:
BEGIN;
-- Try to insert the idempotency key (fails if duplicate)
INSERT INTO idempotency_keys (key, status, request_hash)
VALUES ('uuid-123', 'processing', 'sha256:abc')
ON CONFLICT (key) DO NOTHING;

-- Check if we won the insert (key is new) or lost (key exists)
-- If key already existed, SELECT the stored response and return it
-- If key is new, execute the business logic within this same transaction:
INSERT INTO orders (customer_id, amount) VALUES (42, 99.99);

-- Store the response alongside the key
UPDATE idempotency_keys SET status = 'completed', response_body = '...' WHERE key = 'uuid-123';
COMMIT;
If the transaction commits, both the business operation and the idempotency record are persisted atomically. If it crashes at any point before commit, neither is persisted, and a retry starts fresh. There is no window of inconsistency.When to use Redis: As a fast-path check before hitting the database. Check Redis first (sub-millisecond). If the key exists in Redis, return the cached response without touching the database. If the key does not exist in Redis, fall through to the database transaction. This gives you Redis-speed for retries (the common case) with database-grade consistency for first-time operations.

Going Deeper: What happens when the idempotency key store itself becomes a scaling bottleneck?

At very high scale (millions of writes per second), the idempotency table becomes a hot write path. Every mutating API call writes to it. Strategies:
  1. TTL and cleanup. Set keys to expire after 24 hours (Stripe’s approach). Run a periodic job to delete expired keys. This keeps the table size bounded. At 1 million API calls per hour with 24-hour TTL, you have roughly 24 million active keys — manageable for PostgreSQL with proper indexing.
  2. Partition the idempotency table. Partition by created_at (daily or hourly). Dropping old partitions is O(1) — far faster than DELETE on millions of rows. This also keeps the index small and fast.
  3. Use a hash as the primary key. Idempotency keys are UUIDs (128 bits). Store them as uuid type, not text — this saves 20+ bytes per row and makes the B-tree index denser (more keys per page, fewer levels, faster lookups).
  4. Separate the hot path. For extremely high-throughput systems, use a dedicated Redis cluster just for idempotency with a TTL. Accept that a Redis failure means you lose deduplication for a brief window — for most operations this is an acceptable trade-off. For payment operations where duplicates are catastrophic, keep the database as the authoritative store and use Redis as a cache layer in front of it.
Structured Answer Template.
  1. Explain the intuition first — ATM analogy, press confirm twice, should be charged once. Makes the concept stick.
  2. Describe the four-case server flow — new key, in-progress key, completed key, mismatched body for same key.
  3. Distributed failure modes — network retry, message broker redelivery, LB failover, saga retries. All four cause duplicates without idempotency.
  4. Consistency fix — the idempotency check and the business write must be in one DB transaction.
  5. Scaling the idempotency store — TTL, partition by date, UUID type for key storage, Redis fast-path.
Real-World Example. Stripe’s 2017 blog post “Designing robust and predictable APIs with idempotency” is the canonical reference. They describe a real payment processor incident where a network timeout caused the merchant to retry, resulting in $1.2M of duplicate charges before detection. Post-incident, they moved idempotency key storage from Redis to PostgreSQL (with Redis as an L1 cache), wrapping key-check and business-write in the same transaction. Duplicate charge rate dropped from a few per month to zero.
Big Word Alert — at-least-once delivery. The guarantee that a message will be delivered one or more times, but possibly duplicates. All practical distributed systems (SQS, Kafka, Kinesis, webhooks) provide at-least-once by default. Exactly-once is either marketing or limited to in-system paths (Kafka Streams). Consumers must be idempotent.
Big Word Alert — transactional outbox pattern. Instead of writing to a DB and publishing to a queue (two operations that can partially fail), write both to the DB atomically (business row + outbox row), then a separate relay process publishes outbox rows to the queue. Solves the dual-write consistency problem. Name it when discussing cross-system idempotency.
Follow-up Q&A Chain:Q: Should the idempotency key live in Redis or PostgreSQL? A: For correctness, PostgreSQL — same transaction as the business write guarantees atomicity. For speed, Redis is 10x faster but introduces dual-write consistency problems. The production answer is usually: PostgreSQL as source of truth, Redis as an L1 cache for the check. If Redis says “key exists,” trust it; if Redis says “new,” fall through to PostgreSQL. Worst case on Redis miss is the request takes 5ms instead of 0.5ms.Q: Can you use the request body hash as the idempotency key instead of requiring a client header? A: You could, but it breaks on semantically identical requests with different encoding (trailing whitespace, map ordering, numeric precision). And you cannot distinguish “retry of the same operation” from “identical two operations deliberately submitted in sequence.” Client-generated keys are explicit about intent. Body-hash auto-keys are the 80% solution that fails in the 20% edge cases that matter.Q: How long should idempotency keys live? A: Stripe uses 24 hours. The logic: if a client has not retried within 24 hours, they have given up or lost state. Keeping keys longer wastes storage without preventing duplicates (the client is not going to retry day-old requests). Partition the idempotency table by day and drop old partitions — DROP TABLE idempotency_keys_2026_04_18 is instant and reclaims storage; DELETE FROM ... WHERE created_at < ... is slow and creates bloat.
Further Reading:
  • Stripe Blog — “Designing robust and predictable APIs with idempotency”
  • Brandur Leach — “Implementing Stripe-like idempotency keys in Postgres”
  • Martin Fowler — Idempotent Receiver pattern
What the interviewer is really testing: Do you understand the algorithmic trade-offs at a practical level, and can you reason about distributed state coordination for a seemingly simple feature?Strong answer:Token bucket works like a literal bucket of tokens. The bucket has a maximum capacity (say 100 tokens) and refills at a steady rate (say 10 tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected. The key behavior: if a client has been idle, the bucket fills up to capacity, allowing a burst of 100 requests. Then it settles to the steady rate of 10/second.Sliding window counter divides time into fixed windows (say 1-minute intervals) and counts requests in each. For the current moment, it calculates a weighted count: (previous window count * overlap%) + current window count. If a client made 60 requests in the last minute and the limit is 100/minute, they have 40 remaining.When to pick each:
  • Token bucket when bursts are natural and acceptable. A mobile app that syncs 50 items on launch, then makes occasional requests. An API where clients batch operations. Stripe and AWS use token bucket because their clients have bursty access patterns and penalizing bursts would hurt developer experience. The burst size is the bucket capacity — it is a tunable safety valve.
  • Sliding window when you need strictly even distribution. Payment processing where 1,000 simultaneous requests could overwhelm a downstream bank API. Rate limiting writes to a database where sustained throughput matters more than peak. Sliding window prevents the “double burst at window boundary” problem that fixed windows have, without allowing the bursty behavior of token bucket.
Distributed rate limiting — the hard part:When you have 10 API gateway instances, each seeing a fraction of traffic, you need shared state. There are three approaches:
  1. Centralized counter in Redis (most common). Each gateway instance increments a counter in Redis on every request. For token bucket, store {tokens_remaining, last_refill_timestamp} and use a Lua script to atomically check-and-decrement:
-- Token bucket in Redis (atomic Lua script)
local tokens = tonumber(redis.call('get', KEYS[1]) or ARGV[1])  -- max_tokens default
local last_refill = tonumber(redis.call('get', KEYS[2]) or ARGV[3])
local now = tonumber(ARGV[3])
local refill_rate = tonumber(ARGV[2])
local elapsed = now - last_refill
tokens = math.min(tonumber(ARGV[1]), tokens + elapsed * refill_rate)
if tokens >= 1 then
  tokens = tokens - 1
  redis.call('set', KEYS[1], tokens)
  redis.call('set', KEYS[2], now)
  return 1  -- allowed
else
  return 0  -- rejected
end
The Lua script executes atomically in Redis, so there are no race conditions. Latency overhead: 0.1-0.5ms per request.
  1. Local counters with periodic sync. Each gateway maintains a local counter and syncs to Redis every N seconds. Faster (no per-request Redis call) but less accurate — during the sync interval, the total across all instances can exceed the limit by up to instances * local_threshold. Acceptable for high-throughput internal APIs where approximate limiting is fine.
  2. Redis Cell module. A native Redis module that implements token bucket with a single CL.THROTTLE command. Most operationally simple — one command, atomic, no Lua scripts. But requires loading a Redis module, which some managed Redis services do not support.
The practical choice for most teams: Redis with a Lua script for token bucket, or Redis INCR with TTL for sliding window. The latency overhead of a Redis round-trip per request (0.1-0.5ms) is negligible compared to the API request itself (typically 50-500ms).

Follow-up: What happens when your Redis rate limiting cluster goes down? Do you fail open (allow all traffic) or fail closed (reject all traffic)?

This is one of those decisions that should be made deliberately, not discovered during an incident.Fail open (allow traffic): Safest for user experience. If Redis is down for 30 seconds, you accept potentially unlimited traffic. For most APIs, this is acceptable because (a) Redis outages are rare and short (seconds to minutes), (b) your backend can handle some burst without rate limiting, and (c) rejecting all traffic during a Redis blip is worse than allowing some extra traffic.Fail closed (reject traffic): Necessary when the rate limit protects a critical downstream resource — a payment processor with hard limits, a third-party API that will ban you for exceeding their rate. In this case, a Redis failure should return 503 Service Unavailable with a Retry-After header rather than allowing uncontrolled traffic to hit the downstream system.The hybrid approach (what I would recommend): Fail open with a local fallback. Each gateway instance maintains a rough in-memory rate limit (even just “allow 100 requests per second per instance”) that activates when Redis is unreachable. This provides a coarse safety net without completely dropping rate limiting. When Redis recovers, switch back to centralized counting.

Follow-up: A client complains they are being rate-limited unfairly — they say they are well under the limit. How do you debug this?

First, check the basics: is their key being shared across multiple integrations (a common mistake where a team uses the same API key for staging and production)? Check the rate limit headers in their responses — X-RateLimit-Remaining and X-RateLimit-Reset — to confirm what the system thinks their usage is.Then check for clock skew. If your rate limiter uses sliding windows based on timestamps, and the Redis cluster’s clock is slightly off from the gateway’s clock, window boundaries shift and counts can be attributed to the wrong window. Use Redis server time, not gateway local time.Check if they are being rate-limited per-IP instead of per-key. If they are behind a corporate proxy or NAT, hundreds of users share one IP. Per-IP rate limiting will throttle the whole organization unfairly.Finally, check for “shadow traffic” — health check probes, monitoring synthetic requests, or integration test suites running against production. These all count against the rate limit but the client might not consider them “real” requests.
Structured Answer Template.
  1. Explain each algorithm’s mechanics — token bucket (bucket of tokens, refills steadily), sliding window (weighted prev + current window).
  2. Map each to a use case — token bucket for bursty clients, sliding window for strict enforcement.
  3. Distributed coordination — Redis Lua scripts for atomicity, or Redis Cell module for token bucket natively.
  4. Fail-open vs fail-closed decision — critical downstream protection changes the answer.
  5. Per-what — per-IP, per-key, per-user, per-tenant. Tiered limits for multi-tenant.
Real-World Example. Stripe’s rate limiter is a four-tier architecture documented in their “Scaling your API with rate limiters” post: (1) request rate limiter (how many per second), (2) concurrent request limiter (how many in flight simultaneously), (3) fleet usage load shedder (protect overall API from overload), (4) worker utilization load shedder (protect individual workers). All are implemented in Redis with Lua scripts — sub-millisecond enforcement on every API call. The tiered design is what lets them allow bursts (tier 1 token bucket) while preventing a slow client from holding 1000 concurrent requests (tier 2 concurrency limiter).
Big Word Alert — leaky bucket. Variant of token bucket where excess requests are queued and drained at a fixed rate, smoothing the output. Good for protecting downstream systems that cannot tolerate bursts. Less common in API rate limiting (because clients prefer a fast reject over a queued delay), more common in traffic shaping at the network layer.
Big Word Alert — fair queuing / weighted fair queuing (WFQ). Scheduling algorithm that allocates shared resources proportional to pre-assigned weights, preventing any one consumer from starving others. Applied to multi-tenant rate limiting: enterprise tenants get more weight, free-tier tenants get less, but free-tier is never completely starved under load.
Follow-up Q&A Chain:Q: Why is a naive INCR + EXPIRE not atomic in Redis? A: Because INCR and EXPIRE are two separate commands. If the server crashes between them, you have a counter with no TTL — it lives forever, and the rate limit becomes permanent for that key. The fix is a Lua script that executes INCR and EXPIRE atomically, or using Redis Cell module’s CL.THROTTLE which is atomic by design.Q: How do you handle rate limit counter drift across Redis replicas during a failover? A: During a Redis primary failure, the replica may be a few milliseconds behind. When it promotes, a client might briefly be allowed extra requests (replica did not see the most recent INCRs). For most rate limiters, this is acceptable — a few extra requests during a failover window is not a security or stability issue. For critical enforcement, use Redis with WAIT command (block until N replicas ack the write) at the cost of latency.Q: Should rate limiting happen at the API gateway or inside each service? A: Gateway, almost always. The gateway sees all traffic, can enforce per-consumer limits globally, and sheds load before requests consume backend capacity. Service-level rate limiting only catches requests that already made it through the gateway — too late to protect from abuse. The exception: internal service-to-service rate limiting (bulkheading) where you want to cap how many requests Service A sends to Service B regardless of overall traffic.
Further Reading:
  • Stripe Blog — “Scaling your API with rate limiters”
  • Cloudflare — “How we built rate limiting capable of scaling to millions of domains”
  • Redis Docs — Patterns: Rate Limiting (see rate limiting section)
What the interviewer is really testing: Do you understand the mechanics of synchronous replication, can you reason about the latency implications, and can you propose practical solutions that balance durability with performance?Strong answer:The latency jump is entirely expected and is fundamental to how synchronous replication works.Why it happens: Without sync replication, a write is confirmed as soon as the primary flushes to its local WAL (Write-Ahead Log) — that is a local disk fsync, typically 1-5ms on SSDs. With synchronous replication, the write is not confirmed until the standby also acknowledges that it has received and flushed the WAL data. This adds a network round-trip to the critical path: primary flushes locally (~3ms) + network transfer to standby (~1-2ms same AZ, 50-100ms cross-AZ, 100-300ms cross-region) + standby flushes (~3ms) + acknowledgment back (~1-2ms). If the standby is in a different availability zone (common for HA), the round-trip alone is 5-10ms. If it is cross-region, it could easily be 100-150ms — which explains the developer’s observation.Solutions, ordered by pragmatism:
  1. Semi-synchronous replication. PostgreSQL supports synchronous_standby_names = 'ANY 1 (standby1, standby2)'. With two standbys, the primary waits for only one to acknowledge. If one standby is fast (same AZ) and one is slow (cross-region), the fast one usually responds first. Write latency drops to the fastest standby’s round-trip. You still get durability — at least one replica has the data.
  2. Classify writes by criticality. Not every write needs synchronous replication. For financial transactions, payment records, and audit logs — yes, synchronous is warranted. For session updates, analytics events, and activity logs — asynchronous is fine. PostgreSQL allows setting synchronous_commit at the transaction level:
-- Critical write: wait for standby (default when sync replication is on)
SET LOCAL synchronous_commit = 'on';
INSERT INTO payments (...) VALUES (...);

-- Non-critical write: do not wait for standby
SET LOCAL synchronous_commit = 'local';
INSERT INTO activity_log (...) VALUES (...);
This gives you the best of both worlds: durability where it matters, performance where it does not.
  1. Move the standby closer. If the standby is cross-region (which explains the 150ms), consider whether you actually need cross-region replication for your synchronous standby. Put the synchronous standby in the same region (different AZ for fault tolerance — typically 1-3ms round-trip). Add an asynchronous replica cross-region for disaster recovery. Same durability benefit, much lower latency.
  2. Batch writes. If the application makes many small writes, batch them into fewer transactions. The sync replication overhead is per-commit, not per-row. Inserting 100 rows in a single transaction pays the round-trip penalty once. Inserting 100 rows in 100 separate transactions pays it 100 times.

Follow-up: The standby goes down. What happens to writes on the primary?

With synchronous replication, writes block. The primary waits for standby acknowledgment that will never come. After the configured timeout (wal_sender_timeout, default 60 seconds), the connection to the standby is dropped, and the behavior depends on your configuration:
  • If synchronous_standby_names lists only one standby and it is down, the primary has two choices: (a) block all writes indefinitely (safest for data, catastrophic for availability), or (b) fall back to asynchronous mode (maintain availability, accept that new writes are not replicated until the standby recovers).
  • With ANY 1 (standby1, standby2) and two standbys, losing one standby is fine — the primary continues with the other. This is why production setups should always have at least two synchronous standbys configured in an ANY 1 group.
In practice, most teams configure automatic failover (Patroni for PostgreSQL, RDS Multi-AZ for managed) that promotes the standby to primary and spins up a new standby. The key metric to monitor is pg_stat_replication.sync_state — if your synchronous standby shows as disconnected, you are either blocking writes or running with degraded durability.

Going Deeper: How does PostgreSQL’s WAL-based replication actually work at the byte level?

Every change to the database (INSERT, UPDATE, DELETE, even index modifications) is first written as a WAL record — a sequential, append-only log entry that describes the change. The WAL is divided into 16MB segment files (configurable). The primary streams these WAL bytes to standbys over a persistent TCP connection.The standby’s walreceiver process receives the bytes and writes them to its local WAL. The startup process then replays the WAL records against the standby’s data files, bringing it up to date. In “hot standby” mode, the standby can serve read queries while simultaneously replaying WAL — this is how read replicas work in PostgreSQL.For synchronous replication, the critical flag is remote_apply vs remote_write vs on:
  • synchronous_commit = on (default): primary waits until the standby has flushed WAL to disk. Durable against standby crash, but the standby has not yet replayed the records — queries on the standby might not see the data yet.
  • synchronous_commit = remote_apply: primary waits until the standby has replayed the WAL records. Queries on the standby immediately see the committed data. This is strongest consistency but adds replay latency on top of network latency.
  • synchronous_commit = remote_write: primary waits until the standby has received and written WAL to OS cache (not necessarily flushed to disk). Faster but vulnerable to standby OS crash.
In most production systems, on (flush to standby disk) is the right default. remote_apply is used when read-after-write consistency on replicas is required (the “read your own writes” scenario).
Structured Answer Template.
  1. Break down the added latency — local fsync + network RTT + remote fsync + ack. Quantify each hop.
  2. Distinguish sync levelson, remote_write, remote_apply. Each is a different point on the durability/latency curve.
  3. Propose semi-sync firstANY 1 (standby1, standby2) gives durability without fragility.
  4. Per-transaction synchronous_commit — mix sync and async based on business criticality.
  5. Question the topology — does the standby need to be cross-region? Same-AZ standby is typically 1-3ms RTT.
Real-World Example. Crunchy Data’s PostgreSQL blog documents a customer whose same-region sync replication was fine (3ms) until a cross-region replica was added and incorrectly listed in synchronous_standby_names as FIRST 1 (cross_region, same_az). The primary waited for the 180ms cross-region replica on every write. Fix: change to ANY 1 (same_az, cross_region) — whichever acks first unblocks the commit. Write latency returned to 3-5ms with cross-region durability retained.
Big Word Alert — semi-synchronous replication. A middle ground where the primary waits for at least one (not all) replicas to acknowledge. With three replicas using ANY 1, one slow/failed replica does not block commits. Name it as the practical production default — full sync across all replicas is too fragile.
Big Word Alert — synchronous_commit levels. PostgreSQL’s per-transaction durability knob: off (async, risk data loss), local (primary fsync only), remote_write (standby has received WAL in OS cache), on (standby has flushed WAL to disk — default), remote_apply (standby has replayed — strongest but slowest). Mention that you tune this per-transaction, not globally.
Follow-up Q&A Chain:Q: Your primary waits on a crashed standby. How long until writes resume? A: Until wal_sender_timeout fires (default 60s), the primary keeps waiting. Writes are blocked for up to 60 seconds — catastrophic for availability. The fix is to configure synchronous_standby_names = 'ANY 1 (standby1, standby2)' with multiple standbys, so losing one does not block commits. Or lower wal_sender_timeout at the cost of flapping during transient network blips.Q: Semi-sync gives durability for the standby that acked. What about the ones that did not? A: Asynchronous catch-up. They receive WAL in the background with a few seconds of lag. Durability is at least one replica has the data at commit time; additional replicas converge eventually. This is fine for most workloads. For scenarios needing multiple AZ durability (like financial), use ANY 2 with three standbys across three AZs.Q: When is remote_apply (strictest mode) genuinely worth the latency cost? A: When your read replicas must show the freshly-committed data immediately after a write (read-your-own-writes across nodes, not just same-session). Example: user submits a form, backend redirects to a “confirmation” page that reads from the replica. Without remote_apply, the replica may not have applied the write yet. Alternatives: session-sticky routing to primary post-write (cheaper), or causal consistency tokens.
Further Reading:
  • PostgreSQL Docs — synchronous_commit
  • Crunchy Data Blog — “PostgreSQL Synchronous Replication: A Deep Dive”
  • 2ndQuadrant — “Synchronous replication: the gritty details”
What the interviewer is really testing: Can you think through a seemingly simple feature request and identify the cascade of technical implications? Do you have the judgment to push back when the design has hidden costs?Strong answer:Soft delete sounds trivial — add a deleted_at TIMESTAMPTZ column, change DELETE to UPDATE, filter WHERE deleted_at IS NULL everywhere. But the real-world implications are significant, and I would want to discuss them before implementing.The implementation is straightforward. The consequences are not.
  1. Every query in the application must filter by deleted_at IS NULL. This is the most dangerous part. Miss one query, and deleted records appear in search results, reports, or API responses. In a codebase with 200 queries touching the users table, every single one needs the filter. Global query scopes (Rails default_scope, Django Manager, Hibernate @Where) help but are fragile — they can be accidentally bypassed, they do not apply to raw SQL, and they create confusion when you actually need to see deleted records (admin views, audit, analytics).
  2. Unique constraints break. If you have UNIQUE(email) and user A with email alice@example.com soft-deletes their account, user B cannot sign up with that same email because the soft-deleted row still occupies the unique constraint. The fix is a partial unique index: CREATE UNIQUE INDEX idx_unique_email ON users(email) WHERE deleted_at IS NULL. This works but is PostgreSQL-specific and needs to be applied to every unique constraint on every soft-delete table.
  3. Foreign keys become complicated. If a user is soft-deleted but their orders still reference user_id, what happens? Cascade soft-delete all orders? Leave them orphaned pointing to a “deleted” user? Neither is clean. In my experience, most teams end up with ad-hoc “if the parent is soft-deleted, treat the child as…” logic scattered throughout the codebase.
  4. Table bloat. Soft-deleted records never leave the table. A table that accumulates 10 million “deleted” records but only has 500K active records is 95% dead weight. Indexes include the dead rows (unless you use partial indexes for everything). Full table scans are 20x slower than they need to be. You end up needing a periodic “hard delete” job that archives old soft-deleted records to a separate table or data warehouse — which is basically implementing what you tried to avoid.
  5. Performance. A partial index WHERE deleted_at IS NULL helps for queries that filter active records, but the table itself grows unbounded. VACUUM still has to process the full table. Statistics calculations include soft-deleted rows. The query planner might make suboptimal decisions because the row count statistics do not reflect the actual working set.
What I would push back on:First, I would ask: why do we need soft delete? The answer matters:
  • “Users want an undo/recycle bin” — Implement a time-limited undo (30 days), then hard delete. Store the deleted record in a separate deleted_users table or an archive. This keeps the main table clean.
  • “Legal/compliance requires data retention” — Use an audit log or event sourcing pattern. Every state change is an immutable event. You can hard-delete from the operational database while maintaining a complete history in the audit log. This separates operational data from compliance data.
  • “We need to preserve foreign key relationships” — Consider using a status field (status: active | suspended | deactivated) instead of delete semantics. This is more honest about what is happening and avoids the mental model mismatch of “deleted but still there.”
  • “We are afraid of accidental deletes” — Use database backups and point-in-time recovery. That is what they are for.
My recommendation: If soft delete is truly needed, use it sparingly — only on the specific tables that require it, not as a blanket pattern. Use partial indexes on every query path. Implement a hard-delete archival job that moves records older than 90 days to an archive table. And use a query scope at the ORM level with explicit opt-in for accessing deleted records, not opt-out.

Follow-up: The team decides to go with soft delete. Six months later, the table has 50M soft-deleted rows out of 52M total. The API is slow. What do you do?

This is exactly the scenario I warned about. The table is 96% dead weight. Here is the recovery plan:
  1. Immediate relief: Create partial indexes for every query that filters active records: CREATE INDEX CONCURRENTLY idx_users_active ON users(id) WHERE deleted_at IS NULL. This gives the query planner a small, efficient index to work with instead of scanning the full 52M-row B-tree. For covering queries, include the needed columns: CREATE INDEX CONCURRENTLY idx_users_active_email ON users(email) WHERE deleted_at IS NULL INCLUDE (name, created_at).
  2. Archive the dead rows. Create a users_archive table with identical schema. Move soft-deleted rows older than 90 days in batches:
WITH moved AS (
  DELETE FROM users WHERE deleted_at < now() - INTERVAL '90 days'
  RETURNING *
)
INSERT INTO users_archive SELECT * FROM moved;
Do this in batches of 10,000 rows to avoid long-running transactions that block VACUUM and create replication lag. After archival, the main table drops from 52M to roughly 2.5M rows — a 20x reduction. Indexes shrink proportionally.
  1. Table maintenance. After bulk deletion, run VACUUM FULL users (locks the table — do during maintenance window) or pg_repack (no lock, requires extension) to reclaim disk space and rebuild indexes compactly.
  2. Prevent recurrence. Implement the archival job as a periodic cron that runs daily, moving records older than the retention period.

Follow-up: How does the event sourcing approach compare to soft delete for audit compliance?

In event sourcing, you never mutate or delete. Every state change is an immutable event: UserCreated, UserEmailUpdated, UserDeactivated, UserDeleted. The current state is derived by replaying events.For compliance, this is superior to soft delete because: (a) you have a complete, immutable audit trail — not just “is deleted” but “who deleted it, when, why, and what the state was before deletion.” (b) You can hard-delete from the operational database (projection/read model) while retaining the event log — GDPR right-to-erasure is handled by crypto-shredding the events (encrypting per-user, deleting the key). (c) The operational database only contains current state, so it stays lean.The trade-off is complexity. Event sourcing requires an event store, event replay logic, projections for read models, and a team that understands the pattern. For most applications, an append-only audit_log table that records every change (table name, row ID, old value, new value, timestamp, actor) gives 80% of the compliance benefit at 20% of the complexity. That is what I would recommend for most teams.
Structured Answer Template.
  1. Start with the ask: why — “undo/recycle bin” vs “compliance” vs “fear of deletes” each has a different right answer.
  2. Name the five real consequences — missed WHERE filters, broken unique constraints, FK ambiguity, bloat, index churn.
  3. Propose alternatives for each motivation — time-limited undo, audit log, status enum, backups.
  4. If soft delete is still chosen — enforce with partial unique indexes, ORM scopes, and archival jobs.
  5. Name the event-sourcing escape hatch — it is the “real” solution for compliance, at a cost.
Real-World Example. GitHub’s engineering team famously migrated away from soft-delete for their repositories table after it grew to ~70% dead rows. The migration: move soft-deleted rows older than 90 days to a repositories_archive table, drop the deleted_at filter from all hot-path queries, and keep partial unique indexes for uniqueness. Query latency on the repositories endpoint dropped 40% with no functional change. The lesson: soft-delete at scale is a tax you pay on every query, forever.
Big Word Alert — partial unique index. CREATE UNIQUE INDEX idx ON users(email) WHERE deleted_at IS NULL — enforces uniqueness only on active rows, letting soft-deleted rows have duplicate emails. Essential for any soft-delete schema. Without it, user B cannot reuse the email of a soft-deleted user A.
Big Word Alert — event sourcing. An architectural pattern where the database of record is an append-only log of events (UserCreated, UserEmailUpdated, UserDeleted), and current state is a derived projection. Immutable audit trail for free. Name it as the “right” solution for regulatory compliance — but caveat the complexity cost.
Follow-up Q&A Chain:Q: The soft-delete table is 96% dead rows. What is the cheapest recovery? A: Partial indexes for the 4% active rows, then batched archival of rows deleted more than 90 days ago to a separate archive table. After archival, pg_repack to reclaim disk space. Do not run VACUUM FULL on a 52M-row table — it locks for hours. Total recovery time: days of batched work, not an outage.Q: GDPR right-to-erasure requires hard delete. How do you combine that with soft-delete patterns? A: Two-tier: soft-delete for short-term undo (30 days), hard-delete for regulatory erasure. The scheduled archival job that moves soft-deleted rows past the undo window either (a) hard-deletes them if GDPR-scope data, or (b) moves to archive if retention is required. Event sourcing handles this with “crypto-shredding” — encrypt events per-user, delete the key, events become unreadable.Q: Should a deleted_at filter be in the ORM or the database? A: Both. ORM default scopes (Rails default_scope, Django Manager) prevent accidental inclusion, but they leak: raw SQL bypasses them, reporting queries forget, new developers unaware. Belt-and-suspenders: enforce at the database with Row-Level Security (CREATE POLICY active_only ON users USING (deleted_at IS NULL)). The database becomes the final authority — even a forgotten WHERE clause cannot see deleted rows.
Further Reading:
  • Martin Fowler — Event Sourcing
  • 2ndQuadrant — “Soft deletes considered harmful”
  • GitHub Engineering — “How we handle deleted repositories at scale”
What the interviewer is really testing: Do you understand CAP beyond the oversimplified “pick two” framing? Can you reason about consistency models in practical terms with real database examples?Strong answer:The CAP theorem (Brewer, 2000; proved by Gilbert and Lynch, 2002) states that a distributed data store can provide at most two of three guarantees simultaneously: Consistency (every read returns the most recent write), Availability (every request receives a non-error response), and Partition tolerance (the system continues operating despite network partitions between nodes).But “pick two” is misleading for three reasons:
  1. Partition tolerance is not optional. Network partitions happen in any distributed system — a cable gets cut, a switch fails, a cloud AZ has connectivity issues. You do not get to “pick” partition tolerance. The real choice is: when a partition occurs, do you sacrifice consistency (return potentially stale data) or availability (return an error)?
  2. The choice is not binary or permanent. Real databases make different trade-offs for different operations, different configurations, and even different queries. DynamoDB with ConsistentRead: false is AP (eventual consistency, always available). DynamoDB with ConsistentRead: true is CP (strong consistency, may reject reads during a partition if the leader is unreachable). Same database, different behavior per request.
  3. CAP says nothing about the common case. Most of the time, there is no partition. During normal operation, the real trade-off is latency vs consistency — which is what PACELC captures. Cassandra is fast because it does not wait for all replicas to agree (low latency, weaker consistency). CockroachDB is slower because it uses distributed consensus on every write (higher latency, strong consistency). The daily engineering experience is dominated by this latency/consistency trade-off, not by partition behavior.
Concrete examples of why “AP database” is an oversimplification:Cassandra is often called “AP,” but with quorum reads and writes (consistency level = QUORUM), it provides strong consistency — every read sees the most recent write as long as a majority of replicas are available. With consistency level = ONE, it is highly available but eventually consistent. The same Cassandra cluster is CP or AP depending on the consistency level you choose per query.MongoDB with writeConcern: { w: "majority" } and readConcern: "linearizable" provides strong consistency (CP behavior). With writeConcern: { w: 1 } and readConcern: "local", it is faster but may return stale data during a failover (AP behavior). MongoDB also has a concept of causal consistency sessions — where a client is guaranteed to see its own writes in order, without requiring full linearizability.PostgreSQL with synchronous replication is CP — it sacrifices availability (writes block) rather than return inconsistent data. But PostgreSQL with asynchronous replication has a window where the standby is behind, and reads from the standby are stale — which is an AP-like behavior for read-heavy workloads. Most production PostgreSQL setups are a hybrid: writes are always strongly consistent (go to primary), reads can be “eventually consistent” (from async replicas).The better framework is PACELC: During a Partition, choose Availability or Consistency. Else (normal operation), choose Latency or Consistency. This captures the daily trade-off that CAP ignores. Cassandra is PA/EL (available during partitions, low latency normally — at the cost of consistency in both cases). CockroachDB is PC/EC (consistent always, at the cost of availability during partitions and latency normally).

Follow-up: A colleague says “we need Serializable consistency for our entire application.” How do you respond?

I would push back, respectfully. Serializable isolation is the strongest guarantee — it ensures that concurrent transactions behave as if they executed one at a time. But it comes with real costs:
  1. Performance penalty. PostgreSQL’s Serializable Snapshot Isolation (SSI) tracks read and write dependencies across all concurrent transactions. When it detects a potential serialization anomaly, it aborts one of the transactions with a serialization failure. Your application must catch these failures and retry. Under high concurrency, the abort rate can be significant — I have seen 5-15% of transactions aborted in write-heavy workloads. Each abort wastes all the work done by that transaction.
  2. Throughput reduction. Because SSI sometimes aborts transactions speculatively (it errs on the side of caution), total throughput drops. Benchmarks typically show 20-40% lower throughput compared to Read Committed for write-heavy workloads.
  3. Most operations do not need it. Displaying a user’s dashboard (Read Committed is fine — slight staleness is invisible). Updating a user’s email (Read Committed with a unique constraint is sufficient). Generating a report (Repeatable Read to get a consistent snapshot). The operations that truly need Serializable are specific: double-booking prevention, inventory reservation where two buyers compete for the last item, financial transfers with multi-account balance constraints (write skew).
My recommendation: Use Read Committed as the default (it is PostgreSQL’s default for a reason). Escalate to Repeatable Read for reports and analytics queries that need a consistent snapshot. Use Serializable only for the specific transactions where write skew or phantom reads would cause a business logic violation. Set it per-transaction with SET TRANSACTION ISOLATION LEVEL SERIALIZABLE, not as a database-wide default.

Follow-up: How does CockroachDB achieve strong consistency across regions with acceptable latency?

CockroachDB uses the Raft consensus protocol — every write goes through a Raft group where a majority of replicas must agree before the write is committed. In a 3-region setup (us-east, us-west, europe), every write requires at least 2 of 3 replicas to acknowledge.The latency trick is “leaseholder locality.” CockroachDB assigns a “leaseholder” for each range (chunk of data). The leaseholder serves all reads and coordinates writes. If you configure the leaseholder to be in the same region as most readers and writers for that data (using zone configurations and locality-aware placement), reads are local (fast) and writes only need one cross-region round-trip for Raft consensus (the leaseholder + one nearby follower form a majority).For truly global applications, CockroachDB supports “stale reads” (AS OF SYSTEM TIME) that read from the local replica without a round-trip to the leaseholder. This trades consistency for latency on a per-query basis — similar to reading from an async replica in PostgreSQL, but with a bounded staleness guarantee.Google’s Spanner (which inspired CockroachDB) goes further with TrueTime — GPS and atomic clock synchronized timestamps that allow globally consistent reads without coordination. CockroachDB approximates this with hybrid-logical clocks (HLC), which are less precise but do not require specialized hardware.
What the interviewer is really testing: This is a staff-level question testing system design judgment, understanding of data isolation patterns, and the ability to reason about tail-latency effects in shared infrastructure.Strong answer:This is one of the hardest practical database design problems because you are balancing operational simplicity (one database for all tenants) against isolation guarantees (one tenant’s query should never impact another).Architecture options, in order of isolation strength:Option 1: Shared tables with a tenant_id column (pool model). All tenants share the same tables. Every table has a tenant_id column. Every query includes WHERE tenant_id = ?. Every index is prefixed with tenant_id.
CREATE TABLE orders (
  id BIGSERIAL,
  tenant_id UUID NOT NULL,
  customer_id BIGINT NOT NULL,
  total NUMERIC(12,2),
  created_at TIMESTAMPTZ NOT NULL,
  PRIMARY KEY (tenant_id, id)
);

CREATE INDEX idx_orders_tenant_customer ON orders(tenant_id, customer_id);
CREATE INDEX idx_orders_tenant_created ON orders(tenant_id, created_at DESC);
Making tenant_id part of the primary key and every index means queries for one tenant always hit a contiguous range of the B-tree. PostgreSQL’s query planner can use index scans that skip directly to the tenant’s section.Option 2: Schema-per-tenant (silo model). Each tenant gets their own PostgreSQL schema (CREATE SCHEMA tenant_abc). Tables within the schema have no tenant_id column. Application sets search_path to the tenant’s schema on each request.Option 3: Database-per-tenant. Maximum isolation. Each tenant is a separate database instance (or a separate database within a PostgreSQL cluster). Used by Slack (one MySQL DB per workspace) and some enterprise SaaS products.My recommendation for most SaaS products: shared tables with tenant_id (Option 1) for small/medium tenants, with the ability to shard large tenants to dedicated infrastructure.Preventing noisy neighbors (the core of the question):
  1. Row-Level Security (RLS) in PostgreSQL. Enforce tenant isolation at the database level, not just the application level:
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON orders
  USING (tenant_id = current_setting('app.current_tenant')::UUID);
Set app.current_tenant at the start of each request. Even if application code forgets the WHERE clause, the database enforces isolation. This prevents the most dangerous failure mode: one tenant seeing another tenant’s data.
  1. Query-level resource limits. PostgreSQL’s statement_timeout kills queries that run too long. Set it per-role: large enterprise tenants get 30 seconds, small tenants get 5 seconds. This prevents a poorly-written report from a large tenant from holding connections and locks for minutes.
  2. Connection pool isolation. Dedicate a portion of the connection pool to each tenant tier. Large enterprise tenants get 20 dedicated connections. Small tenants share a pool of 50 connections. This prevents a single tenant from exhausting all connections during a traffic spike.
  3. Partitioning by tenant_id for the largest tenants. When a tenant grows past a threshold (say 10M rows), create a dedicated partition for them:
-- Declarative partitioning by tenant_id
CREATE TABLE orders (
  id BIGSERIAL, tenant_id UUID NOT NULL, ...
) PARTITION BY LIST (tenant_id);

CREATE TABLE orders_tenant_large PARTITION OF orders
  FOR VALUES IN ('large-tenant-uuid');
CREATE TABLE orders_default PARTITION OF orders DEFAULT;
The large tenant’s queries hit only their partition (10M rows), not the full 500M-row table. Small tenants share the default partition. PostgreSQL’s partition pruning ensures each query only touches the relevant partition.
  1. Monitoring per-tenant query patterns. Track P99 query latency per tenant. If a tenant’s queries degrade, investigate before it affects others. pg_stat_statements with a tenant_id label (via the application, logged as a comment in the query) gives you per-tenant query performance visibility.

Follow-up: A tenant with 50M rows wants to run heavy analytics queries. This is destroying performance for other tenants. What do you do?

This is the classic OLTP vs OLAP collision. Never run analytics on the operational database.
  1. Immediate mitigation: Route the tenant’s analytics queries to a read replica. This takes the load off the primary immediately. The replica might have 1-10 seconds of replication lag, which is acceptable for analytics.
  2. Medium-term solution: Set up a dedicated analytics pipeline. Replicate the tenant’s data to a columnar data warehouse (BigQuery, Redshift, Snowflake, or even DuckDB for smaller datasets). Analytics queries run there, completely isolated from the OLTP database. The cost of running Redshift for one tenant’s analytics is far less than the cost of degraded performance for all tenants.
  3. Product-level solution: Offer “advanced analytics” as a premium tier feature. Enterprise tenants who need heavy analytics get a dedicated read replica or data warehouse access. This aligns the cost (infrastructure) with the value (revenue).

Going Deeper: How does DynamoDB handle multi-tenancy differently from PostgreSQL, and when would you prefer it?

DynamoDB’s multi-tenancy is built into its architecture. Each table is automatically partitioned by the partition key. If tenant_id is your partition key (or partition key prefix), each tenant’s data is physically isolated on different partitions, handled by different storage nodes. A hot tenant (high traffic) triggers automatic partition splits — DynamoDB adds more partitions for that tenant without affecting others. There is no shared B-tree, no shared connection pool, no noisy neighbor problem at the storage level.The trade-off: DynamoDB forces you to design your access patterns upfront. With PostgreSQL, you can add an index later when a new query pattern emerges. With DynamoDB, if you did not design the partition key and sort key for a particular access pattern, adding it later means creating a Global Secondary Index (which is eventually consistent) or restructuring the table. Single-table design in DynamoDB is powerful but requires deep upfront design work.I would prefer DynamoDB for multi-tenant SaaS when: (a) access patterns are well-defined and key-based (get tenant’s orders, get tenant’s users), (b) tenants vary wildly in size and you want automatic scaling without manual partitioning, (c) you want zero operational overhead for scaling. I would prefer PostgreSQL when: (a) access patterns involve complex queries, JOINs, or ad-hoc analytics, (b) you need strong multi-row transactions, (c) the team has deep SQL expertise and limited DynamoDB experience.
What the interviewer is really testing: Do you understand database internals deeply enough to make informed decisions about schema changes in production? Can you reason about the impact of DDL operations on a running system?Strong answer:This is one of those operations that looks identical in the application but is radically different under the hood.Regular CREATE INDEX: PostgreSQL acquires a SHARE lock on the table. This blocks all writes (INSERT, UPDATE, DELETE) for the entire duration of the index build. Reads continue. On a 500M-row table, building a B-tree index takes 10-60 minutes depending on disk speed, row width, and available maintenance_work_mem. During those 10-60 minutes, every write to that table queues and waits. For a production system handling 1,000 writes per second, that means 600,000 to 3.6 million queued writes. Connection pools exhaust. Application timeouts fire. Cascading failures ensue. This is why you never run a bare CREATE INDEX on a large production table.CREATE INDEX CONCURRENTLY: PostgreSQL builds the index in three phases without blocking writes:
  1. Phase 1 — Initial scan. PostgreSQL scans the entire table and builds the index from the current snapshot. During this scan, new writes continue normally — they modify the table but the in-progress index does not see them yet. This phase takes roughly the same time as a regular index build (10-60 minutes on 500M rows) but the table is fully writable. PostgreSQL acquires only a ShareUpdateExclusiveLock, which blocks other DDL (you cannot run two concurrent index builds simultaneously) but allows all DML (INSERT/UPDATE/DELETE).
  2. Phase 2 — Second scan. PostgreSQL performs a second pass to pick up any rows that were inserted, updated, or deleted during Phase 1. This catches the “delta” between the snapshot used in Phase 1 and the current state. For a table with moderate write volume, this phase is fast — it only processes the changes, not the full table.
  3. Phase 3 — Validation. PostgreSQL validates that the index is consistent with the table. For unique indexes, it checks uniqueness constraints against the final state. Once validated, the index is marked as valid and the query planner starts using it.
The trade-offs:
  • Takes longer overall. Because it does two table scans instead of one, CREATE INDEX CONCURRENTLY takes roughly 1.5-2x as long as a regular CREATE INDEX. On a 500M-row table, expect 15-90 minutes.
  • Uses more I/O. The double scan means double the disk reads. On an I/O-constrained system, this can impact query performance during the build. Monitor disk utilization and consider running it during off-peak hours.
  • Cannot run in a transaction block. CREATE INDEX CONCURRENTLY cannot be wrapped in a BEGIN ... COMMIT block. If it fails partway through (out of disk space, unique violation), it leaves an invalid index. You need to check pg_index.indisvalid and clean up with DROP INDEX CONCURRENTLY if it failed.
  • maintenance_work_mem matters. The sort phase of index building uses maintenance_work_mem (default 64MB). For a 500M-row index build, increasing it to 1-2GB dramatically reduces sort time because more of the sort can happen in memory instead of spilling to disk.
What I would actually do in production:
-- Increase sort memory for this session
SET maintenance_work_mem = '2GB';

-- Build the index concurrently
CREATE INDEX CONCURRENTLY idx_orders_customer_status
ON orders(customer_id, status);

-- Verify it was built successfully
SELECT indexrelid::regclass, indisvalid
FROM pg_index
WHERE indexrelid = 'idx_orders_customer_status'::regclass;
Monitor pg_stat_progress_create_index (PostgreSQL 12+) to track progress in real time — it tells you which phase you are in, how many tuples have been processed, and the total tuple count.

Follow-up: The concurrent index build fails halfway through because of a unique constraint violation. What state is the database in? How do you recover?

The database has an “invalid” index. It exists in the catalog (pg_index) with indisvalid = false. The query planner will not use it, but it still consumes disk space and still slows down writes (because PostgreSQL maintains it on inserts/updates even though it is invalid).Recovery: DROP INDEX CONCURRENTLY idx_orders_customer_status; — this removes the invalid index without blocking writes. Then fix the data issue (find and resolve the duplicate values), and retry the CREATE INDEX CONCURRENTLY.The dangerous mistake: not cleaning up invalid indexes. I have seen production databases with multiple invalid indexes from failed builds — each one adding write overhead without providing any read benefit. Periodically audit for invalid indexes:
SELECT indexrelid::regclass AS index_name, indrelid::regclass AS table_name
FROM pg_index WHERE NOT indisvalid;

Follow-up: How does MySQL handle this differently? Can you create indexes without blocking writes in MySQL?

MySQL InnoDB handles this differently. Since MySQL 5.6, most ALTER TABLE ... ADD INDEX operations are “online DDL” — they allow concurrent reads and writes during the index build. MySQL uses a mechanism similar to PostgreSQL’s concurrent build: it scans the table, builds the index, captures the delta of writes during the build, then applies the delta.The key difference: MySQL’s online DDL acquires brief metadata locks at the start and end of the operation (which can still cause brief stalls if there are long-running transactions holding locks), but the bulk of the index build allows concurrent DML. PostgreSQL’s CREATE INDEX CONCURRENTLY is more explicitly designed for zero-write-blocking, but MySQL’s online DDL achieves a similar practical result.For very large MySQL tables (billions of rows), teams often use gh-ost or pt-online-schema-change instead of native online DDL, because these tools provide better throttling (they pause when replication lag exceeds a threshold), more predictable impact, and the ability to abort without leaving the table in an intermediate state.
What the interviewer is really testing: Can you make a practical technology choice based on specific constraints rather than defaulting to the “fashionable” option? Do you understand the operational implications beyond the performance numbers?Strong answer:At 15,000 RPS between two services you own, gRPC is the stronger choice — but it is not automatic. Let me walk through the reasoning.Why gRPC fits here:
  1. Wire efficiency. Protocol Buffers are 3-10x smaller than JSON. At 15,000 RPS, if the average JSON payload is 2KB, that is 30MB/second of serialization overhead. Protobuf reduces that to 3-6MB/second. More importantly, serialization/deserialization CPU cost drops significantly — protobuf serialization is 5-20x faster than JSON parsing because it is a simple binary format with no parsing ambiguity (no string escaping, no whitespace handling, no number type inference).
  2. HTTP/2 multiplexing. REST over HTTP/1.1 uses one TCP connection per concurrent request (or connection pooling with head-of-line blocking). At 15,000 RPS with 10ms average latency, you need ~150 concurrent connections. gRPC runs over HTTP/2, which multiplexes all 15,000 RPS over a handful of TCP connections (typically 1-4 per client instance). This reduces TCP overhead, reduces connection management complexity, and plays better with load balancers.
  3. Code-generated clients and servers. The .proto file is the contract. Both teams generate strongly-typed client and server stubs from it. No hand-written HTTP clients, no JSON parsing bugs, no argument about field naming conventions. At 15,000 RPS, even a 0.01% error rate from a parsing bug is 1.5 errors per second — enough to fire alerts.
  4. Streaming capability. If the use case evolves to need server-side streaming (one service pushing updates to the other), gRPC supports it natively. Adding WebSocket or SSE to a REST API is a fundamentally different protocol.
But I would also consider the counter-arguments:
  1. Operational complexity. gRPC requires proto file management, code generation pipelines, and proto file versioning. If the team has no gRPC experience, the ramp-up cost is real — 2-4 weeks before the team is productive. REST with JSON is universally understood.
  2. Debugging difficulty. You cannot curl a gRPC endpoint. You need grpcurl or Postman with gRPC support. Reading a protobuf response in a network trace requires decoding tools. When something goes wrong at 3 AM, the ability to curl an endpoint and read the JSON response is genuinely valuable.
  3. Load balancer compatibility. gRPC requires HTTP/2-aware load balancing. Traditional L4 load balancers work, but L7 load balancing (routing based on gRPC method, header-based routing) requires specific support (Envoy, AWS ALB with gRPC support, nginx with grpc_pass). If your infrastructure is not already HTTP/2 ready, this is a prerequisite.
  4. Is 15,000 RPS actually stressing REST? A well-implemented REST API with connection pooling, HTTP/2 (yes, REST can use HTTP/2 too), and efficient JSON serialization (simdjson, orjson) can handle 15,000 RPS without breaking a sweat. The serialization overhead at 15K RPS is probably 2-5% of total CPU. If the service is doing meaningful work (database queries, business logic), the JSON vs protobuf difference is noise.
My actual recommendation: If both teams have gRPC experience and your infrastructure supports HTTP/2 routing — use gRPC. The strong typing, code generation, and wire efficiency are clear wins for a high-throughput internal service.If the team is new to gRPC and the service is launching soon — start with REST over HTTP/2 with a well-defined OpenAPI spec. The performance difference at 15K RPS will be negligible compared to the delivery speed difference. Migrate to gRPC later if serialization overhead becomes a measured bottleneck (not a hypothetical one).The decision that would concern me most: choosing gRPC purely for resume-driven development when the team has no experience with it and a deadline in 6 weeks. Technology choices should be driven by team capability and operational readiness, not just theoretical performance.

Follow-up: The team picks gRPC. Six months later, a new frontend team wants to call this service from the browser. What is your approach?

gRPC does not work natively in browsers — browsers cannot make HTTP/2 requests with the gRPC framing that gRPC requires. There are three options:
  1. gRPC-Web. A protocol adaptation that works over HTTP/1.1. The browser sends gRPC-Web requests to an Envoy proxy (or any gRPC-Web compatible proxy), which translates them to standard gRPC calls to the backend. The frontend team uses a generated gRPC-Web client. This preserves strong typing and the proto contract but requires the proxy layer.
  2. REST gateway. Add a thin REST API that translates HTTP/JSON to gRPC calls. Tools like grpc-gateway (Go) or Envoy’s gRPC-JSON transcoding can auto-generate REST endpoints from proto definitions. The frontend team uses a standard REST client. This is the lowest friction for the frontend team but adds a translation layer.
  3. Connect protocol (Buf). A newer protocol from the Buf team (creators of protobuf tooling) that is wire-compatible with gRPC but also supports standard HTTP/1.1 with JSON. A single service implementation serves both gRPC clients (binary, HTTP/2) and browser clients (JSON, HTTP/1.1). This is increasingly the recommended approach for new projects that need both internal gRPC and browser access.
I would recommend option 2 (REST gateway via grpc-gateway or Envoy transcoding) if the frontend team is already productive with REST. Option 3 (Connect) if we are early enough to adopt it without rewriting existing clients.

Going Deeper: How does protobuf backward compatibility work, and what are the rules for evolving a proto file without breaking clients?

Protobuf’s backward compatibility rules are critical for service evolution:Safe changes (never break existing clients):
  • Adding new fields (with a new field number). Old clients ignore unknown fields. New clients use default values for missing fields.
  • Adding new enum values. Old clients treat unknown enum values as the default (0 in proto3).
  • Adding new RPC methods to a service. Old clients never call them.
Breaking changes (will break clients):
  • Changing a field’s type (int32 to string).
  • Changing a field number (the field number is the wire identifier, not the name).
  • Removing a field (old clients that set this field will silently lose data; old clients that read it will get the default value, which may violate assumptions).
  • Renaming a field (safe on the wire since protobuf uses numbers, but breaks generated code that references the old name).
The critical rule: Never reuse a field number. If you remove field 7, mark it as reserved 7; so no one accidentally reuses it later. A new field with the same number but a different type would silently corrupt data.In practice, I manage this with a CI check: buf breaking --against .git#branch=main (from the Buf tool) compares the current proto file against the main branch and fails the build if any breaking change is detected. This is the proto equivalent of contract testing for REST APIs.
What the interviewer is really testing: Do you understand concurrency control beyond textbook definitions? Can you reason about which strategy fits which contention pattern, and have you seen the consequences of getting it wrong?Strong answer:Pessimistic locking assumes conflicts are likely. Before modifying a resource, you acquire an exclusive lock (SELECT ... FOR UPDATE in SQL). Other transactions attempting to modify the same resource block until you release the lock (on COMMIT or ROLLBACK). Think of it like locking a bathroom door — only one person at a time, everyone else waits in line.Optimistic locking assumes conflicts are rare. You read the resource with a version number, do your work, and at write time, check if the version has changed (UPDATE ... SET version = version + 1 WHERE id = ? AND version = ?). If the update affects 0 rows, someone else modified it first — you retry or fail. Think of it like a collaborative document where you edit your copy and check at save time whether anyone else changed it.When to use each:
  • Pessimistic: High contention — multiple users frequently modifying the same resource. Example: seat reservation for a popular concert (thousands of people trying to grab the same seats simultaneously). Inventory decrements for a flash sale. Account balance updates in a banking system. The cost of contention is high, and you would rather wait briefly than retry repeatedly.
  • Optimistic: Low contention — conflicts are rare. Example: editing a user profile (how often do two people edit the same profile simultaneously?). Updating configuration settings. Modifying a shopping cart. Most of the time, there is no conflict, and the version check is just a cheap WHERE clause addition.
A production incident scenario:A team I know about built an e-commerce flash sale system using optimistic locking on inventory. The product had 100 units. When the sale launched, 5,000 concurrent users tried to buy simultaneously. The flow was: read inventory count (version 1) -> process payment -> decrement inventory with WHERE version = 1.The first buyer succeeded: inventory went to 99, version to 2. The remaining 4,999 requests all had version 1, so their updates affected 0 rows. They retried: read inventory (version 2), retry payment, decrement with WHERE version = 2. But only one of those 4,999 retries succeeded. The remaining 4,998 retried again. This created an avalanche — each “round” sold one unit but required every remaining buyer to retry. By the time 100 units were sold, the system had processed roughly 100 * 5,000/2 = 250,000 retry cycles. The database was overwhelmed with read-modify-write cycles. Payment timeouts cascaded. Some users were charged but did not get their item because the inventory update failed after payment succeeded.The fix: Switch to pessimistic locking for flash sale inventory. SELECT quantity FROM inventory WHERE product_id = ? FOR UPDATE SKIP LOCKED acquires the lock or immediately skips (returns nothing) if already locked. The SKIP LOCKED is key — it means concurrent buyers do not queue up waiting for each other. They either get the lock and proceed, or immediately get “sold out.” The database handles this efficiently because it is a single-row lock.Even better: use a queue-based approach. Funnel all purchase requests into a queue. A single consumer processes them sequentially — no locking needed because there is no concurrency. The queue absorbs the burst, the consumer processes at a steady rate, and buyers get a “your order is being processed” response immediately.

Follow-up: How does SELECT ... FOR UPDATE SKIP LOCKED actually work, and what is the difference from NOWAIT?

FOR UPDATE acquires a row-level exclusive lock. Three flavors:
  • FOR UPDATE (plain): Blocks until the lock is available. If another transaction holds it, you wait. Simplest, but under high contention, you get long wait queues and potential deadlocks.
  • FOR UPDATE NOWAIT: Tries to acquire the lock. If it is already held, immediately throws an error (ERROR: could not obtain lock on row). Useful when you want to fail fast rather than wait — the application catches the error and retries or returns an error to the user.
  • FOR UPDATE SKIP LOCKED: Tries to acquire the lock. If the row is already locked, skips that row entirely (as if it did not exist in the result set). This is designed for “grab the next available item” patterns — worker queue processing, ticket reservation, any scenario where you want to grab any available row, not a specific row.
Example — job queue with SKIP LOCKED:
-- Worker grabs the next unprocessed job, skipping any locked by other workers
WITH next_job AS (
  SELECT id, payload FROM jobs
  WHERE status = 'pending'
  ORDER BY created_at
  LIMIT 1
  FOR UPDATE SKIP LOCKED
)
UPDATE jobs SET status = 'processing', locked_by = 'worker-1'
WHERE id = (SELECT id FROM next_job)
RETURNING *;
Multiple workers can run this concurrently without blocking each other. Each gets a different pending job. No deadlocks, no retries. This pattern is so effective that it is the foundation of PostgreSQL-based job queues like pgboss, Oban (Elixir), and Que (Ruby).

Going Deeper: How does MVCC relate to locking in PostgreSQL? Do readers block writers?

In PostgreSQL, readers never block writers and writers never block readers. This is the fundamental benefit of MVCC (Multi-Version Concurrency Control).When a transaction reads a row, it sees the version of the row that was committed before the transaction’s snapshot timestamp. It does not acquire any lock that would block a concurrent writer. When a transaction writes a row, it creates a new version of the row — the old version remains visible to transactions that started before the write.The exception: SELECT ... FOR UPDATE explicitly requests a lock. This is an advisory statement saying “I intend to modify this row, so block other writers.” Without FOR UPDATE, a concurrent reader and writer operate completely independently.This is fundamentally different from MySQL InnoDB in REPEATABLE READ mode, which uses gap locks and next-key locks that can cause readers to block writers in some scenarios (particularly range queries). PostgreSQL’s snapshot isolation is purely version-based — it never uses lock-based read isolation (except at SERIALIZABLE, where it uses predicate locks for SSI detection, but even those do not cause blocking — they only cause transaction aborts if an anomaly is detected).The cost of MVCC: dead tuples. Every update creates a new row version and marks the old one as dead. VACUUM must periodically clean up dead tuples to reclaim space and prevent index bloat. On a write-heavy table with 50,000 updates per second, VACUUM must keep pace or the table grows unboundedly. This is the single most important PostgreSQL operational concern — autovacuum tuning.
What the interviewer is really testing: Can you plan and execute a complex, multi-phase migration that involves external stakeholders? Do you understand the organizational and technical dimensions of backward compatibility?Strong answer:This is equal parts technical problem and communication problem. Breaking 200 third-party integrations is not just a bug — it is a trust violation that damages your platform’s reputation. The migration must be invisible to consumers who are not ready to move.Phase 1: Expand (months 1-3)
  1. Deploy v2 alongside v1. Both versions run simultaneously. The v2 endpoints live at /v2/resources. The v1 endpoints remain untouched at /v1/resources. This is the “expand” phase of expand-and-contract.
  2. Internally, both versions share the same backend logic. Do not maintain two separate codebases. Implement a version transformation layer — v1 requests are transformed to the internal canonical format, processed, and the response is transformed back to v1 format. v2 requests use the canonical format directly (or a different transformation). Stripe does exactly this — they maintain hundreds of API versions through a chain of version-specific request/response transforms, all running against a single backend.
  3. Use a version router. The API gateway or a middleware layer inspects the version (URL path, header, or the consumer’s pinned version) and applies the appropriate transformations. This means the business logic does not need to know about versions at all.
Phase 2: Communicate (months 2-6)
  1. Publish a migration guide for every breaking change. Not just “field X was renamed to Y” — but working code examples showing the v1 call, the equivalent v2 call, and how to handle the migration in common languages (Python, Node, Go, Java). Stripe’s migration guides are the gold standard here.
  2. Add deprecation headers to v1 responses:
Deprecation: true
Sunset: 2027-06-01
Link: <https://docs.example.com/migration-v2>; rel="successor-version"
These are machine-parsable — well-built clients can surface warnings automatically.
  1. Email all 200 integrators with a timeline: v2 available now, v1 deprecated on date X, v1 sunset on date Y. Give at least 12 months between deprecation and sunset for public APIs.
  2. Provide a migration testing sandbox. Let integrators test their updated code against v2 in a staging environment before committing. Offer to review their migration code if they are a large partner.
Phase 3: Monitor and nudge (months 6-12)
  1. Track per-consumer version usage. Dashboard showing which consumers are still on v1, their traffic volume, and last request timestamp. Reach out to high-volume v1 consumers individually.
  2. Default new consumers to v2. New API keys get v2 by default. New documentation defaults to v2. Move the v1 docs to a “legacy” section but keep them accessible.
  3. Progressive rate limit tightening on v1. Optional and aggressive, but effective: reduce v1 rate limits by 10% per quarter after the deprecation date. This creates gentle pressure without breaking anyone. Communicate this in advance.
Phase 4: Contract (month 12+)
  1. Sunset v1. After the announced date, v1 endpoints return 410 Gone with a response body containing the migration guide URL. Do not return 404 — 410 explicitly means “this existed but has been intentionally removed.”
  2. Keep the v1 transformation code for at least 6 more months after sunset. If a critical partner missed the migration, you can temporarily re-enable it.
The technical key insight: The version transformation layer is the critical investment. Once you have it, adding v3 in the future is just adding new transforms, not building a new API. Stripe processes billions of requests across hundreds of API versions through this pattern, and it is why they can evolve rapidly without breaking anyone.

Follow-up: How do you handle a breaking change in the database schema that affects both v1 and v2 responses?

The expand-and-contract pattern applies at the database level too. Suppose v1 returns { "name": "Alice Smith" } and v2 splits this into { "first_name": "Alice", "last_name": "Smith" }.
  1. Expand the schema: Add first_name and last_name columns alongside the existing name column. Backfill by parsing the existing name field. New writes populate all three columns.
  2. Version transforms: v1 response continues to read from name (or concatenate first_name + last_name if name has been deprecated). v2 response reads from first_name and last_name. Both versions work against the same database.
  3. Contract: After v1 is sunset and all code paths reading name are removed, drop the name column. Do this with a non-locking migration (on large tables, use ALTER TABLE ... DROP COLUMN which in PostgreSQL is just a catalog update — the column is hidden, not physically removed, so it is instant).
The danger is step 3: dropping the column before all code paths are verified. I would add a “column access audit” — log any query that references the deprecated column for 30 days before dropping it. If the log is empty, it is safe to drop.

Follow-up: A partner refuses to migrate and threatens to leave the platform. What do you do?

This is a business decision, not a technical one. The answer depends on the partner’s revenue contribution.If the partner represents significant revenue: offer to maintain their v1 integration as a “legacy support” tier — possibly with a dedicated support fee. The version transformation layer makes this cheap technically (it is just maintaining one more transform in the chain). Set a clear end-of-life date (24 months instead of 12) and assign a dedicated solutions engineer to help them migrate.If the partner is low-revenue: empathize, provide migration assistance, offer a reasonable extension (3-6 months), but hold the sunset date. Maintaining a legacy API version indefinitely for one consumer creates a maintenance burden that compounds over time.The meta-lesson: API versioning is a product management problem. The version transformation architecture makes it technically feasible to support many versions simultaneously, but the product decision is how long to support each version and at what cost. This is why Stripe treats their API version lifecycle as a first-class product concern, not just an engineering task.
What the interviewer is really testing: Can you explain complex performance and design trade-offs in a way that is clear, concrete, and changes someone’s mental model? This tests your mentoring ability as much as your technical knowledge.Strong answer:Storage is cheap — that part is correct. But the cost of “just add everything” is not storage. It is I/O, memory, network, and maintainability. Let me break down each with numbers.1. I/O cost scales with row width, not row count.PostgreSQL stores data in 8KB pages. A narrow row (50 bytes — id, name, email) fits ~150 rows per page. A wide row (2KB — add a JSON blob, a text bio, several timestamps, some denormalized fields) fits only ~4 rows per page. To scan 1 million rows:
  • Narrow table: ~6,700 pages, ~52 MB of I/O
  • Wide table: ~250,000 pages, ~1.95 GB of I/O
Same number of rows, but the wide table requires 37x more disk reads. When the database’s buffer cache (shared_buffers, typically 25% of RAM) is 4GB, the narrow table fits entirely in cache; the wide table fills half of it — evicting other useful data.SELECT * on the wide table reads all 2KB per row even if you only need the 50 bytes of id, name, email. SELECT id, name, email on a table with a covering index never touches the table at all — it reads the answer directly from the index, which is compact and almost certainly cached in memory.2. Network cost is not free.The database sends results over the network to the application server. At 10,000 queries per second returning 20 rows each:
  • SELECT id, name, email at ~100 bytes per row: 20 MB/second
  • SELECT * with a 2KB row: 400 MB/second
That is the difference between “fits comfortably in a 1 Gbps network link” and “saturates a 4 Gbps link.” In cloud environments where database and application server are in different subnets, this matters.3. Memory cost on the application side.The ORM deserializes every returned column into language objects. A Python dict with 5 fields uses ~400 bytes. A dict with 50 fields uses ~4KB. At 10,000 requests per second, each fetching 20 rows: 5-field rows consume ~80 MB of live memory; 50-field rows consume ~800 MB. This is the difference between “runs on a 2GB container” and “needs 4GB and still GC-pauses.”4. Covering indexes stop working.A covering index is one that contains all the columns your query needs — the database answers the query entirely from the index without touching the table heap (an “index-only scan”). This is the fastest possible query path. SELECT id, name FROM users WHERE email = 'alice@example.com' with an index on (email) INCLUDE (id, name) is an index-only scan: one B-tree lookup, ~0.1ms. But SELECT * always touches the heap because no index contains every column. Even if you have the perfect index, SELECT * defeats it.5. Schema evolution becomes dangerous.Someone adds a profile_image BYTEA column to the users table (a 500KB blob). Every SELECT * in the codebase now returns 500KB per row, silently. No code changed, no deployment happened, but API response sizes 100x overnight. With explicit column lists, the new column is invisible to existing queries until someone deliberately adds it.How I would explain it to the junior developer:“Storage is cheap, but reading is not free. Every SELECT * forces the database to read, serialize, and transmit every column over the network to your application, which then allocates memory for every field. The cost is not the disk — it is the I/O, the network bandwidth, the application memory, and the fact that adding a large column to a table silently breaks every SELECT * in the codebase. Always select the columns you need. It is like ordering the entire menu at a restaurant because food is cheap — you still have to carry it all to the table, and most of it goes to waste.”

Follow-up: When IS it acceptable to use SELECT *?

There are a few legitimate cases:
  1. Ad-hoc exploration. When debugging in a psql terminal, SELECT * FROM users WHERE id = 42 LIMIT 1 is fine. You are looking at data, not running production queries.
  2. EXISTS subqueries. SELECT 1 FROM ... WHERE EXISTS (SELECT * FROM orders WHERE ...) — the SELECT * inside EXISTS is never materialized. The database just checks for row existence. Practically identical to SELECT 1.
  3. Very early prototyping. When you are sketching a feature and the schema is changing hourly. But this should be replaced with explicit column lists before merging to main.
  4. When you genuinely need every column and the table is narrow. A 5-column config table where every query needs all columns — SELECT * is fine. But even here, explicit columns protect you against future schema changes.
The rule of thumb: SELECT * is for humans in terminals. Production code should always specify columns.

Advanced Interview Scenarios

These questions go beyond foundational knowledge into the territory where real production experience separates candidates. Each scenario is designed so that the “textbook” answer is incomplete or wrong. The questions reward engineers who have been on-call, debugged 3 AM incidents, and shipped migrations on tables with billions of rows. If you can answer these with specific tool names, real metrics, and trade-offs you have personally lived through — you will stand out.

The trap: “Additive changes are always safe” is the textbook answer. It is also wrong in practice.

What weak candidates say: “Additive changes are non-breaking by definition. The partner must have a bug in their code. We should tell them to fix their deserialization.”What strong candidates say:This is one of the most common real-world API incidents, and it catches teams who believe the “only removals are breaking changes” rule without understanding how client code actually works.How an additive change breaks clients — five real mechanisms:
  1. Strict deserialization. Many clients use statically typed languages with strict JSON parsers. Java’s Jackson with FAIL_ON_UNKNOWN_PROPERTIES = true (the default in some frameworks) throws an exception when it encounters a field not defined in the model class. .NET’s System.Text.Json behaves similarly by default. The partner integrated six months ago, generated their DTOs, and never updated them. Your new field is “unknown” to their parser, and their code throws. I have personally seen this break a $2M/year partner integration at a fintech company — a single new metadata field on an invoice response crashed their nightly reconciliation pipeline for 14 hours.
  2. Response size increase triggering limits. If the new field is a nested object or array, the response payload grows. Partners with a 64KB response body limit in their HTTP client configuration, or a Lambda function with a 6MB response payload limit, suddenly start failing. At one company, adding a line_items array to an order response pushed the payload for large orders from 45KB to 180KB, exceeding a partner’s nginx proxy_buffer_size of 128KB. The response was silently truncated, producing invalid JSON.
  3. Database/cache key changes. If the partner hashes the response body for caching or deduplication, a new field changes the hash. Their cache hit rate drops to zero overnight. Their “is this a duplicate webhook?” check starts treating every callback as new because the signature of the payload changed.
  4. Snapshot testing and CI pipelines. The partner’s integration test suite does exact-match comparison against a stored response fixture. The new field causes every test to fail, blocking their deployments. They cannot ship any code until they update all fixtures — which for a large integration could be hundreds of test files.
  5. Field name collisions. The new field name collides with a field the partner’s application uses internally. If they flatten the API response into a flat namespace (common in ETL pipelines), status from your API collides with their internal status field.
Whose fault is it? Both sides share responsibility. The API provider should document that additive changes are part of the contract and are not considered breaking. Every API’s terms should state: “Clients MUST ignore unknown fields.” Stripe’s API reference explicitly states this. But the provider should also announce additive changes in a changelog (even if they are not “breaking”), give partners a way to preview upcoming changes, and never add a field that changes the semantic meaning of existing fields.The defensive architecture: This is why Stripe pins each integration to an API version. Even “additive” changes are only visible in the version they were introduced. A partner on 2025-01-15 never sees a field added in 2025-06-01 until they explicitly upgrade their version. This is the gold standard — but it requires the version transformation layer described in Q11.War Story: At a payments platform processing 40M API calls/day, we added an enhanced_risk_score field to the transaction response. The field was optional, nullable, well-documented in the changelog. Within 4 hours, three partners reported failures. Partner A had strict deserialization in their Java service. Partner B’s webhook signature verification failed because they computed HMAC over the raw response body (new field changed the hash). Partner C’s Snowflake ingestion pipeline broke because their COPY command expected a fixed number of JSON keys per row. We ended up rolling back the field, adding it only to v3 of the API, and sending a “breaking change notice” even though, by our own documentation, it was not breaking. The lesson: the theoretical contract (“ignore unknown fields”) and the practical contract (what client code actually does) are different things.

Follow-up: How do you prevent this class of incident going forward?

Build a consumer compatibility testing pipeline. Before deploying any API change — even additive ones — replay the last 24 hours of each partner’s actual request/response pairs through the new code. If the response shape changes in a way that would break strict deserialization (new fields appear), flag it for review. Stripe, Twilio, and Plaid all have internal “compatibility gates” that catch this. For smaller teams: maintain a set of “canonical partner response fixtures” in your test suite that are validated on every deploy. Any fixture diff requires explicit approval.

Follow-up: A partner demands you never add fields to existing endpoints. How do you respond?

This is an unreasonable demand that would freeze your API’s evolution. But the underlying concern is valid — they want stability. Offer them two things: (1) pin them to a versioned API where response shapes are frozen per version, and (2) provide a schema registry (OpenAPI spec published per version) so their code generation tools can stay in sync. If they refuse both and insist on a frozen response format forever, that is a contract negotiation, not an engineering problem. Involve product and legal.

This is a debugging war story question. Candidates who have never been on-call will flounder.

What weak candidates say: “I would increase the lock timeout or add retry logic.” — This treats the symptom, not the cause. You will still have 15 deadlocks per minute, just with more retries consuming more resources.What strong candidates say:Deadlocks at 15/min means two or more transactions are consistently acquiring locks in conflicting orders on the same rows. This is not random — it is a deterministic code pattern. Here is exactly how I would diagnose it, step by step, using real PostgreSQL tooling.Step 1: Confirm the deadlocks and capture the details (2 minutes).PostgreSQL logs deadlock details automatically when log_lock_waits = on and deadlock_timeout is set (default 1 second). Check the PostgreSQL log:
ERROR: deadlock detected
DETAIL: Process 12345 waits for ShareLock on transaction 67890; blocked by process 12346.
         Process 12346 waits for ShareLock on transaction 67891; blocked by process 12345.
HINT: See server log for query details.
STATEMENT: UPDATE orders SET status = 'shipped' WHERE id = 4567
This tells you exactly which queries are deadlocking and which processes are involved. In RDS, these logs are in CloudWatch Logs under the PostgreSQL log group. In self-hosted, check log_directory.Also query pg_stat_activity for currently blocked sessions:
SELECT blocked.pid AS blocked_pid,
       blocked.query AS blocked_query,
       blocking.pid AS blocking_pid,
       blocking.query AS blocking_query,
       blocked.wait_event_type,
       blocked.state
FROM pg_stat_activity blocked
JOIN pg_locks bl ON bl.pid = blocked.pid
JOIN pg_locks bk ON bk.locktype = bl.locktype
  AND bk.database IS NOT DISTINCT FROM bl.database
  AND bk.relation IS NOT DISTINCT FROM bl.relation
  AND bk.page IS NOT DISTINCT FROM bl.page
  AND bk.tuple IS NOT DISTINCT FROM bl.tuple
  AND bk.pid != bl.pid
JOIN pg_stat_activity blocking ON blocking.pid = bk.pid
WHERE NOT bl.granted;
Step 2: Identify the pattern (5 minutes).Deadlocks in production almost always come from one of three patterns:Pattern A: Inconsistent lock ordering. Transaction 1 updates Order A then Order B. Transaction 2 updates Order B then Order A. If they interleave, deadlock. This is the classic case and accounts for ~70% of production deadlocks I have seen. The fix: sort the rows by ID before updating. If you are updating orders 4567 and 4568, always update 4567 first, regardless of business logic order.Pattern B: Foreign key lock escalation. When you insert into a child table, PostgreSQL acquires a shared lock on the parent row (to verify the FK exists). If another transaction holds an exclusive lock on that parent row (updating it), and is also trying to insert a child row that references the row the first transaction locked — deadlock. This is insidious because the developer writing INSERT INTO order_items has no idea they are locking a row in the orders table.Pattern C: Index contention on HOT updates. Less common, but on tables with many indexes, an UPDATE that changes indexed columns can acquire locks on multiple index entries. Two concurrent updates on adjacent rows can deadlock on the same index page.Step 3: Fix it (15-30 minutes for a hotfix).For Pattern A (most likely): identify the application code path doing multi-row updates in inconsistent order. Add explicit ordering:
# WRONG: order depends on business logic, inconsistent across requests
for order_id in batch_of_orders:
    update_order(order_id, status='shipped')

# RIGHT: sort by ID to guarantee consistent lock ordering
for order_id in sorted(batch_of_orders):
    update_order(order_id, status='shipped')
For Pattern B: restructure the transaction to update the parent row first, then insert children. Or batch the inserts so all children for one parent are inserted in a single statement, minimizing the lock window. In extreme cases, defer the FK check with SET CONSTRAINTS ... DEFERRED within the transaction.Step 4: Prevent recurrence.Add a monitoring alert on pg_stat_database.deadlocks (a cumulative counter). If deadlocks/minute exceeds 1, alert. Zero deadlocks is the goal — any deadlock in production indicates a code pattern that should be fixed, not tolerated.Add SET lock_timeout = '5s' to application transactions as a safety net. This causes a transaction to fail fast rather than wait indefinitely for a lock, reducing the deadlock detection window.War Story: At an e-commerce company processing 3,000 orders/minute during Black Friday, we hit 200+ deadlocks/minute at peak. The root cause: a batch fulfillment service was updating order statuses in the order they appeared in the warehouse pick list (sorted by bin location), while a customer-facing service was updating the same orders in the order customers placed them (sorted by timestamp). Same orders, different lock ordering. The fix was a one-line change — ORDER BY order_id in both code paths. Deadlocks dropped to zero within 60 seconds of deployment. Total time from alert to fix: 23 minutes. Total revenue at risk during those 23 minutes: approximately $150K based on our average order value and error rate.

Follow-up: How do deadlocks differ between PostgreSQL and MySQL InnoDB?

PostgreSQL detects deadlocks by running a deadlock detector every deadlock_timeout seconds (default 1s). When detected, it kills the youngest transaction (least work done) and lets the other proceed. The detection is a waits-for graph cycle check.MySQL InnoDB checks for deadlocks on every lock wait (not periodically). It uses a waits-for graph and immediately rolls back the transaction with the fewest rows modified. However, MySQL’s gap locks in REPEATABLE READ mode create deadlocks that would not happen in PostgreSQL. A range query like SELECT * FROM orders WHERE amount BETWEEN 100 AND 200 FOR UPDATE locks the index gap, which can block inserts that PostgreSQL would allow. This is why MySQL deadlocks are more frequent and more surprising than PostgreSQL deadlocks in equivalent workloads.

Follow-up: When are deadlocks actually acceptable?

When the retry cost is low and the alternative (global lock ordering) is impractical. In a system where thousands of transactions operate on overlapping sets of rows and the sets are not known in advance (think: a graph database updating connected nodes), enforcing a global lock order is architecturally infeasible. In this case, design for deadlock tolerance: wrap every transaction in a retry loop (3 retries with jitter), monitor the deadlock rate, and alert if it exceeds a threshold (e.g., 0.1% of transactions). PostgreSQL’s deadlock detection is fast enough that the overhead of occasional deadlock retries is negligible compared to the cost of global lock ordering in complex workloads.

The “obvious answer is wrong” question. Caching everything is a common anti-pattern.

What weak candidates say: “Sounds right. We should add Redis caching to every endpoint with a 5-minute TTL. That will reduce database load by 80%.”What strong candidates say:“Just add Redis in front of everything” is the database equivalent of “just throw more hardware at it.” It might reduce load, but it creates an entirely new class of problems and may not even fix the root cause. Here is why I would push back and what I would do instead.First, diagnose why the database is at 90% CPU.90% CPU does not tell you the story. Is it 90% from thousands of efficient queries at high volume? Or 90% from three terrible queries doing sequential scans on 500M-row tables? The fix is radically different.
-- Find the most expensive queries by total time
SELECT query, calls, total_exec_time, mean_exec_time,
       rows, shared_blks_hit, shared_blks_read
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
In my experience, the Pareto principle applies aggressively to database load: 3-5 queries account for 80%+ of total execution time. If three queries are doing sequential scans because they lack indexes, adding Redis does not fix anything — it just means you are caching slow queries and still paying the CPU cost when the cache misses.What is wrong with “cache everything”:
  1. Cache invalidation becomes your new primary problem. Every write now needs to invalidate the right cache keys. Miss one, and users see stale data. With 50 endpoints, you need 50 invalidation paths. The joke “there are only two hard things in computer science: cache invalidation and naming things” exists because cache invalidation in a complex system is genuinely harder than the database queries it replaces.
  2. Thundering herd on cache expiry. If a hot cache key expires and 1,000 requests simultaneously hit the database, you get a load spike worse than having no cache. With TTL-based caching on every endpoint, you get mini thundering herds constantly as different keys expire. Solutions exist (cache stampede locks, stale-while-revalidate), but they add complexity to every endpoint.
  3. Memory cost is non-trivial. A Redis instance with 50GB of cached API responses is expensive and operationally complex (Redis cluster, failover, persistence, monitoring). If most of those responses are accessed once per TTL window, you are paying for memory to store data nobody will read before it expires.
  4. You now have two data stores to keep consistent. Every bug investigation now requires checking both the database and the cache. “Is the user seeing stale data because of a cache issue or a replication lag issue?” Debugging complexity increases quadratically with the number of data stores.
What I would actually do, in order:Step 1: Fix the queries (hours, biggest impact). Use pg_stat_statements to find the worst queries. Add missing indexes. Rewrite N+1 patterns. This alone typically reduces CPU by 40-60%. Cost: zero infrastructure, just SQL.Step 2: Add connection pooling if missing. If each request opens a new connection, the overhead is enormous. PgBouncer in transaction mode with a pool of 20 connections serving 200 application instances eliminates connection management overhead and can reduce CPU by 10-20%.Step 3: Add targeted caching for specific hot paths. Not “cache everything” — cache the 5 endpoints that account for 80% of traffic AND whose data tolerates staleness. A product catalog page that updates once per hour? Cache it for 5 minutes. A user’s account balance? Never cache it — stale financial data is a support ticket.Step 4: Read replicas for read-heavy paths. If 85% of queries are reads, route them to a read replica. This halves the primary’s CPU immediately. Combine with caching: cache in Redis for the hottest data, read replica for the long tail.Step 5: THEN consider broader caching. Only after steps 1-4 are exhausted. And use cache-aside (lazy loading) with explicit TTLs per endpoint, not a blanket “cache everything for 5 minutes.”War Story: At a B2B SaaS company, the database hit 95% CPU. The team’s instinct was to add Redis. I spent 30 minutes with pg_stat_statements and found three things: (1) A dashboard endpoint was running SELECT COUNT(*) FROM events WHERE tenant_id = ? on a 200M-row table with no index on tenant_id — 45 seconds per call, called every time the dashboard loaded. Adding a composite index reduced it to 3ms. (2) An ORM was generating N+1 queries for a list endpoint — 1 query for the list, then 1 query per item for a related entity. 50 items = 51 queries. Adding .include(:related_entity) collapsed it to 2 queries. (3) A background job was running ANALYZE on the entire database every 5 minutes (someone copy-pasted a Stack Overflow answer). Changing it to rely on autovacuum freed 15% CPU. After these three fixes, CPU dropped from 95% to 35%. We never added Redis. Total cost: zero dollars. Time: one afternoon.

Follow-up: When IS “add Redis everywhere” the right answer?

When the query layer is already optimized, and the bottleneck is genuinely read volume at a scale that a single database (even with replicas) cannot handle. If you have 500,000 reads/second of a product catalog that changes once per hour, caching it in Redis with a 60-second TTL is correct. The pattern: high read volume + low write frequency + tolerance for staleness = cache aggressively. But that describes maybe 10-20% of endpoints in a typical application, not “everything.”

Follow-up: How do you handle cache invalidation for data that changes unpredictably?

Use cache-aside with short TTLs for unpredictable data (30-60 seconds). The cache absorbs burst reads, and the TTL guarantees eventual freshness. For data that MUST be fresh on write (user updates their name, should see it immediately), use write-through caching: on every write, update the database AND the cache in the same operation. If the cache update fails, delete the cache key (so the next read fetches from the database). Never let a stale cache key persist after a successful write — this is the most common caching bug in production.

This question tests whether you recognize when NOT to do something, and how you manage up when a technical decision is being driven by unrealistic timelines.

What weak candidates say: “We can use hash-based sharding on the primary key. Set up two PostgreSQL instances, split the data by hash, and update the application to route queries.” — This is technically not wrong, but the timeline confidence and the lack of risk awareness is the red flag.What strong candidates say:The first thing I would do is have an honest conversation with the CEO about what “done in two weeks” actually means and what the risks are. Sharding a production 2TB PostgreSQL database is a 3-6 month project for an experienced team. Rushing it is how you get a week-long outage that costs more than the original performance problem. Let me explain why, and then what I would actually do in two weeks.Why sharding is a 3-6 month project, not 2 weeks:
  1. Shard key selection is an irreversible architectural decision. Choose the wrong shard key and you get hot shards, cross-shard queries for common operations, and eventual data migration that is harder than the original sharding. You need to analyze every query pattern, every JOIN, every foreign key relationship. For a 3B-row database that has accumulated years of access patterns, this analysis alone takes weeks.
  2. Cross-shard queries. Any query that does not include the shard key becomes a scatter-gather across all shards. If you shard orders by customer_id, the query SELECT * FROM orders WHERE created_at > '2025-01-01' must hit every shard. If 40% of your queries are date-range queries, sharding by customer_id makes 40% of your queries slower, not faster. You need to audit every query path before choosing the shard key.
  3. Application changes are pervasive. Every database query in the application needs to be shard-aware — either the application routes to the correct shard, or a proxy (like Vitess or Citus) handles routing. If you use application-level sharding, every ORM query, every raw SQL call, every migration script, every background job needs to know about shards. This is months of engineering.
  4. Data migration without downtime. Moving 2TB of data from one database to multiple shards while the application continues to serve traffic requires dual-writing, backfilling, verification, and cutover. At 3B rows and typical migration speeds (50K-100K rows/second with backfill), the migration itself takes 8-16 hours. During that window, any write to the old database must be replicated to the new shards.
  5. Operational complexity permanently increases. Instead of one database to monitor, back up, upgrade, and troubleshoot, you now have N databases. Schema migrations run across all shards (Vitess handles this; DIY sharding does not). Backup and restore procedures multiply. On-call complexity increases.
What I would actually do in two weeks (to buy time):Week 1 — Immediate relief without sharding:
  • Run pg_stat_statements analysis. Find and fix the top 10 worst queries (missing indexes, sequential scans, N+1 patterns). On a 3B-row database, a single missing index can account for 30-40% of CPU.
  • Add read replicas. Route all reporting, analytics, and read-heavy API endpoints to replicas. This can cut primary load by 60-70%.
  • Implement connection pooling with PgBouncer if not already in place.
  • Partition the largest tables by date range (declarative partitioning). If the 3B rows are time-series-ish (orders, events, logs), partitioning by month means queries for “last 30 days” scan 1 partition instead of 36 months of data. This is a massive win with far less risk than sharding.
Week 2 — Caching and architectural planning:
  • Add Redis caching for the top 5 read-heavy endpoints.
  • Begin the shard key analysis: catalog all query patterns, JOIN paths, and access patterns. This is the prerequisite for sharding, and starting it now means the eventual sharding project starts from a position of knowledge rather than guessing.
  • Present a realistic sharding timeline (3-6 months) with the query analysis as evidence for why it cannot be rushed.
If sharding is truly needed, the approach:Use Citus (PostgreSQL extension for distributed tables) or Vitess (MySQL sharding proxy) rather than DIY sharding. Citus lets you SELECT create_distributed_table('orders', 'customer_id') and it handles shard routing, cross-shard joins, and distributed transactions transparently. The application changes are minimal compared to DIY sharding. Migration is still a multi-week project, but the operational overhead is dramatically lower.War Story: A fintech company I worked with had a 1.5TB PostgreSQL database doing 8K QPS. Leadership wanted to shard within a month. Instead, we spent two weeks on: (1) adding a composite index to the transactions table that eliminated a sequential scan on 800M rows — this alone dropped P99 query latency from 12 seconds to 40ms. (2) Partitioning the transactions table by quarter — queries for recent data stopped scanning historical partitions. (3) Adding two read replicas and routing 70% of traffic to them. CPU on the primary dropped from 92% to 28%. The database ran fine for another 18 months before sharding became genuinely necessary — and by then, the team had done a proper shard key analysis and used Citus for a controlled 3-month migration with zero downtime. The two-week “emergency sharding” would have been a disaster.

Follow-up: How do you choose a shard key? What is the worst shard key you have seen?

The shard key should satisfy three criteria: (1) high cardinality — enough distinct values to distribute evenly across shards. (2) Included in most queries — so the router can send each query to a single shard rather than all shards. (3) Even distribution — no single value should have disproportionate data.The worst shard key I have encountered: country_code. A SaaS product sharded by country, thinking it was “natural isolation.” The US shard had 70% of all data. The Japan shard had 15%. The remaining 40 shards shared 15%. The US shard was larger than the original unsharded database within a year. They had to re-shard by tenant_id — a 4-month project.Second worst: created_date. All current traffic hits the latest shard. Historical shards are idle. You have not distributed load; you have concentrated it on one shard with extra infrastructure complexity.

Follow-up: What is the difference between sharding and partitioning?

Partitioning splits a table into multiple physical segments within the same database instance. PostgreSQL declarative partitioning keeps all partitions on the same server, managed by the same PostgreSQL process. The query planner does partition pruning (only scans relevant partitions). There is zero application change — queries work identically. It solves query performance (smaller indexes, fewer rows scanned) but not write throughput or storage limits — you are still on one server.Sharding splits data across multiple database instances on different servers. Each shard is an independent database. It solves write throughput (each shard handles its own writes), storage (each shard stores a fraction of the data), and read throughput (queries route to one shard). But it introduces cross-shard query complexity, distributed transaction challenges, and operational overhead.Rule of thumb: partition first, shard only when a single server cannot handle the write throughput or storage requirements. Most teams shard too early. PostgreSQL with partitioning, proper indexing, and read replicas can handle multi-TB databases and tens of thousands of QPS on a single server.

This tests API design instinct — the ability to spot violations of REST conventions and think through the consequences.

What weak candidates say: “Looks fine. It sends a notification to a user.” — Missing the core design problem entirely.What strong candidates say:There are multiple design issues here, ranging from convention violations to real operational risks. Let me walk through them in order of severity.Issue 1: Verb in the URL. REST resources are nouns, not verbs. send-notification is a command, not a resource. This matters because it makes the API unpredictable. If send-notification is a verb, what about update-profile, reset-password, calculate-invoice? You end up with an API that is a collection of RPC-style commands disguised as REST endpoints, with no consistency in naming, no standard HTTP method semantics, and no way for a developer to guess the URL structure.The fix: Model the notification as a resource.
POST /users/:id/notifications
Body: { "type": "welcome", "channel": "email" }
Response 202: { "notification_id": "ntf_abc123", "status": "queued" }
Now notifications are a first-class resource. You can GET /users/:id/notifications to list them. GET /users/:id/notifications/ntf_abc123 to check status. DELETE /users/:id/notifications/ntf_abc123 to cancel a queued one. The entire CRUD lifecycle emerges naturally from treating notifications as resources.Issue 2: POST implies creation, but should this be synchronous? Sending a notification is inherently async — it involves email delivery, push notification services, SMS gateways, any of which can take seconds or fail. A synchronous POST that blocks until the notification is delivered will timeout on slow channels. The 202 Accepted pattern is correct: accept the request, return immediately with a resource ID, process asynchronously. The client polls the status or receives a webhook callback.Issue 3: No idempotency. What if the client retries? Without an idempotency key, retrying the POST sends the notification twice. A user getting two identical “Your password was reset” emails is confusing. A user getting two identical “Your payment of $5,000 was processed” SMS messages causes panic and support tickets.Issue 4: Missing rate limiting considerations. This endpoint triggers an external side effect (sending a notification). If a bug or a malicious actor calls it in a loop, you spam the user with thousands of notifications. Add a per-user rate limit (e.g., max 10 notifications per user per hour) separate from the API-level rate limit. Also add a global rate limit per notification type to protect downstream delivery services.Issue 5: The :id in the URL should be validated. Can user A send a notification to user B? If this is an admin endpoint, that might be fine. If it is a user-facing endpoint, the :id must match the authenticated user. This is an authorization check that is easy to miss in review but is a critical security vulnerability (IDOR — Insecure Direct Object Reference).How I would coach the developer:I would not just request changes on the PR. I would schedule a 15-minute pairing session and walk through the design together, using Stripe’s API or GitHub’s API as references. The goal is to build the developer’s API design intuition, not just fix this one endpoint. I would frame it as: “Every endpoint you add becomes a contract. Once a client depends on POST /users/:id/send-notification, renaming it is a breaking change. Let us design it as a resource from the start so the API grows naturally.”War Story: At a startup, we shipped an endpoint called POST /orders/:id/refund. Six months later, we needed partial refunds, refund reasons, refund-to-store-credit, and refund status tracking. The verb-based endpoint could not accommodate any of this without becoming POST /orders/:id/refund?amount=50&reason=damaged&type=credit. It was an unreadable mess. We redesigned it as POST /orders/:id/refunds (resource), which naturally supported GET /orders/:id/refunds (list), GET /orders/:id/refunds/:refund_id (detail), and a clean body with all the parameters. The migration from the old endpoint to the new one took 3 months because 80 partners had integrated with the verb-based URL.

Follow-up: When IS it acceptable to use verbs in REST URLs?

Two scenarios. First, genuinely non-resource operations that do not create, read, update, or delete anything: POST /translate (stateless transformation), POST /validate (check without side effects), POST /search (complex query that does not fit in a GET query string). Even here, you can often model them as resources (POST /translations, POST /validations), but the verb form is pragmatically acceptable for stateless operations.Second, when your API is not actually REST and you have decided that is fine. An internal RPC-style API between two services you own, where neither discoverability nor resource modeling matters, can use verb-based endpoints. Just be honest about it — do not call it REST.

Follow-up: How would you design the notification endpoint to support multiple channels (email, SMS, push)?

Add a channels array to the request body:
POST /users/:id/notifications
{
  "template": "order_shipped",
  "channels": ["email", "push"],
  "data": { "order_id": "ord_123", "tracking_url": "..." }
}
The server fans out to each channel asynchronously. The response includes per-channel status:
{
  "notification_id": "ntf_abc123",
  "status": "queued",
  "channels": [
    { "channel": "email", "status": "queued" },
    { "channel": "push", "status": "queued" }
  ]
}
Each channel can succeed or fail independently. The client checks GET /users/:id/notifications/ntf_abc123 for updated per-channel statuses. This is cleaner than separate endpoints per channel (/send-email, /send-sms) because the business logic (which channels to use, deduplication, user preferences) stays on the server.

This is a “common misconception with real money on the line” question. The obvious answer — “Kafka has exactly-once” — is dangerously incomplete.

What weak candidates say: “Enable Kafka’s exactly-once semantics with enable.idempotence=true and transactional producers. Problem solved.”What strong candidates say:The team’s understanding is incomplete in a way that cost real money. Let me untangle what “exactly-once” actually means in Kafka and why it did not prevent duplicate charges.What Kafka’s exactly-once actually guarantees:Kafka’s exactly-once semantics (EOS), introduced in 0.11, guarantees that a producer-to-Kafka-to-consumer pipeline processes each message exactly once within Kafka. Specifically:
  • Idempotent producer (enable.idempotence=true): the producer can retry sends without creating duplicate messages in the topic. Kafka deduplicates by producer ID and sequence number.
  • Transactional producer/consumer: a consume-transform-produce pipeline can atomically read from topic A, process, and write to topic B, with consumer offsets committed in the same transaction. If the process crashes mid-transaction, Kafka rolls back both the output message and the offset commit. No message is double-processed within Kafka.
What it does NOT guarantee:Exactly-once stops at the Kafka boundary. The moment your consumer writes to an external system (a database, a payment API, an HTTP endpoint), you are outside Kafka’s transaction scope. Here is the failure mode:
  1. Consumer reads order event from Kafka.
  2. Consumer calls the payment API — payment succeeds, customer is charged $50.
  3. Consumer attempts to commit the Kafka offset.
  4. Consumer crashes (or the network blips) before the offset commit succeeds.
  5. Kafka redelivers the message (offset was not committed).
  6. Consumer processes it again — calls payment API again — customer is charged another $50.
Kafka’s exactly-once cannot prevent this because the payment API is not part of Kafka’s transaction. The charge at step 2 happened outside Kafka’s scope. No message broker can provide exactly-once delivery to an external system — this is a fundamental distributed systems limitation, not a Kafka bug.How to actually fix it — three approaches:Approach 1: Consumer-side idempotency (most common).Before processing each message, check if you have already processed it. Use the Kafka message’s (topic, partition, offset) as a unique identifier, or better, use a business-level idempotency key (order ID + event type).
def process_order(event):
    idempotency_key = f"{event.order_id}:{event.event_type}"

    # Atomic check-and-insert in the same DB transaction as the business logic
    with db.transaction():
        if db.exists("processed_events", key=idempotency_key):
            log.info(f"Skipping duplicate: {idempotency_key}")
            return  # Already processed

        # Process the order
        charge_payment(event.order_id, event.amount)

        # Record that we processed it
        db.insert("processed_events", key=idempotency_key, processed_at=now())
The critical detail: the idempotency check and the business logic MUST be in the same database transaction. If you check in Redis and process in PostgreSQL, a crash between them leaves you inconsistent (same problem as Q2).Approach 2: Outbox pattern with transactional writes.Instead of calling the payment API directly from the consumer, write the “intent to charge” to a local database table (the outbox) in the same transaction as the offset tracking. A separate process reads the outbox and calls the payment API with its own idempotency key. If the charge fails, the outbox entry is retried. If the charge succeeds but the outbox update fails, the retry calls the payment API again with the same idempotency key — the payment API deduplicates it.Approach 3: Exactly-once with Kafka Streams.If your consumer writes its output to another Kafka topic (not an external database), Kafka Streams’ exactly-once processing handles everything. But the moment you have a side effect (database write, API call, email), you are back to needing consumer-side idempotency.War Story: At a food delivery platform processing 50K orders/hour, we discovered duplicate charges during a routine financial reconciliation. 0.03% of orders had been charged twice — 15 orders per hour, $450/hour in overcharges. The consumer was using Kafka’s enable.idempotence=true and the team believed they were protected. The root cause: the consumer committed offsets asynchronously (enable.auto.commit=true with a 5-second interval). During a rolling deployment, containers were killed before the auto-commit fired. On restart, messages since the last commit were redelivered and reprocessed. The fix was threefold: (1) switch to manual offset commits, (2) add an idempotency table keyed on order_id + event_type with the processing wrapped in a database transaction, (3) add a daily reconciliation job that detects and flags duplicate charges for automatic refund. The idempotency table added 0.3ms of overhead per event (one indexed lookup). The duplicate rate dropped from 0.03% to zero. The reconciliation job remained as a safety net.

Follow-up: Why not just use manual offset commits after processing?

Manual commits reduce but do not eliminate duplicates. The failure window is between “processing completes” and “offset commit completes.” If the consumer crashes in that window (or the broker rejects the commit due to a rebalance), the message is redelivered. The window is small (milliseconds), but at 50K messages/hour, even a 0.001% failure rate produces 18 duplicates per day. For payments, zero is the only acceptable number.

Follow-up: How does the idempotency table perform at high throughput?

At 50K events/hour (~14 events/second), the overhead is negligible. At 1M events/hour (~278 events/second), you need to think about table management. The idempotency table grows continuously. Partition it by day, drop partitions older than 7 days (your Kafka retention period). Index on the idempotency key for O(log n) lookups. At 278 inserts/second with 7-day retention, the table holds ~170M rows — PostgreSQL handles this comfortably with a proper index. If throughput exceeds 10K events/second, consider using Redis as a fast-path check with the database as a fallback (check Redis first, if miss then check DB, write to both).

This tests distributed database internals knowledge. Candidates who only know SQL will struggle. Candidates who have operated DynamoDB in production will nail it.

What weak candidates say: “Increase the provisioned capacity. If 10,000 WCU is not enough, set it to 50,000.” — This fundamentally misunderstands how DynamoDB distributes capacity.What strong candidates say:This is the classic DynamoDB hot partition problem, and it is one of the most misunderstood aspects of DynamoDB’s architecture. Increasing provisioned capacity will not help because the problem is not total capacity — it is how capacity is distributed across partitions.How DynamoDB partitioning works:DynamoDB divides your table into partitions based on the partition key hash. Each partition handles a maximum of 3,000 RCU and 1,000 WCU (these are hard limits per partition, not configurable). When you provision 10,000 WCU, DynamoDB creates at least 10 partitions (10,000 / 1,000 = 10) and distributes the capacity evenly: 1,000 WCU per partition.If your access pattern sends 80% of writes to a single partition key value (e.g., tenant_id = "big_enterprise_customer"), that one partition receives 8,000 WCU of demand but can only handle 1,000 WCU. The other 9 partitions sit idle with 9,000 WCU unused. You are throttled at 1,000 WCU effective capacity despite provisioning 10,000.DynamoDB adaptive capacity (introduced 2018) partially mitigates this — it can boost a hot partition’s throughput by borrowing from underutilized partitions. But adaptive capacity has limits: it takes minutes to kick in, cannot exceed the table-level throughput, and does not help if the hot partition is at the physical node’s I/O limit.Root cause analysis for hot partitions:Run a query against CloudWatch ConsumedWriteCapacityUnits with the TableName and partition key dimension. Or use DynamoDB Contributor Insights (must be enabled per table) — it shows the top 20 most-accessed partition keys over the last 24 hours. If one key dominates, you have found your hot partition.Common causes:
  1. Bad partition key design. Using date as the partition key means all of today’s writes hit one partition. Using status (with 3 possible values) means 3 partitions max, with the most common status being hot.
  2. One big tenant in a multi-tenant table. If tenant_id is the partition key and one tenant has 100x the traffic of others, their partition is permanently hot.
  3. Time-based access patterns. An IoT system where device_id is the partition key, but all devices report at the top of every minute. The write pattern is bursty, concentrating on a few partitions simultaneously.
Fixes, ordered by effort:
  1. Write sharding (adding randomness). Append a random suffix to the partition key: instead of tenant_id = "big_customer", use tenant_id = "big_customer#3" where the suffix is a random number from 0-9. This spreads the big customer’s writes across 10 partitions. Reads require a scatter-gather across all 10 suffixed keys, but writes are distributed. This is DynamoDB’s official recommended pattern for hot partitions.
import random
# Write: spread across 10 logical partitions
suffix = random.randint(0, 9)
table.put_item(Item={
    'pk': f'TENANT#{tenant_id}#{suffix}',
    'sk': f'ORDER#{order_id}',
    ...
})

# Read all orders for a tenant: query all 10 partitions
for suffix in range(10):
    results = table.query(
        KeyConditionExpression=Key('pk').eq(f'TENANT#{tenant_id}#{suffix}')
    )
  1. Switch to on-demand capacity mode. On-demand mode provisions capacity automatically based on traffic and does not have per-partition limits as strict as provisioned mode (it still has limits, but they are higher and auto-adjust). For bursty workloads, on-demand eliminates the capacity planning problem entirely. The trade-off: on-demand is 5-7x more expensive per request than provisioned at steady-state traffic. But if you are throttling, the lost requests cost more than the price premium.
  2. Redesign the partition key. If the hot partition is caused by a fundamental design flaw (using date or status as the partition key), no amount of capacity will fix it. Redesign with a high-cardinality key. For time-series data, use device_id (high cardinality) as the partition key and timestamp as the sort key — not the reverse.
  3. Use a GSI with a different partition key. If the hot path is a specific query pattern, create a Global Secondary Index with a partition key that distributes more evenly. The GSI has its own partitions and throughput.
War Story: At an IoT platform ingesting telemetry from 100K devices, we used device_type as the partition key (thinking it was a logical grouping). There were 12 device types. 60% of devices were type “sensor_v2.” That one partition handled 60% of all writes — 180K WCU of demand on a partition that could handle 3K WCU. Contributor Insights showed the hot key immediately. We redesigned the table to use device_id as the partition key (100K distinct values = excellent distribution). The migration required dual-writing to old and new tables for 48 hours, then cutting over reads. Throttling dropped to zero. The table went from needing 300K provisioned WCU (to brute-force the hot partition) to 50K WCU (actual demand, evenly distributed). Cost savings: $18K/month.

Follow-up: How does DynamoDB’s on-demand mode handle sudden traffic spikes?

On-demand mode maintains a “previous peak” high-water mark. It can instantly serve up to double the previous peak traffic. If traffic exceeds double the previous peak, DynamoDB throttles while it provisions additional capacity (which takes minutes). For truly unpredictable spikes (0 to 50K WCU in seconds), on-demand may still throttle briefly. The mitigation: “pre-warm” by running a load test that establishes the high-water mark before the expected spike. Or use provisioned mode with auto-scaling and set the max capacity high enough.

Follow-up: When would you use DynamoDB over PostgreSQL?

When access patterns are known and key-based (get by ID, query by partition key + sort key range), write throughput needs to be massive (>100K writes/second), you want zero-operational-overhead scaling, and you do not need ad-hoc queries or multi-table joins. DynamoDB excels at: session stores, user profiles, shopping carts, IoT telemetry, gaming leaderboards. PostgreSQL excels at: anything requiring JOINs, complex queries, transactions across multiple tables, or access patterns that evolve unpredictably. The wrong choice: using DynamoDB when you need to run SELECT orders.*, customers.name FROM orders JOIN customers ON ... — you will end up doing the JOIN in application code, which is slower and more error-prone.

This is a “numbers that sound good but are not” question. It tests whether candidates think in absolute numbers, not just percentages.

What weak candidates say: “99.9% is three nines of reliability. That is industry standard. Customers should not be complaining.” — This shows the candidate has never operated a high-traffic API.What strong candidates say:99.9% sounds great as a percentage. But percentages without volume are meaningless. Let me do the math.The math that changes the narrative:If the API serves 10 million requests per day:
  • 99.9% success = 0.1% failure = 10,000 failed requests per day
  • That is 416 failures per hour, or 7 per minute
If each failed request represents a user-visible error (a payment that did not go through, a page that did not load, a notification that was not sent), 10,000 users per day are having a bad experience. If even 1% of those users contact support, that is 100 support tickets per day. If 10% of those users churn, you lose 1,000 users per day. At an average customer LTV of 500,thatis500, that is 500K/day in lost future revenue.Now compare to 99.99% (four nines):
  • 0.01% failure = 1,000 failed requests per day
  • 42 per hour, less than 1 per minute
And 99.99% at 100M requests/day:
  • 10,000 failures per day — you are right back where you started
This is why SLOs must be defined in absolute terms, not just percentages. “99.9% success rate” is not an SLO. “Fewer than 1,000 user-visible errors per day” is an SLO.How to improve from 99.9% to 99.99%:Going from three nines to four nines is a 10x reduction in failures. This is where the diminishing returns kick in — each additional nine is an order of magnitude harder and more expensive.
  1. Categorize the failures. Not all 10,000 daily failures have the same cause. Pull the error logs and categorize:
    • Timeouts to downstream services (often 40-50% of failures): add circuit breakers, increase timeout budgets for critical paths, add fallback responses for non-critical dependencies.
    • Database connection exhaustion (often 15-25%): add PgBouncer, tune pool sizes, add connection timeout with retry.
    • Rate limiting / load shedding (often 10-20%): if you are rate-limiting legitimate traffic, your limits are too low or your capacity is too small.
    • Application bugs (often 10-15%): null pointer exceptions, unhandled edge cases in business logic. Each one is a unique fix.
    • Infrastructure failures (often 5-10%): node crashes, deployment rollouts, cloud provider transient errors.
  2. Implement retries with backoff for transient failures. Many of the timeouts and connection errors are transient — the downstream service was busy for 200ms but would succeed on retry. A single retry with 100ms backoff converts a transient failure into a success, reducing the failure count by 30-50%. But retries must be idempotent (see Q2) and must have a retry budget (max 1 retry per request, max 10% of traffic as retries) to avoid amplification.
  3. Add graceful degradation. If the recommendation engine is down, show popular products instead of a 500 error. If the analytics service is slow, skip analytics tracking and serve the page. Not every dependency failure needs to be a user-visible error. This alone can improve the user-perceived success rate from 99.9% to 99.95%.
  4. Reduce blast radius. A single bad deployment that crashes 1 of 10 servers causes a 10% error spike for the duration of the rollout. Canary deployments (route 5% of traffic to the new version, monitor for 10 minutes, then proceed) limit the blast radius. A bad deploy now affects 5% of traffic for 10 minutes instead of 100% for 5 minutes.
War Story: At a payment processing company doing 50M API calls/day, our SLO was 99.95% (four nines target). We were at 99.91% — 45,000 failures/day. The CEO thought we were “basically there.” I showed the team that 45,000 vs our target of 25,000 meant 20,000 excess failures per day. We categorized them: 18,000 were timeouts to a single downstream fraud detection service that had a P99 latency of 8 seconds. Our timeout was 5 seconds. The fraud service team tuned their slowest query path (a GIN index lookup on a 2B-row table that had bloated), dropping P99 to 800ms. Our timeout failures dropped from 18,000 to 1,200. Overall success rate went to 99.97%. A single index optimization on a downstream service was responsible for 0.06% of our error rate. That is why you categorize failures before optimizing.

Follow-up: How do you define and measure SLOs for an API?

Define SLOs using the SLI/SLO/error budget framework (Google SRE book):
  • SLI (Service Level Indicator): The metric you measure. For an API: (successful responses / total responses) * 100 where “successful” means HTTP 2xx/3xx within the latency threshold (e.g., <500ms). Exclude 4xx client errors (those are the client’s fault, not yours) unless they indicate a server-side issue (429s that should not happen).
  • SLO (Service Level Objective): The target for the SLI. “99.95% of requests succeed within 500ms over a 30-day rolling window.”
  • Error budget: 100% - SLO = error budget. At 99.95% SLO with 50M requests/day, your error budget is 25,000 failures/day. If you burn through the budget, freeze deployments and focus on reliability.
Measure with: Datadog APM, Grafana + Prometheus, or AWS CloudWatch. The key is measuring at the edge (load balancer or API gateway), not at the application, because infrastructure-level failures (connection resets, TLS errors, load balancer 502s) are invisible to application-level metrics.

Follow-up: What is the difference between SLO and SLA?

SLO is an internal engineering target. Missing it triggers an engineering response (freeze deploys, fix reliability). SLA is an external contractual commitment, usually with financial penalties (service credits) for violations. Your SLO should always be stricter than your SLA. If your SLA is 99.9%, your SLO should be 99.95% — the gap is your safety margin. You should never be in a situation where engineering only finds out about a reliability problem because a customer claims an SLA credit.

This is a “both sides have merit and the context determines the answer” question. It tests architectural judgment, not religious preference.

What weak candidates say: Either extreme — “Never use PostgreSQL as a queue, always use SQS” or “PostgreSQL does everything, why add another service?” Both are dogmatic rather than analytical.What strong candidates say:Both are right in different contexts, and the decision depends on scale, operational complexity budget, and the team’s existing infrastructure. Let me argue both sides, then give my recommendation.The case for PostgreSQL as a job queue:It genuinely works well at moderate scale, and the operational simplicity is significant.
  1. Transactional enqueue. When an API request creates an order AND enqueues a notification job, both happen in the same database transaction. If the transaction fails, neither persists. With SQS, you have the dual-write problem: the order is committed to the database, but the SQS send fails (or vice versa). You need the outbox pattern to make this reliable, which adds complexity.
  2. No additional infrastructure. No SQS configuration, no IAM policies, no dead-letter queue configuration, no CloudWatch alarms for queue depth. The job table is just another table in the database you already monitor, back up, and maintain.
  3. SKIP LOCKED is remarkably efficient. PostgreSQL’s SELECT ... FOR UPDATE SKIP LOCKED is purpose-built for this pattern. Multiple workers dequeue concurrently without blocking each other. At 100-500 jobs/second, a well-indexed job table with regular cleanup of completed jobs handles this without breaking a sweat. Libraries like pgboss (Node.js), Oban (Elixir), GoodJob (Ruby), and Procrastinate (Python) are production-tested implementations.
  4. Rich querying. You can SELECT COUNT(*) FROM jobs WHERE status = 'failed' AND created_at > now() - interval '1 hour' — try that with SQS. You can add indexes, run analytics on job performance, and JOIN job data with business data. SQS is an opaque pipe.
The case for a dedicated queue (SQS, Kafka, RabbitMQ):
  1. Scale. PostgreSQL as a queue works at 100-500 jobs/second. At 10,000+ jobs/second, the job table becomes a write hotspot. Every enqueue is an INSERT, every dequeue is a SELECT + UPDATE, every completion is an UPDATE or DELETE. That is 3 write operations per job on a single table. At 10K jobs/second, that is 30K write operations/second on the jobs table alone — enough to saturate a mid-tier PostgreSQL instance. SQS handles 3,000 messages/second per queue by default and scales transparently to nearly unlimited throughput.
  2. Isolation. A surge in job processing should not affect API response times. If job dequeuing competes with API queries for database connections, CPU, and I/O, a job backlog can degrade API performance. SQS is completely isolated — a million queued messages does not affect your database.
  3. Retention and replay. SQS retains messages for up to 14 days. Kafka retains indefinitely. A PostgreSQL job table fills up and needs cleanup. If you need to replay failed jobs from last week, SQS/Kafka has them. PostgreSQL requires you to not delete completed/failed jobs (which bloats the table).
  4. Dead-letter queues and visibility timeouts. SQS has built-in DLQ, visibility timeout (message becomes invisible to other consumers for N seconds while being processed), and automatic re-delivery after timeout. Implementing all of this in PostgreSQL is possible but is 500+ lines of code that you have to maintain and test.
My recommendation framework:
FactorUse PostgreSQL QueueUse SQS/Dedicated Queue
Job throughput<500/second>1,000/second
Team already operates SQS/KafkaNoYes
Need transactional enqueue (job + business data atomically)Yes (this is the killer feature)Only with outbox pattern
Job processing affects API latencyAcceptableUnacceptable
Number of servicesMonolith or <5 servicesMicroservices (>5 services)
Need exactly-once processingEasier (DB transaction)Harder (need idempotency)
For a startup or small team with <500 jobs/second: use PostgreSQL. The operational simplicity and transactional guarantees outweigh the scaling limitations you do not yet have. Use Oban, pgboss, or GoodJob — do not build your own.For a team at scale with >1,000 jobs/second or microservices architecture: use SQS. The isolation and scaling benefits outweigh the operational complexity.War Story: A team I advised started with SQS for job processing on day one because “that is what you are supposed to do.” Six months in, they had: 3 separate SQS queues, a DLQ for each, CloudWatch alarms for queue depth, a Lambda consumer for each queue, and the outbox pattern with a Debezium CDC pipeline to reliably publish database events to SQS. Total infrastructure: 15 AWS resources for what was 200 jobs/hour. They were spending more engineering time maintaining the queuing infrastructure than the actual application. I helped them rip it all out and replace it with pgboss — one PostgreSQL table, one worker process, zero additional infrastructure. Deployment complexity dropped dramatically. The system ran without incident for two years before they genuinely needed SQS at 5,000 jobs/second. The lesson: match the tool to the actual scale, not the imagined future scale.

Follow-up: What about using Redis (Bull/BullMQ) as the queue instead of PostgreSQL or SQS?

Redis with BullMQ is the middle ground. Faster than PostgreSQL for high-throughput dequeuing (Redis operations are sub-millisecond), simpler than SQS (no IAM, no DLQ configuration), and has a beautiful dashboard (Bull Board) for monitoring. The trade-off: Redis is not durable by default. If Redis crashes and AOF persistence is not configured (or has a gap), queued jobs are lost. For jobs where loss is acceptable (send analytics event, update search index), Redis is excellent. For jobs where loss is unacceptable (process payment, send invoice), use PostgreSQL or SQS with their durability guarantees.

Follow-up: How does the outbox pattern solve the dual-write problem with SQS?

Instead of writing to the database AND sending to SQS (two separate operations that can partially fail), you write the business data AND the message to the database in the same transaction:
BEGIN;
INSERT INTO orders (id, customer_id, amount) VALUES (...);
INSERT INTO outbox (id, topic, payload) VALUES (..., 'order_created', '{"order_id": ...}');
COMMIT;
A separate process (the “relay”) polls the outbox table, sends each message to SQS, and deletes it from the outbox after SQS confirms receipt. If the relay crashes, it restarts and re-sends un-relayed messages (SQS consumers must be idempotent). An alternative to polling is CDC (Change Data Capture) with Debezium, which streams the outbox inserts to Kafka/SQS in near-real-time via the database’s WAL/binlog. This is more complex but eliminates polling latency.

This tests deep query planner knowledge. The “obvious” answer — “the index is missing” — is wrong. The index exists. The planner chose to ignore it.

What weak candidates say: “The index must be corrupt. Rebuild it.” or “Force an index scan with SET enable_seqscan = off.” — The first is extremely unlikely. The second is a brute-force hack that masks the real problem and can make other queries worse.What strong candidates say:The PostgreSQL query planner is cost-based. It estimates the cost of every possible execution plan (sequential scan, index scan, bitmap index scan, etc.) and picks the cheapest one. If it chose a sequential scan over an index scan on a 200M-row table, it is telling you something. Either the planner’s cost estimate is wrong, or the planner is actually right and the sequential scan really is cheaper for this specific query. Let me walk through the common reasons.Reason 1: The query returns too many rows (low selectivity).If the query matches 30%+ of the table, a sequential scan is genuinely faster than an index scan. Why? An index scan requires: (1) traverse the B-tree to find matching keys, (2) for each key, follow a pointer to the heap (table) to fetch the full row. At 30% selectivity on 200M rows, that is 60M random I/O operations to the heap. A sequential scan reads the entire table in one linear pass — sequential I/O is 10-100x faster than random I/O on both HDDs and SSDs.The threshold where the planner switches from index scan to sequential scan is typically around 5-15% of the table, depending on random_page_cost and seq_page_cost settings.Diagnosis: Check EXPLAIN ANALYZE for the estimated row count. If the planner estimates returning 50M+ rows from a 200M-row table, the sequential scan is correct. The fix is not to force an index — it is to add more selective filters to the query or rethink the access pattern.Reason 2: Stale statistics.PostgreSQL’s planner uses table statistics (row counts, value distribution, most common values, histogram of column values) to estimate selectivity. If the statistics are stale — the table had 10M rows when ANALYZE last ran but now has 200M rows — the planner’s estimates are wrong.Diagnosis: Compare EXPLAIN (estimated) with EXPLAIN ANALYZE (actual). If estimated rows = 50,000 but actual rows = 500, the statistics are wrong. The planner overestimated the result set and chose a sequential scan based on the wrong estimate.Fix:
ANALYZE your_table;
-- For high-cardinality columns, increase statistics granularity:
ALTER TABLE your_table ALTER COLUMN important_column SET STATISTICS 1000;
ANALYZE your_table;
The default statistics target is 100 (PostgreSQL samples 100 * 300 = 30,000 rows to build histograms). For a 200M-row table with skewed data distribution, increasing to 1000 gives much better estimates for columns with many distinct values.Reason 3: Correlated columns confusing the planner.If your WHERE clause filters on two columns (WHERE status = 'active' AND region = 'us-east') and the planner assumes independence (5% of rows are active, 20% are us-east, so 1% match both), but in reality 90% of active rows are in us-east (correlation), the planner dramatically underestimates the result set. With the wrong estimate, it might choose an index scan expecting 2M rows but actually get 36M rows, or it might choose a sequential scan expecting 20M rows when only 2M match.Fix: PostgreSQL 10+ supports multivariate statistics:
CREATE STATISTICS stat_status_region (dependencies, ndistinct, mcv)
ON status, region FROM your_table;
ANALYZE your_table;
This tells the planner about the correlation between status and region, leading to more accurate row estimates.Reason 4: Cost parameters are misconfigured.PostgreSQL’s default random_page_cost = 4.0 and seq_page_cost = 1.0 assumes that random I/O is 4x more expensive than sequential I/O. This was accurate for spinning disks. On SSDs (which most production databases use), random I/O is only 1.1-1.5x more expensive than sequential I/O. With the default settings, the planner overestimates the cost of index scans (which involve random I/O) and favors sequential scans.Fix:
-- For SSD-backed storage:
SET random_page_cost = 1.1;
-- For data mostly in cache:
SET effective_cache_size = '48GB';  -- ~75% of total RAM
Setting random_page_cost closer to seq_page_cost makes the planner more willing to use index scans, which is correct for SSD-backed databases.Reason 5: The index is on the wrong column order.A composite index on (status, created_at) helps WHERE status = 'active' AND created_at > '2025-01-01' but the planner cannot use it for WHERE created_at > '2025-01-01' alone (leftmost prefix rule). The index exists, it matches the columns, but the column order does not match the query pattern.What I would NEVER do:SET enable_seqscan = off — this forces the planner to avoid sequential scans globally. It might “fix” your one query but degrades every other query where a sequential scan was genuinely optimal. It is a debugging tool, not a production setting. Use it only to verify “would an index scan be faster?” then fix the root cause.War Story: At a healthcare SaaS company, a report query on a 150M-row patient records table was doing a sequential scan despite a perfect composite index on (provider_id, visit_date). EXPLAIN showed estimated rows = 12M, actual rows = 3,200. The statistics were wildly off because autovacuum had not analyzed the table in 3 weeks — the table had grown from 80M to 150M rows in that period due to a data migration, and autovacuum’s threshold (autovacuum_analyze_threshold + autovacuum_analyze_scale_factor * n_live_tup) had not triggered because the scale factor was set to 10% (the default). At 80M rows, 10% means autovacuum waits for 8M new rows before re-analyzing. We ran ANALYZE manually — query time dropped from 4 minutes (sequential scan) to 80ms (index scan). Then we tuned autovacuum for that table: ALTER TABLE patient_records SET (autovacuum_analyze_scale_factor = 0.01) — now autovacuum re-analyzes after every 1.5M new rows instead of every 8M.

Follow-up: How do you investigate whether the planner’s estimate is wrong without running the full query?

Use EXPLAIN (without ANALYZE) to see the planner’s estimates without executing the query. Compare the estimated rows to a quick approximation:
-- Quick row count estimate from statistics (instant, no table scan):
SELECT reltuples::bigint FROM pg_class WHERE relname = 'your_table';

-- Estimated selectivity for a specific condition:
SELECT * FROM pg_stats
WHERE tablename = 'your_table' AND attname = 'status';
-- Look at most_common_vals and most_common_freqs
If pg_stats shows status = 'active' has frequency 0.05 (5%) and the EXPLAIN estimates 10M rows for a 200M-row table, the estimate is correct. If it estimates 100M rows, the statistics are stale.

Follow-up: What is a bitmap index scan and when does PostgreSQL use it instead of a regular index scan?

A bitmap index scan is the middle ground between a sequential scan and an index scan. PostgreSQL first scans the index and builds a bitmap of all matching heap pages (not individual rows). Then it scans the heap pages in physical order (sequential I/O, not random). This combines the selectivity of an index with the I/O efficiency of sequential access.The planner uses a bitmap scan when selectivity is moderate (1-15% of the table). Too selective -> regular index scan (few random reads). Not selective enough -> sequential scan (read everything). In the middle -> bitmap scan. It is also used when combining multiple indexes: WHERE status = 'active' AND region = 'us-east' with separate indexes on status and region -> bitmap AND of the two indexes, then heap scan. This is why PostgreSQL can sometimes use two single-column indexes instead of a composite index, though the composite index is typically faster.

This question separates people who have run migrations on toy databases from people who have sweated through a 12-hour online DDL on a production table.

What weak candidates say: “Use ALTER TABLE ADD INDEX and ALTER TABLE ADD COLUMN. MySQL supports online DDL.” — Technically true for simple cases, dangerously naive at 2B rows.What strong candidates say:At 2 billion rows, every migration operation is a multi-hour event with production-safety implications. Here is the operational plan I would run, broken into phases with explicit go/no-go checkpoints.Phase 0: Assessment (1-2 days before migration).Measure the table: row count, average row size, total data size, number of existing indexes, current write rate (inserts/updates/deletes per second), peak hours, and replication lag baseline. On a 2B-row InnoDB table with an average row size of 200 bytes, the data is roughly 400GB. Adding a B-tree index will create an additional ~100-200GB of index data. Ensure the disk has 1.5x the table size as free space (600GB) — the migration tool creates a copy.Phase 1: Add the index using gh-ost (6-18 hours).Do NOT use native ALTER TABLE ADD INDEX at this scale. Even with MySQL 8’s online DDL, the metadata lock acquisition at the start and end of the operation can stall if there are long-running transactions. gh-ost avoids this entirely.
gh-ost \
  --host=primary.db.internal \
  --database=production \
  --table=events \
  --alter="ADD INDEX idx_events_user_created (user_id, created_at)" \
  --allow-on-master \
  --chunk-size=1000 \
  --max-lag-millis=1500 \
  --critical-lag=10000 \
  --throttle-query="SELECT UNIX_TIMESTAMP() - MAX(ts) FROM heartbeat.heartbeat WHERE server_id=1" \
  --initially-drop-ghost-table \
  --initially-drop-old-table \
  --exact-rowcount \
  --execute
Key gh-ost flags explained:
  • --max-lag-millis=1500: gh-ost pauses if replica lag exceeds 1.5 seconds. This prevents the migration from causing lag that affects read traffic on replicas.
  • --critical-lag=10000: If lag exceeds 10 seconds, gh-ost aborts (something is seriously wrong).
  • --chunk-size=1000: Copy 1,000 rows per chunk. Smaller chunks = less lock contention but slower migration. At 2B rows with 1,000 rows/chunk, that is 2M chunks. At 50ms per chunk, the migration takes ~28 hours. Increase to 5,000 rows/chunk to bring it to ~6 hours, but monitor write latency.
  • --throttle-query: Custom query that gh-ost evaluates periodically. If it returns a value exceeding a threshold, gh-ost pauses. Use this to pause during peak hours.
Monitoring during the migration:
  • Watch gh-ost progress output (rows copied, ETA, chunks per second).
  • Watch replica lag on all replicas (pt-heartbeat is more reliable than SHOW SLAVE STATUS).
  • Watch primary write latency (P95 should not increase by more than 20% during the migration).
  • Watch disk I/O utilization — if it hits 80%+, reduce chunk size.
  • Watch connection count — gh-ost uses 2-3 connections, but the increased I/O can cause other queries to queue.
Phase 2: Split the column using expand-and-contract (3 deploys over 1-2 weeks).Splitting name into first_name and last_name is a multi-step operation:Deploy 1 — Expand: Add first_name and last_name columns (NULLable, no default). This is instant in MySQL 8 with ALGORITHM=INSTANT for column additions at the end. Deploy application code that writes to all three columns (name, first_name, last_name) on every INSERT and UPDATE. Old and new application instances coexist safely during rolling deploy because the old code writes only name (new columns are NULLable).Deploy 2 — Backfill: Run a batch job to populate first_name/last_name from name for existing rows. At 2B rows with 50,000 rows per batch and 100ms per batch, the backfill takes ~1.1 hours. Run during off-peak. Verify with SELECT COUNT(*) FROM events WHERE first_name IS NULL AND name IS NOT NULL — should be zero.Deploy 3 — Contract (weeks later): Switch application reads from name to first_name/last_name. After all instances are on the new code and you have verified zero references to the old column (grep the codebase, check CDC consumers, check analytics pipelines), drop the name column. In MySQL, dropping a column on a 2B-row table requires an online DDL or gh-ost — it is NOT instant.Rollback plan:
  • Phase 1 (index): gh-ost creates the ghost table alongside the original. If the migration fails or causes problems, it can be cancelled without affecting the original table. The ghost table is dropped. No data is lost, no schema is changed. This is the safest possible rollback.
  • Phase 2 (column split): If Deploy 1 causes issues, the new columns are NULLable and unused by old code — they can be dropped without impact. If Deploy 2 (backfill) has bugs, rerun it (the backfill should be idempotent). Deploy 3 is the point of no return — once the old column is dropped, you cannot roll back without restoring from backup.
Cost analysis: gh-ost on a 2B-row table generates significant I/O. On RDS, estimate 500M-1B additional I/O operations during the migration. At Aurora pricing (0.20/millionI/Os),thatis0.20/million I/Os), that is 100-200 in I/O costs. The ghost table temporarily doubles storage usage — if you are on gp3 with auto-scaling, storage costs increase by ~50/monthduringthemigrationwindow.Totalmigrationcost:50/month during the migration window. Total migration cost: 150-250 in AWS charges plus 2-3 days of engineering time.

Follow-up: The gh-ost migration is 80% complete after 14 hours. A P0 incident requires a deployment. Can you deploy, or does the migration block it?

gh-ost does not block deployments. The original table is fully writable throughout the migration. You can deploy application changes normally. However: if the deployment includes a schema migration on the same table, that will conflict — two concurrent schema changes on one table is a recipe for disaster. Coordinate with the team: no DDL on the migrating table until gh-ost completes. For application code changes, deploy freely.If the P0 requires killing the migration (freeing I/O for emergency queries), gh-ost supports graceful pause: echo throttle | nc -w 1 127.0.0.1 5678 (if using the socket interface). Resume when the incident is resolved. The migration picks up where it left off.

Follow-up: How do you validate that the migration did not corrupt data?

After gh-ost completes the table swap:
  1. Row count verification. SELECT COUNT(*) FROM events should match the pre-migration count (plus any rows inserted during the migration — gh-ost reports this delta). On a 2B-row table, COUNT(*) is expensive. Use SELECT reltuples FROM pg_class for PostgreSQL or SELECT TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES for MySQL (approximate but instant).
  2. Checksum verification. pt-table-checksum (Percona) computes CRC32 checksums across primary and replicas. Run it on a sample of rows (1% = 20M rows) to verify data consistency between primary and replicas after the migration.
  3. Application-level verification. Run your integration test suite against the migrated table. Spot-check specific rows that are known to have edge cases (NULLs, unicode, maximum-length values, boundary dates).

This is an organizational architecture question disguised as a technical one. It tests whether you understand that API contract ownership is a people problem with technical solutions.

What weak candidates say: “Team A should not have changed the API without telling Team B.” — Correct but insufficient. Communication alone does not prevent the next incident.What strong candidates say:Everyone is at fault, and the root cause is not a code change — it is the absence of a contract governance system. Let me break down what went wrong and how to fix it permanently.The failure cascade:
  1. Team A changed their response format. They probably considered it a minor change (adding a field, changing a null to an empty array, renaming an internal field that “nobody uses”). They deployed without checking consumer compatibility because there was no mechanism to do so.
  2. Team B’s deserialization broke because they either had strict parsing, cached the old format, or depended on an undocumented field/behavior. They had no early warning because there was no contract test, no schema validation, and no pre-deploy compatibility check.
  3. Team C’s service degraded because it depended on both A and B. A’s change did not break C directly, but B’s failure cascaded. Team C had no circuit breaker or fallback for B’s outage.
The organizational fix — contract ownership model:Step 1: Every service publishes a machine-readable schema (OpenAPI, protobuf, AsyncAPI). The schema is versioned and lives in a schema registry (shared git repo, Confluent Schema Registry, or Buf Schema Registry for protobuf). Changes to the schema must pass CI checks before merging.Step 2: Breaking change detection is automated. CI runs openapi-diff, buf breaking, or a custom schema comparator on every PR that modifies a schema. Breaking changes fail the build. The team must either make the change backward-compatible or explicitly bump the version with a migration guide.Step 3: Consumer-driven contract tests for critical relationships. Team B writes a Pact contract: “I expect field X to be a string, field Y to be an integer, field Z to be present.” Team A runs these contracts in their CI pipeline. If a change violates any consumer contract, the build fails. This does not scale to 50 consumers, but for the 3-5 most critical relationships, it is the strongest protection available.Step 4: Separate schema changes from behavior changes. The schema registry enforces: additive changes (new fields, new endpoints) are always safe and can be released without consumer notification. Breaking changes require a version bump and a deprecation timeline. This is the Stripe model — schema evolution rules baked into tooling, not tribal knowledge.Step 5: Service owners, not consumers, are accountable for backward compatibility. The producing team is responsible for not breaking consumers. This creates a natural incentive to invest in versioning, compatibility testing, and deprecation programs. If the incentive is reversed (“consumers should handle any change”), producers will break things constantly because the cost falls on someone else.The technical fix — defensive consuming:Team B’s service should survive a provider’s format change without failing:
  1. Lenient deserialization. Configure JSON/protobuf parsers to ignore unknown fields (DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES = false in Jackson). Never fail on extra data.
  2. Null-safe field access. Use optional types, null-coalescing operators, or explicit presence checks for every field from an external service. Assume any field might be absent, null, or a different type.
  3. Circuit breaker on upstream calls. If Team A’s service returns unexpected responses, Team B’s circuit breaker opens after N failures and returns a fallback (cached data, default values, or a graceful error) instead of cascading the failure.
  4. Contract monitoring in production. Deploy a lightweight response validator that checks actual production responses against the expected schema. Alert when a field disappears, changes type, or a new unknown field appears — before it breaks business logic.
The cost of not doing this: Every cross-team API break costs 2-8 hours of engineering time across 2-3 teams (diagnosis, hotfix, deploy, post-mortem). At a loaded engineering cost of 150/hour,eachincidentcosts150/hour, each incident costs 900-3,600. If it happens monthly, that is $11K-43K/year — more than the cost of setting up a schema registry and contract testing pipeline. The ROI on contract governance is almost always positive within 6 months.

Follow-up: Team A says “we have 30 consumers, we cannot run 30 contract test suites in our CI.” How do you scale this?

You do not need 30 Pact contracts. The 80/20 rule applies: identify the 3-5 highest-traffic consumers that would cause the most damage if broken. Run their contracts. For the remaining 25, use schema-level compatibility checks (detect any structural breaking change) rather than consumer-specific contracts. A schema compatibility check is one tool running once, not 30 test suites.Also: move from “consumer writes the contract” to “schema registry enforces compatibility rules.” The registry rejects any schema change that removes a required field, changes a type, or renames a field — regardless of which consumer cares. This is the difference between “test against specific consumers” (does not scale) and “enforce universal compatibility rules” (scales to any number of consumers).

Follow-up: How do you handle the politics when Team A says the change was documented in their changelog and Team B “should have read it”?

Changelogs are not contracts. If compliance depends on humans reading a document, it will fail at scale. The fix is mechanical: if Team A’s CI cannot merge a breaking change without consumers acknowledging it (via a contract test update, a Pact verification, or an explicit version bump), the “I did not read the changelog” failure mode is eliminated. Shift the enforcement from process (people reading docs) to tooling (CI preventing merges). The political discussion ends when both teams agree on the tooling — the tool becomes the neutral arbiter instead of Slack threads and postmortems.

This tests the intersection of technical architecture, business judgment, and operational decision-making under pressure. It is a staff-level question.

What weak candidates say: “Rate-limit the tenant” or “Tell them to optimize their queries.” — Both are tone-deaf to the business reality.What strong candidates say:This is a noisy-neighbor problem that requires simultaneous action on three fronts: immediate technical mitigation, medium-term architectural isolation, and a commercial conversation that aligns infrastructure cost with revenue.Immediate mitigation (hours, not days):
  1. Identify the hot queries. Use pg_stat_statements filtered by the tenant’s application user or connection pool to find the top 10 queries by total I/O. In my experience, 1-3 queries account for 80%+ of the I/O. A single missing index or a report query doing a sequential scan on a 50M-row partition is often the entire problem.
  2. Optimize the obvious. If the tenant’s top query is SELECT * FROM events WHERE tenant_id = 'big-corp' AND created_at > '2025-01-01' doing a sequential scan, adding CREATE INDEX CONCURRENTLY idx_events_tenant_date ON events(tenant_id, created_at) is a 10-minute fix that might reduce their I/O by 90%. This is the highest-leverage action.
  3. Route the tenant’s read traffic to a dedicated replica. If they have heavy analytics or reporting queries, route those to a read replica that other tenants do not share. This isolates the I/O impact immediately. Cost: one additional replica ($200-2,000/month depending on instance size).
  4. Apply a per-tenant statement timeout. SET statement_timeout = '10s' for this tenant’s database role. Their long-running queries get killed instead of consuming I/O for minutes. This is a blunt instrument but prevents the worst runaway queries from affecting others.
Medium-term architecture (weeks):
  1. Partition the tenant’s data. If they have 50M rows in a shared table, create a dedicated partition: ALTER TABLE events ATTACH PARTITION events_bigcorp FOR VALUES IN ('big-corp-uuid'). Their queries now hit a smaller, isolated B-tree. PostgreSQL’s partition pruning ensures other tenants’ queries never touch the big tenant’s partition.
  2. Per-tenant connection pool limits. Allocate a fixed slice of the connection pool (e.g., 20 out of 100 connections) to the enterprise tenant. Even if they saturate their slice, the other 80 connections remain available for other tenants. PgBouncer supports this via pool configuration per database/user.
  3. Per-tenant I/O quotas. PostgreSQL does not natively support I/O quotas, but you can approximate them with pg_qualstats (to identify expensive query patterns), auto_explain (to log queries exceeding a threshold), and application-level query throttling (a middleware that limits the tenant to N concurrent queries).
The commercial conversation:This is not a “tell them to stop” conversation. It is a “let us align your usage with the right tier” conversation.Frame it positively: “Your usage has grown significantly — congratulations, that means your business is scaling. You have outgrown the shared infrastructure tier, and we want to make sure your performance is great. We are recommending a dedicated infrastructure tier that gives you isolated database capacity, guaranteed IOPS, and a dedicated read replica for analytics. The cost is $X/month, which includes [specific resources].”Cost modeling for the commercial conversation:
ResourceShared Tier Cost (allocated)Dedicated Tier CostDifference
Database IOPS (60% of shared)~$600/mo (their share of the shared instance)$1,200/mo (dedicated db.r6g.xlarge)+$600/mo
Read replica for analytics$0 (shared)$800/mo (dedicated replica)+$800/mo
Dedicated monitoring/support$0~$500/mo (eng time)+$500/mo
Total~$600/mo~$2,500/mo+$1,900/mo
If the tenant pays 10,000/month,a10,000/month, a 2,500 infrastructure cost is 25% gross margin — acceptable. If they pay $500/month, the infrastructure cost exceeds revenue — you need to upsell or accept a loss-leader. This is the conversation the account team needs to have, armed with specific cost data that engineering provides.War Story: At a B2B analytics platform, a Fortune 500 tenant’s nightly ETL job consumed 85% of the shared PostgreSQL cluster’s IOPS for 4 hours every night. Other tenants’ dashboards were unusable from 11 PM to 3 AM. The initial instinct was to throttle them. Instead, we: (1) added a composite index that reduced their ETL’s read I/O by 70% (they were doing a sequential scan on a 200M-row table — a single index addition), (2) moved their analytics queries to a dedicated read replica (400/month),(3)renegotiatedtheircontracttoincludea"dedicatedanalyticstier"at400/month), (3) renegotiated their contract to include a "dedicated analytics tier" at 1,500/month additional revenue. Net result: other tenants’ P99 latency dropped from 8 seconds to 200ms, the enterprise tenant got faster analytics, and the company gained $1,500/month in revenue. The index optimization alone would have solved 70% of the problem for zero cost.

Follow-up: The tenant refuses to pay for dedicated infrastructure. They say “the platform should handle it.” What do you do?

This is a product decision, not an engineering decision. The platform team defines what is included in each tier. If the shared tier includes “shared infrastructure with fair-use limits,” enforce those limits. Implement per-tenant I/O monitoring with a soft limit (alert + dashboard showing their usage relative to their tier) and a hard limit (throttle queries exceeding 2x their tier’s allocated IOPS). The soft limit creates awareness; the hard limit prevents impact to others.Present the data: “Your usage is 12x what your tier includes. Here is a graph showing your I/O vs other tenants in the same tier. We can either move you to a tier that matches your usage, or apply fair-use limits that bring your usage in line with the tier you are paying for.”Most enterprise customers will pay for dedicated resources when shown the data. The small percentage that refuse are candidates for churn that improves margins — not all revenue is good revenue.

Follow-up: How would you measure and attribute per-tenant database cost in a shared environment?

This is the “FinOps for databases” problem, and most teams do not do it well.Measurement approach:
  1. Tag queries with tenant ID. Use PostgreSQL’s application_name or a query comment (/* tenant_id=abc123 */) to attribute queries to tenants. pg_stat_statements with pgaudit extension can capture per-query stats with tenant attribution.
  2. Calculate per-tenant IOPS. Sum shared_blks_read + shared_blks_hit per tenant from pg_stat_statements. Convert to IOPS: total_blocks / time_window.
  3. Calculate per-tenant storage. SELECT pg_total_relation_size('table') * (SELECT COUNT(*) FROM table WHERE tenant_id = ?) / (SELECT COUNT(*) FROM table) gives an approximation. For exact figures, partition by tenant and measure partition sizes.
  4. Allocate shared costs proportionally. The database instance cost, replica cost, backup cost, and operational cost are shared. Allocate proportionally to IOPS or query volume, not row count (a tenant with 1M rows but 100K queries/day costs more than a tenant with 10M rows and 100 queries/day).

This is a “rolling deploy + schema migration mismatch” incident question — one of the most common causes of API outages at companies that have moved past the basics.

What weak candidates say: “Roll back the migration.” — At what cost? Rolling back a migration on a live table that already has the new column, with two versions of the app running, is not straightforward.What strong candidates say:This is a textbook API/database mismatch during a rolling deploy, and it is entirely preventable. Let me walk through the incident response, the root cause, and the systemic fix.Incident response (minute by minute):Minute 0-2: Detect. Alerts fire on elevated 5xx rate. Error logs show INSERT INTO orders (...) VALUES (...) failing with ERROR: null value in column "priority" violates not-null constraint. The new column priority was added as NOT NULL without a default. Old application instances (still running on 7 of 10 pods) do not include priority in their INSERT statements.Minute 2-5: Mitigate. Two options, pick the fastest:Option A: Fix forward. Add a DEFAULT to the column immediately:
ALTER TABLE orders ALTER COLUMN priority SET DEFAULT 'normal';
In PostgreSQL, this is a catalog-only change — instant, no table rewrite. Old instances’ INSERTs now succeed (the default fills in priority). New instances explicitly set priority. This is the fastest fix — under 60 seconds to execute.Option B: Roll back the deploy. Stop the rollout and roll back to the previous application version. All instances now run old code. The new column exists but is unused (old code ignores it, and the DEFAULT set in Option A means it does not block inserts). This is slower (5-10 minutes for a full rollback) and only necessary if the new code has additional issues.Minute 5-10: Verify. Error rate should drop to zero within seconds of Option A. Confirm by watching pg_stat_activity — no more failed INSERT statements. Check that the default value is semantically correct (does 'normal' priority make sense for orders created during the incident window?). If not, plan a backfill to correct the values later.Root cause analysis:The migration and the application deploy were coupled in a way that assumed atomic deployment — all instances switch from old to new simultaneously. In a rolling deploy, this assumption is false. For minutes (or hours in large fleets), old and new code run against the same database. The migration must be compatible with both versions.The systemic fix — decouple migrations from deploys:Rule 1: Schema migrations must be backward-compatible with the currently running application version. The migration runs first (or as part of the deploy), but old application instances must continue to work against the new schema. This means:
  • Adding a column: must be NULLable or have a DEFAULT.
  • Removing a column: only after all instances have stopped referencing it (a separate deploy).
  • Renaming a column: never in one step. Add the new name, dual-write, migrate reads, drop the old name.
Rule 2: Separate “migrate schema” from “deploy code.” Run the migration as a pre-deploy step. Wait for it to complete. Then start the rolling deploy. The migration stands alone and must be safe against the old code. The deploy stands alone and must work against the new schema.Rule 3: CI must enforce backward compatibility. Before merging any migration, a CI check deploys the migration against a copy of the current schema and runs the current application’s test suite against it. If any test fails, the migration is not backward-compatible and must be revised.Rule 4: Every migration has a reverse migration, and the reverse is tested. The reverse of “add column with NOT NULL” is “ALTER TABLE ALTER COLUMN DROP NOT NULL” (or “DROP COLUMN” if you want to fully revert). The reverse is tested in CI against the new application version to ensure it is safe to execute.Measuring the blast radius:During the incident, how many requests failed? SELECT COUNT(*) FROM api_logs WHERE status = 500 AND path LIKE '/orders%' AND timestamp BETWEEN 'incident_start' AND 'incident_end'. If the rolling deploy was at 30% (3 of 10 instances on old code), approximately 30% of write requests failed for the duration of the incident (10 minutes). At 100 RPS to the orders endpoint, that is ~30 * 600 = 18,000 failed requests. If each failed request is an order submission, the business impact is 18,000 lost or delayed orders — this is the number the post-mortem report must include.Cost of the incident: 18,000 failed orders * average order value * conversion rate loss from retry friction. If AOV is 50and1050 and 10% of affected users abandon, that is 90,000 in lost revenue from a missing DEFAULT clause. This is why migration safety is not a “nice to have.”

Follow-up: How do you handle a migration that is NOT backward-compatible — a column type change, for example?

A type change (e.g., VARCHAR(50) to TEXT, or INTEGER to BIGINT) requires the expand-and-contract approach spread across multiple deploys:Deploy 1 (schema only): Add a new column with the target type: ALTER TABLE orders ADD COLUMN amount_v2 NUMERIC(15,2). No application changes.Deploy 2 (dual-write): Application writes to both amount (old) and amount_v2 (new) on every INSERT and UPDATE. Reads still use amount. Both old and new application versions work because amount is still the source of truth.Deploy 3 (backfill): Batch job copies amount to amount_v2 for all existing rows: UPDATE orders SET amount_v2 = amount::NUMERIC(15,2) WHERE amount_v2 IS NULL in batches of 50,000.Deploy 4 (switch reads): Application reads from amount_v2. Writes still go to both columns (for rollback safety).Deploy 5 (drop old column): Remove the old column after verifying zero references. This is the point of no return.Minimum timeline: 5 deploys over 1-3 weeks. There is no shortcut that is safe under rolling deploys. Each deploy is independently rollbackable until Deploy 5.

Follow-up: What about security? A NOT NULL constraint was missing for 10 minutes. Could an attacker have exploited this?

Yes. If the priority column is used in authorization logic (e.g., priority = 'urgent' routes to a faster processing queue, or priority = 'admin' grants elevated access), an attacker who discovers the missing constraint during the incident window could INSERT rows with NULL priority and observe how the system handles them. If the application code treats NULL as a special case (or fails open), this is an exploit vector.More realistically: the security risk is in the data integrity violation, not direct exploitation. 18,000 orders with NULL priority might bypass business rules that depend on priority being set. A post-incident data audit should verify: are any rows in an invalid state? Do any NULL-priority orders need to be corrected?The deeper security lesson: migration validation should include a security review for any constraint change. Adding or removing a NOT NULL, CHECK, or FOREIGN KEY constraint changes what data the database will accept. This is a security boundary, not just a schema change.
Strong Answer Framework:Step 1 - Reject the v3 framing: A v3 forces every consumer to migrate simultaneously, which is operationally impossible when you have no contact info for half of them and the long tail integrated 4 years ago. Treat the breaking change as a Stripe-style dated version (for example 2026-04-21) layered on top of v2 rather than a new major. Each consumer is pinned to the version they integrated on, and new behavior is opt-in per endpoint, not per API.Step 2 - Build a request/response transformation chain: Internally, the service implements only the newest shape. Every dated version installs an upgrade transformer (old request to new) and a downgrade transformer (new response to old). A request from a consumer pinned to 2023-01-15 runs through the chain of transformers between 2023-01-15 and HEAD before hitting the resolver, and the response walks back down. This keeps the core code free of version branching and localizes every change to a single, reviewable file per version.Step 3 - Drive migration with telemetry, not email blasts: Every response carries X-Stripe-Version-style headers and a Sunset header with a concrete date. Dashboards show request volume per pinned version per consumer. Deprecation starts when a version hits under 1 percent of volume, and account managers are given a named list of laggards. Only versions on which no consumer depends get removed. Nothing is retired on a calendar; it is retired on a usage threshold.Real-World Example: Stripe has shipped over 100 dated API versions since 2011 without a v3. Their engineering blog post “APIs as infrastructure” describes the transformer chain approach. GitHub, by contrast, issued a hard v3-to-v4 cutover for GraphQL and spent years running both in parallel because they could not unilaterally move enterprise customers.Senior Follow-up Questions:
  • “How do you stop the transformer chain from becoming a 100-step performance disaster?” - Strong answer: Cache a compiled transformer per pinned version at boot so per-request cost is one function call, not a loop, and snapshot the chain every 2 years by rebasing old versions onto a new baseline.
  • “A consumer says they cannot upgrade because their auditors froze the integration. What do you do?” - Strong answer: Offer a paid long-term-support tier with a contractual sunset 24 months out, which puts a price on the operational cost of keeping their version alive and makes their procurement team the one asking for the upgrade.
  • “How do you detect a breaking change was accidentally shipped inside a minor version?” - Strong answer: Run consumer-driven contract tests in CI using recorded production traffic samples per pinned version, so any resolver change that produces a different response shape for a recorded request fails the build before merge.
Common Wrong Answers:
  • “Just cut a v3 and give consumers 6 months.” - You do not control the consumer’s release cycle; enterprise integrators run on yearly planning cycles and regulated industries on multi-year ones. A 6-month window forces a hard break.
  • “Use feature flags per consumer.” - Feature flags are for progressive rollout of behavior, not for maintaining a stable contract. They quickly degenerate into thousands of per-consumer branches with no audit trail.
Further Reading:
  • “API Versioning at Stripe” - Brandur Leach (engineering blog, stripe.com/blog)
  • RFC 8594 - The Sunset HTTP Header Field
  • Related chapter: API Gateways and Service Mesh
Strong Answer Framework:Step 1 - Frame it as a read/write ratio decision, not a religion: Normalization optimizes write-time consistency at the cost of read-time work. Denormalization inverts that trade. The right answer is the one that matches the traffic: if the surface does 500 reads per write, denormalization wins even if each write fans out to 5 tables. If the ratio is 5 reads per write, keep the joins and fix the query plan instead. Measure first, argue second.Step 2 - Pick a denormalization shape that matches the update pattern: If the denormalized fields come from a slow-changing parent (user display name, country), use synchronous denormalization at write time in the same transaction. If they come from a fast-changing aggregate (follower count, unread count), use an async materialized view or a CDC-driven projection (Debezium to Kafka to a projection service). Never mix these. Synchronous fanout across 5 fast-changing sources is how you get write amplification and deadlocks in production.Step 3 - Budget for the consistency hole: Every denormalized field is a second source of truth that can drift. Budget for a reconciliation job (nightly batch or continuous CDC verifier) that compares the denormalized value to the normalized source and emits a metric for drift rate. If drift is above a threshold (say 0.01 percent), that is your early warning that a writer is bypassing the fanout path, and you fix it before users notice.Real-World Example: Facebook’s TAO paper (2013) documents exactly this trade: they denormalized the social graph into a single read-optimized store and accepted eventual consistency for a 1000x read-to-write advantage. Discord did the opposite for messages in 2017 and kept a normalized Cassandra model because messages are append-only and read-heavy per-channel, so joins were cheap.Senior Follow-up Questions:
  • “Your async denormalization lags by 30 seconds during peak. A user updates their name and it stays old on the profile page. How do you fix without going synchronous?” - Strong answer: Read-your-own-writes via a session-scoped cache that overlays the projection for the writing user only, leaving other viewers on the eventually consistent path.
  • “How do you denormalize when the source table has row-level security policies?” - Strong answer: You do not denormalize the raw fields; you denormalize the already-authorized projection output, so every reader still goes through the authorization boundary, just against the denormalized store.
  • “What metric tells you the denormalization was a mistake?” - Strong answer: Write latency P99 climbing faster than read latency P99 is dropping, meaning you have moved the pain rather than removed it, plus a rising reconciliation drift rate.
Common Wrong Answers:
  • “Always denormalize for performance.” - Ignores write amplification, cache invalidation cost, and the consistency budget. At sufficient write volume, synchronous fanout exceeds the latency of the original join.
  • “Add more indexes first and do not denormalize.” - Indexes do not help when the query plan is join-bound rather than scan-bound. At some point a 14-way join cannot be index-rescued; the planner still has to do the join work.
Further Reading:
  • “TAO: Facebook’s Distributed Data Store for the Social Graph” - USENIX ATC 2013
  • “Designing Data-Intensive Applications” chapter 3 - Martin Kleppmann
  • Related chapter: Database Deep Dives
Strong Answer Framework:Step 1 - Accept that 2PC is off the table: Two-phase commit across Postgres, DynamoDB, and a third-party HTTP API is impossible because Stripe will never participate as a resource manager in your transaction coordinator. The real design is a saga: a sequence of local transactions, each with a compensating transaction, coordinated by a durable workflow engine rather than a database. Every step must be idempotent because every step will be retried.Step 2 - Design the happy path and the compensation path as one unit: For each forward step (reserve inventory, authorize payment, create order), define the compensator (release inventory, void authorization, cancel order). Store the saga state in a single source of truth (Temporal, AWS Step Functions, or a Postgres saga_state table) so that after a crash the orchestrator can resume from the last committed step rather than replaying from the start. The orchestrator, not any of the participant databases, owns atomicity.Step 3 - Make every external call idempotent with a deterministic key: Use the saga ID as the idempotency key on Stripe charges, the DynamoDB conditional write condition, and the Postgres insert’s unique constraint. That way a network timeout that you cannot tell apart from a failure becomes safe to retry. Without idempotency keys, your compensator can charge twice or release inventory you never reserved.Real-World Example: Uber’s Cadence (now Temporal) was built explicitly for this class of problem after their payments flow suffered exactly this failure mode in 2016. Amazon’s 2018 re:Invent talk on Step Functions for order orchestration describes their version of the same pattern for retail checkout.Senior Follow-up Questions:
  • “The compensator for ‘release inventory’ itself fails. Now what?” - Strong answer: The saga enters a stuck state that pages a human; there is no compensator-for-the-compensator, so you build a DLQ of stuck sagas with tooling to let an operator force-resolve with full audit trail.
  • “How is this different from eventual consistency via CDC?” - Strong answer: CDC propagates state between your own stores where you control both sides; sagas coordinate mutations across systems where one side is external or cannot participate in a replication stream, and the business requires an explicit compensation semantic, not just read-side convergence.
  • “Stripe already charged the card, but the order insert is failing because Postgres is in read-only mode during failover. What do you do?” - Strong answer: The saga pauses and retries with backoff rather than compensating, because voiding a successful authorization is worse customer experience than a 30 second delay; compensation is for logical failures, not infrastructure blips.
Common Wrong Answers:
  • “Use distributed transactions with XA.” - DynamoDB does not speak XA, Stripe does not participate in any transaction protocol, and even Postgres XA has well-documented recovery bugs under coordinator failure.
  • “Just retry until everything succeeds.” - Without compensators, a permanent failure in step 3 leaves steps 1 and 2 committed forever, which is exactly the charged-money-no-inventory bug you are trying to avoid.
Further Reading:
  • “Sagas” - Garcia-Molina and Salem, 1987 (the original paper)
  • Temporal documentation on saga pattern
  • Related chapter: Cloud Service Patterns