Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Part VIII — APIs and Integration
- Database Deep Dives — PostgreSQL internals (MVCC, VACUUM, query planner), DynamoDB single-table design and partition strategies, MongoDB aggregation pipelines, and Redis architecture. When this chapter says “add an index,” that chapter explains why the B-tree is 4 levels deep and what happens during a page split.
- GraphQL at Scale — Federation, persisted queries, query complexity analysis, and schema governance. This chapter introduces GraphQL as an API paradigm; that chapter covers what it takes to run it in production with multiple teams and billions of requests.
- API Gateways & Service Mesh — Rate limiting, authentication, TLS termination, and request routing at the gateway level. This chapter covers rate limiting from the API designer’s perspective; that chapter covers it from the infrastructure perspective (Kong, Envoy, Istio).
- Cloud Service Patterns — RDS vs Aurora, DynamoDB capacity modes, ElastiCache deployment patterns, and RDS Proxy for serverless connection management. When this chapter discusses database selection, that chapter covers the specific AWS managed services and their operational characteristics.
Chapter 15: API Design
- Predictable — Consistent naming conventions everywhere. If one endpoint uses
created_at, every endpoint usescreated_at(notcreatedAtin some andcreation_datein others). IfGET /usersreturns a list,GET /ordersreturns a list in the same envelope. A developer who has used one endpoint can guess the shape of any other endpoint. - Forgiving — When something goes wrong, the API tells you exactly what happened and how to fix it. Not
400 Bad Requestwith no body — but{ "error": { "code": "invalid_currency", "message": "Currency 'usd' is not valid. Did you mean 'USD'? Currencies must be uppercase ISO 4217 codes.", "param": "currency" } }. Good error messages are the best API documentation. - Evolvable — Versioning is baked in from day one, not bolted on when the first breaking change happens. Every response includes only additive fields. Removing or renaming a field goes through a deprecation cycle. The API you ship on launch day should be designed assuming it will still be running in five years with clients you have never heard of.
You are designing a public API for a SaaS product. What are the top 5 design decisions you make upfront?
You are designing a public API for a SaaS product. What are the top 5 design decisions you make upfront?
- Authentication mechanism — API keys for simplicity, OAuth 2.0 if delegated access is needed.
- Versioning strategy — URL-based (
/v1/) for discoverability, with a deprecation policy from day one. - Pagination — cursor-based (not offset) for consistency at scale; every list endpoint returns pagination metadata.
- Error format — consistent error response schema across all endpoints (error code, message, details, request_id for support).
- Rate limiting — per API key, with
429 Too Many RequestsandRetry-Afterheader.
request_id in every response for debugging, and write OpenAPI spec before writing code.What weak candidates say vs what strong candidates say:- Weak: “I would use REST with JSON and add authentication.” — Too vague, no specifics on versioning, pagination, or error handling. Shows no awareness of the decisions that cause real pain later.
- Strong: “My first five decisions would be versioning, auth, pagination, error format, and rate limiting — because those are the five things that are hardest to change after launch and cause the most partner friction when done wrong.” — Shows prioritization and understanding of irreversibility.
- Failure mode: “What happens if you skip the idempotency decision at launch?” — Every retry creates duplicates. Retrofitting idempotency keys onto an existing API is a breaking change (new required header). Partners who integrated without it will resist.
- Rollout: “How do you roll out rate limiting to existing consumers without breaking them?” — Announce 30 days before enforcement, start in monitoring-only mode (log but do not reject), then enforce with generous initial limits. Tighten over time.
- Measurement: “How do you know your API design is working?” — Track time-to-first-successful-call for new integrators (target: <5 minutes with docs), support ticket rate per 1,000 API calls, and error rate breakdown by status code.
- Cost: “What is the cost of getting versioning wrong?” — Stripe maintains hundreds of API versions through a transformation chain. This is viable at $14B revenue. For most companies, rearchitecting versioning after launch costs 3-6 months of engineering time.
- Security: “What auth decision would you avoid for a public API?” — Never use API keys alone without HTTPS. Never embed secrets in URLs (they appear in logs, browser history, referer headers). Always support key rotation without downtime.
- Governance: “Who owns the API design decisions?” — A single API design lead or working group, not each team independently. Inconsistency across endpoints is the number-one developer complaint.
GET /users returns camelCase JSON but GET /orders returns snake_case. The team says it is fine because different developers wrote them. Write the Slack message you would send to the team explaining why this must be fixed before launch.”Real-World Example. Stripe’s API was designed at launch (2011) with exactly this set of decisions locked in — API keys for auth, cursor pagination, consistent error envelope with code/message/param/type, Idempotency-Key header on every mutation, per-key rate limits. Over a decade later, they have never had to rewrite the basics. Contrast with Twitter’s v1 API, which lacked idempotency keys and proper cursor pagination — retrofitting both required breaking changes that took years to resolve.spectral (OpenAPI linter) that fails builds on style violations like camelCase/snake_case mixing, and (3) an API design review board that signs off on new endpoints before they ship. The board is the backstop for judgment calls the linter cannot catch.Q: A developer pushes back and says “versioning is premature optimization; we’ll add it when we need it.” What do you say?
A: Adding versioning later is a breaking change — every existing client now has to send a version header they did not send before, or migrate from unversioned /users to /v1/users. Adding versioning on day one is a 30-minute decision that costs nothing. The asymmetry is obvious: cheap to do now, impossible to do later without breakage. It is not premature — it is the cheapest point in the project’s lifetime to make the decision.- Microsoft REST API Guidelines
- Zalando RESTful API Guidelines
- Stripe Blog — “APIs as infrastructure: future-proofing Stripe with versioning” by Brandur Leach
AI-Assisted Engineering Lens: API Design with LLMs
AI-Assisted Engineering Lens: API Design with LLMs
- OpenAPI spec generation: Tools like GitHub Copilot and Claude can generate a complete OpenAPI 3.0 spec from a natural-language description of your resources and relationships. This accelerates the “spec-first” approach — you describe the domain, the LLM generates the spec, and you refine it. What used to take a day of YAML editing takes 30 minutes of iterating with an AI assistant.
- Consistency enforcement: Ask an LLM to review your OpenAPI spec for naming inconsistencies, missing error codes, or pagination gaps. It catches the “camelCase here, snake_case there” problems faster than manual review.
- Client SDK generation: With a clean OpenAPI spec, tools like
openapi-generatorproduce client SDKs in 40+ languages. LLMs can customize the generated code to match your internal patterns. - What LLMs cannot do: They cannot make the judgment calls — whether to use URL-path or header versioning, whether your error format should include stack traces in development mode, or whether your rate limits should allow bursts. These decisions require understanding your consumers, your team’s operational maturity, and your business model. AI accelerates the execution; humans own the strategy.
15.1 REST API Design
Resources as nouns (/users), HTTP methods for actions, plural nouns, hierarchy for relationships (/users/123/orders).
Status codes: 200 OK, 201 Created, 204 No Content, 400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found, 409 Conflict, 422 Unprocessable Entity, 429 Too Many Requests, 500 Internal Server Error.
Pagination: Offset-based (?page=2&size=20) is simple but slow on large datasets. Cursor-based (?cursor=abc&size=20) is consistent regardless of size. Always return pagination metadata.
API Pagination Patterns — Cursor vs Offset vs Keyset
Pagination seems simple until you have 50 million rows, real-time inserts, and clients that expect stable page results. The strategy you choose has direct consequences for database performance, client complexity, and data consistency. Here is what actually works. Three approaches compared:| Criteria | Offset-Based | Cursor-Based (Opaque) | Keyset (Seek) |
|---|---|---|---|
| Request format | ?page=3&size=20 or ?offset=40&limit=20 | ?cursor=eyJpZCI6MTIzfQ&limit=20 | ?after_id=123&limit=20 |
| SQL generated | SELECT ... ORDER BY id LIMIT 20 OFFSET 40 | SELECT ... WHERE id > 123 ORDER BY id LIMIT 20 (decoded from cursor) | SELECT ... WHERE id > 123 ORDER BY id LIMIT 20 |
| Performance at page 1 | Fast | Fast | Fast |
| Performance at page 10,000 | Slow — DB scans and discards 200,000 rows | Fast — index seek, O(log n) | Fast — index seek, O(log n) |
| Stable results during inserts/deletes | No — inserting a row shifts all subsequent pages (users see duplicates or miss items) | Yes — cursor is anchored to a specific position | Yes — keyset is anchored to a specific value |
| Jump to arbitrary page | Yes (?page=500) | No — must traverse sequentially | No — must traverse sequentially |
| Sort flexibility | Any sort order | Any sort order (encoded in cursor) | Requires a unique, sortable column in the WHERE clause |
| Client complexity | Low — just increment page number | Low — pass opaque cursor from previous response | Medium — client must track the last-seen sort key |
| Cacheable | Yes (each page URL is stable if data does not change) | Difficult (cursor encodes position, varies per user) | Difficult (same issue as cursor) |
| Best for | Admin dashboards, small datasets (<100K rows), paginated UI with page numbers | Public APIs, mobile feeds, any large dataset | Internal APIs, streaming through large datasets, data exports |
| Used by | Many internal tools | Stripe, Slack, GitHub GraphQL (as Connections) | Twitter timeline (originally), many analytics APIs |
(created_at, id) comparison for stable ordering even when multiple rows share the same created_at. The cursor is an opaque base64 encoding of these values — clients never parse it.
Response envelope for cursor pagination:
edges (list of nodes with cursors), pageInfo (hasNextPage, hasPreviousPage, startCursor, endCursor). If you are building a GraphQL API, follow this spec — it is what every GraphQL client library expects. See the GraphQL at Scale chapter for implementing Connections efficiently with DataLoader and avoiding N+1 problems in paginated resolvers.15.2 gRPC
Binary protocol using Protocol Buffers over HTTP/2. Created by Google to replace their internal RPC framework (Stubby) — designed for high-throughput, low-latency service-to-service communication. How it works: Define your service in a.proto file (service methods + message types). The protoc compiler generates client and server code in your language. The client calls methods on a generated stub that looks like a local function call — the gRPC framework handles serialization, HTTP/2 framing, and network transport.
Why it is faster: Protocol Buffers are 3-10x smaller than JSON (binary, no field names in payload). HTTP/2 multiplexes multiple RPCs over a single TCP connection (no connection-per-request overhead). Generated code eliminates serialization bugs and reduces boilerplate.
Streaming: Server-side streaming (server sends a stream of responses to one request — live updates, large result sets). Client-side streaming (client sends a stream of messages — file upload, batch ingestion). Bidirectional streaming (both stream simultaneously — chat, real-time collaboration).
Trade-offs: Not browser-friendly (needs gRPC-Web or Envoy proxy). Not human-readable (cannot curl and read the response — need grpcurl or Postman). No HTTP caching (all requests are POST). Steeper learning curve (proto files, code generation, streaming complexity). For public APIs consumed by third-party developers, REST is still the standard.
.proto files, and the compiler generates code in your language. Much smaller and faster to serialize/deserialize than JSON.15.3 Idempotency
An idempotent operation produces the same result regardless of how many times it is called.GET /user/123 is naturally idempotent (reading does not change state). POST /orders is NOT naturally idempotent (calling it twice creates two orders). Making non-idempotent operations safe for retries is one of the most important distributed systems skills.
Implementation: Client generates a unique idempotency key (UUID) per request. Server checks if it has processed that key. If yes, returns the stored response. If no, processes and stores. Essential for payments, order creation, any operation where duplicates have real consequences.
Idempotency Key Patterns in Practice
Idempotency keys are more nuanced than “store a UUID.” Here is how production systems handle them: The basic pattern:- Client generates a UUID v4 and sends it in the
Idempotency-Keyheader. - Server checks a lookup table (Redis or database) for the key.
- If key exists and the request completed: return the stored response (same status code, same body).
- If key exists and the request is in-progress: return
409 Conflict(prevents concurrent retries from double-processing). - If key does not exist: create the record with status
processing, execute the operation, store the response, update status tocompleted.
- Idempotency keys are scoped to the API key (two different merchants can use the same key string without collision).
- Keys expire after 24 hours — retrying after that creates a new operation.
- If a retried request has a different body than the original, the server returns an error (the key is bound to the original parameters).
- The stored response includes the exact HTTP status code and body — a retried
POST /chargesthat originally returned402 Payment Failedreturns402again, not200.
created_at, use the database (not Redis) as the authoritative store for payment operations, add Redis as a fast-path cache, and design the TTL and cleanup strategy to keep the table bounded. I would also instrument the idempotency hit rate — if 15% of requests are retries hitting the idempotency store, that tells me we have a reliability problem upstream causing excessive retries.”AI-Assisted Engineering Lens: Idempotency Implementation with LLMs
AI-Assisted Engineering Lens: Idempotency Implementation with LLMs
- Schema generation: Describe your idempotency requirements to an LLM and it generates the database schema, the Lua script for Redis fast-path checks, and the application middleware — in minutes rather than hours of manual coding.
- Edge-case discovery: Ask an LLM “what are the failure modes of this idempotency implementation?” and it will enumerate: crash between key insert and business logic, Redis/DB inconsistency, concurrent requests with the same key, key collision across API consumers, TTL expiry during long-running operations. This is faster than discovering these edge cases in production.
- Code review: LLMs are excellent at spotting idempotency gaps in pull requests — “this POST endpoint modifies state but does not check for an idempotency key” or “the idempotency check and the database write are not in the same transaction.”
- Limitation: LLMs may suggest overly complex solutions (distributed locks, two-phase commits) when a simple database transaction with an idempotency table is sufficient. Always prefer the simplest correct implementation.
15.4 API Versioning and Backward Compatibility
Safe changes: Adding optional fields, new endpoints, new optional parameters. Breaking changes: Removing fields, renaming fields, changing types, making optional fields required.Versioning Strategies Compared
| Strategy | Example | Pros | Cons | Used By |
|---|---|---|---|---|
| URL path | /v1/users, /v2/users | Most discoverable, easy to route, easy to cache | URL pollution, harder to sunset | Stripe, GitHub, Twitter |
| Custom header | X-API-Version: 2 | Clean URLs, flexible | Hidden from browser, easy to forget | Azure, Jira |
| Accept header (content negotiation) | Accept: application/vnd.myapi.v2+json | HTTP-standards compliant, fine-grained | Complex, hard to test in browser | GitHub (also supports URL) |
| Query parameter | /users?version=2 | Easy to add, discoverable | Breaks caching, messy | Rarely recommended |
/v1/) for public APIs — it is the most discoverable and requires no special client configuration. Reserve header-based versioning for internal APIs where you control all clients. Content negotiation is standards-purist but creates friction for third-party developers.
- Announce deprecation in API changelog and response headers (
Deprecation: true,Sunset: 2026-12-01). - Give at least 12 months notice for major version sunsetting.
- New accounts default to the latest version; existing accounts are pinned to their current version.
- Provide a migration guide for every breaking change.
API Deprecation Programs — Running Them at Scale
Deprecation announcements without enforcement are wishful thinking. A real deprecation program has teeth and timelines. The deprecation funnel (what actually happens):| Phase | Timeline | Action | Response Rate (typical) |
|---|---|---|---|
| Announce | Month 0 | Changelog entry, Sunset header on responses, email to all consumers | 5% migrate immediately |
| Nudge | Month 3 | Dashboard banner, targeted emails to top-20 consumers by volume, office hours for migration questions | 15-20% cumulative |
| Pressure | Month 6 | Rate limits on deprecated endpoints tightened by 25%, new features only on latest version | 40-50% cumulative |
| Warn | Month 9 | Weekly emails to remaining consumers with countdown, Warning: 299 header on every response | 70-80% cumulative |
| Brownout | Month 11 | Disable deprecated endpoints for 1-hour windows on a published schedule — consumers see real failures in non-prod first | 90-95% cumulative |
| Sunset | Month 12 | Return 410 Gone with migration guide URL, keep transformation code for emergency re-enable | 100% (forced) |
410 after 24 months (not 12).
Clients you literally cannot reach: SDKs embedded in mobile apps have a long tail. An app released 2 years ago with your SDK v1 may still have 100K monthly active users who never update. You cannot email them — you do not know who they are. You cannot force an update — the app store controls that. Mitigations: (1) Instrument your API to track SDK version in a custom header (X-SDK-Version). Monitor the version distribution weekly. If 15% of traffic still comes from a deprecated SDK version, sunsetting the underlying API will break those users. (2) For mobile SDKs, build a “minimum supported version” mechanism: the SDK phones home on startup and receives a “please update” prompt if below minimum. (3) Accept that some old versions will exist forever. Your API versioning strategy must account for a 3-5 year tail, not just the 12-month deprecation cycle. This is why Stripe pins merchants to the API version they integrated with indefinitely rather than forcing upgrades.
Deprecation of internal APIs between teams: Internal deprecation is harder socially and easier technically. Technically, you control all the clients — you can grep every codebase for the deprecated endpoint and submit PRs. Socially, teams resist migration because it is “not on their roadmap.” Effective internal deprecation requires executive sponsorship (the deprecation is a company priority, not just the API team’s preference), migration tooling (provide a codemod or migration script, not just documentation), and a hard deadline enforced by the platform (after date X, the API gateway returns 410 — no exceptions, no extensions).
Cost of eternal backward compatibility: Stripe maintains hundreds of API versions simultaneously through a transformation chain. This is viable when API revenue justifies the engineering cost (~$14B revenue). For most companies, supporting 3-5 active versions is the practical ceiling. Each additional version costs roughly 2-4 engineering weeks per year in maintenance, testing, and incident response. Sunset aggressively — the cost of not sunsetting is not just code maintenance, it is the constraint on your ability to evolve the data model underneath.
Compatibility Under Rolling Deploys
During a rolling deployment, old and new versions of your service run simultaneously for minutes (or hours in large fleets). API responses from old instances and new instances must be compatible with clients calling either. The dual-compatibility window: When you deploy version N+1, the fleet runs a mix of N and N+1 for the duration of the rollout. Any client request might be routed to either version. If N+1 returns a new field that N does not, clients see the field appear and disappear randomly depending on which instance they hit. This causes intermittent test failures, flaky UIs, and hard-to-diagnose production bugs. Rules for safe rolling deploys:- New fields must be present in all instances before any client code references them — deploy the API change first, then the client change.
- Removed fields must be absent from all client code before any instance stops returning them — deploy the client change first, then the API change.
- Database schema changes must be compatible with both N and N+1 application code simultaneously (see the “Breaking Change Traps — DB Schema Edition” section below).
- If a deploy takes 30 minutes to roll across 100 instances, any endpoint called during that window must return a response that satisfies clients expecting either version.
| Model | How It Works | Pros | Cons |
|---|---|---|---|
| Provider-driven | The team owning the service defines the contract. Consumers adapt. | Fast iteration for the provider, clear ownership | Consumers discover breaks at integration time, not design time |
| Consumer-driven (Pact) | Each consumer publishes expectations. Provider runs consumer tests before deploying. | Consumers are protected from unexpected breaks | Slow provider evolution — every change needs consumer approval. Does not scale past ~10 consumers |
| Schema registry (shared artifact) | A versioned schema (OpenAPI, protobuf, AsyncAPI) lives in a shared repo. CI checks breaking changes on every PR. | Single source of truth, automated compatibility checks | Governance overhead, schema repo becomes a bottleneck |
buf breaking for protobuf, openapi-diff for REST, or Pact for the 2-3 most critical consumer relationships. Full consumer-driven contracts are expensive to maintain and create coupling that slows both sides.
15.5 GraphQL
A query language for APIs where the client specifies exactly what data it needs. A single endpoint (/graphql) serves all queries. The client sends a query describing the shape of the response.
Benefits: No over-fetching (client gets only requested fields). No under-fetching (client gets all needed data in one request, unlike REST which may require multiple endpoints). Strongly typed schema acts as documentation. Great for mobile clients where bandwidth matters.
Trade-offs: Caching is harder (every request is a POST with a unique body — HTTP caching does not work). Rate limiting is harder (one query can be trivial or extremely expensive). N+1 problems move to the server (resolving nested fields can trigger many database calls without careful dataloader implementation). Security: clients can craft expensive nested queries — implement query depth limits and complexity analysis.
When to use: Multiple client types needing different data shapes (web shows full details, mobile shows summary). Rapidly evolving frontend requirements. When the relationship between data is graph-like. When NOT to use: Simple CRUD APIs, file uploads, real-time streaming (use subscriptions or SSE), when your team is small and REST is working fine.
15.6 Rate Limiting
Control how many requests a client can make in a given time window. Protects against abuse, ensures fair resource distribution, and prevents cascade failures. Algorithms: Fixed window (simple, allows bursts at window boundaries). Sliding window (smoother, more complex). Token bucket (allows controlled bursts up to a limit, then steady rate). Leaky bucket (smoothest output rate, constant processing).Rate Limiting Algorithms — Deep Comparison
| Algorithm | How It Works | Burst Handling | Memory | Precision | Best For |
|---|---|---|---|---|---|
| Fixed window | Counter resets every N seconds | Allows 2x burst at window boundary (end of one + start of next) | Very low (1 counter per key) | Low | Simple internal APIs where boundary bursts are acceptable |
| Sliding window log | Stores timestamp of each request, counts within rolling window | No boundary burst | High (stores every timestamp) | Exact | Low-volume APIs where precision matters more than memory |
| Sliding window counter | Weighted average of current and previous window | Minimal boundary burst | Low (2 counters per key) | Approximate | Most production APIs — good balance of accuracy and memory |
| Token bucket | Bucket fills at steady rate, each request takes a token, bucket has max capacity | Allows controlled bursts up to bucket size, then steady rate | Low (token count + last refill time) | Good | APIs where short bursts are acceptable (Stripe, AWS) |
| Leaky bucket | Requests enter a queue processed at constant rate | No bursts — excess requests are queued or dropped | Moderate (queue) | Smoothest output | Traffic shaping, when downstream systems need constant rate |
- Token bucket is best when you want to allow short bursts (a mobile app that syncs 20 items at once, then goes idle). Set bucket size = max burst, refill rate = sustained limit.
- Sliding window counter is best when you want strictly even distribution and bursts are harmful (payment processing API).
429 Too Many Requests with Retry-After header.
Multi-Tenant Rate Limiting and Fairness
In a multi-tenant SaaS API, “one tenant per rate limit” is not enough. You need a fairness model that prevents a single large tenant from degrading the experience for everyone else, while still allowing large tenants to use the capacity they are paying for. Tiered rate limiting with burst and sustained limits:| Tier | Sustained Rate | Burst Allowance | Monthly Cost (typical) | Overage Handling |
|---|---|---|---|---|
| Free | 10 req/s | 20 req/s for 10s | $0 | Hard reject at burst limit |
| Growth | 100 req/s | 200 req/s for 30s | $99-499/mo | Soft throttle with Retry-After |
| Enterprise | 1,000 req/s | 2,000 req/s for 60s | $2,000-10,000/mo | Graceful degradation, priority queue |
| Platform | Custom | Custom | Custom contract | Dedicated capacity, SLA-backed |
- Weighted fair queuing. Assign each tenant a weight proportional to their tier. During resource contention, the request scheduler serves higher-tier tenants proportionally more, but never starves lower tiers. AWS API Gateway supports this through usage plans.
- Admission control. When backend utilization exceeds 80%, start rejecting requests from tenants above their sustained rate (even if below burst). This preserves capacity for tenants within their sustained limit.
- Per-tenant connection pools. Dedicate a slice of database connections to each tier. Enterprise gets 40% of the pool, Growth gets 40%, Free gets 20%. No tier can exhaust connections for another.
- Cost attribution. Track per-tenant infrastructure cost (compute seconds, database IOPS, bandwidth). Tenants whose usage consistently exceeds their tier’s cost profile are candidates for upsell — this is how API-first companies like Twilio and Stripe align pricing with resource consumption.
Your API serves 10K RPS. A partner wants to integrate but their system can only handle 100 RPS callbacks. Design the integration.
Your API serves 10K RPS. A partner wants to integrate but their system can only handle 100 RPS callbacks. Design the integration.
- Decouple with an outbound queue. Instead of calling the partner synchronously as events occur, publish events to a message queue (SQS, Kafka, or Redis Streams) dedicated to this partner.
- Rate-limited consumer. A dedicated worker consumes from the queue at the partner’s rate — 100 RPS. Use a token bucket or leaky bucket rate limiter on the consumer side. The queue absorbs the burst; the consumer drains it steadily.
- Batching. If the partner supports batch callbacks, aggregate events into batches of 50-100 per request. This delivers 5,000-10,000 events per second within their 100 RPS limit. This is the highest-leverage optimization.
- Backpressure and overflow. Set a queue depth limit. If the partner falls behind (their system is down), the queue grows. Add dead-letter queue (DLQ) for events that fail after N retries. Alert when queue depth exceeds a threshold — this means the partner cannot keep up even at steady state.
- Webhook with retry and exponential backoff. For the actual callback, use exponential backoff with jitter (1s, 2s, 4s, …, max 5 minutes). Include an
Idempotency-Keyheader so the partner can safely deduplicate retries. - Pull alternative. Offer the partner a poll-based API (
GET /events?since=cursor) as an alternative to push. This lets them control their own consumption rate. Many partners prefer this because it puts them in control of backpressure.
X-Signature: sha256=abc... where the value is HMAC-SHA256(shared_secret, request_body). The partner computes the same HMAC and compares. Rotate the shared secret periodically, and support two active secrets during rotation. Include a timestamp in the signed payload with a short acceptance window (5 minutes) to prevent replay attacks.Q: If your queue grows to 10 million messages, what does that tell you?
A: Either the partner has been down for a sustained period, or your consumer is genuinely unable to keep up with the partner’s 100 RPS limit. Check the consumer’s metrics: is it throttled at 100 RPS but the producer is generating 200 RPS? Then batching is the fix. Is the consumer sitting idle because the partner is returning 5xx? Then the partner is the blocker and you should open a ticket. The queue depth is a symptom; the diagnosis is upstream or downstream.- Shopify Engineering — “How we designed Shopify’s webhook infrastructure”
- Stripe Docs — Best practices for using webhooks
- RFC 8131 — “Authentication-Scheme Name Registrations” (for webhook signature conventions)
Design the API for a file upload service that handles files up to 5GB. Consider resume, progress, and failure.
Design the API for a file upload service that handles files up to 5GB. Consider resume, progress, and failure.
POST with a 5GB body is a non-starter — any network interruption means restarting from zero, most proxies and load balancers have body size limits (often 1-10MB), and you cannot report meaningful progress.Design: Chunked Resumable Upload ProtocolStep 1: Initiate the upload.upload_id that the client uses for all subsequent requests.Step 2: Upload chunks.- Chunk size: 5MB is a good default — small enough that a failed chunk is cheap to retry, large enough that 1024 chunks is manageable. Let the server decide (clients with slow connections could request smaller chunks).
- Idempotency: Uploading the same chunk index twice is idempotent (server overwrites or ignores if checksum matches). This makes retries safe.
- Progress: The client knows
received_chunks / total_chunks. For real-time progress, the client can calculate bytes uploaded per second from the chunk upload times. - Expiry: Incomplete uploads expire after 24 hours. The client can extend with a
PATCH /uploads/{id}to refresh the TTL. - Parallel uploads: Chunks are independent, so the client can upload 3-5 chunks in parallel for better throughput on high-bandwidth connections.
CreateMultipartUpload / UploadPart / CompleteMultipartUpload API is the canonical example. Say “multipart upload with 5 MB parts and parallel chunk upload” — the numeric specificity tells the interviewer you have shipped this.sha256(file) client-side, both try to upload to /files/<hash>. The server checks if the hash already exists — if yes, skip the upload and return the existing URL; if no, store it. This gives you dedup for free. Dropbox, Git, and IPFS all use this model.Q: A 5 GB upload is 95% complete and the client loses power. What happens on retry?
A: The client restarts and calls GET /uploads/:id/status, which returns the list of received chunks. The client uploads only the missing chunks. No re-uploading the 4.75 GB that already arrived. This is the entire point of chunked uploads — partial progress survives interruption. TTL on the upload session (24 hours) bounds how long the server holds partial state.Q: How do you prevent a malicious client from uploading 10,000 bogus 5 GB uploads and filling your storage?
A: Three defenses: (1) require authentication on POST /uploads — no anonymous uploads. (2) Enforce per-user storage quotas enforced at POST /uploads time (reject if user is at 90% of quota). (3) Expire incomplete uploads after 24 hours (the TTL) — a bogus upload that never finalizes is garbage-collected. For very large files (>1 GB), require payment authorization up front or rate-limit the number of concurrent upload sessions per user.- tus.io — Open protocol for resumable file uploads
- AWS S3 Docs — Uploading and copying objects using multipart upload
- “Designing Resumable File Upload APIs” — Dropbox engineering blog
15.7 REST vs gRPC vs GraphQL — When to Use Each
| Criteria | REST | gRPC | GraphQL |
|---|---|---|---|
| Protocol | HTTP/1.1 or HTTP/2 | HTTP/2 (always) | HTTP (typically POST) |
| Format | JSON (text) | Protocol Buffers (binary) | JSON (text) |
| Type safety | OpenAPI/Swagger (optional) | Strong (code-generated from .proto) | Strong (schema-defined) |
| Streaming | SSE or WebSocket (separate) | Native bidirectional streaming | Subscriptions (over WebSocket) |
| Browser support | Native | Requires gRPC-Web proxy | Native |
| Caching | HTTP caching works natively | No HTTP caching (binary POST) | Hard (every request is unique POST) |
| Payload size | Larger (JSON text + field names) | 3-10x smaller (binary, no field names) | Variable (client picks fields) |
| Best for | Public APIs, browser clients | Internal service-to-service, high-throughput | Multiple client types needing different data |
| Worst for | Mobile (over-fetching) | Browser clients, public APIs | Simple CRUD, real-time streaming |
| Versioning | URL-based (/v1/), easy | Proto file evolution, careful | Schema evolution, deprecation directives |
Decision Matrix with Concrete Scenarios
| Scenario | Choose | Why |
|---|---|---|
| Public API for third-party developers | REST | Universal tooling, discoverable, HTTP caching, curl-friendly |
| Mobile app + web app consuming same backend, different data needs | GraphQL | Each client fetches exactly the fields it needs, one endpoint |
| Microservice A calls microservice B 10,000 times/second | gRPC | Binary serialization, HTTP/2 multiplexing, code-generated stubs, streaming |
| Internal service with complex data relationships (social graph) | GraphQL | Natural fit for graph-like data traversal |
| Real-time bidirectional communication (chat, collab editing) | gRPC streaming | Native bidi streaming over HTTP/2, no WebSocket needed |
| Simple CRUD API, small team, getting started | REST | Lowest complexity, best documentation, largest talent pool |
| Backend-for-frontend (BFF) aggregating multiple services | GraphQL | Single query resolves data from multiple sources |
| IoT device sending telemetry at very high volume | gRPC | Minimal payload size, efficient binary serialization |
15.8 Contract Testing
Consumer defines expectations, provider verifies it satisfies them. Catches breaking changes before deployment.15.9 Shadow Traffic, Canary Validation, and Safe Rollout
Deploying API changes to production without testing against real traffic patterns is like testing a bridge with toy cars. Shadow traffic and canary validation let you test with real load before real users are affected. Shadow traffic (dark launching): Duplicate a percentage of production traffic to the new API version. The shadow receives real requests but its responses are discarded — only the current version’s response goes to the client. Compare shadow responses against production responses for correctness (field-level diff), latency (P50, P95, P99 comparison), and error rate. Tools: Envoy’s mirroring filter, nginxmirror directive, Istio’s traffic mirroring, or application-level fork (publish request to a queue consumed by the shadow).
When shadow traffic catches what tests miss:
- A database query that performs fine on test data but times out on production data volumes.
- A response field that returns
nullon 0.3% of real records due to a legacy data edge case your test fixtures do not cover. - Serialization overhead that adds 50ms at the 99th percentile on real payload sizes but is invisible on small test payloads.
- Stateless changes (new logic, bug fixes): Roll back the deployment. Previous version handles traffic. Straightforward.
- Stateful changes (new database columns, new cache keys): Rolling back the deployment is not enough if the new version wrote data in a new format. The old version must be able to read or ignore the new data. This is why the expand-and-contract pattern is non-negotiable — the “expand” phase must be complete and backward-compatible before any behavior change deploys.
- Schema-coupled changes: If the API change depends on a database migration that has already run, rolling back the API requires either (a) the migration being backward-compatible (expand phase only) or (b) running a reverse migration, which is risky under production load. Always decouple schema migrations from application deploys — migrate the schema first, deploy the app change later.
full_name derived from first_name + last_name), the backfill must complete before clients start depending on the field. A half-backfilled table means some API responses return the field and some do not — this breaks client assumptions about field presence. Run backfills as idempotent batch jobs with progress tracking. Only flip the API to return the new field after 100% of rows are backfilled and verified.
- Caching — API responses are prime caching candidates. Use HTTP cache headers (
Cache-Control,ETag,Last-Modified) for GET endpoints, and application-level caching (Redis) for expensive computations behind endpoints. A well-cached API can serve 100x the traffic without scaling the backend. See the Caching chapter for cache invalidation patterns and HTTP caching strategies. - Performance and Reliability — Rate limiting (covered above) is only one layer of API protection. Circuit breakers protect your API from slow downstream dependencies. Timeouts prevent a single slow request from consuming a connection indefinitely. See the Performance chapter for the full resilience stack: timeouts, retries, circuit breakers, and bulkheads.
- Messaging — For APIs that trigger long-running operations (video processing, report generation, bulk imports), return
202 Acceptedimmediately and publish the work to a message queue. The client polls a status endpoint or receives a webhook callback when complete. This pattern prevents HTTP timeouts and lets you scale processing independently from request handling. See the Messaging chapter for async job patterns and webhook delivery guarantees. - GraphQL at Scale — If you are evaluating GraphQL as an alternative to REST, that chapter covers the production reality: federation for multi-team schemas, persisted queries for performance, calculated query costs for rate limiting, and the organizational challenges (schema governance, breaking change detection) that only emerge at scale.
- API Gateways & Service Mesh — Many of the concerns discussed in this chapter — rate limiting, authentication, versioning, TLS — are best handled at the gateway layer rather than inside each service. That chapter covers how Kong, Envoy, and Istio handle these cross-cutting concerns, plus service mesh patterns for internal service-to-service communication.
Part IX — Databases and Data Systems
The database is where most systems spend the most time, have the most bugs, and face the hardest scaling challenges. The reason databases are hard is not the SQL — it is the trade-offs: consistency vs performance (isolation levels), read speed vs write speed (indexes), flexibility vs integrity (schema), and single-node simplicity vs distributed resilience (replication/sharding). Every database decision is a trade-off decision.ALTER TABLE locks the table for hours or days — completely unacceptable for a service that developers depend on 24/7. GitHub built gh-ost (GitHub Online Schema Transmogrifier), an open-source tool that performs schema migrations without locking the original table. How it works: gh-ost creates a “ghost” table with the new schema, then uses MySQL’s binary log (binlog) stream to copy rows from the original table to the ghost table while simultaneously capturing and replaying any new writes. When the ghost table is caught up, it performs an atomic table swap. The entire process runs with minimal impact on production traffic because it is throttle-aware — it monitors replication lag and slows down or pauses the migration if the database is under stress. gh-ost replaced the previous approach (pt-online-schema-change from Percona) because it does not use triggers, which were a source of unpredictable load spikes. GitHub now runs thousands of schema migrations per year across their fleet with zero downtime. The lesson: at scale, schema migration is not a DBA task — it is a distributed systems problem that requires dedicated tooling.Chapter 16: Relational Foundations
16.1 Normalization and Denormalization
Normalization: Eliminate redundancy — each fact stored exactly once. A customer’s address is in thecustomers table, not duplicated in every orders row. Benefits: no update anomalies (change address once, reflected everywhere), smaller storage, data integrity. Cost: queries need JOINs to reassemble related data.
Denormalization: Introduce controlled redundancy for read performance. Store the customer’s name directly in the orders table so displaying an order does not require a JOIN. Benefits: faster reads (no JOINs), simpler queries. Cost: update anomalies (customer changes name -> must update it in every order row), more storage, risk of inconsistency.
When to denormalize: When reads vastly outnumber writes (product catalog read 1000x per write). When JOIN performance is unacceptable (joining 5 tables for a dashboard query). When you are building a read-optimized model (CQRS read side, search index, analytics). When NOT to: When write consistency is critical (financial data). When the data changes frequently (denormalized copies go stale). As a first resort (optimize queries and indexes first).
16.2 Indexing In Depth
Index Types — When Each Shines
| Index Type | Data Structure | Supports | Best For | Avoid When |
|---|---|---|---|---|
| B-tree | Balanced tree | Equality, range (>, <, BETWEEN), sorting, prefix LIKE 'foo%' | Default choice for almost everything — primary keys, foreign keys, most WHERE clauses | You only ever do exact-match lookups at extreme scale (hash might edge it out) |
| Hash | Hash table | Equality only (=) | Exact-match lookups in theory (O(1) vs B-tree O(log n)) | Range queries, sorting, prefix matching. In PostgreSQL, hash indexes were not crash-safe before v10 and remain uncommon. B-tree handles equality efficiently enough that hash rarely justifies losing range support |
| Composite | B-tree (multi-column) | Queries matching the leftmost prefix of columns | Multi-column WHERE clauses, covering queries | Column order does not match query patterns (leftmost prefix rule) |
| Partial | B-tree (filtered) | Queries matching the WHERE condition of the index | Indexing a subset — e.g., WHERE status = 'active' when 95% of rows are inactive | The filtered subset is most of the table (no space savings) |
| Covering | B-tree with INCLUDE columns | Index-only scans (database never touches the heap) | Read-heavy queries where all selected columns fit in the index | Wide rows or frequently updated INCLUDE columns |
| GIN (Generalized Inverted Index) | Inverted index | Array containment, full-text search, JSONB key/value queries | PostgreSQL full-text search (to_tsvector), JSONB @> operators, array && overlap | Simple scalar equality — GIN has higher write cost |
| GiST (Generalized Search Tree) | Balanced tree (extensible) | Range overlap, containment, nearest-neighbor | PostGIS geospatial queries (ST_Within, ST_DWithin), range types, ltree hierarchies | Standard equality/range on scalar types (B-tree is faster) |
WHERE UPPER(name) = 'JOHN' does not use an index on name — use a functional index instead).
16.3 Transactions and Isolation Levels
ACID: Atomicity, Consistency, Isolation, Durability.Isolation Levels, Anomalies, and When to Use Each
| Isolation Level | Dirty Read | Non-Repeatable Read | Phantom Read | Write Skew | Performance | Default In |
|---|---|---|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible | Possible | Fastest | Almost never used |
| Read Committed | Prevented | Possible | Possible | Possible | Fast | PostgreSQL, Oracle, SQL Server |
| Repeatable Read | Prevented | Prevented | Possible (SQL standard) / Prevented (PostgreSQL) | Possible | Moderate | MySQL InnoDB |
| Serializable | Prevented | Prevented | Prevented | Prevented | Slowest | None by default |
- Dirty read: Transaction A reads data written by Transaction B before B commits. If B rolls back, A used data that never existed. Prevented at Read Committed and above.
- Non-repeatable read: Transaction A reads a row, Transaction B updates and commits that row, Transaction A reads again and gets a different value. Within a single transaction, the same query returns different results. Prevented at Repeatable Read and above.
- Phantom read: Transaction A queries rows matching a condition (
WHERE price < 100), Transaction B inserts a new row matching that condition and commits, Transaction A re-queries and gets an extra row that was not there before. Prevented at Serializable (and at PostgreSQL’s Repeatable Read due to snapshot isolation). - Write skew: Two transactions each read a value, make a decision based on it, and write — but their combined writes violate a constraint that neither individual write violates. Example: two doctors both check that at least one doctor is on-call, each decides to go off-call, both commit — now zero doctors are on-call. Only Serializable prevents this.
- Read Committed: Use for typical web requests — showing a product page, listing orders, updating a user profile. Slight staleness within a transaction is fine because each statement sees the latest committed data.
- Repeatable Read: Use for reports that run multiple queries and need consistent numbers — “total revenue this month” should not change between the SELECT for orders and the SELECT for refunds within the same report transaction.
- Serializable: Use for financial operations where correctness is critical — transferring money between two accounts (both balances must be read and written atomically), inventory reservation (two users trying to buy the last item must not both succeed).
16.4 Locking
Pessimistic locking:SELECT ... FOR UPDATE acquires a lock on the row. Other transactions wait. Use when contention is high and conflicts are expected.
Optimistic locking: Include a version column. On update, check WHERE version = expected_version. If no row is updated, someone else changed it — retry or fail. Use when contention is low.
Advisory locking: Application-level locks using the database as a coordination point. PostgreSQL provides pg_advisory_lock(key). Use for distributed locking when you want to coordinate across application instances — e.g., ensuring only one instance runs a scheduled job.
16.5 CAP Theorem and Distributed Data
CAP theorem: Consistency, Availability, Partition Tolerance. The common “pick two” framing is misleading — network partitions are not optional, they will happen. The real choice is: during a partition, do you sacrifice consistency or availability? In normal operation (no partition), you can have both.PACELC — The More Practical Model
The PACELC model (Daniel Abadi) extends CAP to capture what matters during normal operation: during a Partition, choose Availability or Consistency; Else (normal operation), choose Latency or Consistency. This captures the fact that even without partitions, there is a trade-off between response speed and consistency (synchronous replication is consistent but slower).| System | During Partition (PAC) | Normal Operation (ELC) | Classification |
|---|---|---|---|
| PostgreSQL (sync replication) | Sacrifice Availability (CP) | Sacrifice Latency for Consistency (EC) | PC/EC |
| Cassandra | Sacrifice Consistency (AP) | Sacrifice Consistency for Latency (EL) | PA/EL |
| DynamoDB (eventual) | Sacrifice Consistency (AP) | Sacrifice Consistency for Latency (EL) | PA/EL |
| DynamoDB (strong reads) | Sacrifice Availability (CP) | Sacrifice Latency for Consistency (EC) | PC/EC |
| MongoDB (majority writes) | Sacrifice Availability (CP) | Sacrifice Latency for Consistency (EC) | PC/EC |
| CockroachDB | Sacrifice Availability (CP) | Sacrifice Latency for Consistency (EC) | PC/EC |
16.6 Database Replication
Synchronous vs Asynchronous vs Semi-Synchronous
| Replication Mode | How It Works | Data Loss Risk | Write Latency | Availability Impact |
|---|---|---|---|---|
| Synchronous | Write confirmed only after replica acknowledges | Zero (for acknowledged replicas) | High (network round-trip to replica) | Replica outage blocks writes |
| Asynchronous | Write confirmed after primary writes locally; replica catches up later | Possible (primary fails before replica receives write) | Lowest | Replica outage has no impact on writes |
| Semi-synchronous | Write confirmed after at least one replica acknowledges; remaining replicas are async | Minimal (at least one replica has the data) | Moderate (one round-trip, not all) | Tolerates all-but-one replica failures |
Your PostgreSQL database is getting slow under read load. Walk me through your approach to adding read replicas.
Your PostgreSQL database is getting slow under read load. Walk me through your approach to adding read replicas.
pg_stat_activity — if you see 80%+ of active connections running SELECT queries and the primary CPU is pegged at 90%+, read replicas will help. If you see lock waits and slow writes, replicas will not.Then set up asynchronous replication to one read replica. On a typical cloud instance (e.g., RDS db.r6g.xlarge with 4 vCPUs), a single PostgreSQL node handles roughly 5,000-10,000 simple read queries per second. Adding one read replica doubles that read capacity to ~10,000-20,000 QPS. Route read-only queries from the application to the replica — this can be done at the ORM level (Rails connected_to, Django using), at the connection pool level, or via a proxy like PgBouncer with read/write splitting. Monitor replication lag — in normal operation on the same availability zone, expect 1-10 ms lag; cross-region replicas may see 50-200 ms lag depending on distance. Keep all writes on the primary.Follow-up: A user creates a post and immediately sees their feed, but the post is not there. Users are complaining. What happened?
Follow-up: A user creates a post and immediately sees their feed, but the post is not there. Users are complaining. What happened?
?after_write=true parameter that forces a primary read. Option 2 is best UX. Option 1 is the safest server-side fix.Real-World Example. LinkedIn published a 2019 engineering post about exactly this issue on their feed service — users posted an update, refreshed, and saw an empty feed. They adopted a “causal consistency token” pattern: the write returns an LSN, the client sends it on the next read, and the replica waits up to 500ms for that LSN before responding (falling back to primary if slower). This eliminated the “my post disappeared” complaint class without forcing primary reads globally.hot_standby_feedback trade-off). (3) Cross-region network saturation — check whether 100% of the WAL bandwidth is being consumed. The diagnosis path is pg_stat_replication.replay_lag + pg_stat_activity for long queries on the replica.- PostgreSQL Docs — Hot Standby and Replication Lag
- LinkedIn Engineering Blog — “Consistency as a Service: Auditing Cloud Consistency”
- AWS RDS Docs — Replication lag metrics and alerting
16.7 Backup and Recovery
The principle: Backups are worthless. Restores are priceless. If you have never tested restoring from your backups, you do not have backups — you have hope. Backup types: Full backup (complete copy of all data — slow, large, but simple to restore). Incremental backup (only changes since last backup — fast, small, but restore requires the full backup plus all increments). Differential backup (changes since last full backup — middle ground). Point-in-Time Recovery (PITR): Restore the database to any specific moment. PostgreSQL achieves this through WAL (Write-Ahead Log) archiving — the base backup plus archived WAL segments let you replay transactions up to any timestamp. MySQL uses binary logs (binlogs) similarly. Cloud-managed databases (RDS, Cloud SQL) provide automated PITR with configurable retention (typically 1-35 days). Backup testing: Schedule regular restore drills (monthly minimum). Measure restore time — does it meet your RTO? Verify data integrity after restore. Automate the verification (restore to a test instance, run validation queries, compare row counts). If you only discover your backup is corrupted during a real disaster, you have failed. Cloud-managed backup: RDS automated backups (daily snapshots + continuous WAL archiving). Cloud SQL automated backups. Azure SQL geo-redundant backups. These are good defaults but understand their RPO (usually 5 minutes for PITR) and retention limits.16.8 Schema Versioning and Migrations
Use the expand and contract pattern: add new column, deploy app writing to both, backfill, deploy app reading from new column, remove old column. Never rename a column in one step. Always test migrations on production-sized data. Include rollback scripts.Online Schema Evolution — What Actually Happens at Scale
On a 500M-row table serving 5,000 QPS,ALTER TABLE ADD COLUMN is not a command — it is a coordinated operation with production-safety implications. Different databases handle this differently, and the differences matter.
PostgreSQL (13+): ALTER TABLE ADD COLUMN with a non-volatile default is metadata-only — it completes in milliseconds regardless of table size. PostgreSQL stores the default in the catalog and returns it on read without rewriting existing rows. This is one of PostgreSQL’s best features for online schema evolution. However: ALTER TABLE ADD COLUMN ... NOT NULL DEFAULT now() (volatile default) does rewrite the table in older versions. Always use a non-volatile default for online adds.
MySQL InnoDB (8.0+): Most ALTER TABLE operations support “instant” DDL (ALGORITHM=INSTANT) for column additions at the end of the table. Adding a column in the middle, changing a column type, or adding an index requires either ALGORITHM=INPLACE (online, allows concurrent DML) or ALGORITHM=COPY (blocks writes). The ALGORITHM hint controls behavior — always specify it explicitly in production migrations to prevent surprises.
Online schema migration tools comparison:
| Tool | Database | Mechanism | Replication-Aware | Throttling | Used By |
|---|---|---|---|---|---|
| gh-ost | MySQL | Binlog stream to ghost table, atomic rename | Yes (pauses if replica lag exceeds threshold) | Adaptive based on replication lag | GitHub, Shopify |
| pt-online-schema-change | MySQL | Triggers on original table copy to new table | Limited | Manual or lag-based | Percona users, many MySQL shops |
| pg_repack | PostgreSQL | Trigger-based table rewrite without locks | No | Manual | PostgreSQL shops needing VACUUM FULL without locks |
| Reshape | PostgreSQL | Dual-write to old and new schema simultaneously | No | None | Newer PostgreSQL projects |
| LHM (Large Hadron Migrator) | MySQL | Chunked copy with triggers | Yes | Configurable | Shopify (legacy, replaced by gh-ost) |
| Vitess | MySQL | Distributed DDL across shards via VReplication | Yes (built-in) | Shard-aware | Slack, PlanetScale, HubSpot |
| Migration Type | Rollback Approach | Risk Level | Time to Rollback |
|---|---|---|---|
| Add column | Drop the column (metadata-only in PG, fast) | Low | Seconds |
| Add index | Drop the index (DROP INDEX CONCURRENTLY in PG) | Low | Seconds to minutes |
| Rename column | Add the old name back as an alias or reverse the rename | Medium — any code deployed referencing new name breaks | Minutes |
| Change column type | Add old-type column back, backfill, swap | High — data loss possible if narrowing type | Hours |
| Drop column | Restore from backup or WAL replay to recover data | Critical — data is gone | Hours to impossible |
| Drop table | Restore from backup | Critical | Hours |
Backfill Operations — The Hidden Landmine
A backfill is a bulk data update that populates a new column, transforms existing data, or computes derived values. Backfills look trivial in development (UPDATE orders SET new_col = old_col * 1.1) and become production incidents at scale.
Why backfills are dangerous:
-
Transaction duration. A single
UPDATE ... SET ... WHERE ...on 100M rows opens one transaction that runs for hours. During that time: VACUUM cannot clean dead tuples behind it (PostgreSQL), replication lag spikes (the replica must replay the entire transaction atomically), and any crash requires replaying the entire transaction from the WAL. - Lock contention. While the backfill holds row-level locks on millions of rows, application writes to those rows queue behind the locks. If the backfill touches a hot table (orders, users), write latency spikes for the duration.
- Checkpoint storms. A massive UPDATE dirties millions of pages in the buffer cache. When a checkpoint fires, all dirty pages are flushed to disk simultaneously, causing I/O saturation for 30-60 seconds.
full_name derived from first_name + last_name), the deploy sequence matters:
- Deploy 1: Add the column (NULLable, no default). No API change — the field is not returned yet.
- Deploy 2: Application writes to the new column on all new INSERTs and UPDATEs. Existing rows still have NULL.
- Backfill: Run the batched backfill. Monitor progress (
SELECT COUNT(*) WHERE new_col IS NULLvs total). - Deploy 3: Only after 100% of rows are backfilled, update the API to return the new field. If the API returns the field while rows are still NULL, clients see inconsistent data — some responses have the field, some do not. This breaks client assumptions about field presence.
- Deploy 4 (optional): Add a NOT NULL constraint after verifying zero NULLs remain.
Change Data Capture (CDC) in Practice
CDC captures row-level changes from the database’s transaction log (WAL in PostgreSQL, binlog in MySQL) and streams them to consumers. This is the foundation for event-driven architectures, real-time data pipelines, cache invalidation, and search index synchronization. CDC tools and approaches:| Tool | Source DB | Output Targets | Exactly-Once | Operational Complexity |
|---|---|---|---|---|
| Debezium | PostgreSQL, MySQL, MongoDB, SQL Server, Oracle | Kafka, Pulsar, Kinesis, Redis Streams | At-least-once (consumers must be idempotent) | Medium (runs on Kafka Connect) |
| AWS DMS | Most relational DBs | Kinesis, Kafka, S3, target RDS | At-least-once | Low (managed service) |
| PostgreSQL logical replication | PostgreSQL | PostgreSQL (subscriber) | At-least-once | Low (built-in) |
| Maxwell | MySQL (binlog) | Kafka, Kinesis, RabbitMQ, stdout | At-least-once | Low |
| SpinalTap (Airbnb) | MySQL | Kafka | At-least-once | Medium |
orders, order_items, customers, and products. If the CDC event for order_items arrives 200ms after orders, the search index briefly contains an order with no items. Consumers querying during that window see inconsistent data.
Mitigation patterns:
- Debounce and aggregate. Buffer CDC events for 1-5 seconds before updating the read store. Group all events for the same entity (order ID) and apply them together.
- Version vector. Include a monotonic version in the read store. Only update if the incoming version is higher. Reprocess out-of-order events.
- Outbox pattern. Instead of relying on CDC from business tables, write a single outbox event per business operation with the complete, pre-assembled payload. CDC on the outbox table produces one consistent event per operation.
Replica Lag — Realities and Measurement
Replication lag is not a number — it is a distribution. “Average lag is 50ms” means nothing if P99 lag is 30 seconds during write spikes. What actually causes replica lag:- Write spikes. A bulk import of 1M rows creates a WAL burst. The replica’s single-threaded apply process falls behind. PostgreSQL 15+ supports parallel replay but only for separate tables.
- Long-running transactions on the replica. A reporting query holding a snapshot on the replica prevents WAL replay from advancing past that snapshot’s horizon (PostgreSQL’s
hot_standby_feedbackandmax_standby_streaming_delaycontrol this trade-off). - Network saturation. Cross-region replicas compete with application traffic for bandwidth. A 1Gbps link carrying 100MB/s of WAL and 100MB/s of application traffic has no headroom for bursts.
- Disk I/O on the replica. The replica writes WAL and applies it to data files. If the replica’s disk is slower than the primary’s (common in cost-optimized setups), it falls behind during sustained write periods.
- PostgreSQL:
SELECT now() - pg_last_xact_replay_timestamp() AS lagon the replica. But this only updates when WAL is applied — if no writes happen for 5 minutes, the lag appears to grow even though the replica is caught up. Better: usepg_stat_replication.replay_lagon the primary (available in PG 10+). - MySQL:
SHOW SLAVE STATUSreportsSeconds_Behind_Master. This metric is notoriously unreliable — it measures the timestamp difference of the last applied event, not the actual queue depth. Usept-heartbeat(Percona) for accurate lag measurement: it writes a timestamp to the primary every second and reads it from the replica. - Alert threshold: Alert at P95 lag > 1 second for same-region replicas, > 5 seconds for cross-region. Investigate immediately at > 30 seconds — the replica may be falling permanently behind.
| Pattern | How It Works | Overhead | Best For |
|---|---|---|---|
| Session-sticky routing | After a write, route all reads from the same session to the primary for N seconds | Low (routing logic only) | Web apps with session affinity |
| Causal consistency token | Write returns a LSN/GTID token. Client sends it on next read. Replica waits until it has applied that position before responding. | Medium (replica may block briefly) | Mobile apps, APIs where session affinity is hard |
| Primary-only for critical reads | Route specific endpoints (balance check, order status after creation) to the primary always | Low | Financial, order management |
| Optimistic UI | Return the written data in the write response. Client renders it immediately without a subsequent read. | Zero server overhead | Modern SPAs, mobile apps |
16.9 Database Connection Management in Production
Connection management is the database topic most likely to wake you up at 3 AM. It is not glamorous, it rarely comes up in tutorials, and it is the single most common cause of production database outages that have nothing to do with queries or data. Understanding connection pooling, limits, and serverless connection challenges is non-negotiable for production engineering.Why Connections Are a Scarce Resource
Every database connection is expensive. A PostgreSQL connection consumes approximately 5-10 MB of server memory (the backend process, work_mem allocations, shared buffer mappings). MySQL’s per-connection overhead is lower (~1-3 MB) but still significant at scale. A database server with 16 GB of RAM can comfortably handle roughly 200-500 connections before memory pressure degrades query performance. PostgreSQL’s defaultmax_connections is just 100.
The math gets bad quickly: if you have 20 application servers, each running 4 worker processes, each holding 5 connections — that is 400 connections. Add a few cron jobs, migration scripts, monitoring tools, and admin sessions, and you are at 450+ connections. Your database is now spending more resources managing connections than executing queries.
Connection Pooling Strategies
| Pooling Strategy | How It Works | Connections to DB | Best For | Trade-offs |
|---|---|---|---|---|
| Application-level pooling | Each app instance maintains a local pool (HikariCP, SQLAlchemy pool, pg pool in Node.js) | app_instances x pool_size | Monoliths, small deployments | Total connections = instances x pool_size; scales poorly as you add instances |
| External connection pooler | A proxy sits between app and DB, multiplexing many app connections onto fewer DB connections | Configurable (e.g., 50) regardless of app instances | PostgreSQL at scale (PgBouncer, PgCat, Odyssey) | Additional infrastructure; transaction-mode pooling breaks session-level features (prepared statements, SET, LISTEN/NOTIFY) |
| Database-native pooling | Built into the database or its proxy (ProxySQL for MySQL, RDS Proxy for AWS) | Configurable | MySQL at scale, serverless architectures | Adds latency (~1-2 ms); vendor-specific configuration |
| Mode | When Connection Is Returned to Pool | Supports | Breaks |
|---|---|---|---|
| Session pooling | When the client disconnects | Everything (prepared statements, SET, LISTEN) | Nothing — but minimal connection savings (one DB connection per client session) |
| Transaction pooling | When the transaction ends (COMMIT/ROLLBACK) | Most applications | Prepared statements (unless using protocol-level), session-level SET commands, LISTEN/NOTIFY, advisory locks |
| Statement pooling | After each statement | Simple queries only | Multi-statement transactions, everything in transaction mode, plus more |
Connection Pool Sizing — The HikariCP Formula
Most teams set their connection pool too large. The counterintuitive truth: a smaller pool almost always outperforms a larger one. The formula (from the HikariCP wiki, applicable to any pool):(4 * 2) + 1 = 9 connections. That is not a typo. A pool of 9-10 connections on a 4-core machine will outperform a pool of 100 connections, because:
- With 100 connections, the database juggles 100 concurrent queries, each getting 1% of CPU time, context-switching constantly, thrashing the disk scheduler.
- With 10 connections, each query gets 10% of CPU time, executes 10x faster, and the total throughput is higher because the CPU spends time executing queries instead of switching between them.
Serverless Connection Challenges
Serverless functions (AWS Lambda, Vercel Functions, Cloudflare Workers) fundamentally break traditional connection pooling because:- No persistent process. Each Lambda invocation may be a new container with no existing connection. Cold starts establish a new database connection (TCP handshake + TLS + authentication = 50-100 ms for PostgreSQL).
- Unbounded concurrency. A traffic spike can spawn 1,000 Lambda instances simultaneously, each opening a database connection. Your database goes from 10 connections to 1,000 in seconds — far beyond
max_connections. - No shared pool. Each Lambda instance has its own pool of 1 connection. There is no cross-instance connection sharing. Application-level pooling is useless when every instance is isolated.
| Solution | How It Works | Latency Added | Complexity | Best For |
|---|---|---|---|---|
| RDS Proxy | AWS-managed connection pooler that sits between Lambda and RDS/Aurora. Maintains a warm connection pool, multiplexes Lambda connections. | ~1-5 ms per query | Low (managed service) | AWS Lambda + RDS/Aurora — the default recommendation |
| PgBouncer on ECS/Fargate | Run PgBouncer as a sidecar or standalone container in your VPC. Lambda connects to PgBouncer, which maintains a fixed pool to PostgreSQL. | ~1-2 ms per query | Medium (you manage PgBouncer) | Non-AWS environments, or when RDS Proxy pricing is prohibitive |
| HTTP-based database proxy | Services like Neon, PlanetScale, or Supabase that expose a database over HTTP/WebSocket, eliminating TCP connection overhead entirely. | Variable (~5-20 ms) | Low (managed service) | Edge functions (Cloudflare Workers), prototyping, low-throughput serverless |
| Aurora Serverless v2 Data API | AWS HTTP-based API to Aurora that manages connections internally. No VPC, no connection pooling needed. | ~10-30 ms per query | Lowest | Simple serverless apps where latency tolerance is higher |
DatabaseConnectionsCurrentlySessionPinned CloudWatch metric — if it is consistently high, your application is defeating the proxy. Common fixes: avoid session-level SET commands (use connection-level defaults), use parameterized queries instead of prepared statements, and avoid temporary tables in Lambda functions. For detailed RDS Proxy configuration and cost analysis, see the Cloud Service Patterns chapter.Your Lambda function connects to PostgreSQL. Under load, you see 'too many connections' errors. Diagnose and fix.
Your Lambda function connects to PostgreSQL. Under load, you see 'too many connections' errors. Diagnose and fix.
max_connections * 0.9) and multiplexes all Lambda connections through that pool. Lambda connects to the RDS Proxy endpoint instead of the database directly. This is the AWS-recommended pattern and requires only changing the connection string.Configuration: Set RDS Proxy’s MaxConnectionsPercent to 90% of the database’s max_connections. Set ConnectionBorrowTimeout to 120 seconds — this is how long a Lambda invocation waits for a connection from the pool before failing. This is better than failing immediately because most Lambda functions hold a connection for <100 ms, so the queue drains quickly.Application-level optimization: Reuse connections across invocations within the same Lambda container. Initialize the database client outside the handler function — Lambda reuses the container for subsequent invocations, and the existing connection is reused. This reduces connection churn from “one connection per invocation” to “one connection per container lifecycle.”too many connections errors cascade into full outages. The fix was RDS Proxy plus moving the DB client initialization outside the handler — net result: from 2,000 connections down to ~50 multiplexed through the proxy, with latency overhead of 1-2ms per query. The “connection reuse” code change was four lines.DatabaseConnectionsCurrentlySessionPinned; high values mean your application is unknowingly defeating pooling.shared_buffers. The HikariCP wiki has the classic benchmark: on a 4-core server, a pool of 10 outperforms a pool of 100 because the 10 connections execute at full CPU speed while the 100 thrash. Little’s Law says pool_size = throughput × latency — anything more is wasted.Q: If RDS Proxy adds 1-2ms of latency, when is it NOT worth it?
A: When your Lambda workload has very low concurrency (single-digit invocations) and you have raised max_connections enough to accommodate it. The 1-2ms is the cost of pooling; the benefit is unbounded scaling. At low scale, connect directly; at high scale, always proxy. The crossover is usually around 20-50 concurrent Lambdas.Q: Your team considers moving off RDS to DynamoDB to avoid connection management entirely. Reasonable?
A: DynamoDB has no connections — every request is a stateless HTTP call — so the connection problem vanishes. But you trade it for access pattern rigidity: no JOINs, no ad-hoc queries, single-table design. The architectural question becomes “does our data model fit DynamoDB?” If yes and you are already struggling with RDS scaling, it is a rational migration. If no (relational data with many JOINs), you are trading one problem for a worse one. RDS Proxy solves the connection problem without the migration; pick it first.- AWS Docs — Amazon RDS Proxy
- HikariCP Wiki — About Pool Sizing
- PlanetScale Blog — “Database connections in a serverless world”
16.10 SQL vs NoSQL — Decision Framework
| Criteria | SQL (PostgreSQL, MySQL) | NoSQL (MongoDB, DynamoDB, Cassandra) |
|---|---|---|
| Data model | Relational (tables, rows, joins) | Document, key-value, wide-column, graph |
| Schema | Fixed schema, enforced by DB | Flexible/schemaless (schema-on-read) |
| Transactions | Full ACID, multi-table | Limited (single-document in MongoDB, none in Cassandra) |
| Joins | Native, efficient | Application-side or denormalize |
| Write scaling | Vertical first, sharding is hard | Horizontal built-in (DynamoDB, Cassandra) |
| Query flexibility | SQL — ad-hoc queries, aggregations | Limited (design for your access pattern upfront) |
| Consistency | Strong by default | Eventual by default (tunable in some) |
| Best for | Complex queries, transactions, relational data, reporting | High write throughput, flexible schema, key-based access, massive scale |
16.11 Redis vs Memcached
| Criteria | Redis | Memcached |
|---|---|---|
| Data structures | Strings, hashes, lists, sets, sorted sets, streams, HyperLogLog | Strings only (key-value) |
| Persistence | RDB snapshots + AOF | None (pure cache) |
| Pub/Sub | Built-in | No |
| Clustering | Redis Cluster (auto-sharding) | Client-side sharding |
| Lua scripting | Yes (atomic operations) | No |
| Multi-threading | Single-threaded main loop (I/O threads since 6.0) | Multi-threaded |
| Best for | Caching + data structures + pub/sub + queues | Pure caching, max memory efficiency |
16.12 Specialized Data Stores
Document databases (MongoDB, CouchDB): Flexible schema, good for content management, catalogs. Key-value stores (Redis, DynamoDB): Ultra-fast lookups by key, session stores, caching. Time-series databases (InfluxDB, TimescaleDB): Optimized for time-stamped data, IoT, metrics. Search engines (Elasticsearch, OpenSearch): Full-text search, log analytics. Vector databases (Pinecone, Weaviate, pgvector): Similarity search for ML embeddings. Graph databases (Neo4j): Relationship-heavy queries (social networks, fraud detection).Database Selection Guide
Use this table when choosing a database for a new project or feature. Start from your use case, not from a technology. This guide gives you the “what to pick” — for the “how it actually works under the hood” for each database, see the Database Deep Dives chapter.| Use Case | Recommended DB Type | Specific Options | Reasoning | Deep Dive |
|---|---|---|---|---|
| General web app (users, orders, products) | Relational (SQL) | PostgreSQL, MySQL | ACID transactions, JOINs, flexible queries, mature tooling | PG internals: MVCC, VACUUM, query planner |
| High-write event stream (clickstream, IoT telemetry) | Time-series or wide-column | TimescaleDB, InfluxDB, Cassandra | Append-optimized writes, time-range queries, horizontal scale | — |
| Session store / caching layer | Key-value (in-memory) | Redis, Memcached | Sub-millisecond reads, TTL expiry, simple key-based access | Redis architecture: eviction, persistence, clustering |
| Product catalog with variable attributes | Document store | MongoDB, DynamoDB | Flexible schema (each product has different attributes), nested documents | DynamoDB single-table design, partition strategies |
| Full-text search (site search, log analytics) | Search engine | Elasticsearch, OpenSearch | Inverted indexes, relevance scoring, faceted search | — |
| Social network (friends, followers, recommendations) | Graph database | Neo4j, Amazon Neptune | Relationship traversal (friends-of-friends) is O(1) per hop vs expensive JOINs | — |
| ML feature store / semantic search | Vector database | Pinecone, Weaviate, pgvector | Approximate nearest neighbor (ANN) search on high-dimensional embeddings | — |
| Financial ledger (strict consistency, audit trail) | Relational (SQL) | PostgreSQL, CockroachDB | Serializable transactions, multi-row ACID, immutable audit log | PG transaction internals, SSI |
| Global multi-region low-latency reads | Distributed NoSQL | DynamoDB Global Tables, CockroachDB, Cassandra | Multi-region replication, tunable consistency, automatic failover | DynamoDB Global Tables patterns |
| Queue / task broker | Stream or message broker | Redis Streams, Kafka, SQS | Ordered delivery, consumer groups, at-least-once semantics | Redis Streams internals |
Operational Economics — What Databases Actually Cost
Most teams track database performance but not database cost. Senior engineers think about both, because the cheapest query is the one you do not run, and the most expensive infrastructure is the infrastructure you over-provisioned. Real-world database cost breakdown (AWS, typical SaaS):| Component | Cost Driver | Monthly Range | Optimization Lever |
|---|---|---|---|
| RDS instance | Instance type, Multi-AZ | $200-5,000/mo | Right-size (most are over-provisioned by 2-4x). Use Reserved Instances for 30-60% savings. |
| Storage (gp3/io2) | GB provisioned + IOPS | $50-2,000/mo | gp3 with provisioned IOPS is 3-5x cheaper than io2 for most workloads. Monitor actual IOPS vs provisioned. |
| Read replicas | Same as primary instance cost per replica | $200-5,000/mo each | One replica is usually enough. Two replicas “just in case” doubles cost with no benefit if the second is idle. |
| Backup storage | GB beyond free tier + PITR retention | $20-500/mo | Default 7-day retention is fine. 35-day retention is 5x the storage cost — only if compliance requires it. |
| Data transfer | Cross-AZ and cross-region replication | $50-1,000/mo | Cross-region replicas cost 60/mo just in replication transfer. |
| RDS Proxy | Per vCPU-hour | $50-500/mo | Only needed for serverless. Remove if you migrated away from Lambda. |
| DynamoDB | RCU/WCU (provisioned) or per-request (on-demand) | $50-50,000/mo | On-demand is 5-7x more expensive per request than provisioned at steady state. Switch to provisioned with auto-scaling after traffic stabilizes. |
| ElastiCache (Redis) | Node type, cluster size | $100-3,000/mo | cache.t3.medium (450/mo) when they do not need the memory. |
- DynamoDB: A scan of a 10GB table with 50M items costs ~0.000125. Design your access patterns to avoid scans.
- Aurora: Aurora charges per I/O operation (0.20. The same query with an index hitting 10 pages costs $0.000002. Indexes are not just about performance — they are about cost.
- Cross-AZ queries: Reading from a replica in a different AZ costs 3/day or ~$90/month. Route reads to same-AZ replicas when possible.
You have a PostgreSQL table with 500M rows. Queries are slow despite indexes. Walk me through your optimization strategy.
You have a PostgreSQL table with 500M rows. Queries are slow despite indexes. Walk me through your optimization strategy.
- Run
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)on the slow queries. This is non-negotiable — you cannot optimize what you have not measured. Look at the actual execution plan, not the estimated one. - Check if the planner is using your indexes at all. If you see
Seq Scanon a 500M-row table, either the index is missing for that query pattern, the statistics are stale, or the planner decided a sequential scan was cheaper (which can happen if the query returns a large fraction of the table).
Seq Scan means PostgreSQL is reading every single row — all 500M of them. At roughly 10 MB/s sustained disk throughput on a typical cloud instance, that is approximately 4-5 minutes of wall-clock time, which matches the 287893 ms (~4.8 minutes) execution time. (2) Rows Removed by Filter: 499,995,207 means it examined 500M rows to find 4,793 matches — a selectivity of 0.001%. This is an index’s dream scenario. (3) Buffers: shared read=8,472,981 means it read ~8.5 million 8 KB pages from disk — about 65 GB of I/O for a query that should touch a few MB. (4) The cost=0.00..12847623.00 is the planner’s estimated cost in arbitrary units — higher means more expensive, and 12.8 million is astronomical.Now add a composite index:EXPLAIN ANALYZE:Index Scan touches exactly the rows it needs (4,821 buffer hits vs 8.5 million disk reads). At 500M rows, a B-tree index is about 4-5 levels deep, so the lookup requires roughly 5 random I/O operations to find the first matching leaf page, then a sequential scan through adjacent leaf pages for the remaining matches. This is why EXPLAIN ANALYZE is non-negotiable — without it, you are guessing. Every database interview answer about query optimization that does not start with “I would run EXPLAIN ANALYZE” is incomplete.Phase 2: Statistics and maintenance.- Run
ANALYZEon the table. PostgreSQL’s query planner uses statistics about data distribution to choose execution plans. On a 500M-row table, the default statistics target (100) may be too low — increase it withALTER TABLE SET STATISTICS 1000on high-cardinality columns used in WHERE clauses. - Check for bloat. On a heavily-updated table, dead tuples accumulate (MVCC). Run
SELECT n_dead_tup, n_live_tup FROM pg_stat_user_tables WHERE relname = 'your_table'. If dead tuples are a significant fraction of live tuples,VACUUM FULLorpg_repack(non-locking) to reclaim space and reduce I/O. - Check index bloat. B-tree indexes on frequently-updated columns become bloated over time.
REINDEX CONCURRENTLYrebuilds without locking.
- Verify composite index column order matches query patterns (leftmost prefix rule). An index on
(status, created_at)helpsWHERE status = 'active' AND created_at > '2025-01-01'but NOTWHERE created_at > '2025-01-01'alone. - Consider partial indexes if queries consistently filter a subset:
CREATE INDEX idx_active ON orders(created_at) WHERE status = 'active'— if only 5% of rows are active, this index is 20x smaller and 20x faster to scan. - Consider covering indexes (
INCLUDE) to enable index-only scans — if the query only needsidandamount,CREATE INDEX idx ON orders(status, created_at) INCLUDE (id, amount)avoids touching the heap entirely.
- Partitioning. If queries consistently filter by a time range (and they usually do), partition by month or quarter using PostgreSQL declarative partitioning. A query for “last 30 days” now scans one partition (maybe 15M rows) instead of 500M. Partition pruning is automatic. The numbers are dramatic: scanning 15M rows takes ~2-3 seconds vs ~8-10 minutes for 500M. With an index on the partition, point lookups drop to under 1 ms because the index is 30x smaller and fits entirely in memory.
- Read replicas. Route analytics and reporting queries to a read replica. Keep the primary for OLTP.
- Materialized views. For expensive aggregation queries that do not need real-time data, create a materialized view refreshed every hour. A dashboard that aggregates 500M rows every page load is a bug, not a feature.
- Add pagination. No query should return 500M rows. Cursor-based pagination with
WHERE id > last_seen_id ORDER BY id LIMIT 100is O(1) regardless of offset. At 500M rows, anOFFSET 10000000 LIMIT 20query still scans and discards 10 million rows — cursor-based pagination avoids this entirely. - Cache hot queries in Redis with a TTL. If the same aggregation is requested 1000 times per minute, compute it once. A Redis GET takes ~0.1 ms vs a complex PostgreSQL aggregation at 50-500 ms — that is a 500-5000x improvement for repeat queries. See the Caching chapter for cache invalidation strategies (write-through, write-behind, cache-aside) and how to avoid the thundering herd problem on cache expiry.
- Review the ORM. ORMs often generate suboptimal queries — N+1 problems, unnecessary SELECT *, missing WHERE clauses.
EXPLAIN ANALYZEthe actual SQL the ORM produces, not what you think it produces. A classic N+1 on a 500M-row table turns a single 5 ms indexed query into 100 separate queries taking 500 ms total — plus the connection overhead.
CREATE INDEX idx ON orders(customer_id, status) INCLUDE (total, created_at). Query time dropped to 3ms — a 15,000x improvement from one line of DDL. The lesson: indexes are not “optimization”; they are the difference between a working system and a broken one at this scale.REINDEX CONCURRENTLY or pg_repack rebuilds it without a lock. Monitor via pg_stat_user_indexes and the pgstattuple extension.INCLUDE clause), so PostgreSQL answers the query entirely from the index without touching the heap. Index-only scans are 10-100x faster than regular index scans because they skip the heap fetch. Name it when discussing optimization beyond basic indexing.ANALYZE. (2) The index does not match the query’s leftmost prefix rule. (3) Selectivity is so low that a sequential scan is actually cheaper (>15% of the table). (4) random_page_cost is at the HDD default of 4.0 — on SSDs, set it to 1.1. (5) The query uses UPPER(col) or another function that prevents index usage — use a functional index. EXPLAIN tells you which reason; do not guess.Q: You have 500M rows and a query that filters by a range of timestamps. Partitioning by month — helps or hurts?
A: It helps only if the query includes the timestamp predicate. If the query filters by user_id without a date range, partitioning makes it worse — now the planner must scan every partition. The precondition for partitioning is: all hot-path queries include the partition key in the WHERE clause. Violate it and you trade one big index scan for N small ones — net worse.Q: Your p99 is 5 seconds but p50 is 20ms. Where is the tail coming from?
A: Three common culprits at scale: (1) cold cache — the first query to touch a partition/index that is not in shared_buffers pays the disk read tax. (2) Autovacuum I/O contention — a VACUUM on a large table saturates disk for minutes. (3) Checkpoint flushes — every checkpoint_timeout interval (default 5 minutes), dirty pages flood the disk. The tail tells you which one: regular spikes every 5 min = checkpoints; sporadic spikes on specific queries = cold cache; long plateaus = autovacuum. auto_explain catches the slow plans; pg_stat_bgwriter confirms checkpoints.- Markus Winand — Use The Index, Luke — the definitive free guide to SQL indexing
- PostgreSQL Docs — Using EXPLAIN
- pganalyze — “Advanced EXPLAIN ANALYZE: How to read PostgreSQL query plans”
- Caching — The fastest database query is the one you never make. For read-heavy workloads, a caching layer (Redis, Memcached) in front of the database can reduce read load by 90%+. See the Caching chapter for strategies on cache invalidation, TTL design, and avoiding the thundering herd problem when a hot cache key expires and 1,000 requests simultaneously hit the database.
- Performance and Reliability — Query tuning is one dimension of performance, but connection pooling (PgBouncer, ProxySQL), circuit breakers around database calls, and timeout configuration are equally critical. A single slow query without a timeout can hold a connection for minutes, starving the pool. See the Performance chapter for connection pool sizing, timeout strategies, and circuit breaker patterns around data stores.
- Messaging and Async Processing — For write-heavy workloads, consider decoupling writes through a message queue. Instead of synchronously inserting into the database on every API request, publish to Kafka or SQS and let a consumer batch-insert. This absorbs traffic spikes (Black Friday: 10x normal write volume) without overwhelming the database. See the Messaging chapter for the outbox pattern, which ensures reliable publishing of database changes to a message broker without distributed transactions.
- Database Deep Dives — This chapter covers database concepts (indexing, transactions, replication) across all databases. The Database Deep Dives chapter goes engine-specific: PostgreSQL MVCC internals and why VACUUM matters, DynamoDB partition key design and hot-partition avoidance, MongoDB aggregation pipeline optimization, and Redis memory management and eviction policies. If you are asked “how does PostgreSQL actually implement MVCC?” or “why is my DynamoDB table throttling?”, that is where to go.
- Cloud Service Patterns — For AWS-specific database operations: RDS instance sizing and storage types (gp3 vs io2), Aurora’s distributed storage architecture and why failover is faster than RDS, DynamoDB on-demand vs provisioned capacity cost modeling, ElastiCache deployment patterns (cluster mode vs non-cluster), and RDS Proxy configuration for serverless workloads.
API Design Checklist
Use this checklist before shipping any API — internal or external. The items are ordered roughly by “pain caused if you skip this.” Every item you defer is technical debt that compounds with every new consumer.1. Versioning Strategy
- Choose a versioning scheme before writing the first endpoint (
/v1/URL path for public APIs, header-based for internal) - Define what constitutes a breaking change for your API (removing fields, renaming fields, changing types, changing status codes)
- Establish a deprecation policy with a timeline (minimum 6 months notice for public APIs, 2 weeks for internal)
- Pin each consumer to a version — new consumers get the latest, existing consumers keep theirs until they opt in
- Include version in every response header (
X-API-Version: 2024-01-15) so consumers can verify
2. Authentication and Authorization
- Choose auth mechanism: API keys (simple, per-consumer), OAuth 2.0 (delegated access, scopes), JWT (stateless, short-lived)
- Require HTTPS everywhere — no exceptions, even for internal APIs (zero-trust networking)
- Implement API key rotation without downtime (support two active keys per consumer during rotation)
- Return
401 Unauthorized(identity unknown) vs403 Forbidden(identity known, insufficient permissions) — these are different - Log every authentication failure with IP, key prefix (never the full key), and timestamp for security auditing
3. Rate Limiting
- Define rate limits per consumer tier (free: 100 req/min, paid: 10,000 req/min, enterprise: custom)
- Return
429 Too Many RequestswithRetry-Afterheader (in seconds) andX-RateLimit-Remaining,X-RateLimit-Resetheaders - Choose algorithm: token bucket for APIs allowing bursts, sliding window for strict rate enforcement
- Implement at the API gateway level (not per-application-instance) for accurate distributed counting
- Document rate limits prominently — developers should not discover limits by hitting them
4. Pagination
- Choose cursor-based pagination for all list endpoints (not offset-based — offset degrades at scale)
- Return pagination metadata in every list response:
{ "data": [...], "has_more": true, "next_cursor": "abc123" } - Set a maximum
page_size(e.g., 100) and a sensible default (e.g., 20) — never return unbounded results - Ensure cursors are opaque to clients (base64-encoded internal identifiers, not raw database IDs)
- Document the sort order — cursors are meaningless without a defined, stable ordering
5. Error Format
- Define a single, consistent error schema used by every endpoint:
- Include a machine-readable
code(for programmatic handling) AND a human-readablemessage(for debugging) - Include
request_idin every response (success and error) — this is the lifeline for support debugging - Use appropriate HTTP status codes:
400(bad input),401(not authenticated),403(not authorized),404(not found),409(conflict),422(validation failed),429(rate limited),500(server error) - Never leak internal details (stack traces, SQL errors, internal service names) in production error responses
6. Idempotency
- Require an
Idempotency-Keyheader on all non-idempotent mutations (POST, PATCH) - Store idempotency keys with the response — retries return the stored response, not a new execution
- Scope keys to the API consumer (two different consumers can use the same key string)
- Set a TTL on idempotency records (24 hours is standard — Stripe’s model)
- Return an error if a retried request has a different body than the original (the key is bound to the parameters)
- For critical operations (payments, order creation), wrap the idempotency check and business logic in a single database transaction
7. Documentation
- Write an OpenAPI (Swagger) spec before writing code — the spec is the contract
- Include a runnable example for every endpoint (curl command or code snippet in 2-3 languages)
- Document every error code with cause, example, and suggested fix
- Provide a “Getting Started” guide that gets a developer from zero to first successful API call in under 5 minutes
- Include a changelog that documents every change with the date, affected endpoints, and migration path
- Host interactive API docs (Swagger UI, Redoc, or Stoplight) — developers should be able to try endpoints without writing code
8. Additional Production Concerns
- CORS — Configure allowed origins, methods, and headers for browser-based consumers
- Request/response logging — Log every request with method, path, status code, latency, and consumer ID (redact sensitive fields)
- Health check endpoint —
GET /healthreturns200with dependency status (database connected, cache reachable) for load balancer probes - Compression — Support
Accept-Encoding: gzip— JSON compresses 5-10x, which matters for bandwidth-constrained mobile clients - Timeouts — Set and document server-side request timeouts (e.g., 30 seconds) — a client should never wait indefinitely
- Webhooks (if applicable) — Support event notifications with retry logic, signature verification (
HMAC-SHA256), and a test mode
Interview Deep-Dive Questions
These questions simulate what a senior or staff-level interviewer would ask in a 45-60 minute technical interview covering APIs and databases. They are ordered from foundational to staff-level, and each includes follow-up chains that mirror how a real interview escalates. The answers are written as a strong candidate would actually speak — structured, specific, grounded in production experience.Q1: You are joining a team that has a REST API returning deeply nested JSON responses averaging 50KB per call. Mobile clients are complaining about latency. Walk me through how you would diagnose and fix this.
Q1: You are joining a team that has a REST API returning deeply nested JSON responses averaging 50KB per call. Mobile clients are complaining about latency. Walk me through how you would diagnose and fix this.
- Measure the breakdown. Instrument the request lifecycle: DNS resolution, TCP/TLS handshake, time-to-first-byte (TTFB), and content transfer time. On a 50KB JSON response over a typical 4G connection (~10 Mbps), the transfer itself is only about 40ms. If the total latency is 2 seconds, the transfer is not the bottleneck — the server processing time is. But on a 3G connection in a developing market (~1 Mbps), that 50KB takes 400ms just to transfer, and now payload size actually matters.
- Enable gzip/brotli compression. JSON compresses extremely well — typically 5-10x. A 50KB response compresses to roughly 5-8KB. This is the single highest-leverage, lowest-risk change. Configure it at the API gateway or reverse proxy (nginx, CloudFront). This alone might cut perceived latency by 30-50% for bandwidth-constrained clients.
-
Analyze the payload. Are mobile clients actually using all 50KB of that nested data? In my experience, the answer is almost always no. A mobile feed view might need 5 fields from a response that includes 40. This is the classic REST over-fetching problem. Solutions: (a) Add a
?fields=id,name,thumbnailsparse fieldset parameter so clients request only what they need. (b) Create a dedicated mobile-optimized endpoint or a Backend-for-Frontend (BFF) that returns a slim payload. (c) If the problem is widespread across many endpoints, this is where GraphQL earns its keep — the client specifies exactly the shape it needs. -
Check the query layer. Run
EXPLAIN ANALYZEon the queries backing the endpoint. A deeply nested JSON response often means the server is doing multiple JOINs or N+1 queries to assemble the object graph. If the endpoint returns an order with line items, each with a product, each with a category — that could be 4 sequential queries or a massive JOIN. Use eager loading (include the joins upfront) or DataLoader-style batching if the backend is GraphQL. -
Cache aggressively. If this endpoint serves data that does not change every second (product catalog, user profiles), add HTTP cache headers (
Cache-Control: max-age=300, stale-while-revalidate=60) and consider a CDN (CloudFront, Fastly) in front of the API. For authenticated endpoints where CDN caching is not appropriate, use server-side caching with Redis — cache the serialized response for 30-60 seconds.
Follow-up: The team proposes moving to GraphQL to solve the over-fetching. What concerns would you raise?
GraphQL solves over-fetching elegantly, but it introduces problems that REST does not have:-
HTTP caching breaks. Every GraphQL request is a POST with a unique body. You cannot use a CDN to cache responses without persisted queries (pre-registered query strings with an ID, so
GET /graphql?id=abc123becomes cacheable). That is additional infrastructure. -
Query cost is unpredictable. A single GraphQL query can be
{ user(id: 1) { name } }(trivial) or{ users { orders { items { product { reviews { author { orders { ... } } } } } } } }(a denial-of-service attack disguised as a query). You need query depth limits, complexity analysis, and cost-based rate limiting from day one. This is non-trivial to implement correctly. -
N+1 problems move to the server. In REST, the server controls what data is fetched. In GraphQL, the client’s query determines the resolver call graph. Without DataLoader (or equivalent batching), resolving
users { orders { ... } }for 100 users triggers 100 separategetOrderscalls. This is the most common GraphQL performance trap. - Organizational cost. Every frontend developer now needs to understand the schema, write queries, and manage a client-side cache (Apollo Client’s normalized cache is powerful but has a steep learning curve). The backend team needs to maintain the schema, handle schema evolution without breaking existing queries, and monitor query performance.
Follow-up: How would you implement sparse fieldsets in a REST API without it becoming a maintenance nightmare?
The cleanest approach is a generic?fields= query parameter that the framework interprets:- The client sends
GET /orders/123?fields=id,total,customer.name,items.product_name - The server parses the fields parameter into a projection tree
- The ORM generates a SELECT with only the requested columns (and the necessary JOINs for nested fields)
- The serializer filters the response to include only the requested fields
DynamicFieldsMixin. In Express/Node, you can write a response transformer that prunes the JSON based on the fields parameter.The pitfall is nested fields. customer.name requires joining the customer table even though the top-level orders query might not need it. You need the fields parser to understand the object graph and translate field requests into appropriate database includes. This is where it can become complex — and if it gets too complex, you have reinvented half of GraphQL’s resolver engine. That is the signal to just adopt GraphQL.Real-World Example. Shopify’s mobile team documented a case where their REST product detail API returned 120KB payloads with 60+ nested fields. After enabling gzip (6KB compressed), adding sparse fieldsets (?fields=id,title,thumbnail,price), and adding a mobile-optimized BFF endpoint, mobile p95 dropped from 2.8s to 340ms on 4G networks. The most impactful change was the BFF — it cut over-fetching by 85%, far beyond what gzip alone achieved.<link rel=preload> hints, which let the client decide what to preload — much better fit. Do not recommend server push in 2026.Q: Your BFF adds a service hop. Is the latency trade worth it?
A: If the BFF aggregates 5 downstream calls into 1 client-facing response, you trade one round-trip (BFF to client) plus 5 internal round-trips (BFF to services) for 5 round-trips from the client directly. Internal round-trips are sub-millisecond; client round-trips are 50-200ms. The BFF wins massively on mobile networks. Only when the BFF becomes a bottleneck itself (adding 100ms of processing) does the trade-off invert.Q: How do you cache BFF responses when they aggregate per-user data?
A: Two-tier caching. L1: cache each downstream call with its own TTL and key (user-agnostic where possible). L2: cache the BFF’s composed response with a short TTL (30-60s) keyed by user + endpoint. Even if user-specific, 60s of caching eliminates 95% of redundant work during a traffic spike. Use stale-while-revalidate so clients never wait on a cache refresh.- Sam Newman — Pattern: Backends For Frontends
- Shopify Engineering — “How Shopify Reduced Mobile API Payload Size by 80%”
- Cloudflare Blog — “HTTP/2 Server Push: Lessons Learned”
Q2: Explain idempotency keys as if you are teaching a junior engineer, then explain what breaks in a distributed system without them.
Q2: Explain idempotency keys as if you are teaching a junior engineer, then explain what breaks in a distributed system without them.
Idempotency-Key header. Server flow: (1) Check if the key exists in the store. (2) If it exists and the operation completed — return the stored response verbatim (same status code, same body). (3) If it exists and the operation is in-progress — return 409 Conflict to prevent concurrent duplicate processing. (4) If it does not exist — record the key with status “processing,” execute the operation, store the result, mark “completed.”What breaks without idempotency in a distributed system:- Network retries create duplicates. The client sends a POST to create an order. The server processes it, inserts the row, but the response is lost (TCP connection reset, load balancer timeout, anything). The client retries. Without idempotency, a second order is created. At Stripe’s scale — processing billions of dollars — even a 0.01% duplicate rate means millions in double charges.
- Message queue redelivery. Most message brokers provide at-least-once delivery, not exactly-once. If a consumer processes a message, crashes before acknowledging it, the broker redelivers it. The consumer must be idempotent or it processes the same event twice — double-sending an email, double-crediting an account, double-triggering a webhook.
- Load balancer failover. The client sends a request to Server A. Server A processes it but crashes before responding. The load balancer retries to Server B. Without a shared idempotency store, Server B has no idea Server A already processed it. Both servers execute the operation. This is why the idempotency store must be shared (Redis or the primary database), not in-process memory.
- Distributed transactions get messy. In a saga pattern where creating an order involves charging a payment, reserving inventory, and sending a notification — if the payment step succeeds but inventory reservation fails and you need to compensate, the compensation (refund) must also be idempotent. Otherwise, retrying a compensation can issue multiple refunds.
Follow-up: How would you handle idempotency in a system where the primary database and Redis (your idempotency store) might have different states after a crash?
This is the critical implementation detail that separates a blog-post-level understanding from a production-level one. The core problem: if you check the idempotency key in Redis, then write to the database, a crash between those two steps leaves them inconsistent.The solution is to use the database itself as the idempotency store — not Redis. Wrap the idempotency check and the business operation in a single database transaction:Going Deeper: What happens when the idempotency key store itself becomes a scaling bottleneck?
At very high scale (millions of writes per second), the idempotency table becomes a hot write path. Every mutating API call writes to it. Strategies:- TTL and cleanup. Set keys to expire after 24 hours (Stripe’s approach). Run a periodic job to delete expired keys. This keeps the table size bounded. At 1 million API calls per hour with 24-hour TTL, you have roughly 24 million active keys — manageable for PostgreSQL with proper indexing.
-
Partition the idempotency table. Partition by
created_at(daily or hourly). Dropping old partitions is O(1) — far faster than DELETE on millions of rows. This also keeps the index small and fast. -
Use a hash as the primary key. Idempotency keys are UUIDs (128 bits). Store them as
uuidtype, nottext— this saves 20+ bytes per row and makes the B-tree index denser (more keys per page, fewer levels, faster lookups). - Separate the hot path. For extremely high-throughput systems, use a dedicated Redis cluster just for idempotency with a TTL. Accept that a Redis failure means you lose deduplication for a brief window — for most operations this is an acceptable trade-off. For payment operations where duplicates are catastrophic, keep the database as the authoritative store and use Redis as a cache layer in front of it.
DROP TABLE idempotency_keys_2026_04_18 is instant and reclaims storage; DELETE FROM ... WHERE created_at < ... is slow and creates bloat.- Stripe Blog — “Designing robust and predictable APIs with idempotency”
- Brandur Leach — “Implementing Stripe-like idempotency keys in Postgres”
- Martin Fowler — Idempotent Receiver pattern
Q3: Compare token bucket and sliding window rate limiting. When would you pick one over the other? How would you implement distributed rate limiting across multiple API gateway instances?
Q3: Compare token bucket and sliding window rate limiting. When would you pick one over the other? How would you implement distributed rate limiting across multiple API gateway instances?
(previous window count * overlap%) + current window count. If a client made 60 requests in the last minute and the limit is 100/minute, they have 40 remaining.When to pick each:- Token bucket when bursts are natural and acceptable. A mobile app that syncs 50 items on launch, then makes occasional requests. An API where clients batch operations. Stripe and AWS use token bucket because their clients have bursty access patterns and penalizing bursts would hurt developer experience. The burst size is the bucket capacity — it is a tunable safety valve.
- Sliding window when you need strictly even distribution. Payment processing where 1,000 simultaneous requests could overwhelm a downstream bank API. Rate limiting writes to a database where sustained throughput matters more than peak. Sliding window prevents the “double burst at window boundary” problem that fixed windows have, without allowing the bursty behavior of token bucket.
- Centralized counter in Redis (most common). Each gateway instance increments a counter in Redis on every request. For token bucket, store
{tokens_remaining, last_refill_timestamp}and use a Lua script to atomically check-and-decrement:
-
Local counters with periodic sync. Each gateway maintains a local counter and syncs to Redis every N seconds. Faster (no per-request Redis call) but less accurate — during the sync interval, the total across all instances can exceed the limit by up to
instances * local_threshold. Acceptable for high-throughput internal APIs where approximate limiting is fine. -
Redis Cell module. A native Redis module that implements token bucket with a single
CL.THROTTLEcommand. Most operationally simple — one command, atomic, no Lua scripts. But requires loading a Redis module, which some managed Redis services do not support.
Follow-up: What happens when your Redis rate limiting cluster goes down? Do you fail open (allow all traffic) or fail closed (reject all traffic)?
This is one of those decisions that should be made deliberately, not discovered during an incident.Fail open (allow traffic): Safest for user experience. If Redis is down for 30 seconds, you accept potentially unlimited traffic. For most APIs, this is acceptable because (a) Redis outages are rare and short (seconds to minutes), (b) your backend can handle some burst without rate limiting, and (c) rejecting all traffic during a Redis blip is worse than allowing some extra traffic.Fail closed (reject traffic): Necessary when the rate limit protects a critical downstream resource — a payment processor with hard limits, a third-party API that will ban you for exceeding their rate. In this case, a Redis failure should return 503 Service Unavailable with a Retry-After header rather than allowing uncontrolled traffic to hit the downstream system.The hybrid approach (what I would recommend): Fail open with a local fallback. Each gateway instance maintains a rough in-memory rate limit (even just “allow 100 requests per second per instance”) that activates when Redis is unreachable. This provides a coarse safety net without completely dropping rate limiting. When Redis recovers, switch back to centralized counting.Follow-up: A client complains they are being rate-limited unfairly — they say they are well under the limit. How do you debug this?
First, check the basics: is their key being shared across multiple integrations (a common mistake where a team uses the same API key for staging and production)? Check the rate limit headers in their responses —X-RateLimit-Remaining and X-RateLimit-Reset — to confirm what the system thinks their usage is.Then check for clock skew. If your rate limiter uses sliding windows based on timestamps, and the Redis cluster’s clock is slightly off from the gateway’s clock, window boundaries shift and counts can be attributed to the wrong window. Use Redis server time, not gateway local time.Check if they are being rate-limited per-IP instead of per-key. If they are behind a corporate proxy or NAT, hundreds of users share one IP. Per-IP rate limiting will throttle the whole organization unfairly.Finally, check for “shadow traffic” — health check probes, monitoring synthetic requests, or integration test suites running against production. These all count against the rate limit but the client might not consider them “real” requests.Real-World Example. Stripe’s rate limiter is a four-tier architecture documented in their “Scaling your API with rate limiters” post: (1) request rate limiter (how many per second), (2) concurrent request limiter (how many in flight simultaneously), (3) fleet usage load shedder (protect overall API from overload), (4) worker utilization load shedder (protect individual workers). All are implemented in Redis with Lua scripts — sub-millisecond enforcement on every API call. The tiered design is what lets them allow bursts (tier 1 token bucket) while preventing a slow client from holding 1000 concurrent requests (tier 2 concurrency limiter).INCR + EXPIRE not atomic in Redis?
A: Because INCR and EXPIRE are two separate commands. If the server crashes between them, you have a counter with no TTL — it lives forever, and the rate limit becomes permanent for that key. The fix is a Lua script that executes INCR and EXPIRE atomically, or using Redis Cell module’s CL.THROTTLE which is atomic by design.Q: How do you handle rate limit counter drift across Redis replicas during a failover?
A: During a Redis primary failure, the replica may be a few milliseconds behind. When it promotes, a client might briefly be allowed extra requests (replica did not see the most recent INCRs). For most rate limiters, this is acceptable — a few extra requests during a failover window is not a security or stability issue. For critical enforcement, use Redis with WAIT command (block until N replicas ack the write) at the cost of latency.Q: Should rate limiting happen at the API gateway or inside each service?
A: Gateway, almost always. The gateway sees all traffic, can enforce per-consumer limits globally, and sheds load before requests consume backend capacity. Service-level rate limiting only catches requests that already made it through the gateway — too late to protect from abuse. The exception: internal service-to-service rate limiting (bulkheading) where you want to cap how many requests Service A sends to Service B regardless of overall traffic.- Stripe Blog — “Scaling your API with rate limiters”
- Cloudflare — “How we built rate limiting capable of scaling to millions of domains”
- Redis Docs — Patterns: Rate Limiting (see rate limiting section)
Q4: You have a PostgreSQL database with synchronous replication to one standby. A developer complains that write latency jumped from 5ms to 150ms after enabling sync replication. Explain why and propose solutions.
Q4: You have a PostgreSQL database with synchronous replication to one standby. A developer complains that write latency jumped from 5ms to 150ms after enabling sync replication. Explain why and propose solutions.
-
Semi-synchronous replication. PostgreSQL supports
synchronous_standby_names = 'ANY 1 (standby1, standby2)'. With two standbys, the primary waits for only one to acknowledge. If one standby is fast (same AZ) and one is slow (cross-region), the fast one usually responds first. Write latency drops to the fastest standby’s round-trip. You still get durability — at least one replica has the data. -
Classify writes by criticality. Not every write needs synchronous replication. For financial transactions, payment records, and audit logs — yes, synchronous is warranted. For session updates, analytics events, and activity logs — asynchronous is fine. PostgreSQL allows setting
synchronous_commitat the transaction level:
- Move the standby closer. If the standby is cross-region (which explains the 150ms), consider whether you actually need cross-region replication for your synchronous standby. Put the synchronous standby in the same region (different AZ for fault tolerance — typically 1-3ms round-trip). Add an asynchronous replica cross-region for disaster recovery. Same durability benefit, much lower latency.
- Batch writes. If the application makes many small writes, batch them into fewer transactions. The sync replication overhead is per-commit, not per-row. Inserting 100 rows in a single transaction pays the round-trip penalty once. Inserting 100 rows in 100 separate transactions pays it 100 times.
Follow-up: The standby goes down. What happens to writes on the primary?
With synchronous replication, writes block. The primary waits for standby acknowledgment that will never come. After the configured timeout (wal_sender_timeout, default 60 seconds), the connection to the standby is dropped, and the behavior depends on your configuration:-
If
synchronous_standby_nameslists only one standby and it is down, the primary has two choices: (a) block all writes indefinitely (safest for data, catastrophic for availability), or (b) fall back to asynchronous mode (maintain availability, accept that new writes are not replicated until the standby recovers). -
With
ANY 1 (standby1, standby2)and two standbys, losing one standby is fine — the primary continues with the other. This is why production setups should always have at least two synchronous standbys configured in anANY 1group.
pg_stat_replication.sync_state — if your synchronous standby shows as disconnected, you are either blocking writes or running with degraded durability.Going Deeper: How does PostgreSQL’s WAL-based replication actually work at the byte level?
Every change to the database (INSERT, UPDATE, DELETE, even index modifications) is first written as a WAL record — a sequential, append-only log entry that describes the change. The WAL is divided into 16MB segment files (configurable). The primary streams these WAL bytes to standbys over a persistent TCP connection.The standby’swalreceiver process receives the bytes and writes them to its local WAL. The startup process then replays the WAL records against the standby’s data files, bringing it up to date. In “hot standby” mode, the standby can serve read queries while simultaneously replaying WAL — this is how read replicas work in PostgreSQL.For synchronous replication, the critical flag is remote_apply vs remote_write vs on:synchronous_commit = on(default): primary waits until the standby has flushed WAL to disk. Durable against standby crash, but the standby has not yet replayed the records — queries on the standby might not see the data yet.synchronous_commit = remote_apply: primary waits until the standby has replayed the WAL records. Queries on the standby immediately see the committed data. This is strongest consistency but adds replay latency on top of network latency.synchronous_commit = remote_write: primary waits until the standby has received and written WAL to OS cache (not necessarily flushed to disk). Faster but vulnerable to standby OS crash.
on (flush to standby disk) is the right default. remote_apply is used when read-after-write consistency on replicas is required (the “read your own writes” scenario).Real-World Example. Crunchy Data’s PostgreSQL blog documents a customer whose same-region sync replication was fine (3ms) until a cross-region replica was added and incorrectly listed in synchronous_standby_names as FIRST 1 (cross_region, same_az). The primary waited for the 180ms cross-region replica on every write. Fix: change to ANY 1 (same_az, cross_region) — whichever acks first unblocks the commit. Write latency returned to 3-5ms with cross-region durability retained.ANY 1, one slow/failed replica does not block commits. Name it as the practical production default — full sync across all replicas is too fragile.off (async, risk data loss), local (primary fsync only), remote_write (standby has received WAL in OS cache), on (standby has flushed WAL to disk — default), remote_apply (standby has replayed — strongest but slowest). Mention that you tune this per-transaction, not globally.wal_sender_timeout fires (default 60s), the primary keeps waiting. Writes are blocked for up to 60 seconds — catastrophic for availability. The fix is to configure synchronous_standby_names = 'ANY 1 (standby1, standby2)' with multiple standbys, so losing one does not block commits. Or lower wal_sender_timeout at the cost of flapping during transient network blips.Q: Semi-sync gives durability for the standby that acked. What about the ones that did not?
A: Asynchronous catch-up. They receive WAL in the background with a few seconds of lag. Durability is at least one replica has the data at commit time; additional replicas converge eventually. This is fine for most workloads. For scenarios needing multiple AZ durability (like financial), use ANY 2 with three standbys across three AZs.Q: When is remote_apply (strictest mode) genuinely worth the latency cost?
A: When your read replicas must show the freshly-committed data immediately after a write (read-your-own-writes across nodes, not just same-session). Example: user submits a form, backend redirects to a “confirmation” page that reads from the replica. Without remote_apply, the replica may not have applied the write yet. Alternatives: session-sticky routing to primary post-write (cheaper), or causal consistency tokens.- PostgreSQL Docs — synchronous_commit
- Crunchy Data Blog — “PostgreSQL Synchronous Replication: A Deep Dive”
- 2ndQuadrant — “Synchronous replication: the gritty details”
Q5: A product manager asks you to add a 'soft delete' feature — records should not be permanently deleted but marked as inactive. What are the database design implications, and what would you push back on?
Q5: A product manager asks you to add a 'soft delete' feature — records should not be permanently deleted but marked as inactive. What are the database design implications, and what would you push back on?
deleted_at TIMESTAMPTZ column, change DELETE to UPDATE, filter WHERE deleted_at IS NULL everywhere. But the real-world implications are significant, and I would want to discuss them before implementing.The implementation is straightforward. The consequences are not.-
Every query in the application must filter by
deleted_at IS NULL. This is the most dangerous part. Miss one query, and deleted records appear in search results, reports, or API responses. In a codebase with 200 queries touching theuserstable, every single one needs the filter. Global query scopes (Railsdefault_scope, DjangoManager, Hibernate@Where) help but are fragile — they can be accidentally bypassed, they do not apply to raw SQL, and they create confusion when you actually need to see deleted records (admin views, audit, analytics). -
Unique constraints break. If you have
UNIQUE(email)and user A with emailalice@example.comsoft-deletes their account, user B cannot sign up with that same email because the soft-deleted row still occupies the unique constraint. The fix is a partial unique index:CREATE UNIQUE INDEX idx_unique_email ON users(email) WHERE deleted_at IS NULL. This works but is PostgreSQL-specific and needs to be applied to every unique constraint on every soft-delete table. -
Foreign keys become complicated. If a user is soft-deleted but their orders still reference
user_id, what happens? Cascade soft-delete all orders? Leave them orphaned pointing to a “deleted” user? Neither is clean. In my experience, most teams end up with ad-hoc “if the parent is soft-deleted, treat the child as…” logic scattered throughout the codebase. - Table bloat. Soft-deleted records never leave the table. A table that accumulates 10 million “deleted” records but only has 500K active records is 95% dead weight. Indexes include the dead rows (unless you use partial indexes for everything). Full table scans are 20x slower than they need to be. You end up needing a periodic “hard delete” job that archives old soft-deleted records to a separate table or data warehouse — which is basically implementing what you tried to avoid.
-
Performance. A partial index
WHERE deleted_at IS NULLhelps for queries that filter active records, but the table itself grows unbounded. VACUUM still has to process the full table. Statistics calculations include soft-deleted rows. The query planner might make suboptimal decisions because the row count statistics do not reflect the actual working set.
-
“Users want an undo/recycle bin” — Implement a time-limited undo (30 days), then hard delete. Store the deleted record in a separate
deleted_userstable or an archive. This keeps the main table clean. - “Legal/compliance requires data retention” — Use an audit log or event sourcing pattern. Every state change is an immutable event. You can hard-delete from the operational database while maintaining a complete history in the audit log. This separates operational data from compliance data.
-
“We need to preserve foreign key relationships” — Consider using a status field (
status: active | suspended | deactivated) instead of delete semantics. This is more honest about what is happening and avoids the mental model mismatch of “deleted but still there.” - “We are afraid of accidental deletes” — Use database backups and point-in-time recovery. That is what they are for.
Follow-up: The team decides to go with soft delete. Six months later, the table has 50M soft-deleted rows out of 52M total. The API is slow. What do you do?
This is exactly the scenario I warned about. The table is 96% dead weight. Here is the recovery plan:-
Immediate relief: Create partial indexes for every query that filters active records:
CREATE INDEX CONCURRENTLY idx_users_active ON users(id) WHERE deleted_at IS NULL. This gives the query planner a small, efficient index to work with instead of scanning the full 52M-row B-tree. For covering queries, include the needed columns:CREATE INDEX CONCURRENTLY idx_users_active_email ON users(email) WHERE deleted_at IS NULL INCLUDE (name, created_at). -
Archive the dead rows. Create a
users_archivetable with identical schema. Move soft-deleted rows older than 90 days in batches:
-
Table maintenance. After bulk deletion, run
VACUUM FULL users(locks the table — do during maintenance window) orpg_repack(no lock, requires extension) to reclaim disk space and rebuild indexes compactly. - Prevent recurrence. Implement the archival job as a periodic cron that runs daily, moving records older than the retention period.
Follow-up: How does the event sourcing approach compare to soft delete for audit compliance?
In event sourcing, you never mutate or delete. Every state change is an immutable event:UserCreated, UserEmailUpdated, UserDeactivated, UserDeleted. The current state is derived by replaying events.For compliance, this is superior to soft delete because: (a) you have a complete, immutable audit trail — not just “is deleted” but “who deleted it, when, why, and what the state was before deletion.” (b) You can hard-delete from the operational database (projection/read model) while retaining the event log — GDPR right-to-erasure is handled by crypto-shredding the events (encrypting per-user, deleting the key). (c) The operational database only contains current state, so it stays lean.The trade-off is complexity. Event sourcing requires an event store, event replay logic, projections for read models, and a team that understands the pattern. For most applications, an append-only audit_log table that records every change (table name, row ID, old value, new value, timestamp, actor) gives 80% of the compliance benefit at 20% of the complexity. That is what I would recommend for most teams.Real-World Example. GitHub’s engineering team famously migrated away from soft-delete for their repositories table after it grew to ~70% dead rows. The migration: move soft-deleted rows older than 90 days to a repositories_archive table, drop the deleted_at filter from all hot-path queries, and keep partial unique indexes for uniqueness. Query latency on the repositories endpoint dropped 40% with no functional change. The lesson: soft-delete at scale is a tax you pay on every query, forever.CREATE UNIQUE INDEX idx ON users(email) WHERE deleted_at IS NULL — enforces uniqueness only on active rows, letting soft-deleted rows have duplicate emails. Essential for any soft-delete schema. Without it, user B cannot reuse the email of a soft-deleted user A.UserCreated, UserEmailUpdated, UserDeleted), and current state is a derived projection. Immutable audit trail for free. Name it as the “right” solution for regulatory compliance — but caveat the complexity cost.archive table. After archival, pg_repack to reclaim disk space. Do not run VACUUM FULL on a 52M-row table — it locks for hours. Total recovery time: days of batched work, not an outage.Q: GDPR right-to-erasure requires hard delete. How do you combine that with soft-delete patterns?
A: Two-tier: soft-delete for short-term undo (30 days), hard-delete for regulatory erasure. The scheduled archival job that moves soft-deleted rows past the undo window either (a) hard-deletes them if GDPR-scope data, or (b) moves to archive if retention is required. Event sourcing handles this with “crypto-shredding” — encrypt events per-user, delete the key, events become unreadable.Q: Should a deleted_at filter be in the ORM or the database?
A: Both. ORM default scopes (Rails default_scope, Django Manager) prevent accidental inclusion, but they leak: raw SQL bypasses them, reporting queries forget, new developers unaware. Belt-and-suspenders: enforce at the database with Row-Level Security (CREATE POLICY active_only ON users USING (deleted_at IS NULL)). The database becomes the final authority — even a forgotten WHERE clause cannot see deleted rows.- Martin Fowler — Event Sourcing
- 2ndQuadrant — “Soft deletes considered harmful”
- GitHub Engineering — “How we handle deleted repositories at scale”
Q6: Explain the CAP theorem, then explain why saying 'we chose an AP database' is an oversimplification. Give concrete examples.
Q6: Explain the CAP theorem, then explain why saying 'we chose an AP database' is an oversimplification. Give concrete examples.
- Partition tolerance is not optional. Network partitions happen in any distributed system — a cable gets cut, a switch fails, a cloud AZ has connectivity issues. You do not get to “pick” partition tolerance. The real choice is: when a partition occurs, do you sacrifice consistency (return potentially stale data) or availability (return an error)?
-
The choice is not binary or permanent. Real databases make different trade-offs for different operations, different configurations, and even different queries. DynamoDB with
ConsistentRead: falseis AP (eventual consistency, always available). DynamoDB withConsistentRead: trueis CP (strong consistency, may reject reads during a partition if the leader is unreachable). Same database, different behavior per request. - CAP says nothing about the common case. Most of the time, there is no partition. During normal operation, the real trade-off is latency vs consistency — which is what PACELC captures. Cassandra is fast because it does not wait for all replicas to agree (low latency, weaker consistency). CockroachDB is slower because it uses distributed consensus on every write (higher latency, strong consistency). The daily engineering experience is dominated by this latency/consistency trade-off, not by partition behavior.
consistency level = QUORUM), it provides strong consistency — every read sees the most recent write as long as a majority of replicas are available. With consistency level = ONE, it is highly available but eventually consistent. The same Cassandra cluster is CP or AP depending on the consistency level you choose per query.MongoDB with writeConcern: { w: "majority" } and readConcern: "linearizable" provides strong consistency (CP behavior). With writeConcern: { w: 1 } and readConcern: "local", it is faster but may return stale data during a failover (AP behavior). MongoDB also has a concept of causal consistency sessions — where a client is guaranteed to see its own writes in order, without requiring full linearizability.PostgreSQL with synchronous replication is CP — it sacrifices availability (writes block) rather than return inconsistent data. But PostgreSQL with asynchronous replication has a window where the standby is behind, and reads from the standby are stale — which is an AP-like behavior for read-heavy workloads. Most production PostgreSQL setups are a hybrid: writes are always strongly consistent (go to primary), reads can be “eventually consistent” (from async replicas).The better framework is PACELC: During a Partition, choose Availability or Consistency. Else (normal operation), choose Latency or Consistency. This captures the daily trade-off that CAP ignores. Cassandra is PA/EL (available during partitions, low latency normally — at the cost of consistency in both cases). CockroachDB is PC/EC (consistent always, at the cost of availability during partitions and latency normally).Follow-up: A colleague says “we need Serializable consistency for our entire application.” How do you respond?
I would push back, respectfully. Serializable isolation is the strongest guarantee — it ensures that concurrent transactions behave as if they executed one at a time. But it comes with real costs:- Performance penalty. PostgreSQL’s Serializable Snapshot Isolation (SSI) tracks read and write dependencies across all concurrent transactions. When it detects a potential serialization anomaly, it aborts one of the transactions with a serialization failure. Your application must catch these failures and retry. Under high concurrency, the abort rate can be significant — I have seen 5-15% of transactions aborted in write-heavy workloads. Each abort wastes all the work done by that transaction.
- Throughput reduction. Because SSI sometimes aborts transactions speculatively (it errs on the side of caution), total throughput drops. Benchmarks typically show 20-40% lower throughput compared to Read Committed for write-heavy workloads.
- Most operations do not need it. Displaying a user’s dashboard (Read Committed is fine — slight staleness is invisible). Updating a user’s email (Read Committed with a unique constraint is sufficient). Generating a report (Repeatable Read to get a consistent snapshot). The operations that truly need Serializable are specific: double-booking prevention, inventory reservation where two buyers compete for the last item, financial transfers with multi-account balance constraints (write skew).
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE, not as a database-wide default.Follow-up: How does CockroachDB achieve strong consistency across regions with acceptable latency?
CockroachDB uses the Raft consensus protocol — every write goes through a Raft group where a majority of replicas must agree before the write is committed. In a 3-region setup (us-east, us-west, europe), every write requires at least 2 of 3 replicas to acknowledge.The latency trick is “leaseholder locality.” CockroachDB assigns a “leaseholder” for each range (chunk of data). The leaseholder serves all reads and coordinates writes. If you configure the leaseholder to be in the same region as most readers and writers for that data (using zone configurations and locality-aware placement), reads are local (fast) and writes only need one cross-region round-trip for Raft consensus (the leaseholder + one nearby follower form a majority).For truly global applications, CockroachDB supports “stale reads” (AS OF SYSTEM TIME) that read from the local replica without a round-trip to the leaseholder. This trades consistency for latency on a per-query basis — similar to reading from an async replica in PostgreSQL, but with a bounded staleness guarantee.Google’s Spanner (which inspired CockroachDB) goes further with TrueTime — GPS and atomic clock synchronized timestamps that allow globally consistent reads without coordination. CockroachDB approximates this with hybrid-logical clocks (HLC), which are less precise but do not require specialized hardware.Q7: Design the database schema and indexing strategy for a multi-tenant SaaS application where each tenant has between 100 and 50 million rows. How do you prevent a large tenant from degrading performance for small tenants?
Q7: Design the database schema and indexing strategy for a multi-tenant SaaS application where each tenant has between 100 and 50 million rows. How do you prevent a large tenant from degrading performance for small tenants?
tenant_id column (pool model).
All tenants share the same tables. Every table has a tenant_id column. Every query includes WHERE tenant_id = ?. Every index is prefixed with tenant_id.tenant_id part of the primary key and every index means queries for one tenant always hit a contiguous range of the B-tree. PostgreSQL’s query planner can use index scans that skip directly to the tenant’s section.Option 2: Schema-per-tenant (silo model). Each tenant gets their own PostgreSQL schema (CREATE SCHEMA tenant_abc). Tables within the schema have no tenant_id column. Application sets search_path to the tenant’s schema on each request.Option 3: Database-per-tenant. Maximum isolation. Each tenant is a separate database instance (or a separate database within a PostgreSQL cluster). Used by Slack (one MySQL DB per workspace) and some enterprise SaaS products.My recommendation for most SaaS products: shared tables with tenant_id (Option 1) for small/medium tenants, with the ability to shard large tenants to dedicated infrastructure.Preventing noisy neighbors (the core of the question):- Row-Level Security (RLS) in PostgreSQL. Enforce tenant isolation at the database level, not just the application level:
app.current_tenant at the start of each request. Even if application code forgets the WHERE clause, the database enforces isolation. This prevents the most dangerous failure mode: one tenant seeing another tenant’s data.-
Query-level resource limits. PostgreSQL’s
statement_timeoutkills queries that run too long. Set it per-role: large enterprise tenants get 30 seconds, small tenants get 5 seconds. This prevents a poorly-written report from a large tenant from holding connections and locks for minutes. - Connection pool isolation. Dedicate a portion of the connection pool to each tenant tier. Large enterprise tenants get 20 dedicated connections. Small tenants share a pool of 50 connections. This prevents a single tenant from exhausting all connections during a traffic spike.
- Partitioning by tenant_id for the largest tenants. When a tenant grows past a threshold (say 10M rows), create a dedicated partition for them:
- Monitoring per-tenant query patterns. Track P99 query latency per tenant. If a tenant’s queries degrade, investigate before it affects others. pg_stat_statements with a tenant_id label (via the application, logged as a comment in the query) gives you per-tenant query performance visibility.
Follow-up: A tenant with 50M rows wants to run heavy analytics queries. This is destroying performance for other tenants. What do you do?
This is the classic OLTP vs OLAP collision. Never run analytics on the operational database.- Immediate mitigation: Route the tenant’s analytics queries to a read replica. This takes the load off the primary immediately. The replica might have 1-10 seconds of replication lag, which is acceptable for analytics.
- Medium-term solution: Set up a dedicated analytics pipeline. Replicate the tenant’s data to a columnar data warehouse (BigQuery, Redshift, Snowflake, or even DuckDB for smaller datasets). Analytics queries run there, completely isolated from the OLTP database. The cost of running Redshift for one tenant’s analytics is far less than the cost of degraded performance for all tenants.
- Product-level solution: Offer “advanced analytics” as a premium tier feature. Enterprise tenants who need heavy analytics get a dedicated read replica or data warehouse access. This aligns the cost (infrastructure) with the value (revenue).
Going Deeper: How does DynamoDB handle multi-tenancy differently from PostgreSQL, and when would you prefer it?
DynamoDB’s multi-tenancy is built into its architecture. Each table is automatically partitioned by the partition key. Iftenant_id is your partition key (or partition key prefix), each tenant’s data is physically isolated on different partitions, handled by different storage nodes. A hot tenant (high traffic) triggers automatic partition splits — DynamoDB adds more partitions for that tenant without affecting others. There is no shared B-tree, no shared connection pool, no noisy neighbor problem at the storage level.The trade-off: DynamoDB forces you to design your access patterns upfront. With PostgreSQL, you can add an index later when a new query pattern emerges. With DynamoDB, if you did not design the partition key and sort key for a particular access pattern, adding it later means creating a Global Secondary Index (which is eventually consistent) or restructuring the table. Single-table design in DynamoDB is powerful but requires deep upfront design work.I would prefer DynamoDB for multi-tenant SaaS when: (a) access patterns are well-defined and key-based (get tenant’s orders, get tenant’s users), (b) tenants vary wildly in size and you want automatic scaling without manual partitioning, (c) you want zero operational overhead for scaling. I would prefer PostgreSQL when: (a) access patterns involve complex queries, JOINs, or ad-hoc analytics, (b) you need strong multi-row transactions, (c) the team has deep SQL expertise and limited DynamoDB experience.Q8: Walk me through what happens at the database level when you run CREATE INDEX CONCURRENTLY on a table with 500 million rows in PostgreSQL. How is it different from a regular CREATE INDEX?
Q8: Walk me through what happens at the database level when you run CREATE INDEX CONCURRENTLY on a table with 500 million rows in PostgreSQL. How is it different from a regular CREATE INDEX?
CREATE INDEX:
PostgreSQL acquires a SHARE lock on the table. This blocks all writes (INSERT, UPDATE, DELETE) for the entire duration of the index build. Reads continue. On a 500M-row table, building a B-tree index takes 10-60 minutes depending on disk speed, row width, and available maintenance_work_mem. During those 10-60 minutes, every write to that table queues and waits. For a production system handling 1,000 writes per second, that means 600,000 to 3.6 million queued writes. Connection pools exhaust. Application timeouts fire. Cascading failures ensue. This is why you never run a bare CREATE INDEX on a large production table.CREATE INDEX CONCURRENTLY:
PostgreSQL builds the index in three phases without blocking writes:-
Phase 1 — Initial scan. PostgreSQL scans the entire table and builds the index from the current snapshot. During this scan, new writes continue normally — they modify the table but the in-progress index does not see them yet. This phase takes roughly the same time as a regular index build (10-60 minutes on 500M rows) but the table is fully writable. PostgreSQL acquires only a
ShareUpdateExclusiveLock, which blocks other DDL (you cannot run two concurrent index builds simultaneously) but allows all DML (INSERT/UPDATE/DELETE). - Phase 2 — Second scan. PostgreSQL performs a second pass to pick up any rows that were inserted, updated, or deleted during Phase 1. This catches the “delta” between the snapshot used in Phase 1 and the current state. For a table with moderate write volume, this phase is fast — it only processes the changes, not the full table.
- Phase 3 — Validation. PostgreSQL validates that the index is consistent with the table. For unique indexes, it checks uniqueness constraints against the final state. Once validated, the index is marked as valid and the query planner starts using it.
- Takes longer overall. Because it does two table scans instead of one,
CREATE INDEX CONCURRENTLYtakes roughly 1.5-2x as long as a regularCREATE INDEX. On a 500M-row table, expect 15-90 minutes. - Uses more I/O. The double scan means double the disk reads. On an I/O-constrained system, this can impact query performance during the build. Monitor disk utilization and consider running it during off-peak hours.
- Cannot run in a transaction block.
CREATE INDEX CONCURRENTLYcannot be wrapped in aBEGIN ... COMMITblock. If it fails partway through (out of disk space, unique violation), it leaves an invalid index. You need to checkpg_index.indisvalidand clean up withDROP INDEX CONCURRENTLYif it failed. - maintenance_work_mem matters. The sort phase of index building uses
maintenance_work_mem(default 64MB). For a 500M-row index build, increasing it to 1-2GB dramatically reduces sort time because more of the sort can happen in memory instead of spilling to disk.
pg_stat_progress_create_index (PostgreSQL 12+) to track progress in real time — it tells you which phase you are in, how many tuples have been processed, and the total tuple count.Follow-up: The concurrent index build fails halfway through because of a unique constraint violation. What state is the database in? How do you recover?
The database has an “invalid” index. It exists in the catalog (pg_index) with indisvalid = false. The query planner will not use it, but it still consumes disk space and still slows down writes (because PostgreSQL maintains it on inserts/updates even though it is invalid).Recovery: DROP INDEX CONCURRENTLY idx_orders_customer_status; — this removes the invalid index without blocking writes. Then fix the data issue (find and resolve the duplicate values), and retry the CREATE INDEX CONCURRENTLY.The dangerous mistake: not cleaning up invalid indexes. I have seen production databases with multiple invalid indexes from failed builds — each one adding write overhead without providing any read benefit. Periodically audit for invalid indexes:Follow-up: How does MySQL handle this differently? Can you create indexes without blocking writes in MySQL?
MySQL InnoDB handles this differently. Since MySQL 5.6, mostALTER TABLE ... ADD INDEX operations are “online DDL” — they allow concurrent reads and writes during the index build. MySQL uses a mechanism similar to PostgreSQL’s concurrent build: it scans the table, builds the index, captures the delta of writes during the build, then applies the delta.The key difference: MySQL’s online DDL acquires brief metadata locks at the start and end of the operation (which can still cause brief stalls if there are long-running transactions holding locks), but the bulk of the index build allows concurrent DML. PostgreSQL’s CREATE INDEX CONCURRENTLY is more explicitly designed for zero-write-blocking, but MySQL’s online DDL achieves a similar practical result.For very large MySQL tables (billions of rows), teams often use gh-ost or pt-online-schema-change instead of native online DDL, because these tools provide better throttling (they pause when replication lag exceeds a threshold), more predictable impact, and the ability to abort without leaving the table in an intermediate state.Q9: Your team is debating between REST and gRPC for communication between two internal microservices. One service handles 15,000 RPS to the other. What is your recommendation and why?
Q9: Your team is debating between REST and gRPC for communication between two internal microservices. One service handles 15,000 RPS to the other. What is your recommendation and why?
- Wire efficiency. Protocol Buffers are 3-10x smaller than JSON. At 15,000 RPS, if the average JSON payload is 2KB, that is 30MB/second of serialization overhead. Protobuf reduces that to 3-6MB/second. More importantly, serialization/deserialization CPU cost drops significantly — protobuf serialization is 5-20x faster than JSON parsing because it is a simple binary format with no parsing ambiguity (no string escaping, no whitespace handling, no number type inference).
- HTTP/2 multiplexing. REST over HTTP/1.1 uses one TCP connection per concurrent request (or connection pooling with head-of-line blocking). At 15,000 RPS with 10ms average latency, you need ~150 concurrent connections. gRPC runs over HTTP/2, which multiplexes all 15,000 RPS over a handful of TCP connections (typically 1-4 per client instance). This reduces TCP overhead, reduces connection management complexity, and plays better with load balancers.
-
Code-generated clients and servers. The
.protofile is the contract. Both teams generate strongly-typed client and server stubs from it. No hand-written HTTP clients, no JSON parsing bugs, no argument about field naming conventions. At 15,000 RPS, even a 0.01% error rate from a parsing bug is 1.5 errors per second — enough to fire alerts. - Streaming capability. If the use case evolves to need server-side streaming (one service pushing updates to the other), gRPC supports it natively. Adding WebSocket or SSE to a REST API is a fundamentally different protocol.
- Operational complexity. gRPC requires proto file management, code generation pipelines, and proto file versioning. If the team has no gRPC experience, the ramp-up cost is real — 2-4 weeks before the team is productive. REST with JSON is universally understood.
-
Debugging difficulty. You cannot
curla gRPC endpoint. You needgrpcurlor Postman with gRPC support. Reading a protobuf response in a network trace requires decoding tools. When something goes wrong at 3 AM, the ability tocurlan endpoint and read the JSON response is genuinely valuable. - Load balancer compatibility. gRPC requires HTTP/2-aware load balancing. Traditional L4 load balancers work, but L7 load balancing (routing based on gRPC method, header-based routing) requires specific support (Envoy, AWS ALB with gRPC support, nginx with grpc_pass). If your infrastructure is not already HTTP/2 ready, this is a prerequisite.
- Is 15,000 RPS actually stressing REST? A well-implemented REST API with connection pooling, HTTP/2 (yes, REST can use HTTP/2 too), and efficient JSON serialization (simdjson, orjson) can handle 15,000 RPS without breaking a sweat. The serialization overhead at 15K RPS is probably 2-5% of total CPU. If the service is doing meaningful work (database queries, business logic), the JSON vs protobuf difference is noise.
Follow-up: The team picks gRPC. Six months later, a new frontend team wants to call this service from the browser. What is your approach?
gRPC does not work natively in browsers — browsers cannot make HTTP/2 requests with the gRPC framing that gRPC requires. There are three options:- gRPC-Web. A protocol adaptation that works over HTTP/1.1. The browser sends gRPC-Web requests to an Envoy proxy (or any gRPC-Web compatible proxy), which translates them to standard gRPC calls to the backend. The frontend team uses a generated gRPC-Web client. This preserves strong typing and the proto contract but requires the proxy layer.
- REST gateway. Add a thin REST API that translates HTTP/JSON to gRPC calls. Tools like grpc-gateway (Go) or Envoy’s gRPC-JSON transcoding can auto-generate REST endpoints from proto definitions. The frontend team uses a standard REST client. This is the lowest friction for the frontend team but adds a translation layer.
- Connect protocol (Buf). A newer protocol from the Buf team (creators of protobuf tooling) that is wire-compatible with gRPC but also supports standard HTTP/1.1 with JSON. A single service implementation serves both gRPC clients (binary, HTTP/2) and browser clients (JSON, HTTP/1.1). This is increasingly the recommended approach for new projects that need both internal gRPC and browser access.
Going Deeper: How does protobuf backward compatibility work, and what are the rules for evolving a proto file without breaking clients?
Protobuf’s backward compatibility rules are critical for service evolution:Safe changes (never break existing clients):- Adding new fields (with a new field number). Old clients ignore unknown fields. New clients use default values for missing fields.
- Adding new enum values. Old clients treat unknown enum values as the default (0 in proto3).
- Adding new RPC methods to a service. Old clients never call them.
- Changing a field’s type (int32 to string).
- Changing a field number (the field number is the wire identifier, not the name).
- Removing a field (old clients that set this field will silently lose data; old clients that read it will get the default value, which may violate assumptions).
- Renaming a field (safe on the wire since protobuf uses numbers, but breaks generated code that references the old name).
reserved 7; so no one accidentally reuses it later. A new field with the same number but a different type would silently corrupt data.In practice, I manage this with a CI check: buf breaking --against .git#branch=main (from the Buf tool) compares the current proto file against the main branch and fails the build if any breaking change is detected. This is the proto equivalent of contract testing for REST APIs.Q10: Explain the difference between optimistic and pessimistic locking. Then describe a real scenario where choosing the wrong one caused a production incident.
Q10: Explain the difference between optimistic and pessimistic locking. Then describe a real scenario where choosing the wrong one caused a production incident.
SELECT ... FOR UPDATE in SQL). Other transactions attempting to modify the same resource block until you release the lock (on COMMIT or ROLLBACK). Think of it like locking a bathroom door — only one person at a time, everyone else waits in line.Optimistic locking assumes conflicts are rare. You read the resource with a version number, do your work, and at write time, check if the version has changed (UPDATE ... SET version = version + 1 WHERE id = ? AND version = ?). If the update affects 0 rows, someone else modified it first — you retry or fail. Think of it like a collaborative document where you edit your copy and check at save time whether anyone else changed it.When to use each:- Pessimistic: High contention — multiple users frequently modifying the same resource. Example: seat reservation for a popular concert (thousands of people trying to grab the same seats simultaneously). Inventory decrements for a flash sale. Account balance updates in a banking system. The cost of contention is high, and you would rather wait briefly than retry repeatedly.
- Optimistic: Low contention — conflicts are rare. Example: editing a user profile (how often do two people edit the same profile simultaneously?). Updating configuration settings. Modifying a shopping cart. Most of the time, there is no conflict, and the version check is just a cheap WHERE clause addition.
WHERE version = 1.The first buyer succeeded: inventory went to 99, version to 2. The remaining 4,999 requests all had version 1, so their updates affected 0 rows. They retried: read inventory (version 2), retry payment, decrement with WHERE version = 2. But only one of those 4,999 retries succeeded. The remaining 4,998 retried again. This created an avalanche — each “round” sold one unit but required every remaining buyer to retry. By the time 100 units were sold, the system had processed roughly 100 * 5,000/2 = 250,000 retry cycles. The database was overwhelmed with read-modify-write cycles. Payment timeouts cascaded. Some users were charged but did not get their item because the inventory update failed after payment succeeded.The fix: Switch to pessimistic locking for flash sale inventory. SELECT quantity FROM inventory WHERE product_id = ? FOR UPDATE SKIP LOCKED acquires the lock or immediately skips (returns nothing) if already locked. The SKIP LOCKED is key — it means concurrent buyers do not queue up waiting for each other. They either get the lock and proceed, or immediately get “sold out.” The database handles this efficiently because it is a single-row lock.Even better: use a queue-based approach. Funnel all purchase requests into a queue. A single consumer processes them sequentially — no locking needed because there is no concurrency. The queue absorbs the burst, the consumer processes at a steady rate, and buyers get a “your order is being processed” response immediately.Follow-up: How does SELECT ... FOR UPDATE SKIP LOCKED actually work, and what is the difference from NOWAIT?
FOR UPDATE acquires a row-level exclusive lock. Three flavors:-
FOR UPDATE(plain): Blocks until the lock is available. If another transaction holds it, you wait. Simplest, but under high contention, you get long wait queues and potential deadlocks. -
FOR UPDATE NOWAIT: Tries to acquire the lock. If it is already held, immediately throws an error (ERROR: could not obtain lock on row). Useful when you want to fail fast rather than wait — the application catches the error and retries or returns an error to the user. -
FOR UPDATE SKIP LOCKED: Tries to acquire the lock. If the row is already locked, skips that row entirely (as if it did not exist in the result set). This is designed for “grab the next available item” patterns — worker queue processing, ticket reservation, any scenario where you want to grab any available row, not a specific row.
SKIP LOCKED:pgboss, Oban (Elixir), and Que (Ruby).Going Deeper: How does MVCC relate to locking in PostgreSQL? Do readers block writers?
In PostgreSQL, readers never block writers and writers never block readers. This is the fundamental benefit of MVCC (Multi-Version Concurrency Control).When a transaction reads a row, it sees the version of the row that was committed before the transaction’s snapshot timestamp. It does not acquire any lock that would block a concurrent writer. When a transaction writes a row, it creates a new version of the row — the old version remains visible to transactions that started before the write.The exception:SELECT ... FOR UPDATE explicitly requests a lock. This is an advisory statement saying “I intend to modify this row, so block other writers.” Without FOR UPDATE, a concurrent reader and writer operate completely independently.This is fundamentally different from MySQL InnoDB in REPEATABLE READ mode, which uses gap locks and next-key locks that can cause readers to block writers in some scenarios (particularly range queries). PostgreSQL’s snapshot isolation is purely version-based — it never uses lock-based read isolation (except at SERIALIZABLE, where it uses predicate locks for SSI detection, but even those do not cause blocking — they only cause transaction aborts if an anomaly is detected).The cost of MVCC: dead tuples. Every update creates a new row version and marks the old one as dead. VACUUM must periodically clean up dead tuples to reclaim space and prevent index bloat. On a write-heavy table with 50,000 updates per second, VACUUM must keep pace or the table grows unboundedly. This is the single most important PostgreSQL operational concern — autovacuum tuning.Q11: You need to migrate a production API from v1 to v2 with breaking changes. 200 third-party integrations depend on v1. How do you execute this without breaking anyone?
Q11: You need to migrate a production API from v1 to v2 with breaking changes. 200 third-party integrations depend on v1. How do you execute this without breaking anyone?
-
Deploy v2 alongside v1. Both versions run simultaneously. The v2 endpoints live at
/v2/resources. The v1 endpoints remain untouched at/v1/resources. This is the “expand” phase of expand-and-contract. - Internally, both versions share the same backend logic. Do not maintain two separate codebases. Implement a version transformation layer — v1 requests are transformed to the internal canonical format, processed, and the response is transformed back to v1 format. v2 requests use the canonical format directly (or a different transformation). Stripe does exactly this — they maintain hundreds of API versions through a chain of version-specific request/response transforms, all running against a single backend.
- Use a version router. The API gateway or a middleware layer inspects the version (URL path, header, or the consumer’s pinned version) and applies the appropriate transformations. This means the business logic does not need to know about versions at all.
- Publish a migration guide for every breaking change. Not just “field X was renamed to Y” — but working code examples showing the v1 call, the equivalent v2 call, and how to handle the migration in common languages (Python, Node, Go, Java). Stripe’s migration guides are the gold standard here.
- Add deprecation headers to v1 responses:
- Email all 200 integrators with a timeline: v2 available now, v1 deprecated on date X, v1 sunset on date Y. Give at least 12 months between deprecation and sunset for public APIs.
- Provide a migration testing sandbox. Let integrators test their updated code against v2 in a staging environment before committing. Offer to review their migration code if they are a large partner.
- Track per-consumer version usage. Dashboard showing which consumers are still on v1, their traffic volume, and last request timestamp. Reach out to high-volume v1 consumers individually.
- Default new consumers to v2. New API keys get v2 by default. New documentation defaults to v2. Move the v1 docs to a “legacy” section but keep them accessible.
- Progressive rate limit tightening on v1. Optional and aggressive, but effective: reduce v1 rate limits by 10% per quarter after the deprecation date. This creates gentle pressure without breaking anyone. Communicate this in advance.
-
Sunset v1. After the announced date, v1 endpoints return
410 Gonewith a response body containing the migration guide URL. Do not return 404 — 410 explicitly means “this existed but has been intentionally removed.” - Keep the v1 transformation code for at least 6 more months after sunset. If a critical partner missed the migration, you can temporarily re-enable it.
Follow-up: How do you handle a breaking change in the database schema that affects both v1 and v2 responses?
The expand-and-contract pattern applies at the database level too. Suppose v1 returns{ "name": "Alice Smith" } and v2 splits this into { "first_name": "Alice", "last_name": "Smith" }.-
Expand the schema: Add
first_nameandlast_namecolumns alongside the existingnamecolumn. Backfill by parsing the existingnamefield. New writes populate all three columns. -
Version transforms: v1 response continues to read from
name(or concatenatefirst_name + last_nameifnamehas been deprecated). v2 response reads fromfirst_nameandlast_name. Both versions work against the same database. -
Contract: After v1 is sunset and all code paths reading
nameare removed, drop thenamecolumn. Do this with a non-locking migration (on large tables, useALTER TABLE ... DROP COLUMNwhich in PostgreSQL is just a catalog update — the column is hidden, not physically removed, so it is instant).
Follow-up: A partner refuses to migrate and threatens to leave the platform. What do you do?
This is a business decision, not a technical one. The answer depends on the partner’s revenue contribution.If the partner represents significant revenue: offer to maintain their v1 integration as a “legacy support” tier — possibly with a dedicated support fee. The version transformation layer makes this cheap technically (it is just maintaining one more transform in the chain). Set a clear end-of-life date (24 months instead of 12) and assign a dedicated solutions engineer to help them migrate.If the partner is low-revenue: empathize, provide migration assistance, offer a reasonable extension (3-6 months), but hold the sunset date. Maintaining a legacy API version indefinitely for one consumer creates a maintenance burden that compounds over time.The meta-lesson: API versioning is a product management problem. The version transformation architecture makes it technically feasible to support many versions simultaneously, but the product decision is how long to support each version and at what cost. This is why Stripe treats their API version lifecycle as a first-class product concern, not just an engineering task.Q12: A junior developer asks you: 'Why can't I just use SELECT * and add all the columns I might ever need to every table? Storage is cheap.' How do you explain why this thinking is wrong?
Q12: A junior developer asks you: 'Why can't I just use SELECT * and add all the columns I might ever need to every table? Storage is cheap.' How do you explain why this thinking is wrong?
- Narrow table: ~6,700 pages, ~52 MB of I/O
- Wide table: ~250,000 pages, ~1.95 GB of I/O
SELECT * on the wide table reads all 2KB per row even if you only need the 50 bytes of id, name, email. SELECT id, name, email on a table with a covering index never touches the table at all — it reads the answer directly from the index, which is compact and almost certainly cached in memory.2. Network cost is not free.The database sends results over the network to the application server. At 10,000 queries per second returning 20 rows each:SELECT id, name, emailat ~100 bytes per row: 20 MB/secondSELECT *with a 2KB row: 400 MB/second
SELECT id, name FROM users WHERE email = 'alice@example.com' with an index on (email) INCLUDE (id, name) is an index-only scan: one B-tree lookup, ~0.1ms. But SELECT * always touches the heap because no index contains every column. Even if you have the perfect index, SELECT * defeats it.5. Schema evolution becomes dangerous.Someone adds a profile_image BYTEA column to the users table (a 500KB blob). Every SELECT * in the codebase now returns 500KB per row, silently. No code changed, no deployment happened, but API response sizes 100x overnight. With explicit column lists, the new column is invisible to existing queries until someone deliberately adds it.How I would explain it to the junior developer:“Storage is cheap, but reading is not free. Every SELECT * forces the database to read, serialize, and transmit every column over the network to your application, which then allocates memory for every field. The cost is not the disk — it is the I/O, the network bandwidth, the application memory, and the fact that adding a large column to a table silently breaks every SELECT * in the codebase. Always select the columns you need. It is like ordering the entire menu at a restaurant because food is cheap — you still have to carry it all to the table, and most of it goes to waste.”Follow-up: When IS it acceptable to use SELECT *?
There are a few legitimate cases:-
Ad-hoc exploration. When debugging in a psql terminal,
SELECT * FROM users WHERE id = 42 LIMIT 1is fine. You are looking at data, not running production queries. -
EXISTSsubqueries.SELECT 1 FROM ... WHERE EXISTS (SELECT * FROM orders WHERE ...)— theSELECT *inside EXISTS is never materialized. The database just checks for row existence. Practically identical toSELECT 1. - Very early prototyping. When you are sketching a feature and the schema is changing hourly. But this should be replaced with explicit column lists before merging to main.
-
When you genuinely need every column and the table is narrow. A 5-column config table where every query needs all columns —
SELECT *is fine. But even here, explicit columns protect you against future schema changes.
SELECT * is for humans in terminals. Production code should always specify columns.Advanced Interview Scenarios
These questions go beyond foundational knowledge into the territory where real production experience separates candidates. Each scenario is designed so that the “textbook” answer is incomplete or wrong. The questions reward engineers who have been on-call, debugged 3 AM incidents, and shipped migrations on tables with billions of rows. If you can answer these with specific tool names, real metrics, and trade-offs you have personally lived through — you will stand out.Q13: Your team's API adds an optional field to a response. No fields were removed, no types changed. A partner's integration breaks in production. How is this possible, and whose fault is it?
Q13: Your team's API adds an optional field to a response. No fields were removed, no types changed. A partner's integration breaks in production. How is this possible, and whose fault is it?
The trap: “Additive changes are always safe” is the textbook answer. It is also wrong in practice.
What weak candidates say: “Additive changes are non-breaking by definition. The partner must have a bug in their code. We should tell them to fix their deserialization.”What strong candidates say:This is one of the most common real-world API incidents, and it catches teams who believe the “only removals are breaking changes” rule without understanding how client code actually works.How an additive change breaks clients — five real mechanisms:-
Strict deserialization. Many clients use statically typed languages with strict JSON parsers. Java’s Jackson with
FAIL_ON_UNKNOWN_PROPERTIES = true(the default in some frameworks) throws an exception when it encounters a field not defined in the model class. .NET’sSystem.Text.Jsonbehaves similarly by default. The partner integrated six months ago, generated their DTOs, and never updated them. Your new field is “unknown” to their parser, and their code throws. I have personally seen this break a $2M/year partner integration at a fintech company — a single newmetadatafield on an invoice response crashed their nightly reconciliation pipeline for 14 hours. -
Response size increase triggering limits. If the new field is a nested object or array, the response payload grows. Partners with a 64KB response body limit in their HTTP client configuration, or a Lambda function with a 6MB response payload limit, suddenly start failing. At one company, adding a
line_itemsarray to an order response pushed the payload for large orders from 45KB to 180KB, exceeding a partner’s nginxproxy_buffer_sizeof 128KB. The response was silently truncated, producing invalid JSON. - Database/cache key changes. If the partner hashes the response body for caching or deduplication, a new field changes the hash. Their cache hit rate drops to zero overnight. Their “is this a duplicate webhook?” check starts treating every callback as new because the signature of the payload changed.
- Snapshot testing and CI pipelines. The partner’s integration test suite does exact-match comparison against a stored response fixture. The new field causes every test to fail, blocking their deployments. They cannot ship any code until they update all fixtures — which for a large integration could be hundreds of test files.
-
Field name collisions. The new field name collides with a field the partner’s application uses internally. If they flatten the API response into a flat namespace (common in ETL pipelines),
statusfrom your API collides with their internalstatusfield.
2025-01-15 never sees a field added in 2025-06-01 until they explicitly upgrade their version. This is the gold standard — but it requires the version transformation layer described in Q11.War Story: At a payments platform processing 40M API calls/day, we added an enhanced_risk_score field to the transaction response. The field was optional, nullable, well-documented in the changelog. Within 4 hours, three partners reported failures. Partner A had strict deserialization in their Java service. Partner B’s webhook signature verification failed because they computed HMAC over the raw response body (new field changed the hash). Partner C’s Snowflake ingestion pipeline broke because their COPY command expected a fixed number of JSON keys per row. We ended up rolling back the field, adding it only to v3 of the API, and sending a “breaking change notice” even though, by our own documentation, it was not breaking. The lesson: the theoretical contract (“ignore unknown fields”) and the practical contract (what client code actually does) are different things.Follow-up: How do you prevent this class of incident going forward?
Build a consumer compatibility testing pipeline. Before deploying any API change — even additive ones — replay the last 24 hours of each partner’s actual request/response pairs through the new code. If the response shape changes in a way that would break strict deserialization (new fields appear), flag it for review. Stripe, Twilio, and Plaid all have internal “compatibility gates” that catch this. For smaller teams: maintain a set of “canonical partner response fixtures” in your test suite that are validated on every deploy. Any fixture diff requires explicit approval.Follow-up: A partner demands you never add fields to existing endpoints. How do you respond?
This is an unreasonable demand that would freeze your API’s evolution. But the underlying concern is valid — they want stability. Offer them two things: (1) pin them to a versioned API where response shapes are frozen per version, and (2) provide a schema registry (OpenAPI spec published per version) so their code generation tools can stay in sync. If they refuse both and insist on a frozen response format forever, that is a contract negotiation, not an engineering problem. Involve product and legal.Q14: Production alert fires at 2 AM: your PostgreSQL database has 15 deadlocks per minute. Walk me through exactly how you diagnose and resolve this without restarting anything.
Q14: Production alert fires at 2 AM: your PostgreSQL database has 15 deadlocks per minute. Walk me through exactly how you diagnose and resolve this without restarting anything.
This is a debugging war story question. Candidates who have never been on-call will flounder.
What weak candidates say: “I would increase the lock timeout or add retry logic.” — This treats the symptom, not the cause. You will still have 15 deadlocks per minute, just with more retries consuming more resources.What strong candidates say:Deadlocks at 15/min means two or more transactions are consistently acquiring locks in conflicting orders on the same rows. This is not random — it is a deterministic code pattern. Here is exactly how I would diagnose it, step by step, using real PostgreSQL tooling.Step 1: Confirm the deadlocks and capture the details (2 minutes).PostgreSQL logs deadlock details automatically whenlog_lock_waits = on and deadlock_timeout is set (default 1 second). Check the PostgreSQL log:log_directory.Also query pg_stat_activity for currently blocked sessions:INSERT INTO order_items has no idea they are locking a row in the orders table.Pattern C: Index contention on HOT updates. Less common, but on tables with many indexes, an UPDATE that changes indexed columns can acquire locks on multiple index entries. Two concurrent updates on adjacent rows can deadlock on the same index page.Step 3: Fix it (15-30 minutes for a hotfix).For Pattern A (most likely): identify the application code path doing multi-row updates in inconsistent order. Add explicit ordering:SET CONSTRAINTS ... DEFERRED within the transaction.Step 4: Prevent recurrence.Add a monitoring alert on pg_stat_database.deadlocks (a cumulative counter). If deadlocks/minute exceeds 1, alert. Zero deadlocks is the goal — any deadlock in production indicates a code pattern that should be fixed, not tolerated.Add SET lock_timeout = '5s' to application transactions as a safety net. This causes a transaction to fail fast rather than wait indefinitely for a lock, reducing the deadlock detection window.War Story: At an e-commerce company processing 3,000 orders/minute during Black Friday, we hit 200+ deadlocks/minute at peak. The root cause: a batch fulfillment service was updating order statuses in the order they appeared in the warehouse pick list (sorted by bin location), while a customer-facing service was updating the same orders in the order customers placed them (sorted by timestamp). Same orders, different lock ordering. The fix was a one-line change — ORDER BY order_id in both code paths. Deadlocks dropped to zero within 60 seconds of deployment. Total time from alert to fix: 23 minutes. Total revenue at risk during those 23 minutes: approximately $150K based on our average order value and error rate.Follow-up: How do deadlocks differ between PostgreSQL and MySQL InnoDB?
PostgreSQL detects deadlocks by running a deadlock detector everydeadlock_timeout seconds (default 1s). When detected, it kills the youngest transaction (least work done) and lets the other proceed. The detection is a waits-for graph cycle check.MySQL InnoDB checks for deadlocks on every lock wait (not periodically). It uses a waits-for graph and immediately rolls back the transaction with the fewest rows modified. However, MySQL’s gap locks in REPEATABLE READ mode create deadlocks that would not happen in PostgreSQL. A range query like SELECT * FROM orders WHERE amount BETWEEN 100 AND 200 FOR UPDATE locks the index gap, which can block inserts that PostgreSQL would allow. This is why MySQL deadlocks are more frequent and more surprising than PostgreSQL deadlocks in equivalent workloads.Follow-up: When are deadlocks actually acceptable?
When the retry cost is low and the alternative (global lock ordering) is impractical. In a system where thousands of transactions operate on overlapping sets of rows and the sets are not known in advance (think: a graph database updating connected nodes), enforcing a global lock order is architecturally infeasible. In this case, design for deadlock tolerance: wrap every transaction in a retry loop (3 retries with jitter), monitor the deadlock rate, and alert if it exceeds a threshold (e.g., 0.1% of transactions). PostgreSQL’s deadlock detection is fast enough that the overhead of occasional deadlock retries is negligible compared to the cost of global lock ordering in complex workloads.Q15: You inherit a service where every API endpoint does a direct database query. There are no caches. The database is at 90% CPU. A senior engineer says 'just add Redis in front of everything.' What is wrong with that advice, and what do you actually do?
Q15: You inherit a service where every API endpoint does a direct database query. There are no caches. The database is at 90% CPU. A senior engineer says 'just add Redis in front of everything.' What is wrong with that advice, and what do you actually do?
The “obvious answer is wrong” question. Caching everything is a common anti-pattern.
What weak candidates say: “Sounds right. We should add Redis caching to every endpoint with a 5-minute TTL. That will reduce database load by 80%.”What strong candidates say:“Just add Redis in front of everything” is the database equivalent of “just throw more hardware at it.” It might reduce load, but it creates an entirely new class of problems and may not even fix the root cause. Here is why I would push back and what I would do instead.First, diagnose why the database is at 90% CPU.90% CPU does not tell you the story. Is it 90% from thousands of efficient queries at high volume? Or 90% from three terrible queries doing sequential scans on 500M-row tables? The fix is radically different.- Cache invalidation becomes your new primary problem. Every write now needs to invalidate the right cache keys. Miss one, and users see stale data. With 50 endpoints, you need 50 invalidation paths. The joke “there are only two hard things in computer science: cache invalidation and naming things” exists because cache invalidation in a complex system is genuinely harder than the database queries it replaces.
- Thundering herd on cache expiry. If a hot cache key expires and 1,000 requests simultaneously hit the database, you get a load spike worse than having no cache. With TTL-based caching on every endpoint, you get mini thundering herds constantly as different keys expire. Solutions exist (cache stampede locks, stale-while-revalidate), but they add complexity to every endpoint.
- Memory cost is non-trivial. A Redis instance with 50GB of cached API responses is expensive and operationally complex (Redis cluster, failover, persistence, monitoring). If most of those responses are accessed once per TTL window, you are paying for memory to store data nobody will read before it expires.
- You now have two data stores to keep consistent. Every bug investigation now requires checking both the database and the cache. “Is the user seeing stale data because of a cache issue or a replication lag issue?” Debugging complexity increases quadratically with the number of data stores.
pg_stat_statements to find the worst queries. Add missing indexes. Rewrite N+1 patterns. This alone typically reduces CPU by 40-60%. Cost: zero infrastructure, just SQL.Step 2: Add connection pooling if missing. If each request opens a new connection, the overhead is enormous. PgBouncer in transaction mode with a pool of 20 connections serving 200 application instances eliminates connection management overhead and can reduce CPU by 10-20%.Step 3: Add targeted caching for specific hot paths. Not “cache everything” — cache the 5 endpoints that account for 80% of traffic AND whose data tolerates staleness. A product catalog page that updates once per hour? Cache it for 5 minutes. A user’s account balance? Never cache it — stale financial data is a support ticket.Step 4: Read replicas for read-heavy paths. If 85% of queries are reads, route them to a read replica. This halves the primary’s CPU immediately. Combine with caching: cache in Redis for the hottest data, read replica for the long tail.Step 5: THEN consider broader caching. Only after steps 1-4 are exhausted. And use cache-aside (lazy loading) with explicit TTLs per endpoint, not a blanket “cache everything for 5 minutes.”War Story: At a B2B SaaS company, the database hit 95% CPU. The team’s instinct was to add Redis. I spent 30 minutes with pg_stat_statements and found three things: (1) A dashboard endpoint was running SELECT COUNT(*) FROM events WHERE tenant_id = ? on a 200M-row table with no index on tenant_id — 45 seconds per call, called every time the dashboard loaded. Adding a composite index reduced it to 3ms. (2) An ORM was generating N+1 queries for a list endpoint — 1 query for the list, then 1 query per item for a related entity. 50 items = 51 queries. Adding .include(:related_entity) collapsed it to 2 queries. (3) A background job was running ANALYZE on the entire database every 5 minutes (someone copy-pasted a Stack Overflow answer). Changing it to rely on autovacuum freed 15% CPU. After these three fixes, CPU dropped from 95% to 35%. We never added Redis. Total cost: zero dollars. Time: one afternoon.Follow-up: When IS “add Redis everywhere” the right answer?
When the query layer is already optimized, and the bottleneck is genuinely read volume at a scale that a single database (even with replicas) cannot handle. If you have 500,000 reads/second of a product catalog that changes once per hour, caching it in Redis with a 60-second TTL is correct. The pattern: high read volume + low write frequency + tolerance for staleness = cache aggressively. But that describes maybe 10-20% of endpoints in a typical application, not “everything.”Follow-up: How do you handle cache invalidation for data that changes unpredictably?
Use cache-aside with short TTLs for unpredictable data (30-60 seconds). The cache absorbs burst reads, and the TTL guarantees eventual freshness. For data that MUST be fresh on write (user updates their name, should see it immediately), use write-through caching: on every write, update the database AND the cache in the same operation. If the cache update fails, delete the cache key (so the next read fetches from the database). Never let a stale cache key persist after a successful write — this is the most common caching bug in production.Q16: You need to shard a PostgreSQL database that has grown to 2TB with 3 billion rows. The CEO wants it done in two weeks. Walk me through your response.
Q16: You need to shard a PostgreSQL database that has grown to 2TB with 3 billion rows. The CEO wants it done in two weeks. Walk me through your response.
This question tests whether you recognize when NOT to do something, and how you manage up when a technical decision is being driven by unrealistic timelines.
What weak candidates say: “We can use hash-based sharding on the primary key. Set up two PostgreSQL instances, split the data by hash, and update the application to route queries.” — This is technically not wrong, but the timeline confidence and the lack of risk awareness is the red flag.What strong candidates say:The first thing I would do is have an honest conversation with the CEO about what “done in two weeks” actually means and what the risks are. Sharding a production 2TB PostgreSQL database is a 3-6 month project for an experienced team. Rushing it is how you get a week-long outage that costs more than the original performance problem. Let me explain why, and then what I would actually do in two weeks.Why sharding is a 3-6 month project, not 2 weeks:- Shard key selection is an irreversible architectural decision. Choose the wrong shard key and you get hot shards, cross-shard queries for common operations, and eventual data migration that is harder than the original sharding. You need to analyze every query pattern, every JOIN, every foreign key relationship. For a 3B-row database that has accumulated years of access patterns, this analysis alone takes weeks.
-
Cross-shard queries. Any query that does not include the shard key becomes a scatter-gather across all shards. If you shard orders by
customer_id, the querySELECT * FROM orders WHERE created_at > '2025-01-01'must hit every shard. If 40% of your queries are date-range queries, sharding by customer_id makes 40% of your queries slower, not faster. You need to audit every query path before choosing the shard key. - Application changes are pervasive. Every database query in the application needs to be shard-aware — either the application routes to the correct shard, or a proxy (like Vitess or Citus) handles routing. If you use application-level sharding, every ORM query, every raw SQL call, every migration script, every background job needs to know about shards. This is months of engineering.
- Data migration without downtime. Moving 2TB of data from one database to multiple shards while the application continues to serve traffic requires dual-writing, backfilling, verification, and cutover. At 3B rows and typical migration speeds (50K-100K rows/second with backfill), the migration itself takes 8-16 hours. During that window, any write to the old database must be replicated to the new shards.
- Operational complexity permanently increases. Instead of one database to monitor, back up, upgrade, and troubleshoot, you now have N databases. Schema migrations run across all shards (Vitess handles this; DIY sharding does not). Backup and restore procedures multiply. On-call complexity increases.
- Run
pg_stat_statementsanalysis. Find and fix the top 10 worst queries (missing indexes, sequential scans, N+1 patterns). On a 3B-row database, a single missing index can account for 30-40% of CPU. - Add read replicas. Route all reporting, analytics, and read-heavy API endpoints to replicas. This can cut primary load by 60-70%.
- Implement connection pooling with PgBouncer if not already in place.
- Partition the largest tables by date range (declarative partitioning). If the 3B rows are time-series-ish (orders, events, logs), partitioning by month means queries for “last 30 days” scan 1 partition instead of 36 months of data. This is a massive win with far less risk than sharding.
- Add Redis caching for the top 5 read-heavy endpoints.
- Begin the shard key analysis: catalog all query patterns, JOIN paths, and access patterns. This is the prerequisite for sharding, and starting it now means the eventual sharding project starts from a position of knowledge rather than guessing.
- Present a realistic sharding timeline (3-6 months) with the query analysis as evidence for why it cannot be rushed.
SELECT create_distributed_table('orders', 'customer_id') and it handles shard routing, cross-shard joins, and distributed transactions transparently. The application changes are minimal compared to DIY sharding. Migration is still a multi-week project, but the operational overhead is dramatically lower.War Story: A fintech company I worked with had a 1.5TB PostgreSQL database doing 8K QPS. Leadership wanted to shard within a month. Instead, we spent two weeks on: (1) adding a composite index to the transactions table that eliminated a sequential scan on 800M rows — this alone dropped P99 query latency from 12 seconds to 40ms. (2) Partitioning the transactions table by quarter — queries for recent data stopped scanning historical partitions. (3) Adding two read replicas and routing 70% of traffic to them. CPU on the primary dropped from 92% to 28%. The database ran fine for another 18 months before sharding became genuinely necessary — and by then, the team had done a proper shard key analysis and used Citus for a controlled 3-month migration with zero downtime. The two-week “emergency sharding” would have been a disaster.Follow-up: How do you choose a shard key? What is the worst shard key you have seen?
The shard key should satisfy three criteria: (1) high cardinality — enough distinct values to distribute evenly across shards. (2) Included in most queries — so the router can send each query to a single shard rather than all shards. (3) Even distribution — no single value should have disproportionate data.The worst shard key I have encountered:country_code. A SaaS product sharded by country, thinking it was “natural isolation.” The US shard had 70% of all data. The Japan shard had 15%. The remaining 40 shards shared 15%. The US shard was larger than the original unsharded database within a year. They had to re-shard by tenant_id — a 4-month project.Second worst: created_date. All current traffic hits the latest shard. Historical shards are idle. You have not distributed load; you have concentrated it on one shard with extra infrastructure complexity.Follow-up: What is the difference between sharding and partitioning?
Partitioning splits a table into multiple physical segments within the same database instance. PostgreSQL declarative partitioning keeps all partitions on the same server, managed by the same PostgreSQL process. The query planner does partition pruning (only scans relevant partitions). There is zero application change — queries work identically. It solves query performance (smaller indexes, fewer rows scanned) but not write throughput or storage limits — you are still on one server.Sharding splits data across multiple database instances on different servers. Each shard is an independent database. It solves write throughput (each shard handles its own writes), storage (each shard stores a fraction of the data), and read throughput (queries route to one shard). But it introduces cross-shard query complexity, distributed transaction challenges, and operational overhead.Rule of thumb: partition first, shard only when a single server cannot handle the write throughput or storage requirements. Most teams shard too early. PostgreSQL with partitioning, proper indexing, and read replicas can handle multi-TB databases and tens of thousands of QPS on a single server.Q17: You are reviewing a pull request that adds a new API endpoint: POST /users/:id/send-notification. What is wrong with this design, and how would you coach the developer?
Q17: You are reviewing a pull request that adds a new API endpoint: POST /users/:id/send-notification. What is wrong with this design, and how would you coach the developer?
This tests API design instinct — the ability to spot violations of REST conventions and think through the consequences.
What weak candidates say: “Looks fine. It sends a notification to a user.” — Missing the core design problem entirely.What strong candidates say:There are multiple design issues here, ranging from convention violations to real operational risks. Let me walk through them in order of severity.Issue 1: Verb in the URL. REST resources are nouns, not verbs.send-notification is a command, not a resource. This matters because it makes the API unpredictable. If send-notification is a verb, what about update-profile, reset-password, calculate-invoice? You end up with an API that is a collection of RPC-style commands disguised as REST endpoints, with no consistency in naming, no standard HTTP method semantics, and no way for a developer to guess the URL structure.The fix: Model the notification as a resource.GET /users/:id/notifications to list them. GET /users/:id/notifications/ntf_abc123 to check status. DELETE /users/:id/notifications/ntf_abc123 to cancel a queued one. The entire CRUD lifecycle emerges naturally from treating notifications as resources.Issue 2: POST implies creation, but should this be synchronous? Sending a notification is inherently async — it involves email delivery, push notification services, SMS gateways, any of which can take seconds or fail. A synchronous POST that blocks until the notification is delivered will timeout on slow channels. The 202 Accepted pattern is correct: accept the request, return immediately with a resource ID, process asynchronously. The client polls the status or receives a webhook callback.Issue 3: No idempotency. What if the client retries? Without an idempotency key, retrying the POST sends the notification twice. A user getting two identical “Your password was reset” emails is confusing. A user getting two identical “Your payment of $5,000 was processed” SMS messages causes panic and support tickets.Issue 4: Missing rate limiting considerations. This endpoint triggers an external side effect (sending a notification). If a bug or a malicious actor calls it in a loop, you spam the user with thousands of notifications. Add a per-user rate limit (e.g., max 10 notifications per user per hour) separate from the API-level rate limit. Also add a global rate limit per notification type to protect downstream delivery services.Issue 5: The :id in the URL should be validated. Can user A send a notification to user B? If this is an admin endpoint, that might be fine. If it is a user-facing endpoint, the :id must match the authenticated user. This is an authorization check that is easy to miss in review but is a critical security vulnerability (IDOR — Insecure Direct Object Reference).How I would coach the developer:I would not just request changes on the PR. I would schedule a 15-minute pairing session and walk through the design together, using Stripe’s API or GitHub’s API as references. The goal is to build the developer’s API design intuition, not just fix this one endpoint. I would frame it as: “Every endpoint you add becomes a contract. Once a client depends on POST /users/:id/send-notification, renaming it is a breaking change. Let us design it as a resource from the start so the API grows naturally.”War Story: At a startup, we shipped an endpoint called POST /orders/:id/refund. Six months later, we needed partial refunds, refund reasons, refund-to-store-credit, and refund status tracking. The verb-based endpoint could not accommodate any of this without becoming POST /orders/:id/refund?amount=50&reason=damaged&type=credit. It was an unreadable mess. We redesigned it as POST /orders/:id/refunds (resource), which naturally supported GET /orders/:id/refunds (list), GET /orders/:id/refunds/:refund_id (detail), and a clean body with all the parameters. The migration from the old endpoint to the new one took 3 months because 80 partners had integrated with the verb-based URL.Follow-up: When IS it acceptable to use verbs in REST URLs?
Two scenarios. First, genuinely non-resource operations that do not create, read, update, or delete anything:POST /translate (stateless transformation), POST /validate (check without side effects), POST /search (complex query that does not fit in a GET query string). Even here, you can often model them as resources (POST /translations, POST /validations), but the verb form is pragmatically acceptable for stateless operations.Second, when your API is not actually REST and you have decided that is fine. An internal RPC-style API between two services you own, where neither discoverability nor resource modeling matters, can use verb-based endpoints. Just be honest about it — do not call it REST.Follow-up: How would you design the notification endpoint to support multiple channels (email, SMS, push)?
Add achannels array to the request body:GET /users/:id/notifications/ntf_abc123 for updated per-channel statuses. This is cleaner than separate endpoints per channel (/send-email, /send-sms) because the business logic (which channels to use, deduplication, user preferences) stays on the server.Q18: A Kafka consumer processes order events and writes to a database. You discover that some orders were processed twice, resulting in duplicate charges. The team says 'Kafka guarantees exactly-once delivery.' What is actually happening, and how do you fix it?
Q18: A Kafka consumer processes order events and writes to a database. You discover that some orders were processed twice, resulting in duplicate charges. The team says 'Kafka guarantees exactly-once delivery.' What is actually happening, and how do you fix it?
This is a “common misconception with real money on the line” question. The obvious answer — “Kafka has exactly-once” — is dangerously incomplete.
What weak candidates say: “Enable Kafka’s exactly-once semantics withenable.idempotence=true and transactional producers. Problem solved.”What strong candidates say:The team’s understanding is incomplete in a way that cost real money. Let me untangle what “exactly-once” actually means in Kafka and why it did not prevent duplicate charges.What Kafka’s exactly-once actually guarantees:Kafka’s exactly-once semantics (EOS), introduced in 0.11, guarantees that a producer-to-Kafka-to-consumer pipeline processes each message exactly once within Kafka. Specifically:- Idempotent producer (
enable.idempotence=true): the producer can retry sends without creating duplicate messages in the topic. Kafka deduplicates by producer ID and sequence number. - Transactional producer/consumer: a consume-transform-produce pipeline can atomically read from topic A, process, and write to topic B, with consumer offsets committed in the same transaction. If the process crashes mid-transaction, Kafka rolls back both the output message and the offset commit. No message is double-processed within Kafka.
- Consumer reads order event from Kafka.
- Consumer calls the payment API — payment succeeds, customer is charged $50.
- Consumer attempts to commit the Kafka offset.
- Consumer crashes (or the network blips) before the offset commit succeeds.
- Kafka redelivers the message (offset was not committed).
- Consumer processes it again — calls payment API again — customer is charged another $50.
(topic, partition, offset) as a unique identifier, or better, use a business-level idempotency key (order ID + event type).enable.idempotence=true and the team believed they were protected. The root cause: the consumer committed offsets asynchronously (enable.auto.commit=true with a 5-second interval). During a rolling deployment, containers were killed before the auto-commit fired. On restart, messages since the last commit were redelivered and reprocessed. The fix was threefold: (1) switch to manual offset commits, (2) add an idempotency table keyed on order_id + event_type with the processing wrapped in a database transaction, (3) add a daily reconciliation job that detects and flags duplicate charges for automatic refund. The idempotency table added 0.3ms of overhead per event (one indexed lookup). The duplicate rate dropped from 0.03% to zero. The reconciliation job remained as a safety net.Follow-up: Why not just use manual offset commits after processing?
Manual commits reduce but do not eliminate duplicates. The failure window is between “processing completes” and “offset commit completes.” If the consumer crashes in that window (or the broker rejects the commit due to a rebalance), the message is redelivered. The window is small (milliseconds), but at 50K messages/hour, even a 0.001% failure rate produces 18 duplicates per day. For payments, zero is the only acceptable number.Follow-up: How does the idempotency table perform at high throughput?
At 50K events/hour (~14 events/second), the overhead is negligible. At 1M events/hour (~278 events/second), you need to think about table management. The idempotency table grows continuously. Partition it by day, drop partitions older than 7 days (your Kafka retention period). Index on the idempotency key for O(log n) lookups. At 278 inserts/second with 7-day retention, the table holds ~170M rows — PostgreSQL handles this comfortably with a proper index. If throughput exceeds 10K events/second, consider using Redis as a fast-path check with the database as a fallback (check Redis first, if miss then check DB, write to both).Q19: Your DynamoDB table is experiencing throttling on a single partition despite having plenty of provisioned capacity. The table has 10,000 WCU provisioned but only one partition is hot. Explain the problem and fix it.
Q19: Your DynamoDB table is experiencing throttling on a single partition despite having plenty of provisioned capacity. The table has 10,000 WCU provisioned but only one partition is hot. Explain the problem and fix it.
This tests distributed database internals knowledge. Candidates who only know SQL will struggle. Candidates who have operated DynamoDB in production will nail it.
What weak candidates say: “Increase the provisioned capacity. If 10,000 WCU is not enough, set it to 50,000.” — This fundamentally misunderstands how DynamoDB distributes capacity.What strong candidates say:This is the classic DynamoDB hot partition problem, and it is one of the most misunderstood aspects of DynamoDB’s architecture. Increasing provisioned capacity will not help because the problem is not total capacity — it is how capacity is distributed across partitions.How DynamoDB partitioning works:DynamoDB divides your table into partitions based on the partition key hash. Each partition handles a maximum of 3,000 RCU and 1,000 WCU (these are hard limits per partition, not configurable). When you provision 10,000 WCU, DynamoDB creates at least 10 partitions (10,000 / 1,000 = 10) and distributes the capacity evenly: 1,000 WCU per partition.If your access pattern sends 80% of writes to a single partition key value (e.g.,tenant_id = "big_enterprise_customer"), that one partition receives 8,000 WCU of demand but can only handle 1,000 WCU. The other 9 partitions sit idle with 9,000 WCU unused. You are throttled at 1,000 WCU effective capacity despite provisioning 10,000.DynamoDB adaptive capacity (introduced 2018) partially mitigates this — it can boost a hot partition’s throughput by borrowing from underutilized partitions. But adaptive capacity has limits: it takes minutes to kick in, cannot exceed the table-level throughput, and does not help if the hot partition is at the physical node’s I/O limit.Root cause analysis for hot partitions:Run a query against CloudWatch ConsumedWriteCapacityUnits with the TableName and partition key dimension. Or use DynamoDB Contributor Insights (must be enabled per table) — it shows the top 20 most-accessed partition keys over the last 24 hours. If one key dominates, you have found your hot partition.Common causes:- Bad partition key design. Using
dateas the partition key means all of today’s writes hit one partition. Usingstatus(with 3 possible values) means 3 partitions max, with the most common status being hot. - One big tenant in a multi-tenant table. If
tenant_idis the partition key and one tenant has 100x the traffic of others, their partition is permanently hot. - Time-based access patterns. An IoT system where
device_idis the partition key, but all devices report at the top of every minute. The write pattern is bursty, concentrating on a few partitions simultaneously.
- Write sharding (adding randomness). Append a random suffix to the partition key: instead of
tenant_id = "big_customer", usetenant_id = "big_customer#3"where the suffix is a random number from 0-9. This spreads the big customer’s writes across 10 partitions. Reads require a scatter-gather across all 10 suffixed keys, but writes are distributed. This is DynamoDB’s official recommended pattern for hot partitions.
- Switch to on-demand capacity mode. On-demand mode provisions capacity automatically based on traffic and does not have per-partition limits as strict as provisioned mode (it still has limits, but they are higher and auto-adjust). For bursty workloads, on-demand eliminates the capacity planning problem entirely. The trade-off: on-demand is 5-7x more expensive per request than provisioned at steady-state traffic. But if you are throttling, the lost requests cost more than the price premium.
-
Redesign the partition key. If the hot partition is caused by a fundamental design flaw (using
dateorstatusas the partition key), no amount of capacity will fix it. Redesign with a high-cardinality key. For time-series data, usedevice_id(high cardinality) as the partition key andtimestampas the sort key — not the reverse. - Use a GSI with a different partition key. If the hot path is a specific query pattern, create a Global Secondary Index with a partition key that distributes more evenly. The GSI has its own partitions and throughput.
device_type as the partition key (thinking it was a logical grouping). There were 12 device types. 60% of devices were type “sensor_v2.” That one partition handled 60% of all writes — 180K WCU of demand on a partition that could handle 3K WCU. Contributor Insights showed the hot key immediately. We redesigned the table to use device_id as the partition key (100K distinct values = excellent distribution). The migration required dual-writing to old and new tables for 48 hours, then cutting over reads. Throttling dropped to zero. The table went from needing 300K provisioned WCU (to brute-force the hot partition) to 50K WCU (actual demand, evenly distributed). Cost savings: $18K/month.Follow-up: How does DynamoDB’s on-demand mode handle sudden traffic spikes?
On-demand mode maintains a “previous peak” high-water mark. It can instantly serve up to double the previous peak traffic. If traffic exceeds double the previous peak, DynamoDB throttles while it provisions additional capacity (which takes minutes). For truly unpredictable spikes (0 to 50K WCU in seconds), on-demand may still throttle briefly. The mitigation: “pre-warm” by running a load test that establishes the high-water mark before the expected spike. Or use provisioned mode with auto-scaling and set the max capacity high enough.Follow-up: When would you use DynamoDB over PostgreSQL?
When access patterns are known and key-based (get by ID, query by partition key + sort key range), write throughput needs to be massive (>100K writes/second), you want zero-operational-overhead scaling, and you do not need ad-hoc queries or multi-table joins. DynamoDB excels at: session stores, user profiles, shopping carts, IoT telemetry, gaming leaderboards. PostgreSQL excels at: anything requiring JOINs, complex queries, transactions across multiple tables, or access patterns that evolve unpredictably. The wrong choice: using DynamoDB when you need to runSELECT orders.*, customers.name FROM orders JOIN customers ON ... — you will end up doing the JOIN in application code, which is slower and more error-prone.Q20: Your API has a 99.9% success rate, which sounds great until you do the math. A product manager asks why customers are still complaining. Explain the math and propose a solution.
Q20: Your API has a 99.9% success rate, which sounds great until you do the math. A product manager asks why customers are still complaining. Explain the math and propose a solution.
This is a “numbers that sound good but are not” question. It tests whether candidates think in absolute numbers, not just percentages.
What weak candidates say: “99.9% is three nines of reliability. That is industry standard. Customers should not be complaining.” — This shows the candidate has never operated a high-traffic API.What strong candidates say:99.9% sounds great as a percentage. But percentages without volume are meaningless. Let me do the math.The math that changes the narrative:If the API serves 10 million requests per day:- 99.9% success = 0.1% failure = 10,000 failed requests per day
- That is 416 failures per hour, or 7 per minute
- 0.01% failure = 1,000 failed requests per day
- 42 per hour, less than 1 per minute
- 10,000 failures per day — you are right back where you started
-
Categorize the failures. Not all 10,000 daily failures have the same cause. Pull the error logs and categorize:
- Timeouts to downstream services (often 40-50% of failures): add circuit breakers, increase timeout budgets for critical paths, add fallback responses for non-critical dependencies.
- Database connection exhaustion (often 15-25%): add PgBouncer, tune pool sizes, add connection timeout with retry.
- Rate limiting / load shedding (often 10-20%): if you are rate-limiting legitimate traffic, your limits are too low or your capacity is too small.
- Application bugs (often 10-15%): null pointer exceptions, unhandled edge cases in business logic. Each one is a unique fix.
- Infrastructure failures (often 5-10%): node crashes, deployment rollouts, cloud provider transient errors.
- Implement retries with backoff for transient failures. Many of the timeouts and connection errors are transient — the downstream service was busy for 200ms but would succeed on retry. A single retry with 100ms backoff converts a transient failure into a success, reducing the failure count by 30-50%. But retries must be idempotent (see Q2) and must have a retry budget (max 1 retry per request, max 10% of traffic as retries) to avoid amplification.
- Add graceful degradation. If the recommendation engine is down, show popular products instead of a 500 error. If the analytics service is slow, skip analytics tracking and serve the page. Not every dependency failure needs to be a user-visible error. This alone can improve the user-perceived success rate from 99.9% to 99.95%.
- Reduce blast radius. A single bad deployment that crashes 1 of 10 servers causes a 10% error spike for the duration of the rollout. Canary deployments (route 5% of traffic to the new version, monitor for 10 minutes, then proceed) limit the blast radius. A bad deploy now affects 5% of traffic for 10 minutes instead of 100% for 5 minutes.
Follow-up: How do you define and measure SLOs for an API?
Define SLOs using the SLI/SLO/error budget framework (Google SRE book):- SLI (Service Level Indicator): The metric you measure. For an API:
(successful responses / total responses) * 100where “successful” means HTTP 2xx/3xx within the latency threshold (e.g., <500ms). Exclude 4xx client errors (those are the client’s fault, not yours) unless they indicate a server-side issue (429s that should not happen). - SLO (Service Level Objective): The target for the SLI. “99.95% of requests succeed within 500ms over a 30-day rolling window.”
- Error budget:
100% - SLO = error budget. At 99.95% SLO with 50M requests/day, your error budget is 25,000 failures/day. If you burn through the budget, freeze deployments and focus on reliability.
Follow-up: What is the difference between SLO and SLA?
SLO is an internal engineering target. Missing it triggers an engineering response (freeze deploys, fix reliability). SLA is an external contractual commitment, usually with financial penalties (service credits) for violations. Your SLO should always be stricter than your SLA. If your SLA is 99.9%, your SLO should be 99.95% — the gap is your safety margin. You should never be in a situation where engineering only finds out about a reliability problem because a customer claims an SLA credit.Q21: A developer proposes using a single PostgreSQL database as both the application database AND the job queue (using SKIP LOCKED for dequeuing). A staff engineer says 'use a proper queue like SQS.' Who is right?
Q21: A developer proposes using a single PostgreSQL database as both the application database AND the job queue (using SKIP LOCKED for dequeuing). A staff engineer says 'use a proper queue like SQS.' Who is right?
This is a “both sides have merit and the context determines the answer” question. It tests architectural judgment, not religious preference.
What weak candidates say: Either extreme — “Never use PostgreSQL as a queue, always use SQS” or “PostgreSQL does everything, why add another service?” Both are dogmatic rather than analytical.What strong candidates say:Both are right in different contexts, and the decision depends on scale, operational complexity budget, and the team’s existing infrastructure. Let me argue both sides, then give my recommendation.The case for PostgreSQL as a job queue:It genuinely works well at moderate scale, and the operational simplicity is significant.- Transactional enqueue. When an API request creates an order AND enqueues a notification job, both happen in the same database transaction. If the transaction fails, neither persists. With SQS, you have the dual-write problem: the order is committed to the database, but the SQS send fails (or vice versa). You need the outbox pattern to make this reliable, which adds complexity.
- No additional infrastructure. No SQS configuration, no IAM policies, no dead-letter queue configuration, no CloudWatch alarms for queue depth. The job table is just another table in the database you already monitor, back up, and maintain.
-
SKIP LOCKED is remarkably efficient. PostgreSQL’s
SELECT ... FOR UPDATE SKIP LOCKEDis purpose-built for this pattern. Multiple workers dequeue concurrently without blocking each other. At 100-500 jobs/second, a well-indexed job table with regular cleanup of completed jobs handles this without breaking a sweat. Libraries likepgboss(Node.js),Oban(Elixir),GoodJob(Ruby), andProcrastinate(Python) are production-tested implementations. -
Rich querying. You can
SELECT COUNT(*) FROM jobs WHERE status = 'failed' AND created_at > now() - interval '1 hour'— try that with SQS. You can add indexes, run analytics on job performance, and JOIN job data with business data. SQS is an opaque pipe.
- Scale. PostgreSQL as a queue works at 100-500 jobs/second. At 10,000+ jobs/second, the job table becomes a write hotspot. Every enqueue is an INSERT, every dequeue is a SELECT + UPDATE, every completion is an UPDATE or DELETE. That is 3 write operations per job on a single table. At 10K jobs/second, that is 30K write operations/second on the jobs table alone — enough to saturate a mid-tier PostgreSQL instance. SQS handles 3,000 messages/second per queue by default and scales transparently to nearly unlimited throughput.
- Isolation. A surge in job processing should not affect API response times. If job dequeuing competes with API queries for database connections, CPU, and I/O, a job backlog can degrade API performance. SQS is completely isolated — a million queued messages does not affect your database.
- Retention and replay. SQS retains messages for up to 14 days. Kafka retains indefinitely. A PostgreSQL job table fills up and needs cleanup. If you need to replay failed jobs from last week, SQS/Kafka has them. PostgreSQL requires you to not delete completed/failed jobs (which bloats the table).
- Dead-letter queues and visibility timeouts. SQS has built-in DLQ, visibility timeout (message becomes invisible to other consumers for N seconds while being processed), and automatic re-delivery after timeout. Implementing all of this in PostgreSQL is possible but is 500+ lines of code that you have to maintain and test.
| Factor | Use PostgreSQL Queue | Use SQS/Dedicated Queue |
|---|---|---|
| Job throughput | <500/second | >1,000/second |
| Team already operates SQS/Kafka | No | Yes |
| Need transactional enqueue (job + business data atomically) | Yes (this is the killer feature) | Only with outbox pattern |
| Job processing affects API latency | Acceptable | Unacceptable |
| Number of services | Monolith or <5 services | Microservices (>5 services) |
| Need exactly-once processing | Easier (DB transaction) | Harder (need idempotency) |
Oban, pgboss, or GoodJob — do not build your own.For a team at scale with >1,000 jobs/second or microservices architecture: use SQS. The isolation and scaling benefits outweigh the operational complexity.War Story: A team I advised started with SQS for job processing on day one because “that is what you are supposed to do.” Six months in, they had: 3 separate SQS queues, a DLQ for each, CloudWatch alarms for queue depth, a Lambda consumer for each queue, and the outbox pattern with a Debezium CDC pipeline to reliably publish database events to SQS. Total infrastructure: 15 AWS resources for what was 200 jobs/hour. They were spending more engineering time maintaining the queuing infrastructure than the actual application. I helped them rip it all out and replace it with pgboss — one PostgreSQL table, one worker process, zero additional infrastructure. Deployment complexity dropped dramatically. The system ran without incident for two years before they genuinely needed SQS at 5,000 jobs/second. The lesson: match the tool to the actual scale, not the imagined future scale.Follow-up: What about using Redis (Bull/BullMQ) as the queue instead of PostgreSQL or SQS?
Redis with BullMQ is the middle ground. Faster than PostgreSQL for high-throughput dequeuing (Redis operations are sub-millisecond), simpler than SQS (no IAM, no DLQ configuration), and has a beautiful dashboard (Bull Board) for monitoring. The trade-off: Redis is not durable by default. If Redis crashes and AOF persistence is not configured (or has a gap), queued jobs are lost. For jobs where loss is acceptable (send analytics event, update search index), Redis is excellent. For jobs where loss is unacceptable (process payment, send invoice), use PostgreSQL or SQS with their durability guarantees.Follow-up: How does the outbox pattern solve the dual-write problem with SQS?
Instead of writing to the database AND sending to SQS (two separate operations that can partially fail), you write the business data AND the message to the database in the same transaction:Q22: You run EXPLAIN ANALYZE on a slow query and see that PostgreSQL chose a sequential scan on a 200M-row table despite a perfectly matching B-tree index existing. Why would the optimizer ignore your index, and what do you do?
Q22: You run EXPLAIN ANALYZE on a slow query and see that PostgreSQL chose a sequential scan on a 200M-row table despite a perfectly matching B-tree index existing. Why would the optimizer ignore your index, and what do you do?
This tests deep query planner knowledge. The “obvious” answer — “the index is missing” — is wrong. The index exists. The planner chose to ignore it.
What weak candidates say: “The index must be corrupt. Rebuild it.” or “Force an index scan withSET enable_seqscan = off.” — The first is extremely unlikely. The second is a brute-force hack that masks the real problem and can make other queries worse.What strong candidates say:The PostgreSQL query planner is cost-based. It estimates the cost of every possible execution plan (sequential scan, index scan, bitmap index scan, etc.) and picks the cheapest one. If it chose a sequential scan over an index scan on a 200M-row table, it is telling you something. Either the planner’s cost estimate is wrong, or the planner is actually right and the sequential scan really is cheaper for this specific query. Let me walk through the common reasons.Reason 1: The query returns too many rows (low selectivity).If the query matches 30%+ of the table, a sequential scan is genuinely faster than an index scan. Why? An index scan requires: (1) traverse the B-tree to find matching keys, (2) for each key, follow a pointer to the heap (table) to fetch the full row. At 30% selectivity on 200M rows, that is 60M random I/O operations to the heap. A sequential scan reads the entire table in one linear pass — sequential I/O is 10-100x faster than random I/O on both HDDs and SSDs.The threshold where the planner switches from index scan to sequential scan is typically around 5-15% of the table, depending on random_page_cost and seq_page_cost settings.Diagnosis: Check EXPLAIN ANALYZE for the estimated row count. If the planner estimates returning 50M+ rows from a 200M-row table, the sequential scan is correct. The fix is not to force an index — it is to add more selective filters to the query or rethink the access pattern.Reason 2: Stale statistics.PostgreSQL’s planner uses table statistics (row counts, value distribution, most common values, histogram of column values) to estimate selectivity. If the statistics are stale — the table had 10M rows when ANALYZE last ran but now has 200M rows — the planner’s estimates are wrong.Diagnosis: Compare EXPLAIN (estimated) with EXPLAIN ANALYZE (actual). If estimated rows = 50,000 but actual rows = 500, the statistics are wrong. The planner overestimated the result set and chose a sequential scan based on the wrong estimate.Fix:WHERE status = 'active' AND region = 'us-east') and the planner assumes independence (5% of rows are active, 20% are us-east, so 1% match both), but in reality 90% of active rows are in us-east (correlation), the planner dramatically underestimates the result set. With the wrong estimate, it might choose an index scan expecting 2M rows but actually get 36M rows, or it might choose a sequential scan expecting 20M rows when only 2M match.Fix: PostgreSQL 10+ supports multivariate statistics:status and region, leading to more accurate row estimates.Reason 4: Cost parameters are misconfigured.PostgreSQL’s default random_page_cost = 4.0 and seq_page_cost = 1.0 assumes that random I/O is 4x more expensive than sequential I/O. This was accurate for spinning disks. On SSDs (which most production databases use), random I/O is only 1.1-1.5x more expensive than sequential I/O. With the default settings, the planner overestimates the cost of index scans (which involve random I/O) and favors sequential scans.Fix:random_page_cost closer to seq_page_cost makes the planner more willing to use index scans, which is correct for SSD-backed databases.Reason 5: The index is on the wrong column order.A composite index on (status, created_at) helps WHERE status = 'active' AND created_at > '2025-01-01' but the planner cannot use it for WHERE created_at > '2025-01-01' alone (leftmost prefix rule). The index exists, it matches the columns, but the column order does not match the query pattern.What I would NEVER do:SET enable_seqscan = off — this forces the planner to avoid sequential scans globally. It might “fix” your one query but degrades every other query where a sequential scan was genuinely optimal. It is a debugging tool, not a production setting. Use it only to verify “would an index scan be faster?” then fix the root cause.War Story: At a healthcare SaaS company, a report query on a 150M-row patient records table was doing a sequential scan despite a perfect composite index on (provider_id, visit_date). EXPLAIN showed estimated rows = 12M, actual rows = 3,200. The statistics were wildly off because autovacuum had not analyzed the table in 3 weeks — the table had grown from 80M to 150M rows in that period due to a data migration, and autovacuum’s threshold (autovacuum_analyze_threshold + autovacuum_analyze_scale_factor * n_live_tup) had not triggered because the scale factor was set to 10% (the default). At 80M rows, 10% means autovacuum waits for 8M new rows before re-analyzing. We ran ANALYZE manually — query time dropped from 4 minutes (sequential scan) to 80ms (index scan). Then we tuned autovacuum for that table: ALTER TABLE patient_records SET (autovacuum_analyze_scale_factor = 0.01) — now autovacuum re-analyzes after every 1.5M new rows instead of every 8M.Follow-up: How do you investigate whether the planner’s estimate is wrong without running the full query?
UseEXPLAIN (without ANALYZE) to see the planner’s estimates without executing the query. Compare the estimated rows to a quick approximation:pg_stats shows status = 'active' has frequency 0.05 (5%) and the EXPLAIN estimates 10M rows for a 200M-row table, the estimate is correct. If it estimates 100M rows, the statistics are stale.Follow-up: What is a bitmap index scan and when does PostgreSQL use it instead of a regular index scan?
A bitmap index scan is the middle ground between a sequential scan and an index scan. PostgreSQL first scans the index and builds a bitmap of all matching heap pages (not individual rows). Then it scans the heap pages in physical order (sequential I/O, not random). This combines the selectivity of an index with the I/O efficiency of sequential access.The planner uses a bitmap scan when selectivity is moderate (1-15% of the table). Too selective -> regular index scan (few random reads). Not selective enough -> sequential scan (read everything). In the middle -> bitmap scan. It is also used when combining multiple indexes:WHERE status = 'active' AND region = 'us-east' with separate indexes on status and region -> bitmap AND of the two indexes, then heap scan. This is why PostgreSQL can sometimes use two single-column indexes instead of a composite index, though the composite index is typically faster.Q23: Your team needs to run a schema migration on a 2-billion-row MySQL table in production -- adding an index and splitting a column into two. You cannot have downtime. Walk me through the operational plan, including rollback, monitoring, and cost.
Q23: Your team needs to run a schema migration on a 2-billion-row MySQL table in production -- adding an index and splitting a column into two. You cannot have downtime. Walk me through the operational plan, including rollback, monitoring, and cost.
This question separates people who have run migrations on toy databases from people who have sweated through a 12-hour online DDL on a production table.
What weak candidates say: “UseALTER TABLE ADD INDEX and ALTER TABLE ADD COLUMN. MySQL supports online DDL.” — Technically true for simple cases, dangerously naive at 2B rows.What strong candidates say:At 2 billion rows, every migration operation is a multi-hour event with production-safety implications. Here is the operational plan I would run, broken into phases with explicit go/no-go checkpoints.Phase 0: Assessment (1-2 days before migration).Measure the table: row count, average row size, total data size, number of existing indexes, current write rate (inserts/updates/deletes per second), peak hours, and replication lag baseline. On a 2B-row InnoDB table with an average row size of 200 bytes, the data is roughly 400GB. Adding a B-tree index will create an additional ~100-200GB of index data. Ensure the disk has 1.5x the table size as free space (600GB) — the migration tool creates a copy.Phase 1: Add the index using gh-ost (6-18 hours).Do NOT use native ALTER TABLE ADD INDEX at this scale. Even with MySQL 8’s online DDL, the metadata lock acquisition at the start and end of the operation can stall if there are long-running transactions. gh-ost avoids this entirely.--max-lag-millis=1500: gh-ost pauses if replica lag exceeds 1.5 seconds. This prevents the migration from causing lag that affects read traffic on replicas.--critical-lag=10000: If lag exceeds 10 seconds, gh-ost aborts (something is seriously wrong).--chunk-size=1000: Copy 1,000 rows per chunk. Smaller chunks = less lock contention but slower migration. At 2B rows with 1,000 rows/chunk, that is 2M chunks. At 50ms per chunk, the migration takes ~28 hours. Increase to 5,000 rows/chunk to bring it to ~6 hours, but monitor write latency.--throttle-query: Custom query that gh-ost evaluates periodically. If it returns a value exceeding a threshold, gh-ost pauses. Use this to pause during peak hours.
- Watch
gh-ostprogress output (rows copied, ETA, chunks per second). - Watch replica lag on all replicas (
pt-heartbeatis more reliable thanSHOW SLAVE STATUS). - Watch primary write latency (P95 should not increase by more than 20% during the migration).
- Watch disk I/O utilization — if it hits 80%+, reduce chunk size.
- Watch connection count — gh-ost uses 2-3 connections, but the increased I/O can cause other queries to queue.
name into first_name and last_name is a multi-step operation:Deploy 1 — Expand: Add first_name and last_name columns (NULLable, no default). This is instant in MySQL 8 with ALGORITHM=INSTANT for column additions at the end. Deploy application code that writes to all three columns (name, first_name, last_name) on every INSERT and UPDATE. Old and new application instances coexist safely during rolling deploy because the old code writes only name (new columns are NULLable).Deploy 2 — Backfill: Run a batch job to populate first_name/last_name from name for existing rows. At 2B rows with 50,000 rows per batch and 100ms per batch, the backfill takes ~1.1 hours. Run during off-peak. Verify with SELECT COUNT(*) FROM events WHERE first_name IS NULL AND name IS NOT NULL — should be zero.Deploy 3 — Contract (weeks later): Switch application reads from name to first_name/last_name. After all instances are on the new code and you have verified zero references to the old column (grep the codebase, check CDC consumers, check analytics pipelines), drop the name column. In MySQL, dropping a column on a 2B-row table requires an online DDL or gh-ost — it is NOT instant.Rollback plan:- Phase 1 (index): gh-ost creates the ghost table alongside the original. If the migration fails or causes problems, it can be cancelled without affecting the original table. The ghost table is dropped. No data is lost, no schema is changed. This is the safest possible rollback.
- Phase 2 (column split): If Deploy 1 causes issues, the new columns are NULLable and unused by old code — they can be dropped without impact. If Deploy 2 (backfill) has bugs, rerun it (the backfill should be idempotent). Deploy 3 is the point of no return — once the old column is dropped, you cannot roll back without restoring from backup.
Follow-up: The gh-ost migration is 80% complete after 14 hours. A P0 incident requires a deployment. Can you deploy, or does the migration block it?
gh-ost does not block deployments. The original table is fully writable throughout the migration. You can deploy application changes normally. However: if the deployment includes a schema migration on the same table, that will conflict — two concurrent schema changes on one table is a recipe for disaster. Coordinate with the team: no DDL on the migrating table until gh-ost completes. For application code changes, deploy freely.If the P0 requires killing the migration (freeing I/O for emergency queries), gh-ost supports graceful pause:echo throttle | nc -w 1 127.0.0.1 5678 (if using the socket interface). Resume when the incident is resolved. The migration picks up where it left off.Follow-up: How do you validate that the migration did not corrupt data?
After gh-ost completes the table swap:-
Row count verification.
SELECT COUNT(*) FROM eventsshould match the pre-migration count (plus any rows inserted during the migration — gh-ost reports this delta). On a 2B-row table,COUNT(*)is expensive. UseSELECT reltuples FROM pg_classfor PostgreSQL orSELECT TABLE_ROWS FROM INFORMATION_SCHEMA.TABLESfor MySQL (approximate but instant). -
Checksum verification.
pt-table-checksum(Percona) computes CRC32 checksums across primary and replicas. Run it on a sample of rows (1% = 20M rows) to verify data consistency between primary and replicas after the migration. - Application-level verification. Run your integration test suite against the migrated table. Spot-check specific rows that are known to have edge cases (NULLs, unicode, maximum-length values, boundary dates).
Q24: Three teams each own a microservice. Team A changes their API response format, breaking Team B's service. Team B blames Team A. Team C is caught in the middle because they consume both. Who is at fault, and how do you prevent this organizationally and technically?
Q24: Three teams each own a microservice. Team A changes their API response format, breaking Team B's service. Team B blames Team A. Team C is caught in the middle because they consume both. Who is at fault, and how do you prevent this organizationally and technically?
This is an organizational architecture question disguised as a technical one. It tests whether you understand that API contract ownership is a people problem with technical solutions.
What weak candidates say: “Team A should not have changed the API without telling Team B.” — Correct but insufficient. Communication alone does not prevent the next incident.What strong candidates say:Everyone is at fault, and the root cause is not a code change — it is the absence of a contract governance system. Let me break down what went wrong and how to fix it permanently.The failure cascade:- Team A changed their response format. They probably considered it a minor change (adding a field, changing a null to an empty array, renaming an internal field that “nobody uses”). They deployed without checking consumer compatibility because there was no mechanism to do so.
- Team B’s deserialization broke because they either had strict parsing, cached the old format, or depended on an undocumented field/behavior. They had no early warning because there was no contract test, no schema validation, and no pre-deploy compatibility check.
- Team C’s service degraded because it depended on both A and B. A’s change did not break C directly, but B’s failure cascaded. Team C had no circuit breaker or fallback for B’s outage.
openapi-diff, buf breaking, or a custom schema comparator on every PR that modifies a schema. Breaking changes fail the build. The team must either make the change backward-compatible or explicitly bump the version with a migration guide.Step 3: Consumer-driven contract tests for critical relationships. Team B writes a Pact contract: “I expect field X to be a string, field Y to be an integer, field Z to be present.” Team A runs these contracts in their CI pipeline. If a change violates any consumer contract, the build fails. This does not scale to 50 consumers, but for the 3-5 most critical relationships, it is the strongest protection available.Step 4: Separate schema changes from behavior changes. The schema registry enforces: additive changes (new fields, new endpoints) are always safe and can be released without consumer notification. Breaking changes require a version bump and a deprecation timeline. This is the Stripe model — schema evolution rules baked into tooling, not tribal knowledge.Step 5: Service owners, not consumers, are accountable for backward compatibility. The producing team is responsible for not breaking consumers. This creates a natural incentive to invest in versioning, compatibility testing, and deprecation programs. If the incentive is reversed (“consumers should handle any change”), producers will break things constantly because the cost falls on someone else.The technical fix — defensive consuming:Team B’s service should survive a provider’s format change without failing:- Lenient deserialization. Configure JSON/protobuf parsers to ignore unknown fields (
DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES = falsein Jackson). Never fail on extra data. - Null-safe field access. Use optional types, null-coalescing operators, or explicit presence checks for every field from an external service. Assume any field might be absent, null, or a different type.
- Circuit breaker on upstream calls. If Team A’s service returns unexpected responses, Team B’s circuit breaker opens after N failures and returns a fallback (cached data, default values, or a graceful error) instead of cascading the failure.
- Contract monitoring in production. Deploy a lightweight response validator that checks actual production responses against the expected schema. Alert when a field disappears, changes type, or a new unknown field appears — before it breaks business logic.
Follow-up: Team A says “we have 30 consumers, we cannot run 30 contract test suites in our CI.” How do you scale this?
You do not need 30 Pact contracts. The 80/20 rule applies: identify the 3-5 highest-traffic consumers that would cause the most damage if broken. Run their contracts. For the remaining 25, use schema-level compatibility checks (detect any structural breaking change) rather than consumer-specific contracts. A schema compatibility check is one tool running once, not 30 test suites.Also: move from “consumer writes the contract” to “schema registry enforces compatibility rules.” The registry rejects any schema change that removes a required field, changes a type, or renames a field — regardless of which consumer cares. This is the difference between “test against specific consumers” (does not scale) and “enforce universal compatibility rules” (scales to any number of consumers).Follow-up: How do you handle the politics when Team A says the change was documented in their changelog and Team B “should have read it”?
Changelogs are not contracts. If compliance depends on humans reading a document, it will fail at scale. The fix is mechanical: if Team A’s CI cannot merge a breaking change without consumers acknowledging it (via a contract test update, a Pact verification, or an explicit version bump), the “I did not read the changelog” failure mode is eliminated. Shift the enforcement from process (people reading docs) to tooling (CI preventing merges). The political discussion ends when both teams agree on the tooling — the tool becomes the neutral arbiter instead of Slack threads and postmortems.Q25: You are running a multi-tenant SaaS platform. A single enterprise tenant is consuming 60% of your database IOPS. Other tenants are experiencing degraded performance. How do you handle this -- technically, commercially, and without losing the customer?
Q25: You are running a multi-tenant SaaS platform. A single enterprise tenant is consuming 60% of your database IOPS. Other tenants are experiencing degraded performance. How do you handle this -- technically, commercially, and without losing the customer?
This tests the intersection of technical architecture, business judgment, and operational decision-making under pressure. It is a staff-level question.
What weak candidates say: “Rate-limit the tenant” or “Tell them to optimize their queries.” — Both are tone-deaf to the business reality.What strong candidates say:This is a noisy-neighbor problem that requires simultaneous action on three fronts: immediate technical mitigation, medium-term architectural isolation, and a commercial conversation that aligns infrastructure cost with revenue.Immediate mitigation (hours, not days):-
Identify the hot queries. Use
pg_stat_statementsfiltered by the tenant’s application user or connection pool to find the top 10 queries by total I/O. In my experience, 1-3 queries account for 80%+ of the I/O. A single missing index or a report query doing a sequential scan on a 50M-row partition is often the entire problem. -
Optimize the obvious. If the tenant’s top query is
SELECT * FROM events WHERE tenant_id = 'big-corp' AND created_at > '2025-01-01'doing a sequential scan, addingCREATE INDEX CONCURRENTLY idx_events_tenant_date ON events(tenant_id, created_at)is a 10-minute fix that might reduce their I/O by 90%. This is the highest-leverage action. - Route the tenant’s read traffic to a dedicated replica. If they have heavy analytics or reporting queries, route those to a read replica that other tenants do not share. This isolates the I/O impact immediately. Cost: one additional replica ($200-2,000/month depending on instance size).
-
Apply a per-tenant statement timeout.
SET statement_timeout = '10s'for this tenant’s database role. Their long-running queries get killed instead of consuming I/O for minutes. This is a blunt instrument but prevents the worst runaway queries from affecting others.
-
Partition the tenant’s data. If they have 50M rows in a shared table, create a dedicated partition:
ALTER TABLE events ATTACH PARTITION events_bigcorp FOR VALUES IN ('big-corp-uuid'). Their queries now hit a smaller, isolated B-tree. PostgreSQL’s partition pruning ensures other tenants’ queries never touch the big tenant’s partition. - Per-tenant connection pool limits. Allocate a fixed slice of the connection pool (e.g., 20 out of 100 connections) to the enterprise tenant. Even if they saturate their slice, the other 80 connections remain available for other tenants. PgBouncer supports this via pool configuration per database/user.
-
Per-tenant I/O quotas. PostgreSQL does not natively support I/O quotas, but you can approximate them with
pg_qualstats(to identify expensive query patterns),auto_explain(to log queries exceeding a threshold), and application-level query throttling (a middleware that limits the tenant to N concurrent queries).
| Resource | Shared Tier Cost (allocated) | Dedicated Tier Cost | Difference |
|---|---|---|---|
| Database IOPS (60% of shared) | ~$600/mo (their share of the shared instance) | $1,200/mo (dedicated db.r6g.xlarge) | +$600/mo |
| Read replica for analytics | $0 (shared) | $800/mo (dedicated replica) | +$800/mo |
| Dedicated monitoring/support | $0 | ~$500/mo (eng time) | +$500/mo |
| Total | ~$600/mo | ~$2,500/mo | +$1,900/mo |
Follow-up: The tenant refuses to pay for dedicated infrastructure. They say “the platform should handle it.” What do you do?
This is a product decision, not an engineering decision. The platform team defines what is included in each tier. If the shared tier includes “shared infrastructure with fair-use limits,” enforce those limits. Implement per-tenant I/O monitoring with a soft limit (alert + dashboard showing their usage relative to their tier) and a hard limit (throttle queries exceeding 2x their tier’s allocated IOPS). The soft limit creates awareness; the hard limit prevents impact to others.Present the data: “Your usage is 12x what your tier includes. Here is a graph showing your I/O vs other tenants in the same tier. We can either move you to a tier that matches your usage, or apply fair-use limits that bring your usage in line with the tier you are paying for.”Most enterprise customers will pay for dedicated resources when shown the data. The small percentage that refuse are candidates for churn that improves margins — not all revenue is good revenue.Follow-up: How would you measure and attribute per-tenant database cost in a shared environment?
This is the “FinOps for databases” problem, and most teams do not do it well.Measurement approach:- Tag queries with tenant ID. Use PostgreSQL’s
application_nameor a query comment (/* tenant_id=abc123 */) to attribute queries to tenants.pg_stat_statementswithpgauditextension can capture per-query stats with tenant attribution. - Calculate per-tenant IOPS. Sum
shared_blks_read + shared_blks_hitper tenant frompg_stat_statements. Convert to IOPS:total_blocks / time_window. - Calculate per-tenant storage.
SELECT pg_total_relation_size('table') * (SELECT COUNT(*) FROM table WHERE tenant_id = ?) / (SELECT COUNT(*) FROM table)gives an approximation. For exact figures, partition by tenant and measure partition sizes. - Allocate shared costs proportionally. The database instance cost, replica cost, backup cost, and operational cost are shared. Allocate proportionally to IOPS or query volume, not row count (a tenant with 1M rows but 100K queries/day costs more than a tenant with 10M rows and 100 queries/day).
Q26: You deploy a new version of your API. Ten minutes later, you discover that the database migration in the deploy added a NOT NULL column without a default, and the old version of the service (still draining on some instances) is failing on every INSERT. Walk me through the incident response and what you change to prevent this class of failure.
Q26: You deploy a new version of your API. Ten minutes later, you discover that the database migration in the deploy added a NOT NULL column without a default, and the old version of the service (still draining on some instances) is failing on every INSERT. Walk me through the incident response and what you change to prevent this class of failure.
This is a “rolling deploy + schema migration mismatch” incident question — one of the most common causes of API outages at companies that have moved past the basics.
What weak candidates say: “Roll back the migration.” — At what cost? Rolling back a migration on a live table that already has the new column, with two versions of the app running, is not straightforward.What strong candidates say:This is a textbook API/database mismatch during a rolling deploy, and it is entirely preventable. Let me walk through the incident response, the root cause, and the systemic fix.Incident response (minute by minute):Minute 0-2: Detect. Alerts fire on elevated 5xx rate. Error logs showINSERT INTO orders (...) VALUES (...) failing with ERROR: null value in column "priority" violates not-null constraint. The new column priority was added as NOT NULL without a default. Old application instances (still running on 7 of 10 pods) do not include priority in their INSERT statements.Minute 2-5: Mitigate. Two options, pick the fastest:Option A: Fix forward. Add a DEFAULT to the column immediately:priority). New instances explicitly set priority. This is the fastest fix — under 60 seconds to execute.Option B: Roll back the deploy. Stop the rollout and roll back to the previous application version. All instances now run old code. The new column exists but is unused (old code ignores it, and the DEFAULT set in Option A means it does not block inserts). This is slower (5-10 minutes for a full rollback) and only necessary if the new code has additional issues.Minute 5-10: Verify. Error rate should drop to zero within seconds of Option A. Confirm by watching pg_stat_activity — no more failed INSERT statements. Check that the default value is semantically correct (does 'normal' priority make sense for orders created during the incident window?). If not, plan a backfill to correct the values later.Root cause analysis:The migration and the application deploy were coupled in a way that assumed atomic deployment — all instances switch from old to new simultaneously. In a rolling deploy, this assumption is false. For minutes (or hours in large fleets), old and new code run against the same database. The migration must be compatible with both versions.The systemic fix — decouple migrations from deploys:Rule 1: Schema migrations must be backward-compatible with the currently running application version. The migration runs first (or as part of the deploy), but old application instances must continue to work against the new schema. This means:- Adding a column: must be NULLable or have a DEFAULT.
- Removing a column: only after all instances have stopped referencing it (a separate deploy).
- Renaming a column: never in one step. Add the new name, dual-write, migrate reads, drop the old name.
SELECT COUNT(*) FROM api_logs WHERE status = 500 AND path LIKE '/orders%' AND timestamp BETWEEN 'incident_start' AND 'incident_end'. If the rolling deploy was at 30% (3 of 10 instances on old code), approximately 30% of write requests failed for the duration of the incident (10 minutes). At 100 RPS to the orders endpoint, that is ~30 * 600 = 18,000 failed requests. If each failed request is an order submission, the business impact is 18,000 lost or delayed orders — this is the number the post-mortem report must include.Cost of the incident: 18,000 failed orders * average order value * conversion rate loss from retry friction. If AOV is 90,000 in lost revenue from a missing DEFAULT clause. This is why migration safety is not a “nice to have.”Follow-up: How do you handle a migration that is NOT backward-compatible — a column type change, for example?
A type change (e.g.,VARCHAR(50) to TEXT, or INTEGER to BIGINT) requires the expand-and-contract approach spread across multiple deploys:Deploy 1 (schema only): Add a new column with the target type: ALTER TABLE orders ADD COLUMN amount_v2 NUMERIC(15,2). No application changes.Deploy 2 (dual-write): Application writes to both amount (old) and amount_v2 (new) on every INSERT and UPDATE. Reads still use amount. Both old and new application versions work because amount is still the source of truth.Deploy 3 (backfill): Batch job copies amount to amount_v2 for all existing rows: UPDATE orders SET amount_v2 = amount::NUMERIC(15,2) WHERE amount_v2 IS NULL in batches of 50,000.Deploy 4 (switch reads): Application reads from amount_v2. Writes still go to both columns (for rollback safety).Deploy 5 (drop old column): Remove the old column after verifying zero references. This is the point of no return.Minimum timeline: 5 deploys over 1-3 weeks. There is no shortcut that is safe under rolling deploys. Each deploy is independently rollbackable until Deploy 5.Follow-up: What about security? A NOT NULL constraint was missing for 10 minutes. Could an attacker have exploited this?
Yes. If thepriority column is used in authorization logic (e.g., priority = 'urgent' routes to a faster processing queue, or priority = 'admin' grants elevated access), an attacker who discovers the missing constraint during the incident window could INSERT rows with NULL priority and observe how the system handles them. If the application code treats NULL as a special case (or fails open), this is an exploit vector.More realistically: the security risk is in the data integrity violation, not direct exploitation. 18,000 orders with NULL priority might bypass business rules that depend on priority being set. A post-incident data audit should verify: are any rows in an invalid state? Do any NULL-priority orders need to be corrected?The deeper security lesson: migration validation should include a security review for any constraint change. Adding or removing a NOT NULL, CHECK, or FOREIGN KEY constraint changes what data the database will accept. This is a security boundary, not just a schema change.Interview: You run a public API with 100+ third-party consumers and 6 years of history. Your v1 and v2 schemes have diverged into 14 minor versions. A product launch requires a breaking change to the payments endpoint. How do you ship it without a v3 big-bang?
Interview: You run a public API with 100+ third-party consumers and 6 years of history. Your v1 and v2 schemes have diverged into 14 minor versions. A product launch requires a breaking change to the payments endpoint. How do you ship it without a v3 big-bang?
2026-04-21) layered on top of v2 rather than a new major. Each consumer is pinned to the version they integrated on, and new behavior is opt-in per endpoint, not per API.Step 2 - Build a request/response transformation chain: Internally, the service implements only the newest shape. Every dated version installs an upgrade transformer (old request to new) and a downgrade transformer (new response to old). A request from a consumer pinned to 2023-01-15 runs through the chain of transformers between 2023-01-15 and HEAD before hitting the resolver, and the response walks back down. This keeps the core code free of version branching and localizes every change to a single, reviewable file per version.Step 3 - Drive migration with telemetry, not email blasts: Every response carries X-Stripe-Version-style headers and a Sunset header with a concrete date. Dashboards show request volume per pinned version per consumer. Deprecation starts when a version hits under 1 percent of volume, and account managers are given a named list of laggards. Only versions on which no consumer depends get removed. Nothing is retired on a calendar; it is retired on a usage threshold.Real-World Example:
Stripe has shipped over 100 dated API versions since 2011 without a v3. Their engineering blog post “APIs as infrastructure” describes the transformer chain approach. GitHub, by contrast, issued a hard v3-to-v4 cutover for GraphQL and spent years running both in parallel because they could not unilaterally move enterprise customers.Senior Follow-up Questions:- “How do you stop the transformer chain from becoming a 100-step performance disaster?” - Strong answer: Cache a compiled transformer per pinned version at boot so per-request cost is one function call, not a loop, and snapshot the chain every 2 years by rebasing old versions onto a new baseline.
- “A consumer says they cannot upgrade because their auditors froze the integration. What do you do?” - Strong answer: Offer a paid long-term-support tier with a contractual sunset 24 months out, which puts a price on the operational cost of keeping their version alive and makes their procurement team the one asking for the upgrade.
- “How do you detect a breaking change was accidentally shipped inside a minor version?” - Strong answer: Run consumer-driven contract tests in CI using recorded production traffic samples per pinned version, so any resolver change that produces a different response shape for a recorded request fails the build before merge.
- “Just cut a v3 and give consumers 6 months.” - You do not control the consumer’s release cycle; enterprise integrators run on yearly planning cycles and regulated industries on multi-year ones. A 6-month window forces a hard break.
- “Use feature flags per consumer.” - Feature flags are for progressive rollout of behavior, not for maintaining a stable contract. They quickly degenerate into thousands of per-consumer branches with no audit trail.
- “API Versioning at Stripe” - Brandur Leach (engineering blog, stripe.com/blog)
- RFC 8594 - The Sunset HTTP Header Field
- Related chapter: API Gateways and Service Mesh
Interview: A read-heavy product surface (think user profile page) is doing 14 joins per request and the DBA wants you to denormalize. Your peer argues denormalization will create a write-fanout nightmare. Who is right, and how do you decide?
Interview: A read-heavy product surface (think user profile page) is doing 14 joins per request and the DBA wants you to denormalize. Your peer argues denormalization will create a write-fanout nightmare. Who is right, and how do you decide?
- “Your async denormalization lags by 30 seconds during peak. A user updates their name and it stays old on the profile page. How do you fix without going synchronous?” - Strong answer: Read-your-own-writes via a session-scoped cache that overlays the projection for the writing user only, leaving other viewers on the eventually consistent path.
- “How do you denormalize when the source table has row-level security policies?” - Strong answer: You do not denormalize the raw fields; you denormalize the already-authorized projection output, so every reader still goes through the authorization boundary, just against the denormalized store.
- “What metric tells you the denormalization was a mistake?” - Strong answer: Write latency P99 climbing faster than read latency P99 is dropping, meaning you have moved the pain rather than removed it, plus a rising reconciliation drift rate.
- “Always denormalize for performance.” - Ignores write amplification, cache invalidation cost, and the consistency budget. At sufficient write volume, synchronous fanout exceeds the latency of the original join.
- “Add more indexes first and do not denormalize.” - Indexes do not help when the query plan is join-bound rather than scan-bound. At some point a 14-way join cannot be index-rescued; the planner still has to do the join work.
- “TAO: Facebook’s Distributed Data Store for the Social Graph” - USENIX ATC 2013
- “Designing Data-Intensive Applications” chapter 3 - Martin Kleppmann
- Related chapter: Database Deep Dives
Interview: Your checkout flow writes to Postgres (orders), DynamoDB (inventory), and Stripe (payment). A partial failure leaves an order with charged money but no inventory reservation. Walk me through how you design cross-database consistency without 2PC.
Interview: Your checkout flow writes to Postgres (orders), DynamoDB (inventory), and Stripe (payment). A partial failure leaves an order with charged money but no inventory reservation. Walk me through how you design cross-database consistency without 2PC.
saga_state table) so that after a crash the orchestrator can resume from the last committed step rather than replaying from the start. The orchestrator, not any of the participant databases, owns atomicity.Step 3 - Make every external call idempotent with a deterministic key: Use the saga ID as the idempotency key on Stripe charges, the DynamoDB conditional write condition, and the Postgres insert’s unique constraint. That way a network timeout that you cannot tell apart from a failure becomes safe to retry. Without idempotency keys, your compensator can charge twice or release inventory you never reserved.Real-World Example:
Uber’s Cadence (now Temporal) was built explicitly for this class of problem after their payments flow suffered exactly this failure mode in 2016. Amazon’s 2018 re:Invent talk on Step Functions for order orchestration describes their version of the same pattern for retail checkout.Senior Follow-up Questions:- “The compensator for ‘release inventory’ itself fails. Now what?” - Strong answer: The saga enters a stuck state that pages a human; there is no compensator-for-the-compensator, so you build a DLQ of stuck sagas with tooling to let an operator force-resolve with full audit trail.
- “How is this different from eventual consistency via CDC?” - Strong answer: CDC propagates state between your own stores where you control both sides; sagas coordinate mutations across systems where one side is external or cannot participate in a replication stream, and the business requires an explicit compensation semantic, not just read-side convergence.
- “Stripe already charged the card, but the order insert is failing because Postgres is in read-only mode during failover. What do you do?” - Strong answer: The saga pauses and retries with backoff rather than compensating, because voiding a successful authorization is worse customer experience than a 30 second delay; compensation is for logical failures, not infrastructure blips.
- “Use distributed transactions with XA.” - DynamoDB does not speak XA, Stripe does not participate in any transaction protocol, and even Postgres XA has well-documented recovery bugs under coordinator failure.
- “Just retry until everything succeeds.” - Without compensators, a permanent failure in step 3 leaves steps 1 and 2 committed forever, which is exactly the charged-money-no-inventory bug you are trying to avoid.
- “Sagas” - Garcia-Molina and Salem, 1987 (the original paper)
- Temporal documentation on saga pattern
- Related chapter: Cloud Service Patterns