Skip to main content

Part VIII — APIs and Integration

Real-World Story: Stripe’s API Design Philosophy. When Stripe launched in 2011, the payments landscape was dominated by clunky SOAP APIs with 70-page integration guides. Stripe made a radical bet: treat the API itself as the product. Every endpoint followed consistent conventions — predictable resource naming, uniform error formats, idempotency keys on every mutating operation, and versioning baked in from day one. The result? Developers could go from zero to processing payments in under an hour. Stripe pins each merchant to the API version they integrated with, so deploying a new version never breaks existing integrations. Their API changelog is a masterclass in backward compatibility. Today, Stripe’s API is widely regarded as the gold standard for public API design — not because it uses exotic technology, but because every design decision was made from the developer-consumer’s perspective. The lesson: API design is UX design for developers.
Real-World Story: Airbnb’s Migration from Monolithic MySQL to Service-Oriented Data Architecture. By 2015, Airbnb’s entire platform ran on a single monolithic MySQL database. Every team — search, payments, messaging, trust — shared the same database, the same connection pool, and the same on-call rotation when things broke. A slow query from the messaging team could starve the payments team of connections, and schema migrations required cross-team coordination that could take weeks. Airbnb began decomposing their monolith into domain-specific services, each owning its own data store. Search moved to Elasticsearch. The messaging team got its own database. Payment data was isolated behind a service boundary with strict access controls. The key insight was not “use microservices” — it was “data ownership must follow team ownership.” When one team’s schema change can break another team’s queries, you do not have independent services, you have a distributed monolith with extra network hops. Airbnb’s migration took years and required building internal tooling for data replication between services (they open-sourced SpinalTap, a MySQL changelog stream). The payoff: teams could evolve their schemas independently, scale their data stores to match their specific access patterns, and on-call pages dropped dramatically because failures were isolated.

Chapter 15: API Design

The 3 Rules of Great API Design. Every well-designed API — Stripe, Twilio, GitHub — follows the same three principles whether they know it or not:
  1. Predictable — Consistent naming conventions everywhere. If one endpoint uses created_at, every endpoint uses created_at (not createdAt in some and creation_date in others). If GET /users returns a list, GET /orders returns a list in the same envelope. A developer who has used one endpoint can guess the shape of any other endpoint.
  2. Forgiving — When something goes wrong, the API tells you exactly what happened and how to fix it. Not 400 Bad Request with no body — but { "error": { "code": "invalid_currency", "message": "Currency 'usd' is not valid. Did you mean 'USD'? Currencies must be uppercase ISO 4217 codes.", "param": "currency" } }. Good error messages are the best API documentation.
  3. Evolvable — Versioning is baked in from day one, not bolted on when the first breaking change happens. Every response includes only additive fields. Removing or renaming a field goes through a deprecation cycle. The API you ship on launch day should be designed assuming it will still be running in five years with clients you have never heard of.
These are not nice-to-haves. If your API is unpredictable, developers waste hours guessing conventions. If it is unforgiving, every integration turns into a support ticket. If it is not evolvable, your first breaking change becomes a crisis.
Over-Designing APIs. Building a perfectly RESTful API with HATEOAS, content negotiation, and every status code used correctly — for an internal service called by one frontend. Match the API design investment to the consumer count. Internal APIs between two services you own: keep it simple, use a few conventions, move fast. Public APIs consumed by hundreds of third-party developers: invest heavily in design, documentation, versioning, and backward compatibility.
Strong answer:
  1. Authentication mechanism — API keys for simplicity, OAuth 2.0 if delegated access is needed.
  2. Versioning strategy — URL-based (/v1/) for discoverability, with a deprecation policy from day one.
  3. Pagination — cursor-based (not offset) for consistency at scale; every list endpoint returns pagination metadata.
  4. Error format — consistent error response schema across all endpoints (error code, message, details, request_id for support).
  5. Rate limiting — per API key, with 429 Too Many Requests and Retry-After header.
Also: use consistent naming conventions (snake_case or camelCase, not a mix), return request_id in every response for debugging, and write OpenAPI spec before writing code.

15.1 REST API Design

Resources as nouns (/users), HTTP methods for actions, plural nouns, hierarchy for relationships (/users/123/orders). Status codes: 200 OK, 201 Created, 204 No Content, 400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found, 409 Conflict, 422 Unprocessable Entity, 429 Too Many Requests, 500 Internal Server Error. Pagination: Offset-based (?page=2&size=20) is simple but slow on large datasets. Cursor-based (?cursor=abc&size=20) is consistent regardless of size. Always return pagination metadata.

15.2 gRPC

Binary protocol using Protocol Buffers over HTTP/2. Created by Google to replace their internal RPC framework (Stubby) — designed for high-throughput, low-latency service-to-service communication. How it works: Define your service in a .proto file (service methods + message types). The protoc compiler generates client and server code in your language. The client calls methods on a generated stub that looks like a local function call — the gRPC framework handles serialization, HTTP/2 framing, and network transport. Why it is faster: Protocol Buffers are 3-10x smaller than JSON (binary, no field names in payload). HTTP/2 multiplexes multiple RPCs over a single TCP connection (no connection-per-request overhead). Generated code eliminates serialization bugs and reduces boilerplate. Streaming: Server-side streaming (server sends a stream of responses to one request — live updates, large result sets). Client-side streaming (client sends a stream of messages — file upload, batch ingestion). Bidirectional streaming (both stream simultaneously — chat, real-time collaboration). Trade-offs: Not browser-friendly (needs gRPC-Web or Envoy proxy). Not human-readable (cannot curl and read the response — need grpcurl or Postman). No HTTP caching (all requests are POST). Steeper learning curve (proto files, code generation, streaming complexity). For public APIs consumed by third-party developers, REST is still the standard.
Protocol Buffers. A language-neutral, platform-neutral, extensible mechanism for serializing structured data. You define your data structures in .proto files, and the compiler generates code in your language. Much smaller and faster to serialize/deserialize than JSON.

15.3 Idempotency

An idempotent operation produces the same result regardless of how many times it is called. GET /user/123 is naturally idempotent (reading does not change state). POST /orders is NOT naturally idempotent (calling it twice creates two orders). Making non-idempotent operations safe for retries is one of the most important distributed systems skills. Implementation: Client generates a unique idempotency key (UUID) per request. Server checks if it has processed that key. If yes, returns the stored response. If no, processes and stores. Essential for payments, order creation, any operation where duplicates have real consequences.

Idempotency Key Patterns in Practice

Idempotency keys are more nuanced than “store a UUID.” Here is how production systems handle them: The basic pattern:
  1. Client generates a UUID v4 and sends it in the Idempotency-Key header.
  2. Server checks a lookup table (Redis or database) for the key.
  3. If key exists and the request completed: return the stored response (same status code, same body).
  4. If key exists and the request is in-progress: return 409 Conflict (prevents concurrent retries from double-processing).
  5. If key does not exist: create the record with status processing, execute the operation, store the response, update status to completed.
How payment systems handle this (Stripe’s model):
  • Idempotency keys are scoped to the API key (two different merchants can use the same key string without collision).
  • Keys expire after 24 hours — retrying after that creates a new operation.
  • If a retried request has a different body than the original, the server returns an error (the key is bound to the original parameters).
  • The stored response includes the exact HTTP status code and body — a retried POST /charges that originally returned 402 Payment Failed returns 402 again, not 200.
Database schema for idempotency:
CREATE TABLE idempotency_keys (
  key         TEXT PRIMARY KEY,
  api_key_id  BIGINT NOT NULL,
  request_path TEXT NOT NULL,
  request_hash TEXT NOT NULL,  -- hash of the request body
  status      TEXT NOT NULL DEFAULT 'processing',  -- processing | completed | failed
  response_code INT,
  response_body JSONB,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  expires_at  TIMESTAMPTZ NOT NULL DEFAULT now() + INTERVAL '24 hours'
);
For critical operations like payments, wrap the idempotency check and the business operation in a single database transaction. This prevents a crash between “recorded the key” and “completed the operation” from leaving the system in a state where the key exists but the operation never finished. Use the status field to detect and recover from this — a processing key older than a timeout threshold should be retried.

15.4 API Versioning and Backward Compatibility

Safe changes: Adding optional fields, new endpoints, new optional parameters. Breaking changes: Removing fields, renaming fields, changing types, making optional fields required.

Versioning Strategies Compared

StrategyExampleProsConsUsed By
URL path/v1/users, /v2/usersMost discoverable, easy to route, easy to cacheURL pollution, harder to sunsetStripe, GitHub, Twitter
Custom headerX-API-Version: 2Clean URLs, flexibleHidden from browser, easy to forgetAzure, Jira
Accept header (content negotiation)Accept: application/vnd.myapi.v2+jsonHTTP-standards compliant, fine-grainedComplex, hard to test in browserGitHub (also supports URL)
Query parameter/users?version=2Easy to add, discoverableBreaks caching, messyRarely recommended
Recommendation: Use URL path versioning (/v1/) for public APIs — it is the most discoverable and requires no special client configuration. Reserve header-based versioning for internal APIs where you control all clients. Content negotiation is standards-purist but creates friction for third-party developers.
The Expand and Contract Pattern for APIs. To rename a field: (1) add the new field alongside the old one, (2) update clients to use the new field, (3) remove the old field. Never do step 1 and 3 together.
Real-world deprecation policy example (modeled after Stripe):
  1. Announce deprecation in API changelog and response headers (Deprecation: true, Sunset: 2026-12-01).
  2. Give at least 12 months notice for major version sunsetting.
  3. New accounts default to the latest version; existing accounts are pinned to their current version.
  4. Provide a migration guide for every breaking change.

15.5 GraphQL

A query language for APIs where the client specifies exactly what data it needs. A single endpoint (/graphql) serves all queries. The client sends a query describing the shape of the response. Benefits: No over-fetching (client gets only requested fields). No under-fetching (client gets all needed data in one request, unlike REST which may require multiple endpoints). Strongly typed schema acts as documentation. Great for mobile clients where bandwidth matters. Trade-offs: Caching is harder (every request is a POST with a unique body — HTTP caching does not work). Rate limiting is harder (one query can be trivial or extremely expensive). N+1 problems move to the server (resolving nested fields can trigger many database calls without careful dataloader implementation). Security: clients can craft expensive nested queries — implement query depth limits and complexity analysis. When to use: Multiple client types needing different data shapes (web shows full details, mobile shows summary). Rapidly evolving frontend requirements. When the relationship between data is graph-like. When NOT to use: Simple CRUD APIs, file uploads, real-time streaming (use subscriptions or SSE), when your team is small and REST is working fine.
Tools: Apollo (client and server, most popular GraphQL ecosystem). Hasura (instant GraphQL over PostgreSQL). graphql-codegen (TypeScript type generation from schema).
Connection: Idempotency (Part VIII) connects to retries (Part V — retry a failed request safely only if the operation is idempotent), duplicate messages (Part XV — at-least-once delivery means consumers must be idempotent), optimistic locking (Part IX — a retry that conflicts is safely rejected), and the outbox pattern (Part VII — an idempotent relay process can safely re-publish).

15.6 Rate Limiting

Control how many requests a client can make in a given time window. Protects against abuse, ensures fair resource distribution, and prevents cascade failures. Algorithms: Fixed window (simple, allows bursts at window boundaries). Sliding window (smoother, more complex). Token bucket (allows controlled bursts up to a limit, then steady rate). Leaky bucket (smoothest output rate, constant processing).

Rate Limiting Algorithms — Deep Comparison

AlgorithmHow It WorksBurst HandlingMemoryPrecisionBest For
Fixed windowCounter resets every N secondsAllows 2x burst at window boundary (end of one + start of next)Very low (1 counter per key)LowSimple internal APIs where boundary bursts are acceptable
Sliding window logStores timestamp of each request, counts within rolling windowNo boundary burstHigh (stores every timestamp)ExactLow-volume APIs where precision matters more than memory
Sliding window counterWeighted average of current and previous windowMinimal boundary burstLow (2 counters per key)ApproximateMost production APIs — good balance of accuracy and memory
Token bucketBucket fills at steady rate, each request takes a token, bucket has max capacityAllows controlled bursts up to bucket size, then steady rateLow (token count + last refill time)GoodAPIs where short bursts are acceptable (Stripe, AWS)
Leaky bucketRequests enter a queue processed at constant rateNo bursts — excess requests are queued or droppedModerate (queue)Smoothest outputTraffic shaping, when downstream systems need constant rate
Token bucket vs sliding window — the practical choice:
  • Token bucket is best when you want to allow short bursts (a mobile app that syncs 20 items at once, then goes idle). Set bucket size = max burst, refill rate = sustained limit.
  • Sliding window counter is best when you want strictly even distribution and bursts are harmful (payment processing API).
Implementation: API gateway level (Kong, AWS API Gateway — easiest). Application middleware level (express-rate-limit, ASP.NET rate limiting). Distributed rate limiting with Redis (INCR with TTL, or Redis Cell module for token bucket). Return 429 Too Many Requests with Retry-After header.
Connection: Rate limiting at the API gateway is a form of backpressure (Part IV, Ch 7.5) that protects your circuit breakers (Part V, Ch 9.2) from opening unnecessarily. If your rate limiter is well-tuned, your circuit breakers should rarely trip.
Per what? Per IP (simplest, breaks for users behind shared IPs). Per API key (standard for authenticated APIs). Per user (fairest for logged-in users). Per tenant (essential in multi-tenant SaaS). Combine multiple: per-user within per-tenant limits.
Strong answer:The core problem is a throughput mismatch — our system produces events 100x faster than the partner can consume them. Sending callbacks at our rate would overwhelm their system, causing dropped events, cascading retries, and a worse situation.Design approach:
  1. Decouple with an outbound queue. Instead of calling the partner synchronously as events occur, publish events to a message queue (SQS, Kafka, or Redis Streams) dedicated to this partner.
  2. Rate-limited consumer. A dedicated worker consumes from the queue at the partner’s rate — 100 RPS. Use a token bucket or leaky bucket rate limiter on the consumer side. The queue absorbs the burst; the consumer drains it steadily.
  3. Batching. If the partner supports batch callbacks, aggregate events into batches of 50-100 per request. This delivers 5,000-10,000 events per second within their 100 RPS limit. This is the highest-leverage optimization.
  4. Backpressure and overflow. Set a queue depth limit. If the partner falls behind (their system is down), the queue grows. Add dead-letter queue (DLQ) for events that fail after N retries. Alert when queue depth exceeds a threshold — this means the partner cannot keep up even at steady state.
  5. Webhook with retry and exponential backoff. For the actual callback, use exponential backoff with jitter (1s, 2s, 4s, …, max 5 minutes). Include an Idempotency-Key header so the partner can safely deduplicate retries.
  6. Pull alternative. Offer the partner a poll-based API (GET /events?since=cursor) as an alternative to push. This lets them control their own consumption rate. Many partners prefer this because it puts them in control of backpressure.
Key trade-offs: Push (webhook) is real-time but requires careful rate management. Pull (polling) is simpler but introduces latency up to the polling interval. Offer both; let the partner choose. Batching is the single biggest win — it is always worth proposing.
Strong answer:A single POST with a 5GB body is a non-starter — any network interruption means restarting from zero, most proxies and load balancers have body size limits (often 1-10MB), and you cannot report meaningful progress.Design: Chunked Resumable Upload ProtocolStep 1: Initiate the upload.
POST /uploads
Content-Type: application/json
{ "filename": "dataset.parquet", "size_bytes": 5368709120, "content_type": "application/octet-stream" }

Response 201:
{ "upload_id": "upl_abc123", "chunk_size": 5242880, "total_chunks": 1024, "upload_url": "/uploads/upl_abc123" }
The server calculates chunks (5MB each for a 5GB file = ~1024 chunks). Returns an upload_id that the client uses for all subsequent requests.Step 2: Upload chunks.
PUT /uploads/upl_abc123/chunks/0
Content-Type: application/octet-stream
Content-Range: bytes 0-5242879/5368709120
[binary data]

Response 200: { "chunk_index": 0, "status": "received", "checksum": "sha256:abc..." }
Each chunk includes its index and byte range. The server stores each chunk and returns a checksum for client-side verification.Step 3: Resume after failure. If the connection drops, the client queries which chunks were received:
GET /uploads/upl_abc123/status
Response: { "upload_id": "upl_abc123", "received_chunks": [0, 1, 2, 3, 4], "missing_chunks": [5, 6, ..., 1023], "progress_pct": 0.49 }
The client resumes from chunk 5. No re-uploading completed chunks.Step 4: Finalize.
POST /uploads/upl_abc123/complete
Response 200: { "file_id": "file_xyz789", "size_bytes": 5368709120, "checksum": "sha256:final...", "url": "/files/file_xyz789" }
The server assembles chunks, verifies the full-file checksum, and returns the final file resource.Key design decisions:
  • Chunk size: 5MB is a good default — small enough that a failed chunk is cheap to retry, large enough that 1024 chunks is manageable. Let the server decide (clients with slow connections could request smaller chunks).
  • Idempotency: Uploading the same chunk index twice is idempotent (server overwrites or ignores if checksum matches). This makes retries safe.
  • Progress: The client knows received_chunks / total_chunks. For real-time progress, the client can calculate bytes uploaded per second from the chunk upload times.
  • Expiry: Incomplete uploads expire after 24 hours. The client can extend with a PATCH /uploads/{id} to refresh the TTL.
  • Parallel uploads: Chunks are independent, so the client can upload 3-5 chunks in parallel for better throughput on high-bandwidth connections.
This is essentially the protocol used by Google’s resumable upload API, tus.io (an open protocol for resumable uploads), and AWS S3 multipart uploads.

15.7 REST vs gRPC vs GraphQL — When to Use Each

CriteriaRESTgRPCGraphQL
ProtocolHTTP/1.1 or HTTP/2HTTP/2 (always)HTTP (typically POST)
FormatJSON (text)Protocol Buffers (binary)JSON (text)
Type safetyOpenAPI/Swagger (optional)Strong (code-generated from .proto)Strong (schema-defined)
StreamingSSE or WebSocket (separate)Native bidirectional streamingSubscriptions (over WebSocket)
Browser supportNativeRequires gRPC-Web proxyNative
CachingHTTP caching works nativelyNo HTTP caching (binary POST)Hard (every request is unique POST)
Payload sizeLarger (JSON text + field names)3-10x smaller (binary, no field names)Variable (client picks fields)
Best forPublic APIs, browser clientsInternal service-to-service, high-throughputMultiple client types needing different data
Worst forMobile (over-fetching)Browser clients, public APIsSimple CRUD, real-time streaming
VersioningURL-based (/v1/), easyProto file evolution, carefulSchema evolution, deprecation directives

Decision Matrix with Concrete Scenarios

ScenarioChooseWhy
Public API for third-party developersRESTUniversal tooling, discoverable, HTTP caching, curl-friendly
Mobile app + web app consuming same backend, different data needsGraphQLEach client fetches exactly the fields it needs, one endpoint
Microservice A calls microservice B 10,000 times/secondgRPCBinary serialization, HTTP/2 multiplexing, code-generated stubs, streaming
Internal service with complex data relationships (social graph)GraphQLNatural fit for graph-like data traversal
Real-time bidirectional communication (chat, collab editing)gRPC streamingNative bidi streaming over HTTP/2, no WebSocket needed
Simple CRUD API, small team, getting startedRESTLowest complexity, best documentation, largest talent pool
Backend-for-frontend (BFF) aggregating multiple servicesGraphQLSingle query resolves data from multiple sources
IoT device sending telemetry at very high volumegRPCMinimal payload size, efficient binary serialization
Decision rule: Public API -> REST. Internal service-to-service with high throughput -> gRPC. Multiple client types with different data needs -> GraphQL. When in doubt -> REST (simplest, most understood, best tooling).
Analogy: Restaurant vs Buffet. REST is like ordering from a restaurant menu — you pick from fixed options, and the kitchen decides what comes on the plate (maybe you get fries you did not want). GraphQL is like a buffet — you walk up and pick exactly what you want, nothing more, nothing less. The restaurant is simpler to run (fixed menu, predictable load), but the buffet is better when every diner has different dietary needs. gRPC is like a professional catering service — incredibly efficient, highly optimized, but you need a contract (proto file) agreed upon in advance, and it is not something you set up for a casual lunch.
You do not have to pick just one. Many production systems use REST for public APIs, gRPC between internal microservices, and GraphQL for the frontend gateway. Netflix, Airbnb, and Shopify all run hybrid architectures.

15.8 Contract Testing

Consumer defines expectations, provider verifies it satisfies them. Catches breaking changes before deployment.
Tools: Pact for consumer-driven contract testing. OpenAPI/Swagger for API documentation and code generation. Postman for API testing and collaboration.
Cross-Chapter Connections for API Design.
  • Caching — API responses are prime caching candidates. Use HTTP cache headers (Cache-Control, ETag, Last-Modified) for GET endpoints, and application-level caching (Redis) for expensive computations behind endpoints. A well-cached API can serve 100x the traffic without scaling the backend. See the Caching chapter for cache invalidation patterns and HTTP caching strategies.
  • Performance and Reliability — Rate limiting (covered above) is only one layer of API protection. Circuit breakers protect your API from slow downstream dependencies. Timeouts prevent a single slow request from consuming a connection indefinitely. See the Performance chapter for the full resilience stack: timeouts, retries, circuit breakers, and bulkheads.
  • Messaging — For APIs that trigger long-running operations (video processing, report generation, bulk imports), return 202 Accepted immediately and publish the work to a message queue. The client polls a status endpoint or receives a webhook callback when complete. This pattern prevents HTTP timeouts and lets you scale processing independently from request handling. See the Messaging chapter for async job patterns and webhook delivery guarantees.
Further reading: RESTful Web APIs by Leonard Richardson and Mike Amundsen — deep dive into REST API design principles. API Design Patterns by JJ Geewax — reusable patterns for designing consistent, developer-friendly APIs. Stripe API Reference — the gold standard for public API design; study the consistency of naming, error formats, pagination, and versioning across every resource.REST best practices: Microsoft REST API Guidelines — comprehensive, opinionated guidelines from Microsoft covering naming, versioning, pagination, error handling, and long-running operations; the most thorough public API style guide available. Zalando RESTful API Guidelines — battle-tested guidelines from Zalando’s engineering team covering URL design, JSON conventions, HTTP headers, deprecation, and event-driven APIs; excellent complement to Microsoft’s guidelines with a European engineering perspective.gRPC: gRPC Official Documentation — the canonical reference for gRPC concepts, language-specific quickstarts, authentication, error handling, and performance best practices; start with the “What is gRPC?” overview and the language tutorial for your stack. Google’s Introduction to gRPC — Google’s comparison of gRPC, OpenAPI, and REST with guidance on when each is appropriate; valuable for understanding Google’s own reasoning since they created gRPC.GraphQL: GraphQL Official Learning Guide — the best starting point for GraphQL fundamentals; covers queries, mutations, schemas, validation, and execution in a clear, progressive tutorial written by the GraphQL Foundation. Apollo GraphQL Documentation — comprehensive docs for the most popular GraphQL ecosystem; covers client-side caching, server-side schema design, federation for microservices, and performance monitoring; particularly valuable for understanding how GraphQL works in production. GraphQL in Action by Samer Buna — practical guide to building GraphQL APIs from schema design through production deployment.Rate limiting: Cloudflare Blog: What is Rate Limiting? — clear explanation of rate limiting algorithms (fixed window, sliding window, token bucket) with visual diagrams; Cloudflare processes millions of requests per second, so their perspective on rate limiting at scale is authoritative. Stripe’s Rate Limiting Approach — how Stripe implements multi-tier rate limiting (request rate limiter, concurrent request limiter, fleet usage load shedder, worker utilization load shedder); the most practical production-grade rate limiting write-up available.Idempotency: Stripe’s Idempotent Requests Documentation — Stripe’s official documentation on idempotency keys, including key scoping, expiration, and error handling; the industry reference implementation for idempotent API design. Brandur Leach’s blog on idempotency keys — deep, production-tested writing on implementing idempotency keys with database transactions, rocket rides (the canonical example), and designing APIs that survive distributed failures; the most thorough technical deep dive on the subject.API versioning: Stripe’s API Versioning Approach — how Stripe versions their API by pinning each integration to a version, running version-specific request/response transforms, and maintaining backward compatibility across hundreds of versions; essential reading for anyone building a long-lived public API.

Part IX — Databases and Data Systems

The database is where most systems spend the most time, have the most bugs, and face the hardest scaling challenges. The reason databases are hard is not the SQL — it is the trade-offs: consistency vs performance (isolation levels), read speed vs write speed (indexes), flexibility vs integrity (schema), and single-node simplicity vs distributed resilience (replication/sharding). Every database decision is a trade-off decision.
Real-World Story: Slack’s Migration from MySQL to Vitess. By 2017, Slack’s MySQL infrastructure was hitting a wall. Their architecture used one MySQL database per workspace (customer), which worked well at first — it provided natural data isolation and meant a problem in one workspace could not affect another. But as Slack grew to millions of workspaces, they were managing tens of thousands of MySQL instances. Schema migrations had to be rolled out across all of them. Connection pooling was a nightmare. Large enterprise workspaces were outgrowing single-node MySQL, but their per-workspace-database model did not support sharding within a workspace. Slack adopted Vitess — a MySQL-compatible horizontal sharding layer originally built at YouTube. Vitess sits between the application and MySQL, routing queries to the correct shard transparently. It handles connection pooling (solving the thousands-of-connections problem), online schema migrations (applying DDL changes across shards without downtime), and horizontal resharding (splitting a hot shard into smaller ones). The migration was not a rewrite — Slack’s application still spoke MySQL. Vitess acted as an intelligent proxy that made a cluster of MySQL instances behave like a single, horizontally scalable database. The lesson: you do not always need to abandon your database when you hit scale limits. Sometimes the answer is a layer that makes your existing database distributed.
Real-World Story: GitHub’s Zero-Downtime Schema Migrations with gh-ost. GitHub runs one of the world’s largest MySQL deployments — over 300 million repositories backed by MySQL. When you need to add a column or change an index on a table with billions of rows, traditional ALTER TABLE locks the table for hours or days — completely unacceptable for a service that developers depend on 24/7. GitHub built gh-ost (GitHub Online Schema Transmogrifier), an open-source tool that performs schema migrations without locking the original table. How it works: gh-ost creates a “ghost” table with the new schema, then uses MySQL’s binary log (binlog) stream to copy rows from the original table to the ghost table while simultaneously capturing and replaying any new writes. When the ghost table is caught up, it performs an atomic table swap. The entire process runs with minimal impact on production traffic because it is throttle-aware — it monitors replication lag and slows down or pauses the migration if the database is under stress. gh-ost replaced the previous approach (pt-online-schema-change from Percona) because it does not use triggers, which were a source of unpredictable load spikes. GitHub now runs thousands of schema migrations per year across their fleet with zero downtime. The lesson: at scale, schema migration is not a DBA task — it is a distributed systems problem that requires dedicated tooling.

Chapter 16: Relational Foundations

16.1 Normalization and Denormalization

Normalization: Eliminate redundancy — each fact stored exactly once. A customer’s address is in the customers table, not duplicated in every orders row. Benefits: no update anomalies (change address once, reflected everywhere), smaller storage, data integrity. Cost: queries need JOINs to reassemble related data. Denormalization: Introduce controlled redundancy for read performance. Store the customer’s name directly in the orders table so displaying an order does not require a JOIN. Benefits: faster reads (no JOINs), simpler queries. Cost: update anomalies (customer changes name -> must update it in every order row), more storage, risk of inconsistency. When to denormalize: When reads vastly outnumber writes (product catalog read 1000x per write). When JOIN performance is unacceptable (joining 5 tables for a dashboard query). When you are building a read-optimized model (CQRS read side, search index, analytics). When NOT to: When write consistency is critical (financial data). When the data changes frequently (denormalized copies go stale). As a first resort (optimize queries and indexes first).

16.2 Indexing In Depth

Analogy: The Table of Contents. A database index is like a book’s table of contents — without one, you would have to read every page to find what you need (a full table scan). The table of contents tells you “Chapter 7 is on page 142,” so you jump straight there. A composite index is like a table of contents organized by topic AND subtopic — you can quickly find “Networking > TCP Handshake” without scanning every networking entry. A partial index is like an index that only covers chapters 5-8, because those are the ones readers look up most often. And just like a real book, adding too many indexes (imagine a 50-page table of contents for a 200-page book) makes the book harder to update and physically larger — that is write amplification.

Index Types — When Each Shines

Index TypeData StructureSupportsBest ForAvoid When
B-treeBalanced treeEquality, range (>, <, BETWEEN), sorting, prefix LIKE 'foo%'Default choice for almost everything — primary keys, foreign keys, most WHERE clausesYou only ever do exact-match lookups at extreme scale (hash might edge it out)
HashHash tableEquality only (=)Exact-match lookups in theory (O(1) vs B-tree O(log n))Range queries, sorting, prefix matching. In PostgreSQL, hash indexes were not crash-safe before v10 and remain uncommon. B-tree handles equality efficiently enough that hash rarely justifies losing range support
CompositeB-tree (multi-column)Queries matching the leftmost prefix of columnsMulti-column WHERE clauses, covering queriesColumn order does not match query patterns (leftmost prefix rule)
PartialB-tree (filtered)Queries matching the WHERE condition of the indexIndexing a subset — e.g., WHERE status = 'active' when 95% of rows are inactiveThe filtered subset is most of the table (no space savings)
CoveringB-tree with INCLUDE columnsIndex-only scans (database never touches the heap)Read-heavy queries where all selected columns fit in the indexWide rows or frequently updated INCLUDE columns
GIN (Generalized Inverted Index)Inverted indexArray containment, full-text search, JSONB key/value queriesPostgreSQL full-text search (to_tsvector), JSONB @> operators, array && overlapSimple scalar equality — GIN has higher write cost
GiST (Generalized Search Tree)Balanced tree (extensible)Range overlap, containment, nearest-neighborPostGIS geospatial queries (ST_Within, ST_DWithin), range types, ltree hierarchiesStandard equality/range on scalar types (B-tree is faster)
Write amplification: Each index adds a write operation to every INSERT, UPDATE, and DELETE. A table with 8 indexes means a single row insert causes ~9 disk writes (1 table + 8 indexes). In concrete terms: a single-index INSERT on PostgreSQL typically takes 0.1-0.5 ms. With 8 indexes, that same INSERT takes 0.8-4 ms — not a problem for individual writes, but at 10,000 inserts/second, the difference is between 5 seconds and 40 seconds of cumulative write time per second, which can saturate disk I/O. For write-heavy tables, every index must justify its existence. This is why write-optimized databases (LSM-tree based: Cassandra, RocksDB, LevelDB) use append-only writes and defer index maintenance — they trade read latency for write throughput. Common mistakes: Over-indexing (slows writes due to write amplification). Under-indexing foreign keys. Functions on indexed columns preventing index usage (WHERE UPPER(name) = 'JOHN' does not use an index on name — use a functional index instead).
How to decide: Start with B-tree for every new index. Switch to GIN only if you are searching inside JSONB, arrays, or full-text. Switch to GiST only for geospatial or range-type queries. Use partial indexes to shrink index size when you only query a subset of rows. Use composite indexes when your queries consistently filter on 2-3 columns together — put the highest-cardinality column first for equality checks, and the range column last.

16.3 Transactions and Isolation Levels

ACID: Atomicity, Consistency, Isolation, Durability.

Isolation Levels, Anomalies, and When to Use Each

Isolation LevelDirty ReadNon-Repeatable ReadPhantom ReadWrite SkewPerformanceDefault In
Read UncommittedPossiblePossiblePossiblePossibleFastestAlmost never used
Read CommittedPreventedPossiblePossiblePossibleFastPostgreSQL, Oracle, SQL Server
Repeatable ReadPreventedPreventedPossible (SQL standard) / Prevented (PostgreSQL)PossibleModerateMySQL InnoDB
SerializablePreventedPreventedPreventedPreventedSlowestNone by default
Anomalies explained:
  • Dirty read: Transaction A reads data written by Transaction B before B commits. If B rolls back, A used data that never existed. Prevented at Read Committed and above.
  • Non-repeatable read: Transaction A reads a row, Transaction B updates and commits that row, Transaction A reads again and gets a different value. Within a single transaction, the same query returns different results. Prevented at Repeatable Read and above.
  • Phantom read: Transaction A queries rows matching a condition (WHERE price < 100), Transaction B inserts a new row matching that condition and commits, Transaction A re-queries and gets an extra row that was not there before. Prevented at Serializable (and at PostgreSQL’s Repeatable Read due to snapshot isolation).
  • Write skew: Two transactions each read a value, make a decision based on it, and write — but their combined writes violate a constraint that neither individual write violates. Example: two doctors both check that at least one doctor is on-call, each decides to go off-call, both commit — now zero doctors are on-call. Only Serializable prevents this.
Practical guidance:
  • Read Committed: Use for typical web requests — showing a product page, listing orders, updating a user profile. Slight staleness within a transaction is fine because each statement sees the latest committed data.
  • Repeatable Read: Use for reports that run multiple queries and need consistent numbers — “total revenue this month” should not change between the SELECT for orders and the SELECT for refunds within the same report transaction.
  • Serializable: Use for financial operations where correctness is critical — transferring money between two accounts (both balances must be read and written atomically), inventory reservation (two users trying to buy the last item must not both succeed).
PostgreSQL’s REPEATABLE READ is actually snapshot isolation (MVCC), which prevents phantom reads. MySQL’s REPEATABLE READ uses gap locks and behaves differently. Know your database’s specific implementation.

16.4 Locking

Pessimistic locking: SELECT ... FOR UPDATE acquires a lock on the row. Other transactions wait. Use when contention is high and conflicts are expected. Optimistic locking: Include a version column. On update, check WHERE version = expected_version. If no row is updated, someone else changed it — retry or fail. Use when contention is low. Advisory locking: Application-level locks using the database as a coordination point. PostgreSQL provides pg_advisory_lock(key). Use for distributed locking when you want to coordinate across application instances — e.g., ensuring only one instance runs a scheduled job.
Advisory Lock. Unlike row locks that protect data, advisory locks protect application logic. They do not lock any table or row — they are purely cooperative. Your code must explicitly acquire and release them. Useful for job scheduling, migration coordination, and singleton processing.
Deadlocks: Transaction A holds Row 1, waits for Row 2. Transaction B holds Row 2, waits for Row 1. Prevention: consistent lock ordering, short transactions, lock timeouts.
Gap Locks in MySQL. MySQL InnoDB’s REPEATABLE READ uses “gap locks” that lock ranges of index records, not just specific rows. This can cause unexpected blocking and deadlocks in high-concurrency write scenarios. PostgreSQL’s MVCC approach avoids this by using snapshot isolation instead.

16.5 CAP Theorem and Distributed Data

CAP theorem: Consistency, Availability, Partition Tolerance. The common “pick two” framing is misleading — network partitions are not optional, they will happen. The real choice is: during a partition, do you sacrifice consistency or availability? In normal operation (no partition), you can have both.

PACELC — The More Practical Model

The PACELC model (Daniel Abadi) extends CAP to capture what matters during normal operation: during a Partition, choose Availability or Consistency; Else (normal operation), choose Latency or Consistency. This captures the fact that even without partitions, there is a trade-off between response speed and consistency (synchronous replication is consistent but slower).
SystemDuring Partition (PAC)Normal Operation (ELC)Classification
PostgreSQL (sync replication)Sacrifice Availability (CP)Sacrifice Latency for Consistency (EC)PC/EC
CassandraSacrifice Consistency (AP)Sacrifice Consistency for Latency (EL)PA/EL
DynamoDB (eventual)Sacrifice Consistency (AP)Sacrifice Consistency for Latency (EL)PA/EL
DynamoDB (strong reads)Sacrifice Availability (CP)Sacrifice Latency for Consistency (EC)PC/EC
MongoDB (majority writes)Sacrifice Availability (CP)Sacrifice Latency for Consistency (EC)PC/EC
CockroachDBSacrifice Availability (CP)Sacrifice Latency for Consistency (EC)PC/EC
PACELC is more useful than CAP for real engineering decisions because most of the time your system is NOT partitioned. The daily trade-off you actually make is latency vs consistency — and PACELC makes that explicit. When someone says “we chose Cassandra for availability,” the more honest statement is “we chose Cassandra because we value low latency over strong consistency in normal operation, AND we prefer availability over consistency during partitions.”
CP systems: ZooKeeper, etcd, traditional RDBMS with synchronous replication. Refuse to respond rather than return stale data. AP systems: Cassandra, DynamoDB (eventual consistency mode), CouchDB. Return potentially stale data rather than refusing. Practical nuance: Real systems are not purely CP or AP. They make different trade-offs per operation. Write CP, read AP is a common hybrid approach.
Eventual Consistency. After an update, replicas may return stale data for a period, but will eventually converge to the latest value. “Eventually” could be milliseconds or seconds. The question is whether your application can tolerate the window of inconsistency.

16.6 Database Replication

Synchronous vs Asynchronous vs Semi-Synchronous

Replication ModeHow It WorksData Loss RiskWrite LatencyAvailability Impact
SynchronousWrite confirmed only after replica acknowledgesZero (for acknowledged replicas)High (network round-trip to replica)Replica outage blocks writes
AsynchronousWrite confirmed after primary writes locally; replica catches up laterPossible (primary fails before replica receives write)LowestReplica outage has no impact on writes
Semi-synchronousWrite confirmed after at least one replica acknowledges; remaining replicas are asyncMinimal (at least one replica has the data)Moderate (one round-trip, not all)Tolerates all-but-one replica failures
Synchronous replication: Write is confirmed only after at least one replica (or a configurable quorum) acknowledges. Guarantees no data loss if the primary fails. Adds latency to writes. Waiting for all replicas is possible but rarely done — it makes writes as slow as the slowest replica and means a single replica outage blocks all writes. Asynchronous replication: Write is confirmed after primary acknowledges. Replicas catch up later. Risk of data loss if primary fails before replication completes. Semi-synchronous replication (the practical middle ground): Used in MySQL and PostgreSQL (via synchronous_standby_names with ANY 1). The primary waits for at least one replica to confirm, then acknowledges the write. Remaining replicas replicate asynchronously. This gives you durability (one replica always has the latest data) without the fragility of full synchronous replication (one slow replica does not block all writes). This is the most common production configuration for databases that need strong durability without sacrificing availability. Read replicas: Direct read traffic to replicas, reducing load on the primary. Reads may be slightly stale (replication lag).
Replication Lag. A user creates a record, is redirected to a page that reads it, and the read hits a replica that has not received the write yet. Solutions: “read your own writes” consistency (route reads to primary for the writing user for a short period), or use the primary for reads immediately after writes.
Strong answer: First, verify that reads are the bottleneck (not write contention or slow queries — adding replicas does not help those). Check pg_stat_activity — if you see 80%+ of active connections running SELECT queries and the primary CPU is pegged at 90%+, read replicas will help. If you see lock waits and slow writes, replicas will not.Then set up asynchronous replication to one read replica. On a typical cloud instance (e.g., RDS db.r6g.xlarge with 4 vCPUs), a single PostgreSQL node handles roughly 5,000-10,000 simple read queries per second. Adding one read replica doubles that read capacity to ~10,000-20,000 QPS. Route read-only queries from the application to the replica — this can be done at the ORM level (Rails connected_to, Django using), at the connection pool level, or via a proxy like PgBouncer with read/write splitting. Monitor replication lag — in normal operation on the same availability zone, expect 1-10 ms lag; cross-region replicas may see 50-200 ms lag depending on distance. Keep all writes on the primary.
Classic replication lag issue. The user wrote to the primary, then the feed page read from the replica, but the replica had not received the write yet. Fixes: (1) Route the writing user’s reads to the primary for the next 5 seconds after a write (track this in the session — “read your own writes” consistency). (2) After creating a post, return the new post in the API response and have the frontend optimistically insert it into the feed without waiting for a server round-trip. (3) For the specific “just wrote” scenario, include a ?after_write=true parameter that forces a primary read. Option 2 is best UX. Option 1 is the safest server-side fix.

16.7 Backup and Recovery

The principle: Backups are worthless. Restores are priceless. If you have never tested restoring from your backups, you do not have backups — you have hope. Backup types: Full backup (complete copy of all data — slow, large, but simple to restore). Incremental backup (only changes since last backup — fast, small, but restore requires the full backup plus all increments). Differential backup (changes since last full backup — middle ground). Point-in-Time Recovery (PITR): Restore the database to any specific moment. PostgreSQL achieves this through WAL (Write-Ahead Log) archiving — the base backup plus archived WAL segments let you replay transactions up to any timestamp. MySQL uses binary logs (binlogs) similarly. Cloud-managed databases (RDS, Cloud SQL) provide automated PITR with configurable retention (typically 1-35 days). Backup testing: Schedule regular restore drills (monthly minimum). Measure restore time — does it meet your RTO? Verify data integrity after restore. Automate the verification (restore to a test instance, run validation queries, compare row counts). If you only discover your backup is corrupted during a real disaster, you have failed. Cloud-managed backup: RDS automated backups (daily snapshots + continuous WAL archiving). Cloud SQL automated backups. Azure SQL geo-redundant backups. These are good defaults but understand their RPO (usually 5 minutes for PITR) and retention limits.
Logical backups (pg_dump, mysqldump) are portable but slow for large databases. Physical backups (filesystem-level snapshots, pg_basebackup) are fast but tied to the same database version. For databases over 100 GB, always use physical backups for disaster recovery.

16.8 Schema Versioning and Migrations

Use the expand and contract pattern: add new column, deploy app writing to both, backfill, deploy app reading from new column, remove old column. Never rename a column in one step. Always test migrations on production-sized data. Include rollback scripts.
Tools: Flyway, Liquibase (JVM). Alembic (Python). Knex migrations (Node.js). Entity Framework migrations (.NET). golang-migrate (Go). dbmate (language-agnostic).

16.9 SQL vs NoSQL — Decision Framework

CriteriaSQL (PostgreSQL, MySQL)NoSQL (MongoDB, DynamoDB, Cassandra)
Data modelRelational (tables, rows, joins)Document, key-value, wide-column, graph
SchemaFixed schema, enforced by DBFlexible/schemaless (schema-on-read)
TransactionsFull ACID, multi-tableLimited (single-document in MongoDB, none in Cassandra)
JoinsNative, efficientApplication-side or denormalize
Write scalingVertical first, sharding is hardHorizontal built-in (DynamoDB, Cassandra)
Query flexibilitySQL — ad-hoc queries, aggregationsLimited (design for your access pattern upfront)
ConsistencyStrong by defaultEventual by default (tunable in some)
Best forComplex queries, transactions, relational data, reportingHigh write throughput, flexible schema, key-based access, massive scale
Decision rule: If you need transactions across multiple entities -> SQL. If you need flexible schema with key-based access at massive scale -> NoSQL. If unsure -> SQL (PostgreSQL). Adding a NoSQL store later is easy. Discovering you need transactions after choosing NoSQL is painful.

16.10 Redis vs Memcached

CriteriaRedisMemcached
Data structuresStrings, hashes, lists, sets, sorted sets, streams, HyperLogLogStrings only (key-value)
PersistenceRDB snapshots + AOFNone (pure cache)
Pub/SubBuilt-inNo
ClusteringRedis Cluster (auto-sharding)Client-side sharding
Lua scriptingYes (atomic operations)No
Multi-threadingSingle-threaded main loop (I/O threads since 6.0)Multi-threaded
Best forCaching + data structures + pub/sub + queuesPure caching, max memory efficiency
Decision rule: Need just caching -> either works. Need data structures, pub/sub, or persistence -> Redis. Need maximum memory efficiency for simple key-value at massive scale -> Memcached. Most teams choose Redis.

16.11 Specialized Data Stores

Document databases (MongoDB, CouchDB): Flexible schema, good for content management, catalogs. Key-value stores (Redis, DynamoDB): Ultra-fast lookups by key, session stores, caching. Time-series databases (InfluxDB, TimescaleDB): Optimized for time-stamped data, IoT, metrics. Search engines (Elasticsearch, OpenSearch): Full-text search, log analytics. Vector databases (Pinecone, Weaviate, pgvector): Similarity search for ML embeddings. Graph databases (Neo4j): Relationship-heavy queries (social networks, fraud detection).

Database Selection Guide

Use this table when choosing a database for a new project or feature. Start from your use case, not from a technology.
Use CaseRecommended DB TypeSpecific OptionsReasoning
General web app (users, orders, products)Relational (SQL)PostgreSQL, MySQLACID transactions, JOINs, flexible queries, mature tooling
High-write event stream (clickstream, IoT telemetry)Time-series or wide-columnTimescaleDB, InfluxDB, CassandraAppend-optimized writes, time-range queries, horizontal scale
Session store / caching layerKey-value (in-memory)Redis, MemcachedSub-millisecond reads, TTL expiry, simple key-based access
Product catalog with variable attributesDocument storeMongoDB, DynamoDBFlexible schema (each product has different attributes), nested documents
Full-text search (site search, log analytics)Search engineElasticsearch, OpenSearchInverted indexes, relevance scoring, faceted search
Social network (friends, followers, recommendations)Graph databaseNeo4j, Amazon NeptuneRelationship traversal (friends-of-friends) is O(1) per hop vs expensive JOINs
ML feature store / semantic searchVector databasePinecone, Weaviate, pgvectorApproximate nearest neighbor (ANN) search on high-dimensional embeddings
Financial ledger (strict consistency, audit trail)Relational (SQL)PostgreSQL, CockroachDBSerializable transactions, multi-row ACID, immutable audit log
Global multi-region low-latency readsDistributed NoSQLDynamoDB Global Tables, CockroachDB, CassandraMulti-region replication, tunable consistency, automatic failover
Queue / task brokerStream or message brokerRedis Streams, Kafka, SQSOrdered delivery, consumer groups, at-least-once semantics
The polyglot persistence principle: Most real systems use multiple databases, each for what it does best. A typical SaaS app might use PostgreSQL for transactional data, Redis for caching and sessions, Elasticsearch for search, and S3 for blob storage. Do not force one database to do everything. But also do not add a new database for every feature — each database is an operational burden (backups, monitoring, upgrades, expertise). Start with PostgreSQL for everything, then split out when you have a proven need.
Database Red Flags in Interviews — Things That Immediately Signal You Are Junior.
  1. Using SELECT * in production queries. It fetches every column including BLOBs and text fields you do not need, defeats covering indexes (which can serve the query entirely from the index without touching the table), and breaks when someone adds a column. Always specify the columns you need.
  2. Not mentioning indexes. If someone asks you to optimize a slow query and you do not say “index” in your first sentence, the interviewer has already mentally downgraded you. Indexes are the single highest-leverage optimization in all of database engineering.
  3. Ignoring connection pooling. Every database connection costs ~5-10 MB of memory on the server side. A web app with 100 application servers each opening 20 connections means 2,000 connections — PostgreSQL defaults to max_connections = 100. Without a connection pooler (PgBouncer, PgCat), your database falls over before your queries even execute.
  4. Not considering read replicas for read-heavy loads. If your answer to “the database is slow” is always “add an index” or “throw more hardware at the primary,” you are missing the architecture-level solution. Most web applications are 80-95% reads. Routing reads to replicas is table-stakes knowledge.
  5. Saying “just use NoSQL” without understanding the trade-offs. NoSQL is not faster. It makes different trade-offs. If you cannot articulate what you are giving up (transactions, JOINs, ad-hoc query flexibility) and what you are gaining (horizontal write scaling, flexible schema, key-based access patterns), you do not understand the choice.
  6. Never mentioning EXPLAIN ANALYZE. Optimizing a query without looking at the execution plan is like debugging code without reading the error message. If your optimization strategy is “add indexes and hope,” you are guessing.
  7. Ignoring data volume in your estimates. When someone says “we have 500 million rows” and your answer does not change compared to “we have 50,000 rows,” you are not thinking about scale. A full table scan on 500M rows takes minutes. On 50K rows, it takes milliseconds. The strategy is completely different.
Strong answer:“Indexes exist” does not mean “indexes are working.” This is a systematic diagnosis, not a guessing game. Let me put scale in perspective first: at 500M rows with an average row size of 200 bytes, we are talking about ~100 GB of raw data. A full table scan at typical cloud SSD throughput (~200 MB/s sequential) takes roughly 8-10 minutes. With a well-designed B-tree index, a point lookup on the same table takes 2-5 ms. That is a 100,000x difference — and it is the difference between “the page loads” and “the page times out.”Phase 1: Understand the queries.
  • Run EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) on the slow queries. This is non-negotiable — you cannot optimize what you have not measured. Look at the actual execution plan, not the estimated one.
  • Check if the planner is using your indexes at all. If you see Seq Scan on a 500M-row table, either the index is missing for that query pattern, the statistics are stale, or the planner decided a sequential scan was cheaper (which can happen if the query returns a large fraction of the table).
Reading a real query plan — concrete example. Suppose you run:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 42 AND status = 'shipped';
And get back:
Seq Scan on orders  (cost=0.00..12847623.00 rows=4821 width=196)
                    (actual time=287401.22..287892.55 rows=4793 loops=1)
  Filter: ((customer_id = 42) AND (status = 'shipped'::text))
  Rows Removed by Filter: 499995207
  Buffers: shared hit=1024 read=8472981
Planning Time: 0.12 ms
Execution Time: 287893.01 ms
This plan is screaming at you. Here is how to read it: (1) Seq Scan means PostgreSQL is reading every single row — all 500M of them. At roughly 10 MB/s sustained disk throughput on a typical cloud instance, that is approximately 4-5 minutes of wall-clock time, which matches the 287893 ms (~4.8 minutes) execution time. (2) Rows Removed by Filter: 499,995,207 means it examined 500M rows to find 4,793 matches — a selectivity of 0.001%. This is an index’s dream scenario. (3) Buffers: shared read=8,472,981 means it read ~8.5 million 8 KB pages from disk — about 65 GB of I/O for a query that should touch a few MB. (4) The cost=0.00..12847623.00 is the planner’s estimated cost in arbitrary units — higher means more expensive, and 12.8 million is astronomical.Now add a composite index:
CREATE INDEX idx_orders_customer_status ON orders(customer_id, status);
Re-run EXPLAIN ANALYZE:
Index Scan using idx_orders_customer_status on orders
    (cost=0.57..5241.33 rows=4821 width=196)
    (actual time=0.03..4.87 rows=4793 loops=1)
  Index Cond: ((customer_id = 42) AND (status = 'shipped'::text))
  Buffers: shared hit=4821
Planning Time: 0.15 ms
Execution Time: 5.12 ms
Same query, same data: from 287 seconds to 5 milliseconds — a 57,000x improvement. The Index Scan touches exactly the rows it needs (4,821 buffer hits vs 8.5 million disk reads). At 500M rows, a B-tree index is about 4-5 levels deep, so the lookup requires roughly 5 random I/O operations to find the first matching leaf page, then a sequential scan through adjacent leaf pages for the remaining matches. This is why EXPLAIN ANALYZE is non-negotiable — without it, you are guessing. Every database interview answer about query optimization that does not start with “I would run EXPLAIN ANALYZE” is incomplete.
How to read any EXPLAIN ANALYZE output in 30 seconds: (1) Scan for Seq Scan on large tables — that is your red flag. (2) Check actual time — the second number is total time in milliseconds. (3) Compare rows (estimated) to actual rows — a big mismatch means stale statistics (ANALYZE the table). (4) Look at Buffers: shared read — high numbers mean cold cache and heavy disk I/O. (5) Check for Sort or Hash Join with high row counts — those eat memory and spill to disk.
Phase 2: Statistics and maintenance.
  • Run ANALYZE on the table. PostgreSQL’s query planner uses statistics about data distribution to choose execution plans. On a 500M-row table, the default statistics target (100) may be too low — increase it with ALTER TABLE SET STATISTICS 1000 on high-cardinality columns used in WHERE clauses.
  • Check for bloat. On a heavily-updated table, dead tuples accumulate (MVCC). Run SELECT n_dead_tup, n_live_tup FROM pg_stat_user_tables WHERE relname = 'your_table'. If dead tuples are a significant fraction of live tuples, VACUUM FULL or pg_repack (non-locking) to reclaim space and reduce I/O.
  • Check index bloat. B-tree indexes on frequently-updated columns become bloated over time. REINDEX CONCURRENTLY rebuilds without locking.
Phase 3: Index optimization.
  • Verify composite index column order matches query patterns (leftmost prefix rule). An index on (status, created_at) helps WHERE status = 'active' AND created_at > '2025-01-01' but NOT WHERE created_at > '2025-01-01' alone.
  • Consider partial indexes if queries consistently filter a subset: CREATE INDEX idx_active ON orders(created_at) WHERE status = 'active' — if only 5% of rows are active, this index is 20x smaller and 20x faster to scan.
  • Consider covering indexes (INCLUDE) to enable index-only scans — if the query only needs id and amount, CREATE INDEX idx ON orders(status, created_at) INCLUDE (id, amount) avoids touching the heap entirely.
Phase 4: Table-level strategies for 500M+ rows.
  • Partitioning. If queries consistently filter by a time range (and they usually do), partition by month or quarter using PostgreSQL declarative partitioning. A query for “last 30 days” now scans one partition (maybe 15M rows) instead of 500M. Partition pruning is automatic. The numbers are dramatic: scanning 15M rows takes ~2-3 seconds vs ~8-10 minutes for 500M. With an index on the partition, point lookups drop to under 1 ms because the index is 30x smaller and fits entirely in memory.
  • Read replicas. Route analytics and reporting queries to a read replica. Keep the primary for OLTP.
  • Materialized views. For expensive aggregation queries that do not need real-time data, create a materialized view refreshed every hour. A dashboard that aggregates 500M rows every page load is a bug, not a feature.
Phase 5: Application-level changes.
  • Add pagination. No query should return 500M rows. Cursor-based pagination with WHERE id > last_seen_id ORDER BY id LIMIT 100 is O(1) regardless of offset. At 500M rows, an OFFSET 10000000 LIMIT 20 query still scans and discards 10 million rows — cursor-based pagination avoids this entirely.
  • Cache hot queries in Redis with a TTL. If the same aggregation is requested 1000 times per minute, compute it once. A Redis GET takes ~0.1 ms vs a complex PostgreSQL aggregation at 50-500 ms — that is a 500-5000x improvement for repeat queries. See the Caching chapter for cache invalidation strategies (write-through, write-behind, cache-aside) and how to avoid the thundering herd problem on cache expiry.
  • Review the ORM. ORMs often generate suboptimal queries — N+1 problems, unnecessary SELECT *, missing WHERE clauses. EXPLAIN ANALYZE the actual SQL the ORM produces, not what you think it produces. A classic N+1 on a 500M-row table turns a single 5 ms indexed query into 100 separate queries taking 500 ms total — plus the connection overhead.
What NOT to do: Do not throw hardware at it before understanding the query plans. Do not add random indexes hoping one helps — every index slows writes (on a table with 8 indexes, each INSERT pays ~9 disk writes). Do not jump to sharding before exhausting single-node optimizations — PostgreSQL can handle billions of rows on a single node with proper indexing and partitioning.
Cross-Chapter Connections for Database Optimization.
  • Caching — The fastest database query is the one you never make. For read-heavy workloads, a caching layer (Redis, Memcached) in front of the database can reduce read load by 90%+. See the Caching chapter for strategies on cache invalidation, TTL design, and avoiding the thundering herd problem when a hot cache key expires and 1,000 requests simultaneously hit the database.
  • Performance and Reliability — Query tuning is one dimension of performance, but connection pooling (PgBouncer, ProxySQL), circuit breakers around database calls, and timeout configuration are equally critical. A single slow query without a timeout can hold a connection for minutes, starving the pool. See the Performance chapter for connection pool sizing, timeout strategies, and circuit breaker patterns around data stores.
  • Messaging and Async Processing — For write-heavy workloads, consider decoupling writes through a message queue. Instead of synchronously inserting into the database on every API request, publish to Kafka or SQS and let a consumer batch-insert. This absorbs traffic spikes (Black Friday: 10x normal write volume) without overwhelming the database. See the Messaging chapter for the outbox pattern, which ensures reliable publishing of database changes to a message broker without distributed transactions.
Further reading: Designing Data-Intensive Applications by Martin Kleppmann — chapters on replication, partitioning, and transactions are exceptional. Chapters 3 (Storage and Retrieval), 5 (Replication), 6 (Partitioning), and 7 (Transactions) are the most directly relevant to this section. Use The Index, Luke — the best free resource on SQL indexing; covers B-tree internals, composite index design, partial indexes, and how to read execution plans across PostgreSQL, MySQL, Oracle, and SQL Server. Percona’s Database Performance Blog — deep technical posts on MySQL and PostgreSQL performance tuning, replication, and operational best practices from the team behind pt-online-schema-change and Percona Server. GitHub Engineering Blog on Database Infrastructure — real-world case studies on zero-downtime migrations with gh-ost, MySQL fleet management at scale, and the operational challenges of running databases for 100M+ developers. Uber Engineering: Schemaless — Uber’s custom database layer built on top of MySQL that provides schemaless storage, append-only writes, and horizontal sharding; a fascinating case study in building application-level database abstractions. Google’s Spanner Paper — the foundational paper on globally distributed databases with external consistency; introduces TrueTime and explains how Google achieves strong consistency across data centers worldwide.Database indexing: Use The Index, Luke — SQL Indexing and Tuning e-Book — free, comprehensive guide to database indexing by Markus Winand; covers B-tree structure, composite index design, partial indexes, covering indexes, and how to read execution plans; essential for anyone writing SQL in production.PostgreSQL: PostgreSQL Wiki: Performance Optimization — community-maintained collection of PostgreSQL performance tuning techniques covering configuration (shared_buffers, work_mem, effective_cache_size), query optimization, and connection management. pganalyze Blog — in-depth articles on PostgreSQL performance including EXPLAIN plan analysis, index advisor strategies, vacuum tuning, bloat management, and connection pooling; written by the team behind the pganalyze monitoring tool.Transactions and ACID: PostgreSQL Documentation: Transaction Isolation — the authoritative reference on how PostgreSQL implements Read Committed, Repeatable Read (snapshot isolation), and Serializable isolation levels; essential for understanding MVCC behavior, write skew prevention, and serialization failure handling. Jepsen.io: Consistency Analyses — Kyle Kingsbury’s rigorous distributed systems correctness testing; tests real databases (PostgreSQL, MySQL, MongoDB, CockroachDB, Cassandra, and more) for consistency violations under network partitions and failures; the most honest source of truth about whether a database actually delivers the consistency guarantees it claims.CAP theorem and PACELC: Martin Kleppmann: Please Stop Calling Databases CP or AP — Kleppmann’s widely-cited critique explaining why the CAP theorem is frequently misunderstood, why real systems do not fit neatly into CP or AP categories, and why you should reason about specific consistency models (linearizability, causal consistency, eventual consistency) rather than using CAP as a classification tool.Replication: MySQL Documentation: Replication — the canonical reference for MySQL replication covering binary log replication, GTID-based replication, semi-synchronous replication, group replication, and troubleshooting replication lag. PostgreSQL Documentation: Streaming Replication — official guide to PostgreSQL’s WAL-based streaming replication, covering synchronous and asynchronous modes, replication slots, hot standby configuration, and monitoring replication lag.Database migrations: Flyway Documentation — the most popular SQL-based migration tool for JVM projects; covers versioned migrations, repeatable migrations, undo migrations, and baseline; well-suited for teams that prefer writing raw SQL migration scripts. Liquibase Documentation — database-agnostic migration tool supporting SQL, XML, YAML, and JSON changelog formats; particularly strong for teams managing multiple database vendors or needing rollback tracking. gh-ost: GitHub’s Online Schema Migration Tool — GitHub’s triggerless online schema migration tool for MySQL; performs non-blocking ALTER TABLE on tables with billions of rows by using the binary log stream; essential reading for anyone running MySQL at scale.Connection pooling: PgBouncer Documentation — the standard connection pooler for PostgreSQL; covers session, transaction, and statement pooling modes, configuration tuning, and monitoring; critical for any PostgreSQL deployment serving more than a handful of application instances. HikariCP Wiki — documentation for the fastest JVM connection pool; the “About Pool Sizing” article is particularly valuable — it explains why a small pool (often 10 connections) outperforms a large pool (100+) and provides the formula for calculating optimal pool size based on core count and disk characteristics.

API Design Checklist

Use this checklist before shipping any API — internal or external. The items are ordered roughly by “pain caused if you skip this.” Every item you defer is technical debt that compounds with every new consumer.

1. Versioning Strategy

  • Choose a versioning scheme before writing the first endpoint (/v1/ URL path for public APIs, header-based for internal)
  • Define what constitutes a breaking change for your API (removing fields, renaming fields, changing types, changing status codes)
  • Establish a deprecation policy with a timeline (minimum 6 months notice for public APIs, 2 weeks for internal)
  • Pin each consumer to a version — new consumers get the latest, existing consumers keep theirs until they opt in
  • Include version in every response header (X-API-Version: 2024-01-15) so consumers can verify

2. Authentication and Authorization

  • Choose auth mechanism: API keys (simple, per-consumer), OAuth 2.0 (delegated access, scopes), JWT (stateless, short-lived)
  • Require HTTPS everywhere — no exceptions, even for internal APIs (zero-trust networking)
  • Implement API key rotation without downtime (support two active keys per consumer during rotation)
  • Return 401 Unauthorized (identity unknown) vs 403 Forbidden (identity known, insufficient permissions) — these are different
  • Log every authentication failure with IP, key prefix (never the full key), and timestamp for security auditing

3. Rate Limiting

  • Define rate limits per consumer tier (free: 100 req/min, paid: 10,000 req/min, enterprise: custom)
  • Return 429 Too Many Requests with Retry-After header (in seconds) and X-RateLimit-Remaining, X-RateLimit-Reset headers
  • Choose algorithm: token bucket for APIs allowing bursts, sliding window for strict rate enforcement
  • Implement at the API gateway level (not per-application-instance) for accurate distributed counting
  • Document rate limits prominently — developers should not discover limits by hitting them

4. Pagination

  • Choose cursor-based pagination for all list endpoints (not offset-based — offset degrades at scale)
  • Return pagination metadata in every list response: { "data": [...], "has_more": true, "next_cursor": "abc123" }
  • Set a maximum page_size (e.g., 100) and a sensible default (e.g., 20) — never return unbounded results
  • Ensure cursors are opaque to clients (base64-encoded internal identifiers, not raw database IDs)
  • Document the sort order — cursors are meaningless without a defined, stable ordering

5. Error Format

  • Define a single, consistent error schema used by every endpoint:
{
  "error": {
    "code": "invalid_parameter",
    "message": "The 'currency' field must be an uppercase ISO 4217 code (e.g., 'USD'). Received: 'usd'.",
    "param": "currency",
    "type": "validation_error",
    "request_id": "req_abc123"
  }
}
  • Include a machine-readable code (for programmatic handling) AND a human-readable message (for debugging)
  • Include request_id in every response (success and error) — this is the lifeline for support debugging
  • Use appropriate HTTP status codes: 400 (bad input), 401 (not authenticated), 403 (not authorized), 404 (not found), 409 (conflict), 422 (validation failed), 429 (rate limited), 500 (server error)
  • Never leak internal details (stack traces, SQL errors, internal service names) in production error responses

6. Idempotency

  • Require an Idempotency-Key header on all non-idempotent mutations (POST, PATCH)
  • Store idempotency keys with the response — retries return the stored response, not a new execution
  • Scope keys to the API consumer (two different consumers can use the same key string)
  • Set a TTL on idempotency records (24 hours is standard — Stripe’s model)
  • Return an error if a retried request has a different body than the original (the key is bound to the parameters)
  • For critical operations (payments, order creation), wrap the idempotency check and business logic in a single database transaction

7. Documentation

  • Write an OpenAPI (Swagger) spec before writing code — the spec is the contract
  • Include a runnable example for every endpoint (curl command or code snippet in 2-3 languages)
  • Document every error code with cause, example, and suggested fix
  • Provide a “Getting Started” guide that gets a developer from zero to first successful API call in under 5 minutes
  • Include a changelog that documents every change with the date, affected endpoints, and migration path
  • Host interactive API docs (Swagger UI, Redoc, or Stoplight) — developers should be able to try endpoints without writing code

8. Additional Production Concerns

  • CORS — Configure allowed origins, methods, and headers for browser-based consumers
  • Request/response logging — Log every request with method, path, status code, latency, and consumer ID (redact sensitive fields)
  • Health check endpointGET /health returns 200 with dependency status (database connected, cache reachable) for load balancer probes
  • Compression — Support Accept-Encoding: gzip — JSON compresses 5-10x, which matters for bandwidth-constrained mobile clients
  • Timeouts — Set and document server-side request timeouts (e.g., 30 seconds) — a client should never wait indefinitely
  • Webhooks (if applicable) — Support event notifications with retry logic, signature verification (HMAC-SHA256), and a test mode