Part VIII — APIs and Integration
Chapter 15: API Design
- Predictable — Consistent naming conventions everywhere. If one endpoint uses
created_at, every endpoint usescreated_at(notcreatedAtin some andcreation_datein others). IfGET /usersreturns a list,GET /ordersreturns a list in the same envelope. A developer who has used one endpoint can guess the shape of any other endpoint. - Forgiving — When something goes wrong, the API tells you exactly what happened and how to fix it. Not
400 Bad Requestwith no body — but{ "error": { "code": "invalid_currency", "message": "Currency 'usd' is not valid. Did you mean 'USD'? Currencies must be uppercase ISO 4217 codes.", "param": "currency" } }. Good error messages are the best API documentation. - Evolvable — Versioning is baked in from day one, not bolted on when the first breaking change happens. Every response includes only additive fields. Removing or renaming a field goes through a deprecation cycle. The API you ship on launch day should be designed assuming it will still be running in five years with clients you have never heard of.
You are designing a public API for a SaaS product. What are the top 5 design decisions you make upfront?
You are designing a public API for a SaaS product. What are the top 5 design decisions you make upfront?
- Authentication mechanism — API keys for simplicity, OAuth 2.0 if delegated access is needed.
- Versioning strategy — URL-based (
/v1/) for discoverability, with a deprecation policy from day one. - Pagination — cursor-based (not offset) for consistency at scale; every list endpoint returns pagination metadata.
- Error format — consistent error response schema across all endpoints (error code, message, details, request_id for support).
- Rate limiting — per API key, with
429 Too Many RequestsandRetry-Afterheader.
request_id in every response for debugging, and write OpenAPI spec before writing code.15.1 REST API Design
Resources as nouns (/users), HTTP methods for actions, plural nouns, hierarchy for relationships (/users/123/orders).
Status codes: 200 OK, 201 Created, 204 No Content, 400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found, 409 Conflict, 422 Unprocessable Entity, 429 Too Many Requests, 500 Internal Server Error.
Pagination: Offset-based (?page=2&size=20) is simple but slow on large datasets. Cursor-based (?cursor=abc&size=20) is consistent regardless of size. Always return pagination metadata.
15.2 gRPC
Binary protocol using Protocol Buffers over HTTP/2. Created by Google to replace their internal RPC framework (Stubby) — designed for high-throughput, low-latency service-to-service communication. How it works: Define your service in a.proto file (service methods + message types). The protoc compiler generates client and server code in your language. The client calls methods on a generated stub that looks like a local function call — the gRPC framework handles serialization, HTTP/2 framing, and network transport.
Why it is faster: Protocol Buffers are 3-10x smaller than JSON (binary, no field names in payload). HTTP/2 multiplexes multiple RPCs over a single TCP connection (no connection-per-request overhead). Generated code eliminates serialization bugs and reduces boilerplate.
Streaming: Server-side streaming (server sends a stream of responses to one request — live updates, large result sets). Client-side streaming (client sends a stream of messages — file upload, batch ingestion). Bidirectional streaming (both stream simultaneously — chat, real-time collaboration).
Trade-offs: Not browser-friendly (needs gRPC-Web or Envoy proxy). Not human-readable (cannot curl and read the response — need grpcurl or Postman). No HTTP caching (all requests are POST). Steeper learning curve (proto files, code generation, streaming complexity). For public APIs consumed by third-party developers, REST is still the standard.
.proto files, and the compiler generates code in your language. Much smaller and faster to serialize/deserialize than JSON.15.3 Idempotency
An idempotent operation produces the same result regardless of how many times it is called.GET /user/123 is naturally idempotent (reading does not change state). POST /orders is NOT naturally idempotent (calling it twice creates two orders). Making non-idempotent operations safe for retries is one of the most important distributed systems skills.
Implementation: Client generates a unique idempotency key (UUID) per request. Server checks if it has processed that key. If yes, returns the stored response. If no, processes and stores. Essential for payments, order creation, any operation where duplicates have real consequences.
Idempotency Key Patterns in Practice
Idempotency keys are more nuanced than “store a UUID.” Here is how production systems handle them: The basic pattern:- Client generates a UUID v4 and sends it in the
Idempotency-Keyheader. - Server checks a lookup table (Redis or database) for the key.
- If key exists and the request completed: return the stored response (same status code, same body).
- If key exists and the request is in-progress: return
409 Conflict(prevents concurrent retries from double-processing). - If key does not exist: create the record with status
processing, execute the operation, store the response, update status tocompleted.
- Idempotency keys are scoped to the API key (two different merchants can use the same key string without collision).
- Keys expire after 24 hours — retrying after that creates a new operation.
- If a retried request has a different body than the original, the server returns an error (the key is bound to the original parameters).
- The stored response includes the exact HTTP status code and body — a retried
POST /chargesthat originally returned402 Payment Failedreturns402again, not200.
15.4 API Versioning and Backward Compatibility
Safe changes: Adding optional fields, new endpoints, new optional parameters. Breaking changes: Removing fields, renaming fields, changing types, making optional fields required.Versioning Strategies Compared
| Strategy | Example | Pros | Cons | Used By |
|---|---|---|---|---|
| URL path | /v1/users, /v2/users | Most discoverable, easy to route, easy to cache | URL pollution, harder to sunset | Stripe, GitHub, Twitter |
| Custom header | X-API-Version: 2 | Clean URLs, flexible | Hidden from browser, easy to forget | Azure, Jira |
| Accept header (content negotiation) | Accept: application/vnd.myapi.v2+json | HTTP-standards compliant, fine-grained | Complex, hard to test in browser | GitHub (also supports URL) |
| Query parameter | /users?version=2 | Easy to add, discoverable | Breaks caching, messy | Rarely recommended |
/v1/) for public APIs — it is the most discoverable and requires no special client configuration. Reserve header-based versioning for internal APIs where you control all clients. Content negotiation is standards-purist but creates friction for third-party developers.
- Announce deprecation in API changelog and response headers (
Deprecation: true,Sunset: 2026-12-01). - Give at least 12 months notice for major version sunsetting.
- New accounts default to the latest version; existing accounts are pinned to their current version.
- Provide a migration guide for every breaking change.
15.5 GraphQL
A query language for APIs where the client specifies exactly what data it needs. A single endpoint (/graphql) serves all queries. The client sends a query describing the shape of the response.
Benefits: No over-fetching (client gets only requested fields). No under-fetching (client gets all needed data in one request, unlike REST which may require multiple endpoints). Strongly typed schema acts as documentation. Great for mobile clients where bandwidth matters.
Trade-offs: Caching is harder (every request is a POST with a unique body — HTTP caching does not work). Rate limiting is harder (one query can be trivial or extremely expensive). N+1 problems move to the server (resolving nested fields can trigger many database calls without careful dataloader implementation). Security: clients can craft expensive nested queries — implement query depth limits and complexity analysis.
When to use: Multiple client types needing different data shapes (web shows full details, mobile shows summary). Rapidly evolving frontend requirements. When the relationship between data is graph-like. When NOT to use: Simple CRUD APIs, file uploads, real-time streaming (use subscriptions or SSE), when your team is small and REST is working fine.
15.6 Rate Limiting
Control how many requests a client can make in a given time window. Protects against abuse, ensures fair resource distribution, and prevents cascade failures. Algorithms: Fixed window (simple, allows bursts at window boundaries). Sliding window (smoother, more complex). Token bucket (allows controlled bursts up to a limit, then steady rate). Leaky bucket (smoothest output rate, constant processing).Rate Limiting Algorithms — Deep Comparison
| Algorithm | How It Works | Burst Handling | Memory | Precision | Best For |
|---|---|---|---|---|---|
| Fixed window | Counter resets every N seconds | Allows 2x burst at window boundary (end of one + start of next) | Very low (1 counter per key) | Low | Simple internal APIs where boundary bursts are acceptable |
| Sliding window log | Stores timestamp of each request, counts within rolling window | No boundary burst | High (stores every timestamp) | Exact | Low-volume APIs where precision matters more than memory |
| Sliding window counter | Weighted average of current and previous window | Minimal boundary burst | Low (2 counters per key) | Approximate | Most production APIs — good balance of accuracy and memory |
| Token bucket | Bucket fills at steady rate, each request takes a token, bucket has max capacity | Allows controlled bursts up to bucket size, then steady rate | Low (token count + last refill time) | Good | APIs where short bursts are acceptable (Stripe, AWS) |
| Leaky bucket | Requests enter a queue processed at constant rate | No bursts — excess requests are queued or dropped | Moderate (queue) | Smoothest output | Traffic shaping, when downstream systems need constant rate |
- Token bucket is best when you want to allow short bursts (a mobile app that syncs 20 items at once, then goes idle). Set bucket size = max burst, refill rate = sustained limit.
- Sliding window counter is best when you want strictly even distribution and bursts are harmful (payment processing API).
429 Too Many Requests with Retry-After header.
Your API serves 10K RPS. A partner wants to integrate but their system can only handle 100 RPS callbacks. Design the integration.
Your API serves 10K RPS. A partner wants to integrate but their system can only handle 100 RPS callbacks. Design the integration.
- Decouple with an outbound queue. Instead of calling the partner synchronously as events occur, publish events to a message queue (SQS, Kafka, or Redis Streams) dedicated to this partner.
- Rate-limited consumer. A dedicated worker consumes from the queue at the partner’s rate — 100 RPS. Use a token bucket or leaky bucket rate limiter on the consumer side. The queue absorbs the burst; the consumer drains it steadily.
- Batching. If the partner supports batch callbacks, aggregate events into batches of 50-100 per request. This delivers 5,000-10,000 events per second within their 100 RPS limit. This is the highest-leverage optimization.
- Backpressure and overflow. Set a queue depth limit. If the partner falls behind (their system is down), the queue grows. Add dead-letter queue (DLQ) for events that fail after N retries. Alert when queue depth exceeds a threshold — this means the partner cannot keep up even at steady state.
- Webhook with retry and exponential backoff. For the actual callback, use exponential backoff with jitter (1s, 2s, 4s, …, max 5 minutes). Include an
Idempotency-Keyheader so the partner can safely deduplicate retries. - Pull alternative. Offer the partner a poll-based API (
GET /events?since=cursor) as an alternative to push. This lets them control their own consumption rate. Many partners prefer this because it puts them in control of backpressure.
Design the API for a file upload service that handles files up to 5GB. Consider resume, progress, and failure.
Design the API for a file upload service that handles files up to 5GB. Consider resume, progress, and failure.
POST with a 5GB body is a non-starter — any network interruption means restarting from zero, most proxies and load balancers have body size limits (often 1-10MB), and you cannot report meaningful progress.Design: Chunked Resumable Upload ProtocolStep 1: Initiate the upload.upload_id that the client uses for all subsequent requests.Step 2: Upload chunks.- Chunk size: 5MB is a good default — small enough that a failed chunk is cheap to retry, large enough that 1024 chunks is manageable. Let the server decide (clients with slow connections could request smaller chunks).
- Idempotency: Uploading the same chunk index twice is idempotent (server overwrites or ignores if checksum matches). This makes retries safe.
- Progress: The client knows
received_chunks / total_chunks. For real-time progress, the client can calculate bytes uploaded per second from the chunk upload times. - Expiry: Incomplete uploads expire after 24 hours. The client can extend with a
PATCH /uploads/{id}to refresh the TTL. - Parallel uploads: Chunks are independent, so the client can upload 3-5 chunks in parallel for better throughput on high-bandwidth connections.
15.7 REST vs gRPC vs GraphQL — When to Use Each
| Criteria | REST | gRPC | GraphQL |
|---|---|---|---|
| Protocol | HTTP/1.1 or HTTP/2 | HTTP/2 (always) | HTTP (typically POST) |
| Format | JSON (text) | Protocol Buffers (binary) | JSON (text) |
| Type safety | OpenAPI/Swagger (optional) | Strong (code-generated from .proto) | Strong (schema-defined) |
| Streaming | SSE or WebSocket (separate) | Native bidirectional streaming | Subscriptions (over WebSocket) |
| Browser support | Native | Requires gRPC-Web proxy | Native |
| Caching | HTTP caching works natively | No HTTP caching (binary POST) | Hard (every request is unique POST) |
| Payload size | Larger (JSON text + field names) | 3-10x smaller (binary, no field names) | Variable (client picks fields) |
| Best for | Public APIs, browser clients | Internal service-to-service, high-throughput | Multiple client types needing different data |
| Worst for | Mobile (over-fetching) | Browser clients, public APIs | Simple CRUD, real-time streaming |
| Versioning | URL-based (/v1/), easy | Proto file evolution, careful | Schema evolution, deprecation directives |
Decision Matrix with Concrete Scenarios
| Scenario | Choose | Why |
|---|---|---|
| Public API for third-party developers | REST | Universal tooling, discoverable, HTTP caching, curl-friendly |
| Mobile app + web app consuming same backend, different data needs | GraphQL | Each client fetches exactly the fields it needs, one endpoint |
| Microservice A calls microservice B 10,000 times/second | gRPC | Binary serialization, HTTP/2 multiplexing, code-generated stubs, streaming |
| Internal service with complex data relationships (social graph) | GraphQL | Natural fit for graph-like data traversal |
| Real-time bidirectional communication (chat, collab editing) | gRPC streaming | Native bidi streaming over HTTP/2, no WebSocket needed |
| Simple CRUD API, small team, getting started | REST | Lowest complexity, best documentation, largest talent pool |
| Backend-for-frontend (BFF) aggregating multiple services | GraphQL | Single query resolves data from multiple sources |
| IoT device sending telemetry at very high volume | gRPC | Minimal payload size, efficient binary serialization |
15.8 Contract Testing
Consumer defines expectations, provider verifies it satisfies them. Catches breaking changes before deployment.- Caching — API responses are prime caching candidates. Use HTTP cache headers (
Cache-Control,ETag,Last-Modified) for GET endpoints, and application-level caching (Redis) for expensive computations behind endpoints. A well-cached API can serve 100x the traffic without scaling the backend. See the Caching chapter for cache invalidation patterns and HTTP caching strategies. - Performance and Reliability — Rate limiting (covered above) is only one layer of API protection. Circuit breakers protect your API from slow downstream dependencies. Timeouts prevent a single slow request from consuming a connection indefinitely. See the Performance chapter for the full resilience stack: timeouts, retries, circuit breakers, and bulkheads.
- Messaging — For APIs that trigger long-running operations (video processing, report generation, bulk imports), return
202 Acceptedimmediately and publish the work to a message queue. The client polls a status endpoint or receives a webhook callback when complete. This pattern prevents HTTP timeouts and lets you scale processing independently from request handling. See the Messaging chapter for async job patterns and webhook delivery guarantees.
Part IX — Databases and Data Systems
The database is where most systems spend the most time, have the most bugs, and face the hardest scaling challenges. The reason databases are hard is not the SQL — it is the trade-offs: consistency vs performance (isolation levels), read speed vs write speed (indexes), flexibility vs integrity (schema), and single-node simplicity vs distributed resilience (replication/sharding). Every database decision is a trade-off decision.ALTER TABLE locks the table for hours or days — completely unacceptable for a service that developers depend on 24/7. GitHub built gh-ost (GitHub Online Schema Transmogrifier), an open-source tool that performs schema migrations without locking the original table. How it works: gh-ost creates a “ghost” table with the new schema, then uses MySQL’s binary log (binlog) stream to copy rows from the original table to the ghost table while simultaneously capturing and replaying any new writes. When the ghost table is caught up, it performs an atomic table swap. The entire process runs with minimal impact on production traffic because it is throttle-aware — it monitors replication lag and slows down or pauses the migration if the database is under stress. gh-ost replaced the previous approach (pt-online-schema-change from Percona) because it does not use triggers, which were a source of unpredictable load spikes. GitHub now runs thousands of schema migrations per year across their fleet with zero downtime. The lesson: at scale, schema migration is not a DBA task — it is a distributed systems problem that requires dedicated tooling.Chapter 16: Relational Foundations
16.1 Normalization and Denormalization
Normalization: Eliminate redundancy — each fact stored exactly once. A customer’s address is in thecustomers table, not duplicated in every orders row. Benefits: no update anomalies (change address once, reflected everywhere), smaller storage, data integrity. Cost: queries need JOINs to reassemble related data.
Denormalization: Introduce controlled redundancy for read performance. Store the customer’s name directly in the orders table so displaying an order does not require a JOIN. Benefits: faster reads (no JOINs), simpler queries. Cost: update anomalies (customer changes name -> must update it in every order row), more storage, risk of inconsistency.
When to denormalize: When reads vastly outnumber writes (product catalog read 1000x per write). When JOIN performance is unacceptable (joining 5 tables for a dashboard query). When you are building a read-optimized model (CQRS read side, search index, analytics). When NOT to: When write consistency is critical (financial data). When the data changes frequently (denormalized copies go stale). As a first resort (optimize queries and indexes first).
16.2 Indexing In Depth
Index Types — When Each Shines
| Index Type | Data Structure | Supports | Best For | Avoid When |
|---|---|---|---|---|
| B-tree | Balanced tree | Equality, range (>, <, BETWEEN), sorting, prefix LIKE 'foo%' | Default choice for almost everything — primary keys, foreign keys, most WHERE clauses | You only ever do exact-match lookups at extreme scale (hash might edge it out) |
| Hash | Hash table | Equality only (=) | Exact-match lookups in theory (O(1) vs B-tree O(log n)) | Range queries, sorting, prefix matching. In PostgreSQL, hash indexes were not crash-safe before v10 and remain uncommon. B-tree handles equality efficiently enough that hash rarely justifies losing range support |
| Composite | B-tree (multi-column) | Queries matching the leftmost prefix of columns | Multi-column WHERE clauses, covering queries | Column order does not match query patterns (leftmost prefix rule) |
| Partial | B-tree (filtered) | Queries matching the WHERE condition of the index | Indexing a subset — e.g., WHERE status = 'active' when 95% of rows are inactive | The filtered subset is most of the table (no space savings) |
| Covering | B-tree with INCLUDE columns | Index-only scans (database never touches the heap) | Read-heavy queries where all selected columns fit in the index | Wide rows or frequently updated INCLUDE columns |
| GIN (Generalized Inverted Index) | Inverted index | Array containment, full-text search, JSONB key/value queries | PostgreSQL full-text search (to_tsvector), JSONB @> operators, array && overlap | Simple scalar equality — GIN has higher write cost |
| GiST (Generalized Search Tree) | Balanced tree (extensible) | Range overlap, containment, nearest-neighbor | PostGIS geospatial queries (ST_Within, ST_DWithin), range types, ltree hierarchies | Standard equality/range on scalar types (B-tree is faster) |
WHERE UPPER(name) = 'JOHN' does not use an index on name — use a functional index instead).
16.3 Transactions and Isolation Levels
ACID: Atomicity, Consistency, Isolation, Durability.Isolation Levels, Anomalies, and When to Use Each
| Isolation Level | Dirty Read | Non-Repeatable Read | Phantom Read | Write Skew | Performance | Default In |
|---|---|---|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible | Possible | Fastest | Almost never used |
| Read Committed | Prevented | Possible | Possible | Possible | Fast | PostgreSQL, Oracle, SQL Server |
| Repeatable Read | Prevented | Prevented | Possible (SQL standard) / Prevented (PostgreSQL) | Possible | Moderate | MySQL InnoDB |
| Serializable | Prevented | Prevented | Prevented | Prevented | Slowest | None by default |
- Dirty read: Transaction A reads data written by Transaction B before B commits. If B rolls back, A used data that never existed. Prevented at Read Committed and above.
- Non-repeatable read: Transaction A reads a row, Transaction B updates and commits that row, Transaction A reads again and gets a different value. Within a single transaction, the same query returns different results. Prevented at Repeatable Read and above.
- Phantom read: Transaction A queries rows matching a condition (
WHERE price < 100), Transaction B inserts a new row matching that condition and commits, Transaction A re-queries and gets an extra row that was not there before. Prevented at Serializable (and at PostgreSQL’s Repeatable Read due to snapshot isolation). - Write skew: Two transactions each read a value, make a decision based on it, and write — but their combined writes violate a constraint that neither individual write violates. Example: two doctors both check that at least one doctor is on-call, each decides to go off-call, both commit — now zero doctors are on-call. Only Serializable prevents this.
- Read Committed: Use for typical web requests — showing a product page, listing orders, updating a user profile. Slight staleness within a transaction is fine because each statement sees the latest committed data.
- Repeatable Read: Use for reports that run multiple queries and need consistent numbers — “total revenue this month” should not change between the SELECT for orders and the SELECT for refunds within the same report transaction.
- Serializable: Use for financial operations where correctness is critical — transferring money between two accounts (both balances must be read and written atomically), inventory reservation (two users trying to buy the last item must not both succeed).
16.4 Locking
Pessimistic locking:SELECT ... FOR UPDATE acquires a lock on the row. Other transactions wait. Use when contention is high and conflicts are expected.
Optimistic locking: Include a version column. On update, check WHERE version = expected_version. If no row is updated, someone else changed it — retry or fail. Use when contention is low.
Advisory locking: Application-level locks using the database as a coordination point. PostgreSQL provides pg_advisory_lock(key). Use for distributed locking when you want to coordinate across application instances — e.g., ensuring only one instance runs a scheduled job.
16.5 CAP Theorem and Distributed Data
CAP theorem: Consistency, Availability, Partition Tolerance. The common “pick two” framing is misleading — network partitions are not optional, they will happen. The real choice is: during a partition, do you sacrifice consistency or availability? In normal operation (no partition), you can have both.PACELC — The More Practical Model
The PACELC model (Daniel Abadi) extends CAP to capture what matters during normal operation: during a Partition, choose Availability or Consistency; Else (normal operation), choose Latency or Consistency. This captures the fact that even without partitions, there is a trade-off between response speed and consistency (synchronous replication is consistent but slower).| System | During Partition (PAC) | Normal Operation (ELC) | Classification |
|---|---|---|---|
| PostgreSQL (sync replication) | Sacrifice Availability (CP) | Sacrifice Latency for Consistency (EC) | PC/EC |
| Cassandra | Sacrifice Consistency (AP) | Sacrifice Consistency for Latency (EL) | PA/EL |
| DynamoDB (eventual) | Sacrifice Consistency (AP) | Sacrifice Consistency for Latency (EL) | PA/EL |
| DynamoDB (strong reads) | Sacrifice Availability (CP) | Sacrifice Latency for Consistency (EC) | PC/EC |
| MongoDB (majority writes) | Sacrifice Availability (CP) | Sacrifice Latency for Consistency (EC) | PC/EC |
| CockroachDB | Sacrifice Availability (CP) | Sacrifice Latency for Consistency (EC) | PC/EC |
16.6 Database Replication
Synchronous vs Asynchronous vs Semi-Synchronous
| Replication Mode | How It Works | Data Loss Risk | Write Latency | Availability Impact |
|---|---|---|---|---|
| Synchronous | Write confirmed only after replica acknowledges | Zero (for acknowledged replicas) | High (network round-trip to replica) | Replica outage blocks writes |
| Asynchronous | Write confirmed after primary writes locally; replica catches up later | Possible (primary fails before replica receives write) | Lowest | Replica outage has no impact on writes |
| Semi-synchronous | Write confirmed after at least one replica acknowledges; remaining replicas are async | Minimal (at least one replica has the data) | Moderate (one round-trip, not all) | Tolerates all-but-one replica failures |
Your PostgreSQL database is getting slow under read load. Walk me through your approach to adding read replicas.
Your PostgreSQL database is getting slow under read load. Walk me through your approach to adding read replicas.
pg_stat_activity — if you see 80%+ of active connections running SELECT queries and the primary CPU is pegged at 90%+, read replicas will help. If you see lock waits and slow writes, replicas will not.Then set up asynchronous replication to one read replica. On a typical cloud instance (e.g., RDS db.r6g.xlarge with 4 vCPUs), a single PostgreSQL node handles roughly 5,000-10,000 simple read queries per second. Adding one read replica doubles that read capacity to ~10,000-20,000 QPS. Route read-only queries from the application to the replica — this can be done at the ORM level (Rails connected_to, Django using), at the connection pool level, or via a proxy like PgBouncer with read/write splitting. Monitor replication lag — in normal operation on the same availability zone, expect 1-10 ms lag; cross-region replicas may see 50-200 ms lag depending on distance. Keep all writes on the primary.Follow-up: A user creates a post and immediately sees their feed, but the post is not there. Users are complaining. What happened?
Follow-up: A user creates a post and immediately sees their feed, but the post is not there. Users are complaining. What happened?
?after_write=true parameter that forces a primary read. Option 2 is best UX. Option 1 is the safest server-side fix.16.7 Backup and Recovery
The principle: Backups are worthless. Restores are priceless. If you have never tested restoring from your backups, you do not have backups — you have hope. Backup types: Full backup (complete copy of all data — slow, large, but simple to restore). Incremental backup (only changes since last backup — fast, small, but restore requires the full backup plus all increments). Differential backup (changes since last full backup — middle ground). Point-in-Time Recovery (PITR): Restore the database to any specific moment. PostgreSQL achieves this through WAL (Write-Ahead Log) archiving — the base backup plus archived WAL segments let you replay transactions up to any timestamp. MySQL uses binary logs (binlogs) similarly. Cloud-managed databases (RDS, Cloud SQL) provide automated PITR with configurable retention (typically 1-35 days). Backup testing: Schedule regular restore drills (monthly minimum). Measure restore time — does it meet your RTO? Verify data integrity after restore. Automate the verification (restore to a test instance, run validation queries, compare row counts). If you only discover your backup is corrupted during a real disaster, you have failed. Cloud-managed backup: RDS automated backups (daily snapshots + continuous WAL archiving). Cloud SQL automated backups. Azure SQL geo-redundant backups. These are good defaults but understand their RPO (usually 5 minutes for PITR) and retention limits.16.8 Schema Versioning and Migrations
Use the expand and contract pattern: add new column, deploy app writing to both, backfill, deploy app reading from new column, remove old column. Never rename a column in one step. Always test migrations on production-sized data. Include rollback scripts.16.9 SQL vs NoSQL — Decision Framework
| Criteria | SQL (PostgreSQL, MySQL) | NoSQL (MongoDB, DynamoDB, Cassandra) |
|---|---|---|
| Data model | Relational (tables, rows, joins) | Document, key-value, wide-column, graph |
| Schema | Fixed schema, enforced by DB | Flexible/schemaless (schema-on-read) |
| Transactions | Full ACID, multi-table | Limited (single-document in MongoDB, none in Cassandra) |
| Joins | Native, efficient | Application-side or denormalize |
| Write scaling | Vertical first, sharding is hard | Horizontal built-in (DynamoDB, Cassandra) |
| Query flexibility | SQL — ad-hoc queries, aggregations | Limited (design for your access pattern upfront) |
| Consistency | Strong by default | Eventual by default (tunable in some) |
| Best for | Complex queries, transactions, relational data, reporting | High write throughput, flexible schema, key-based access, massive scale |
16.10 Redis vs Memcached
| Criteria | Redis | Memcached |
|---|---|---|
| Data structures | Strings, hashes, lists, sets, sorted sets, streams, HyperLogLog | Strings only (key-value) |
| Persistence | RDB snapshots + AOF | None (pure cache) |
| Pub/Sub | Built-in | No |
| Clustering | Redis Cluster (auto-sharding) | Client-side sharding |
| Lua scripting | Yes (atomic operations) | No |
| Multi-threading | Single-threaded main loop (I/O threads since 6.0) | Multi-threaded |
| Best for | Caching + data structures + pub/sub + queues | Pure caching, max memory efficiency |
16.11 Specialized Data Stores
Document databases (MongoDB, CouchDB): Flexible schema, good for content management, catalogs. Key-value stores (Redis, DynamoDB): Ultra-fast lookups by key, session stores, caching. Time-series databases (InfluxDB, TimescaleDB): Optimized for time-stamped data, IoT, metrics. Search engines (Elasticsearch, OpenSearch): Full-text search, log analytics. Vector databases (Pinecone, Weaviate, pgvector): Similarity search for ML embeddings. Graph databases (Neo4j): Relationship-heavy queries (social networks, fraud detection).Database Selection Guide
Use this table when choosing a database for a new project or feature. Start from your use case, not from a technology.| Use Case | Recommended DB Type | Specific Options | Reasoning |
|---|---|---|---|
| General web app (users, orders, products) | Relational (SQL) | PostgreSQL, MySQL | ACID transactions, JOINs, flexible queries, mature tooling |
| High-write event stream (clickstream, IoT telemetry) | Time-series or wide-column | TimescaleDB, InfluxDB, Cassandra | Append-optimized writes, time-range queries, horizontal scale |
| Session store / caching layer | Key-value (in-memory) | Redis, Memcached | Sub-millisecond reads, TTL expiry, simple key-based access |
| Product catalog with variable attributes | Document store | MongoDB, DynamoDB | Flexible schema (each product has different attributes), nested documents |
| Full-text search (site search, log analytics) | Search engine | Elasticsearch, OpenSearch | Inverted indexes, relevance scoring, faceted search |
| Social network (friends, followers, recommendations) | Graph database | Neo4j, Amazon Neptune | Relationship traversal (friends-of-friends) is O(1) per hop vs expensive JOINs |
| ML feature store / semantic search | Vector database | Pinecone, Weaviate, pgvector | Approximate nearest neighbor (ANN) search on high-dimensional embeddings |
| Financial ledger (strict consistency, audit trail) | Relational (SQL) | PostgreSQL, CockroachDB | Serializable transactions, multi-row ACID, immutable audit log |
| Global multi-region low-latency reads | Distributed NoSQL | DynamoDB Global Tables, CockroachDB, Cassandra | Multi-region replication, tunable consistency, automatic failover |
| Queue / task broker | Stream or message broker | Redis Streams, Kafka, SQS | Ordered delivery, consumer groups, at-least-once semantics |
You have a PostgreSQL table with 500M rows. Queries are slow despite indexes. Walk me through your optimization strategy.
You have a PostgreSQL table with 500M rows. Queries are slow despite indexes. Walk me through your optimization strategy.
- Run
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)on the slow queries. This is non-negotiable — you cannot optimize what you have not measured. Look at the actual execution plan, not the estimated one. - Check if the planner is using your indexes at all. If you see
Seq Scanon a 500M-row table, either the index is missing for that query pattern, the statistics are stale, or the planner decided a sequential scan was cheaper (which can happen if the query returns a large fraction of the table).
Seq Scan means PostgreSQL is reading every single row — all 500M of them. At roughly 10 MB/s sustained disk throughput on a typical cloud instance, that is approximately 4-5 minutes of wall-clock time, which matches the 287893 ms (~4.8 minutes) execution time. (2) Rows Removed by Filter: 499,995,207 means it examined 500M rows to find 4,793 matches — a selectivity of 0.001%. This is an index’s dream scenario. (3) Buffers: shared read=8,472,981 means it read ~8.5 million 8 KB pages from disk — about 65 GB of I/O for a query that should touch a few MB. (4) The cost=0.00..12847623.00 is the planner’s estimated cost in arbitrary units — higher means more expensive, and 12.8 million is astronomical.Now add a composite index:EXPLAIN ANALYZE:Index Scan touches exactly the rows it needs (4,821 buffer hits vs 8.5 million disk reads). At 500M rows, a B-tree index is about 4-5 levels deep, so the lookup requires roughly 5 random I/O operations to find the first matching leaf page, then a sequential scan through adjacent leaf pages for the remaining matches. This is why EXPLAIN ANALYZE is non-negotiable — without it, you are guessing. Every database interview answer about query optimization that does not start with “I would run EXPLAIN ANALYZE” is incomplete.Phase 2: Statistics and maintenance.- Run
ANALYZEon the table. PostgreSQL’s query planner uses statistics about data distribution to choose execution plans. On a 500M-row table, the default statistics target (100) may be too low — increase it withALTER TABLE SET STATISTICS 1000on high-cardinality columns used in WHERE clauses. - Check for bloat. On a heavily-updated table, dead tuples accumulate (MVCC). Run
SELECT n_dead_tup, n_live_tup FROM pg_stat_user_tables WHERE relname = 'your_table'. If dead tuples are a significant fraction of live tuples,VACUUM FULLorpg_repack(non-locking) to reclaim space and reduce I/O. - Check index bloat. B-tree indexes on frequently-updated columns become bloated over time.
REINDEX CONCURRENTLYrebuilds without locking.
- Verify composite index column order matches query patterns (leftmost prefix rule). An index on
(status, created_at)helpsWHERE status = 'active' AND created_at > '2025-01-01'but NOTWHERE created_at > '2025-01-01'alone. - Consider partial indexes if queries consistently filter a subset:
CREATE INDEX idx_active ON orders(created_at) WHERE status = 'active'— if only 5% of rows are active, this index is 20x smaller and 20x faster to scan. - Consider covering indexes (
INCLUDE) to enable index-only scans — if the query only needsidandamount,CREATE INDEX idx ON orders(status, created_at) INCLUDE (id, amount)avoids touching the heap entirely.
- Partitioning. If queries consistently filter by a time range (and they usually do), partition by month or quarter using PostgreSQL declarative partitioning. A query for “last 30 days” now scans one partition (maybe 15M rows) instead of 500M. Partition pruning is automatic. The numbers are dramatic: scanning 15M rows takes ~2-3 seconds vs ~8-10 minutes for 500M. With an index on the partition, point lookups drop to under 1 ms because the index is 30x smaller and fits entirely in memory.
- Read replicas. Route analytics and reporting queries to a read replica. Keep the primary for OLTP.
- Materialized views. For expensive aggregation queries that do not need real-time data, create a materialized view refreshed every hour. A dashboard that aggregates 500M rows every page load is a bug, not a feature.
- Add pagination. No query should return 500M rows. Cursor-based pagination with
WHERE id > last_seen_id ORDER BY id LIMIT 100is O(1) regardless of offset. At 500M rows, anOFFSET 10000000 LIMIT 20query still scans and discards 10 million rows — cursor-based pagination avoids this entirely. - Cache hot queries in Redis with a TTL. If the same aggregation is requested 1000 times per minute, compute it once. A Redis GET takes ~0.1 ms vs a complex PostgreSQL aggregation at 50-500 ms — that is a 500-5000x improvement for repeat queries. See the Caching chapter for cache invalidation strategies (write-through, write-behind, cache-aside) and how to avoid the thundering herd problem on cache expiry.
- Review the ORM. ORMs often generate suboptimal queries — N+1 problems, unnecessary SELECT *, missing WHERE clauses.
EXPLAIN ANALYZEthe actual SQL the ORM produces, not what you think it produces. A classic N+1 on a 500M-row table turns a single 5 ms indexed query into 100 separate queries taking 500 ms total — plus the connection overhead.
- Caching — The fastest database query is the one you never make. For read-heavy workloads, a caching layer (Redis, Memcached) in front of the database can reduce read load by 90%+. See the Caching chapter for strategies on cache invalidation, TTL design, and avoiding the thundering herd problem when a hot cache key expires and 1,000 requests simultaneously hit the database.
- Performance and Reliability — Query tuning is one dimension of performance, but connection pooling (PgBouncer, ProxySQL), circuit breakers around database calls, and timeout configuration are equally critical. A single slow query without a timeout can hold a connection for minutes, starving the pool. See the Performance chapter for connection pool sizing, timeout strategies, and circuit breaker patterns around data stores.
- Messaging and Async Processing — For write-heavy workloads, consider decoupling writes through a message queue. Instead of synchronously inserting into the database on every API request, publish to Kafka or SQS and let a consumer batch-insert. This absorbs traffic spikes (Black Friday: 10x normal write volume) without overwhelming the database. See the Messaging chapter for the outbox pattern, which ensures reliable publishing of database changes to a message broker without distributed transactions.
API Design Checklist
Use this checklist before shipping any API — internal or external. The items are ordered roughly by “pain caused if you skip this.” Every item you defer is technical debt that compounds with every new consumer.1. Versioning Strategy
- Choose a versioning scheme before writing the first endpoint (
/v1/URL path for public APIs, header-based for internal) - Define what constitutes a breaking change for your API (removing fields, renaming fields, changing types, changing status codes)
- Establish a deprecation policy with a timeline (minimum 6 months notice for public APIs, 2 weeks for internal)
- Pin each consumer to a version — new consumers get the latest, existing consumers keep theirs until they opt in
- Include version in every response header (
X-API-Version: 2024-01-15) so consumers can verify
2. Authentication and Authorization
- Choose auth mechanism: API keys (simple, per-consumer), OAuth 2.0 (delegated access, scopes), JWT (stateless, short-lived)
- Require HTTPS everywhere — no exceptions, even for internal APIs (zero-trust networking)
- Implement API key rotation without downtime (support two active keys per consumer during rotation)
- Return
401 Unauthorized(identity unknown) vs403 Forbidden(identity known, insufficient permissions) — these are different - Log every authentication failure with IP, key prefix (never the full key), and timestamp for security auditing
3. Rate Limiting
- Define rate limits per consumer tier (free: 100 req/min, paid: 10,000 req/min, enterprise: custom)
- Return
429 Too Many RequestswithRetry-Afterheader (in seconds) andX-RateLimit-Remaining,X-RateLimit-Resetheaders - Choose algorithm: token bucket for APIs allowing bursts, sliding window for strict rate enforcement
- Implement at the API gateway level (not per-application-instance) for accurate distributed counting
- Document rate limits prominently — developers should not discover limits by hitting them
4. Pagination
- Choose cursor-based pagination for all list endpoints (not offset-based — offset degrades at scale)
- Return pagination metadata in every list response:
{ "data": [...], "has_more": true, "next_cursor": "abc123" } - Set a maximum
page_size(e.g., 100) and a sensible default (e.g., 20) — never return unbounded results - Ensure cursors are opaque to clients (base64-encoded internal identifiers, not raw database IDs)
- Document the sort order — cursors are meaningless without a defined, stable ordering
5. Error Format
- Define a single, consistent error schema used by every endpoint:
- Include a machine-readable
code(for programmatic handling) AND a human-readablemessage(for debugging) - Include
request_idin every response (success and error) — this is the lifeline for support debugging - Use appropriate HTTP status codes:
400(bad input),401(not authenticated),403(not authorized),404(not found),409(conflict),422(validation failed),429(rate limited),500(server error) - Never leak internal details (stack traces, SQL errors, internal service names) in production error responses
6. Idempotency
- Require an
Idempotency-Keyheader on all non-idempotent mutations (POST, PATCH) - Store idempotency keys with the response — retries return the stored response, not a new execution
- Scope keys to the API consumer (two different consumers can use the same key string)
- Set a TTL on idempotency records (24 hours is standard — Stripe’s model)
- Return an error if a retried request has a different body than the original (the key is bound to the parameters)
- For critical operations (payments, order creation), wrap the idempotency check and business logic in a single database transaction
7. Documentation
- Write an OpenAPI (Swagger) spec before writing code — the spec is the contract
- Include a runnable example for every endpoint (curl command or code snippet in 2-3 languages)
- Document every error code with cause, example, and suggested fix
- Provide a “Getting Started” guide that gets a developer from zero to first successful API call in under 5 minutes
- Include a changelog that documents every change with the date, affected endpoints, and migration path
- Host interactive API docs (Swagger UI, Redoc, or Stoplight) — developers should be able to try endpoints without writing code
8. Additional Production Concerns
- CORS — Configure allowed origins, methods, and headers for browser-based consumers
- Request/response logging — Log every request with method, path, status code, latency, and consumer ID (redact sensitive fields)
- Health check endpoint —
GET /healthreturns200with dependency status (database connected, cache reachable) for load balancer probes - Compression — Support
Accept-Encoding: gzip— JSON compresses 5-10x, which matters for bandwidth-constrained mobile clients - Timeouts — Set and document server-side request timeouts (e.g., 30 seconds) — a client should never wait indefinitely
- Webhooks (if applicable) — Support event notifications with retry logic, signature verification (
HMAC-SHA256), and a test mode