Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Database Deep Dives

Knowing SQL syntax is table stakes. Knowing why your query plan changed after a VACUUM, why your DynamoDB table has a hot partition at 3 AM, why MongoDB’s $lookup is killing your aggregation pipeline, or why Redis just evicted your session keys — that is what separates engineers who operate databases from engineers who understand them. This chapter goes beneath the surface of four databases you will encounter in production, with the depth that interviews at serious companies demand.
Real-World Story: Figma’s PostgreSQL Scaling Journey. Figma, the collaborative design tool valued at $20 billion, ran on a single PostgreSQL instance far longer than most engineers would expect. By 2020, that single instance was handling millions of collaborative editing sessions, billions of rows, and thousands of queries per second. Rather than immediately jumping to a distributed database, Figma’s team invested in understanding PostgreSQL internals — aggressive index tuning, partitioning large tables, optimizing VACUUM settings, and using PgBouncer to manage connection pooling. When they eventually needed to scale further, they horizontally partitioned by workspace using application-level routing, keeping PostgreSQL as the storage engine. Their CTO has publicly stated that deep PostgreSQL knowledge bought them years of runway on a single database — years they spent building product instead of managing distributed systems complexity. The lesson: understanding your database’s internals is often worth more than replacing it.
Real-World Story: Coinbase and DynamoDB at Financial Scale. When Coinbase needed a database to handle cryptocurrency transaction ledgers — millions of writes per second during market volatility, with zero tolerance for data loss — they chose DynamoDB. But the decision was not “DynamoDB is the best database.” It was “DynamoDB’s single-digit millisecond latency at any scale, combined with its durability guarantees, matches our specific access pattern: high-velocity writes with simple key-based lookups.” Coinbase’s engineering team designed their DynamoDB tables around access patterns first (not entities), used single-table design for related data, and built a sophisticated partition key strategy to avoid hot partitions during Bitcoin price spikes (when everyone buys or sells simultaneously). They have publicly discussed how getting the partition key wrong in early iterations caused throttling during peak trading — a mistake that directly cost money. The takeaway: choosing the right database is necessary but not sufficient. Designing for that database’s strengths is where the real engineering happens.

Section 1: PostgreSQL Internals

PostgreSQL is not just “a relational database.” It is a sophisticated system with its own approach to concurrency, crash recovery, query optimization, and extensibility. Understanding these internals transforms you from someone who writes SQL into someone who can diagnose why a query that ran in 2 ms yesterday takes 30 seconds today.

1.1 MVCC (Multi-Version Concurrency Control)

Most databases face a fundamental tension: readers and writers want to access the same data simultaneously. The brute-force solution is locking — make readers wait while writers write, and vice versa. PostgreSQL chose a different path: readers never block writers, and writers never block readers. PostgreSQL achieves this through MVCC. Instead of modifying a row in place, every UPDATE creates a new version of the row. The old version is not immediately deleted — it sticks around so that any transaction that started before the update can still see the original value. Each row version has two hidden system columns: xmin (the transaction ID that created this version) and xmax (the transaction ID that deleted or replaced this version, or 0 if it is still current). When a transaction runs a query, PostgreSQL checks each row version’s xmin and xmax against the transaction’s snapshot — a record of which transactions were committed at the moment this transaction began. If xmin is from a committed transaction that started before the snapshot and xmax is either 0 or from a transaction not yet visible, the row version is visible. Otherwise, it is skipped. What this means in practice: Ten concurrent transactions can all read the orders table while an eleventh transaction is updating 50,000 rows. The readers see a consistent snapshot of the data as it existed when their transaction started. No locks. No waiting. No dirty reads.
The cost of MVCC: dead tuples. Every UPDATE creates a new row version and marks the old one as “dead.” A table with 1 million rows that gets updated 10 times has 10 million dead tuples consuming disk space and slowing down sequential scans. This is why VACUUM exists — and why understanding VACUUM is non-negotiable for PostgreSQL operations.
Cross-chapter connection: MVCC is the mechanism behind PostgreSQL’s isolation levels discussed in APIs and Databases — Transactions and Isolation Levels. PostgreSQL’s REPEATABLE READ is actually snapshot isolation powered by MVCC — each transaction sees a frozen snapshot of the database, which is why it prevents phantom reads (unlike the SQL standard’s definition of REPEATABLE READ).

1.2 WAL (Write-Ahead Log)

The Write-Ahead Log is PostgreSQL’s crash recovery mechanism and the foundation of its replication system. The principle is simple: before any change is written to the actual data files on disk, a record of that change is first written to the WAL. Why this matters: Imagine PostgreSQL is halfway through writing an UPDATE to a data page when the server loses power. Without WAL, that page is now corrupted — half old data, half new data. With WAL, on restart PostgreSQL reads the WAL from the last checkpoint, replays any committed transactions that were not yet flushed to the data files, and rolls back any uncommitted transactions. The data files might be in an inconsistent state, but the WAL has everything needed to reconstruct consistency. How it works step by step:
  1. Transaction begins. Client issues UPDATE accounts SET balance = 500 WHERE id = 42.
  2. PostgreSQL modifies the row in the shared buffer pool (in-memory copy of the data page).
  3. A WAL record describing this change is written to the WAL buffer (in memory).
  4. When the transaction commits, the WAL buffer is flushed to disk (pg_wal/ directory). This is the fsync call — the moment the change is durable.
  5. The actual data page in the buffer pool is not immediately written to disk. It will be flushed later by the background writer or checkpointer.
  6. On crash, PostgreSQL replays WAL records from the last checkpoint to bring data files up to date.
WAL and replication: Streaming replication works by sending WAL records from the primary to replicas in real-time. The replica applies the same WAL records to its own data files, keeping an exact copy. This is why PostgreSQL replication is so reliable — replicas are literally replaying the same sequence of changes.
pg_wal directory. WAL segments are stored in pg_wal/ (called pg_xlog/ before PostgreSQL 10). Each segment is 16 MB by default. If this directory fills up (because archiving is failing or replication slots are retaining too many segments), PostgreSQL will refuse to accept new writes until space is freed. Monitor pg_wal disk usage in production.
Cross-chapter connection: WAL and the OS. The fsync call in step 4 is where PostgreSQL meets the operating system. fsync forces the OS to flush its page cache to physical disk — without it, a power failure could lose data that the application believes is committed. This is the same fsync mechanism discussed in OS Fundamentals — File Systems and I/O, where we cover how the page cache, write-back buffers, and mmap() interact with database engines. Understanding why fsync is expensive (it bypasses the OS page cache’s lazy write-back) explains why PostgreSQL batches WAL writes and why synchronous_commit = off is sometimes a valid trade-off. For how AWS manages this durability layer for you, see Cloud Service Patterns — RDS vs Aurora, where Aurora’s distributed storage architecture eliminates per-instance WAL management entirely.

1.3 VACUUM — Dead Tuple Cleanup

VACUUM is the garbage collector for MVCC. It reclaims space occupied by dead tuples (old row versions that are no longer visible to any active transaction), updates the visibility map (which pages are fully visible to all transactions — enabling index-only scans), and updates the free space map (which pages have room for new rows). What happens without VACUUM:
  • Table bloat: a 1 GB table grows to 10 GB as dead tuples accumulate.
  • Index bloat: indexes still point to dead tuples, making them larger and slower.
  • Transaction ID wraparound: PostgreSQL uses 32-bit transaction IDs. After ~2 billion transactions, IDs wrap around and old transactions appear to be in the “future.” If this happens, PostgreSQL shuts down to prevent data corruption. VACUUM marks old transaction IDs as “frozen” to prevent wraparound.
Autovacuum is the background process that runs VACUUM automatically. It monitors each table and triggers when the number of dead tuples exceeds a threshold:
threshold = autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * table_size
Default: 50 + 0.2 * row_count. For a 10-million-row table, that means autovacuum triggers after 2,000,050 dead tuples — often too late. Autovacuum tuning for large tables:
ALTER TABLE large_events SET (
  autovacuum_vacuum_scale_factor = 0.01,    -- trigger at 1% dead tuples instead of 20%
  autovacuum_vacuum_threshold = 1000,       -- lower base threshold
  autovacuum_analyze_scale_factor = 0.005,  -- update statistics more frequently
  autovacuum_vacuum_cost_delay = 2          -- reduce throttling (default 20ms)
);
VACUUM does not reclaim disk space to the OS. Regular VACUUM marks dead space as reusable within the table but does not shrink the file. Only VACUUM FULL (which rewrites the entire table with an exclusive lock) returns space to the OS. For a 500 GB table, VACUUM FULL requires 500 GB of temporary disk space and locks the table for the entire duration — potentially hours. In production, use pg_repack instead, which performs an online table rewrite without an exclusive lock.

1.4 The Query Planner

When you run a query, PostgreSQL does not just execute it. It first generates multiple possible execution plans and estimates the cost of each, then picks the cheapest one. Understanding the planner means understanding why PostgreSQL sometimes ignores your carefully crafted index. Scan types the planner chooses between:
Scan TypeWhen ChosenExample Scenario
Sequential ScanReading most of the table, or table is smallSELECT * FROM users (no WHERE), or WHERE status = 'active' when 90% of rows are active
Index ScanFetching a small fraction of rows via indexWHERE id = 42 on a B-tree primary key index
Bitmap Index ScanFetching a moderate fraction (1-20%) of rowsWHERE created_at > '2024-01-01' returning 10% of rows
Index Only ScanAll required columns exist in the index (covering index) and visibility map is currentSELECT id, email FROM users WHERE email = 'x' with an index on (email) INCLUDE (id)
Why the planner sometimes “ignores” your index: The planner uses table statistics (row count, value distribution, correlation) to estimate how many rows each plan will return. If statistics are stale (because ANALYZE has not run recently), the planner makes bad estimates. If the planner estimates a query will return 50% of the table, a sequential scan is genuinely faster than an index scan — random I/O (jumping around the heap via index pointers) is slower than sequential I/O (reading pages in order) when you are reading most of the table anyway.
Force a statistics update if the planner is making bad choices: ANALYZE tablename; This updates the statistics that the planner relies on. In autovacuum configurations, ANALYZE runs automatically, but after bulk loads or major data distribution changes, a manual ANALYZE is often needed immediately.
Cross-chapter connection: Indexing for API queries. The query planner’s index selection directly impacts your API response times. If your REST endpoint does GET /users?status=active&sort=created_at, the planner needs a compound index on (status, created_at) to avoid a sequential scan. This is the database layer of the optimization chain discussed in Performance and Scalability — Query Optimization. When the planner falls back to a sequential scan on a 50M-row table, your API p99 goes from 5 ms to 3 seconds — and no amount of application-level caching hides a fundamentally missing index. See also APIs and Databases for how index design connects to API query parameter design.

1.5 EXPLAIN ANALYZE Deep Dive

EXPLAIN shows the plan. EXPLAIN ANALYZE runs the query and shows actual vs estimated numbers. This is the single most important diagnostic tool for PostgreSQL performance.
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT u.name, COUNT(o.id) as order_count
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE u.created_at > '2024-01-01'
GROUP BY u.name
ORDER BY order_count DESC
LIMIT 10;
Reading the output — what to look for:
  1. Actual rows vs estimated rows. If the planner estimated 100 rows but got 50,000, the plan is based on wrong statistics. Run ANALYZE on the table.
  2. Loops. A Nested Loop with loops=10000 means the inner operation ran 10,000 times. Multiply the inner node’s time by the loop count for the true cost.
  3. Buffers: shared hit vs shared read. shared hit = read from buffer cache (fast). shared read = read from disk (slow). A high shared read count means the working set does not fit in shared_buffers.
  4. Sort Method: external merge disk. The sort spilled to disk because work_mem was too small. Increase work_mem for this session or globally.
  5. Seq Scan with Filter: rows removed by filter. If a sequential scan filtered out 99% of rows, an index would be far more efficient.
pg_stat_statements. The single most valuable extension for production PostgreSQL. It tracks every distinct query, its total/mean/max execution time, the number of calls, rows returned, and buffer usage. Finding your slowest queries is as simple as:
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
This tells you which queries consume the most cumulative time — the highest-ROI optimization targets.

1.6 Connection Management

Each PostgreSQL connection is a full OS process (not a thread). On Linux, each backend process consumes approximately 5-10 MB of RAM. A server with max_connections = 500 reserves 2.5-5 GB just for connection overhead — before any query execution. The problem: Modern web applications using serverless functions or microservices can easily open thousands of connections. Each Lambda invocation, each Kubernetes pod, each application instance opens its own connections. Without connection pooling, you exhaust PostgreSQL’s connection limit quickly. PgBouncer is the standard solution — a lightweight connection pooler that sits between your application and PostgreSQL. It maintains a small pool of actual PostgreSQL connections (say, 50) and multiplexes hundreds or thousands of client connections onto them. Pooling modes:
  • Transaction pooling (most common): A server connection is assigned to a client for the duration of one transaction, then returned to the pool. Best for web workloads with short transactions.
  • Session pooling: A server connection is assigned for the entire client session. Needed for features that require session state (prepared statements, temp tables, SET commands).
  • Statement pooling: A connection is assigned per statement. Most aggressive but breaks multi-statement transactions.
Transaction pooling breaks prepared statements unless you use server-side prepared statement support in PgBouncer 1.21+. If your ORM uses prepared statements by default (most do), you will get errors like prepared statement does not exist. Either switch to session pooling (loses most of the benefit), disable prepared statements in the ORM, or upgrade PgBouncer.
Cross-chapter connection: Serverless and connection pooling. The connection exhaustion problem becomes acute with AWS Lambda, where each concurrent invocation opens its own connection. A Lambda function with 1,000 concurrent executions means 1,000 PostgreSQL backend processes — almost certainly exceeding max_connections. AWS RDS Proxy is the managed solution: it pools connections on the AWS side, letting thousands of Lambda invocations share a small pool of actual database connections. See Cloud Service Patterns — Lambda and RDS for RDS Proxy configuration and the Lambda connection lifecycle. For the mathematical framework behind sizing your connection pool (using Little’s Law), see the Connection Pool Sizing Formula later in this chapter.

1.7 Partitioning

Partitioning splits a single logical table into multiple physical tables. PostgreSQL supports three strategies:
  • Range partitioning: Divide by value ranges. Most common for time-series data: one partition per month, per week, or per day.
  • List partitioning: Divide by discrete values. Partition by region (us-east, eu-west) or status (active, archived).
  • Hash partitioning: Distribute evenly across N partitions based on a hash of the partition key. Good for even distribution when there is no natural range or list.
When to partition:
  • Tables over 100 million rows where queries consistently filter by the partition key.
  • Time-series data where old partitions can be cheaply dropped (DROP TABLE is instant; DELETE FROM ... WHERE date < X on a 500M row table takes hours and creates massive dead tuple bloat).
  • When you need different storage strategies for hot vs cold data (recent partitions on SSD, old partitions on cheaper storage).
When NOT to partition: When your queries do not filter by the partition key (every query hits all partitions — worse than a single table). When the table has fewer than 10 million rows (partitioning overhead exceeds benefit). When you just need better indexes.

1.8 Extensions

PostgreSQL’s extension ecosystem is one of its strongest competitive advantages:
  • PostGIS: Geospatial data types and queries. ST_DWithin(point, location, 5000) finds everything within 5 km. Powers location features at companies from Instacart to OpenStreetMap.
  • pg_trgm: Trigram-based fuzzy text search. Enables LIKE '%search%' queries to use an index (normally impossible with leading wildcards). Create a GIN trigram index and suddenly fuzzy search is fast.
  • TimescaleDB: Turns PostgreSQL into a time-series database with automatic partitioning (hypertables), compression (90%+ space reduction on old data), and continuous aggregates. If you are choosing between PostgreSQL + TimescaleDB and InfluxDB, keep it in the PostgreSQL ecosystem unless you have a strong reason not to.
  • pgvector: Vector similarity search for AI/ML embeddings. Store embeddings alongside relational data and query both in a single SQL statement. Rapidly becoming the default for RAG (Retrieval-Augmented Generation) applications that do not need billion-vector scale.
Cross-chapter connection: pgvector is increasingly relevant for AI engineering interviews. If you are building RAG applications, being able to discuss how pgvector compares to dedicated vector databases (Pinecone, Weaviate, Qdrant) demonstrates cross-domain depth. See the AI Engineer Interview Questions chapter for LLM-specific database considerations.

Section 2: MongoDB Patterns

MongoDB is a document database — it stores data as JSON-like documents (BSON internally) rather than rows in tables. It excels at rapid iteration with flexible schemas, hierarchical data that maps naturally to nested documents, and read-heavy workloads with denormalized data. It struggles with complex relationships, multi-document transactions at scale, and workloads that would benefit from JOINs.

2.1 The Document Model

A MongoDB document is a JSON object. Unlike a relational row, it can contain nested objects, arrays, and mixed types. This flexibility is both its greatest strength and its most common source of production pain. When the document model shines:
  • Hierarchical data: A product with variants, each variant with sizes, each size with inventory by warehouse. In PostgreSQL, that is 4 tables and 3 JOINs. In MongoDB, it is one document.
  • Flexible schema: An e-commerce platform where shoes have size and color, electronics have voltage and wattage, and books have isbn and author. In relational, you either use sparse columns (wasteful) or an EAV pattern (painful). In MongoDB, each document has exactly the fields it needs.
  • Read-heavy with denormalization: When 90% of your operations are “fetch this entire thing and display it,” embedding all the data in one document means one read, no JOINs.
When the document model hurts:
  • Many-to-many relationships: Students and courses. If a student document embeds course references and a course document embeds student references, updating a course name means updating every student document that references it. This is the relational model’s strength and the document model’s weakness.
  • Data that is frequently updated in nested arrays: Updating the 500th element of a 1000-element embedded array is expensive. MongoDB must rewrite the entire document.
  • Complex analytical queries with JOINs across collections: MongoDB’s $lookup (the equivalent of JOIN) is significantly slower than relational JOINs and does not support the same optimization strategies.

2.2 Schema Design Patterns

MongoDB schema design is not “throw everything in one document.” Production schemas use well-established patterns: Extended Reference Pattern: Instead of embedding the entire related document, embed only the fields you need for display. An order document contains { customer_name: "Jane", customer_email: "jane@co.com" } rather than the entire customer object. Reduces document size and read latency while accepting controlled denormalization. Subset Pattern: When a document has a large array but queries only need the most recent entries, store only the last N items in the main document and archive older items to a separate collection. A product’s reviews document has the last 10 reviews embedded; the full review history lives in a reviews collection. Bucket Pattern: Instead of one document per event (millions of tiny documents), group events into time-based buckets. A sensor reading every second becomes one document per minute with an array of 60 readings. Reduces document count by 60x, improves query efficiency, and reduces index overhead. Computed Pattern: Pre-compute aggregations and store the results. Instead of calculating total_revenue by summing all order documents every time, maintain a running total in a statistics document updated on each new order. Trades write complexity for read speed. Outlier Pattern: When 99% of documents have small arrays but 1% have massive arrays (a celebrity with 10 million followers vs a normal user with 200), handle the outliers differently. Normal users embed their follower list. Celebrity users store a flag has_overflow: true and put excess followers in a separate collection.

2.3 Write Concerns and Read Concerns

Write concern controls how many replicas must acknowledge a write before the driver returns success. Read concern controls the consistency guarantee of read operations.
Write ConcernDurabilityLatencyUse Case
w: 0Fire and forget (no acknowledgment)LowestMetrics, logs where occasional loss is acceptable
w: 1Primary acknowledgedLowDefault — good for most operations
w: "majority"Majority of replica set acknowledgedHigherFinancial data, anything you cannot afford to lose on primary failure
Read ConcernGuaranteeTrade-off
localReturns the most recent data on the queried node (may be rolled back)Fastest, default
majorityReturns data acknowledged by majority (durable, will not be rolled back)Slightly slower
linearizableReturns data that is both majority-committed and reflects all successful writes before the read startedSlowest, strongest
The default w: 1 can lose data. If the primary acknowledges a write and then fails before replicating to a secondary, that write is lost when a secondary is promoted to primary. For any data where loss is unacceptable, use w: "majority". The latency cost is typically 2-5 ms extra.

2.4 Aggregation Pipeline

MongoDB’s aggregation pipeline is a sequence of stages that transform documents. Think of it as a Unix pipeline for data: each stage takes documents in, processes them, and passes the results to the next stage. Key stages: $match (filter — always put first to reduce documents early), $group (aggregate), $project (reshape), $sort, $limit, $unwind (flatten arrays), $lookup (left outer join to another collection). The $lookup limitation: $lookup performs a left outer join, but it is fundamentally different from a relational JOIN. It executes a separate query for each input document (unless the joined collection is small enough to fit in memory). On a pipeline processing 100,000 documents, a $lookup can trigger 100,000 separate queries against the joined collection. Always ensure the joined collection has an index on the foreign field. Even then, $lookup at scale is where MongoDB’s document model starts to feel its limitations. Optimization tip: $match and $sort stages at the beginning of the pipeline can use indexes. Once a $group or $project reshapes the documents, subsequent stages cannot use indexes. Structure your pipeline to filter as much as possible before transforming.

2.5 Sharding

Sharding distributes data across multiple servers (shards). The shard key is the most critical MongoDB decision you will make — it determines how data is distributed and cannot be easily changed after the fact. Shard key selection criteria:
  1. High cardinality: A boolean field (is_active) is a terrible shard key — all data lands on two chunks. A user ID or order ID has millions of distinct values.
  2. Even distribution: Monotonically increasing keys (timestamps, ObjectId) cause all new writes to hit the same shard (the one owning the highest range). This creates a “hot shard” bottleneck.
  3. Query isolation: The shard key should appear in most queries. If your queries filter by user_id, shard on user_id so queries hit a single shard. If you shard on order_id but query by user_id, every query must scatter to all shards and gather results — a scatter-gather query that does not scale.
Jumbo chunks: When a chunk grows beyond the maximum chunk size (default 128 MB) but cannot be split because all documents in the chunk have the same shard key value, it becomes a “jumbo chunk.” Jumbo chunks cannot be migrated between shards, causing data imbalance. This happens with low-cardinality shard keys. Prevention: choose high-cardinality shard keys or use compound shard keys.

2.6 Change Streams

Change Streams let you subscribe to real-time changes on a collection, database, or entire cluster. Under the hood, they tail MongoDB’s oplog (operation log — similar to PostgreSQL’s WAL). Use cases: Real-time event processing (new order -> trigger fulfillment), cache invalidation (document updated -> invalidate Redis cache), sync to search index (change in MongoDB -> update Elasticsearch), audit logging. Key limitation: Change Streams require a replica set (not available on standalone instances). They are “at-least-once” delivery — your consumer must be idempotent because it may receive the same change event more than once after a resume.

2.7 Common Gotchas

  • Unbounded arrays: Embedding an ever-growing array (all user activity, all comments) into a single document eventually hits MongoDB’s 16 MB document size limit. Even before hitting the limit, updates to large arrays are slow because MongoDB rewrites the entire document. Use the Bucket or Outlier pattern instead.
  • Missing compound indexes: If you query by { user_id, status, created_at }, an index on just user_id forces MongoDB to scan all of that user’s documents to filter by status and created_at. Create a compound index matching your query pattern.
  • WiredTiger cache sizing: WiredTiger (MongoDB’s storage engine) defaults to using 50% of RAM minus 1 GB for its internal cache. On a 64 GB server, that is ~31 GB. If your working set (frequently accessed data + indexes) exceeds this, performance degrades dramatically as WiredTiger constantly reads from and evicts to disk. Monitor wiredTiger.cache.bytes currently in the cache and ensure it does not stay near the maximum.

Section 3: DynamoDB Strategies

DynamoDB is AWS’s fully managed NoSQL database. It is not a general-purpose database — it is purpose-built for applications that need predictable single-digit millisecond performance at any scale. The trade-off: you give up query flexibility (no JOINs, no ad-hoc queries) in exchange for virtually unlimited throughput and rock-solid latency guarantees.
Cross-chapter connection: DynamoDB in the AWS ecosystem. DynamoDB does not exist in isolation — it is the core data layer of AWS serverless architectures. The patterns for Lambda + DynamoDB (event triggers, Streams as CDC, single-table design for API backends) and DynamoDB vs Aurora decision criteria are covered in Cloud Service Patterns — Database Patterns on AWS. Understanding when to use DynamoDB vs Aurora vs RDS is a common interview topic that spans both chapters.

3.1 Access Patterns First

The most important DynamoDB design principle: design your table around your queries, not your entities. In relational databases, you normalize data into entities (users, orders, products) and rely on JOINs to answer any query. In DynamoDB, there are no JOINs. Every access pattern must be answerable by a single query against a single table (or index). Before writing a single line of schema, list every access pattern your application needs:
  1. Get user by ID
  2. Get all orders for a user, sorted by date
  3. Get all comments on a post
  4. Get a user’s most recent 10 posts
  5. Get all users in an organization
Then design your partition key (PK), sort key (SK), and Global Secondary Indexes (GSIs) so that each access pattern maps to a single GetItem, Query, or BatchGetItem operation.

3.2 Single-Table Design

Single-table design stores multiple entity types in one DynamoDB table. This is counterintuitive for engineers from a relational background, but it is the canonical DynamoDB pattern. How it works (adjacency list pattern):
PKSKData
USER#123PROFILE{ name: "Jane", email: "jane@co.com" }
USER#123ORDER#2024-001{ total: 59.99, status: "shipped" }
USER#123ORDER#2024-002{ total: 129.00, status: "pending" }
ORDER#2024-001ITEM#A{ product: "Widget", qty: 2 }
ORDER#2024-001ITEM#B{ product: "Gadget", qty: 1 }
ORG#acmeMEMBER#123{ role: "admin", joined: "2024-01-15" }
ORG#acmeMEMBER#456{ role: "viewer", joined: "2024-03-20" }
Access patterns served:
  • Get user profile: GetItem(PK=USER#123, SK=PROFILE)
  • Get all orders for user: Query(PK=USER#123, SK begins_with ORDER#)
  • Get all items in an order: Query(PK=ORDER#2024-001, SK begins_with ITEM#)
  • Get all members of an org: Query(PK=ORG#acme, SK begins_with MEMBER#)
One table. No JOINs. Every query hits a single partition.
When NOT to use single-table design: When your access patterns are not well-defined or change frequently (early-stage startups exploring product-market fit). When your team is not DynamoDB-experienced (the learning curve is steep and mistakes are costly). When ad-hoc queries are important (use DynamoDB for the transactional workload and export to Athena/Redshift for analytics). Single-table design optimizes for read performance and cost at the expense of developer ergonomics and flexibility.

3.3 Partition Key Design

DynamoDB distributes data across partitions based on a hash of the partition key. If many items share the same partition key, that partition becomes “hot” — receiving a disproportionate share of read/write traffic and potentially getting throttled. Avoiding hot partitions:
  • High-cardinality keys: Use user IDs, order IDs, or UUIDs — not statuses, dates, or categories.
  • Write sharding: If you must use a low-cardinality key (e.g., tracking page views per URL), append a random suffix: URL#homepage#3 (where 3 is random between 0-9). This distributes writes across 10 partitions. Reads require querying all 10 suffixes and aggregating — a trade-off that is worth it for high-write scenarios.
  • Composite keys: Combine fields to increase cardinality. TENANT#acme#USER#123 is better than TENANT#acme alone.
The hot partition problem is real and expensive. During a flash sale, if all orders have PK=PRODUCT#popular-item, that single partition receives all write traffic. DynamoDB’s adaptive capacity can help by isolating frequently accessed items, but it is not instantaneous and will not save a fundamentally bad key design. A hot partition manifests as ProvisionedThroughputExceededException errors even when your overall table capacity is far from exhausted.

3.4 Global Secondary Indexes (GSI)

A GSI is a projection of your table with a different partition key and sort key. It lets you query data by attributes other than the base table’s keys. Example: Your base table has PK=USER#123, SK=ORDER#2024-001. You need to query “all orders with status=pending.” Create a GSI with PK=STATUS, SK=created_at. Now Query(PK=pending, SK between '2024-01-01' and '2024-12-31') returns all pending orders sorted by date. Key characteristics:
  • GSIs are eventually consistent — there is no strongly consistent read option for GSIs. If you write an item and immediately query the GSI, you might not see it yet (typically milliseconds, but during high load it can be longer).
  • GSIs have their own provisioned throughput. If the GSI is throttled, writes to the base table are also throttled (backpressure).
  • You can project a subset of attributes to the GSI (reducing cost and improving performance) or project all attributes.
  • Maximum 20 GSIs per table.

3.5 Local Secondary Indexes (LSI)

An LSI uses the same partition key as the base table but a different sort key. It must be created at table creation time (cannot be added later). LSIs support strongly consistent reads. When to use LSI: When you need an alternative sort order within the same partition AND you need strongly consistent reads on that sort order. Example: base table sorts orders by order_id, but you also need orders sorted by created_at within the same user partition. Limitations: LSIs share the partition’s 10 GB limit. If a single partition key’s items (across the base table and all LSIs) exceed 10 GB, writes fail. This is the main reason many DynamoDB practitioners avoid LSIs entirely and use GSIs instead.

3.6 DynamoDB Streams

DynamoDB Streams captures a time-ordered sequence of item-level modifications (inserts, updates, deletes) to a DynamoDB table. Think of it as DynamoDB’s version of a change data capture (CDC) log. Use cases:
  • Event-driven architectures: New order inserted -> Stream triggers Lambda -> Lambda sends confirmation email.
  • Cross-region replication: DynamoDB Global Tables use Streams under the hood.
  • Materialized views: Stream changes to build denormalized views in another table or in Elasticsearch for full-text search.
  • Audit trail: Stream all changes to S3 for compliance and forensics.
Streams guarantee exactly-once delivery to the stream (each change appears exactly once in the stream), but Lambda consumers process with at-least-once semantics — your Lambda handler must be idempotent.

3.7 DAX (DynamoDB Accelerator)

DAX is a fully managed, in-memory cache for DynamoDB. It sits between your application and DynamoDB, caching frequently read items and query results. When DAX makes sense: Read-heavy workloads where the same items are read repeatedly (product catalog, configuration data). When you need microsecond latency (DAX serves from memory in ~200 microseconds vs DynamoDB’s ~5 ms). When DAX does NOT make sense: Write-heavy workloads (DAX caches reads, not writes). When reads are highly varied with little repetition (low cache hit rate). When eventual consistency of the cache is problematic (DAX is eventually consistent by default with a 5-minute TTL).

3.8 Capacity Modes

On-demand: Pay per request. No capacity planning. Instantly handles traffic spikes. Costs roughly 6.5x more per request than provisioned at steady state. Best for: unpredictable workloads, new applications where traffic patterns are unknown, spiky traffic. Provisioned: You specify read/write capacity units. Cheaper at steady state. Risk of throttling if you underestimate. Use with Auto Scaling to adjust capacity based on utilization. Best for: predictable, steady workloads. Burst capacity: DynamoDB reserves unused capacity for up to 5 minutes to handle short spikes. This is not infinite — sustained traffic above provisioned capacity will be throttled even with burst capacity available.

3.9 Cost Optimization

  • Avoid Scan operations. A Scan reads every item in the table and charges for every read capacity unit consumed. Always use Query or GetItem with a key condition.
  • Use sparse indexes. A GSI only contains items that have the GSI’s key attributes. If only 5% of items have a premium_tier attribute, a GSI on premium_tier is 95% smaller than the base table — dramatically cheaper to read.
  • Compress large items. DynamoDB charges per KB of read/write capacity consumed. Compressing a 10 KB JSON attribute to 2 KB saves 80% on throughput costs for that attribute.
  • Use ProjectionExpression. Only retrieve the attributes you need, not the entire item. Reduces data transfer and read capacity consumption.
  • Time-to-Live (TTL). Automatically delete expired items at no cost (DynamoDB handles deletion in the background without consuming write capacity).

Section 4: Redis Architecture

Redis is an in-memory data structure server. It is often described as “a cache,” but that undersells it dramatically. Redis is a Swiss Army knife for problems that need sub-millisecond latency: caching, session storage, rate limiting, leaderboards, pub/sub messaging, distributed locks, and real-time analytics.
Cross-chapter connection: Redis as cache vs Redis as data store. This chapter covers Redis as a database — its data structures, persistence, clustering, and architecture. But in many production systems, Redis serves double duty as both a primary data store (session storage, leaderboards, rate limiting) and an application cache (caching PostgreSQL query results, API responses). The caching patterns — cache-aside, write-through, write-behind, cache stampede prevention — are covered in depth in Caching and Observability. The key design decision: when Redis is your cache, losing data means a slower response (cache miss falls through to the database). When Redis is your data store, losing data means losing data. This distinction drives every persistence and eviction decision below. For managed Redis in production, see ElastiCache and MemoryDB patterns in Cloud Service Patterns.

4.1 Data Structures Beyond Key-Value

Redis’s power comes from its rich data structure support. Each data structure has O(1) or O(log N) operations that would be expensive to build on top of a plain key-value store:
  • Strings: Basic key-value. But also atomic counters (INCR), bit operations, and values up to 512 MB.
  • Hashes: A hash map within a single key. Perfect for objects: HSET user:123 name "Jane" email "jane@co.com". Get individual fields without deserializing the whole object.
  • Lists: Doubly-linked lists. LPUSH / RPOP gives you a queue. LRANGE gives you the last N items. Used for activity feeds, recent items.
  • Sets: Unordered unique collections. SADD, SINTER (intersection), SUNION. Used for tags, mutual friends, unique visitors.
  • Sorted Sets: Sets with a score for each member. ZADD leaderboard 1500 "player1". ZRANGE leaderboard 0 9 REV gives the top 10. O(log N) insert and rank query. The data structure behind leaderboards, priority queues, and rate limiters.
  • HyperLogLog: Probabilistic data structure for cardinality estimation. Count unique visitors with 0.81% standard error using only 12 KB of memory, regardless of the actual count. PFADD visitors "user1" "user2". PFCOUNT visitors returns the approximate unique count.
  • Streams: An append-only log data structure (added in Redis 5.0). Think Kafka-lite. Supports consumer groups, acknowledgment, and message replay. Used for event sourcing, activity streams, and lightweight message queues.
  • Bitmaps: Bit arrays. Set and get individual bits. SETBIT active:2024-04-09 user_id 1. Count bits with BITCOUNT. Used for tracking daily active users (1 bit per user, 1 million users = 125 KB per day).

4.2 Persistence

Redis is in-memory, but it can persist data to disk via two mechanisms: RDB (Redis Database) snapshots: Point-in-time snapshots of the entire dataset written to a .rdb file. Redis forks the process and the child writes the snapshot while the parent continues serving requests (copy-on-write). Fast recovery (load the entire dataset from one file). Risk: data written between the last snapshot and a crash is lost. Configure with save 900 1 (snapshot if at least 1 key changed in 900 seconds). AOF (Append Only File): Logs every write operation. Three fsync policies:
  • appendfsync always: fsync after every write. Most durable, slowest.
  • appendfsync everysec: fsync every second. Loses at most 1 second of data on crash. The recommended default.
  • appendfsync no: Let the OS decide when to flush. Fastest, least durable.
AOF files grow over time (they contain every write ever). AOF rewriting compacts the file by reading the current dataset and writing the shortest sequence of commands to reconstruct it. RDB + AOF (recommended for production): Enable both. Redis uses AOF for recovery (more durable) and RDB for faster restarts and backups. Since Redis 7.0, the AOF uses a multi-part format with a base RDB snapshot plus incremental AOF, combining the best of both approaches.

4.3 Eviction Policies

When Redis reaches maxmemory, it must decide what to evict:
PolicyBehaviorUse Case
noevictionReturn errors on writes when memory is fullWhen data loss is unacceptable (session store, job queue)
allkeys-lruEvict least recently used keys across all keysGeneral-purpose caching
volatile-lruEvict least recently used keys among keys with a TTL setWhen some keys must persist (config) and others can be evicted (cache)
allkeys-lfuEvict least frequently used keysWhen access frequency matters more than recency (popular items should stay)
volatile-lfuEvict least frequently used keys among those with a TTLMixed workload with frequency-based eviction
allkeys-randomEvict random keysWhen all keys have equal value
volatile-ttlEvict keys with the shortest remaining TTLWhen near-expiry keys should go first
For most caching workloads, allkeys-lfu is the best modern choice (available since Redis 4.0). It keeps frequently accessed items in memory even if they have not been accessed in the last few seconds (which LRU would evict). A product catalog page accessed 10,000 times a day but not in the last 5 minutes should not be evicted in favor of a page accessed once 2 minutes ago. LFU handles this correctly; LRU does not.

4.4 Redis Cluster

Redis Cluster provides horizontal scaling by distributing data across multiple nodes. The keyspace is divided into 16,384 hash slots. Each node is responsible for a subset of slots. The hash slot for a key is CRC16(key) mod 16384. How resharding works: To add a node, you migrate hash slots from existing nodes to the new node. Redis handles this online — keys in migrating slots are redirected with ASK and MOVED responses that clients follow automatically. Multi-key operations: Operations that span multiple keys (MGET, SUNION, transactions) work only if all keys map to the same hash slot. Use hash tags to force related keys to the same slot: {user:123}:profile and {user:123}:orders both hash on user:123 and land on the same slot. Client-side vs proxy-based:
  • Client-side (smart clients): The client library (like redis-py-cluster or ioredis) knows the slot-to-node mapping and routes requests directly to the correct node. Lower latency (no proxy hop), but the client must handle cluster topology changes.
  • Proxy-based (Twemproxy, Redis Cluster Proxy): A proxy sits between clients and the cluster, handling routing transparently. Simpler client code, but adds a network hop and a single point of failure.

4.5 Redis Sentinel

Redis Sentinel provides high availability for non-clustered Redis (single master with replicas). Sentinel monitors the master, detects failures, and automatically promotes a replica to master. How failover works:
  1. Multiple Sentinel instances (minimum 3 for quorum) monitor the master.
  2. When a Sentinel detects the master is unreachable, it asks other Sentinels to confirm (to avoid false positives from network issues).
  3. If a quorum of Sentinels agree the master is down, one Sentinel is elected to perform the failover.
  4. The elected Sentinel promotes the best replica (most up-to-date data) to master and reconfigures the other replicas to follow the new master.
  5. Clients using Sentinel-aware drivers are notified of the new master address.
Sentinel vs Cluster: Use Sentinel when your dataset fits on a single node and you just need automatic failover. Use Cluster when you need to shard data across multiple nodes for capacity beyond what one node can handle.

4.6 Pub/Sub vs Streams

Pub/Sub: Fire-and-forget messaging. Publishers send messages to channels; subscribers receive messages in real-time. Messages are NOT persisted — if a subscriber is disconnected, it misses messages. No consumer groups. No acknowledgment. Best for: real-time notifications, chat, live updates where message loss is acceptable. Streams: Persistent, append-only log with consumer groups. Messages are stored and can be replayed. Consumer groups allow multiple consumers to cooperatively process messages with acknowledgment and pending message tracking. Best for: reliable event processing, task queues, audit logs, anything where message loss is not acceptable. Decision rule: If you would be upset that a message was lost, use Streams. If the message is only valuable in real-time and losing it is fine (like a live scoreboard update), Pub/Sub is simpler and faster.

4.7 Lua Scripting

Redis executes Lua scripts atomically — no other command runs between the script’s start and finish. This eliminates race conditions in multi-step operations. Example: Atomic rate limiter
-- KEYS[1] = rate limit key, ARGV[1] = max requests, ARGV[2] = window in seconds
local current = redis.call('INCR', KEYS[1])
if current == 1 then
  redis.call('EXPIRE', KEYS[1], ARGV[2])
end
if current > tonumber(ARGV[1]) then
  return 0  -- rate limited
end
return 1  -- allowed
Without Lua, the INCR and EXPIRE would be separate commands with a gap between them — if the process crashes after INCR but before EXPIRE, the key never expires and the rate limit becomes permanent. Lua makes this atomic.

4.8 Common Patterns

Distributed locks (Redlock): Acquire locks across multiple independent Redis instances. A lock is considered acquired only if the majority of instances grant it within a time limit. More robust than a single-instance lock but controversial — Martin Kleppmann published a detailed critique arguing that Redlock does not provide the guarantees it claims in the presence of clock drift and process pauses. Use it for efficiency (avoiding duplicate work) but not for correctness (where a race condition would corrupt data). Rate limiting: Use Sorted Sets with timestamps as scores. ZADD rate:user:123 <timestamp> <request_id>. ZREMRANGEBYSCORE to remove entries outside the window. ZCARD to count entries in the window. This gives a precise sliding window rate limiter. Leaderboards: Sorted Sets are purpose-built for this. ZADD leaderboard <score> <player>. ZREVRANK leaderboard <player> gives rank in O(log N). ZREVRANGE leaderboard 0 9 WITHSCORES gives the top 10. Used by gaming companies, Stack Overflow, and fitness apps. Session storage: Store session data as Redis Hashes with a TTL. HSET session:<id> user_id 123 role admin. EXPIRE session:<id> 3600. All app instances share the same session store. Stateless application servers.

4.9 Memory Optimization

  • Key naming conventions: Use user:123:profile not the_profile_data_for_user_id_one_hundred_twenty_three. Short, consistent key names save significant memory at scale. At 100 million keys, saving 20 bytes per key name saves 2 GB.
  • Ziplist encoding: Redis automatically uses a compact encoding (ziplist) for small Hashes, Lists, and Sorted Sets. A Hash with fewer than hash-max-ziplist-entries (default 128) entries and values smaller than hash-max-ziplist-value (default 64 bytes) is stored as a ziplist instead of a hash table — using 5-10x less memory. Keep small objects small to benefit from this.
  • TTL strategies: Set TTLs aggressively. Every key without a TTL is a potential memory leak. Use SCAN periodically to find keys without TTLs that should have them.
  • Compression: For large values (JSON blobs > 1 KB), compress with gzip or LZ4 at the application level before storing in Redis. Redis does not compress values internally.

Section 5: When to Choose What

Decision Matrix

CriteriaPostgreSQLMongoDBDynamoDBRedis
Data modelRelational (tables, rows, JOINs)Documents (JSON, nested, flexible)Key-value / wide columnIn-memory data structures
Query flexibilityHighest (SQL, ad-hoc queries, JOINs)High (flexible queries, aggregation pipeline)Lowest (must design for access patterns)Low (key-based access, Lua for complex ops)
Scale modelVertical (read replicas, partitioning, Citus for horizontal)Horizontal (native sharding)Horizontal (fully managed, unlimited)Vertical (Cluster for horizontal)
ConsistencyStrong (ACID, serializable isolation)Tunable (w:majority for strong writes)Tunable (strong reads on base table)Eventual (replication), strong (single node)
Latency1-100 ms (query dependent)1-50 ms (query dependent)<10 ms (predictable at any scale)<1 ms (in-memory)
Operational burdenMedium (VACUUM, connection mgmt, replication)Medium (sharding, WiredTiger tuning)Low (fully managed by AWS)Medium (memory management, persistence)
Best forComplex queries, transactions, data integrityFlexible schema, hierarchical data, rapid iterationPredictable latency at scale, serverless backendsCaching, real-time, sessions, leaderboards
Worst forUnlimited horizontal scale without extensionsComplex relationships, heavy JOINsAd-hoc queries, complex transactionsPrimary data store (data > memory), complex queries

Concrete Scenarios

ScenarioChooseWhy
E-commerce platform with orders, inventory, paymentsPostgreSQLTransactions, data integrity, complex queries across entities
Content management system with varied content typesMongoDBFlexible schema (articles, videos, podcasts all have different fields)
Mobile app backend with millions of users, simple reads/writesDynamoDBPredictable latency, scales with users, serverless-friendly
Real-time leaderboard, session store, rate limiterRedisSub-millisecond latency, purpose-built data structures
IoT sensor data (millions of events/second, time-series queries)PostgreSQL + TimescaleDB or DynamoDBTimescaleDB for SQL analytics on time-series; DynamoDB for pure ingestion throughput
Social network (users, posts, comments, likes, follows)PostgreSQL + Redis (cache)Relational model for relationships; Redis for feeds, counts, sessions
AI/ML embeddings search + metadataPostgreSQL + pgvectorVector search + relational queries in one database

Polyglot Persistence

Most production systems use multiple databases, each for what it does best:
  • PostgreSQL for the core transactional data (users, orders, billing).
  • Redis for caching, sessions, rate limiting, and real-time features.
  • Elasticsearch (or MongoDB with Atlas Search) for full-text search.
  • DynamoDB for high-throughput, low-latency specific access patterns (API keys, feature flags).
  • S3 for blob storage (images, files, backups).
The cost of polyglot persistence is operational complexity — each database has its own failure modes, backup procedures, monitoring, and expertise requirements. The benefit is that each workload runs on the engine optimized for it. The rule of thumb: start with one database (PostgreSQL is the safest default), and add specialized databases only when a measured limitation of the primary database cannot be solved by tuning or restructuring.
Cross-chapter connection: Polyglot persistence connects to the CQRS pattern discussed in Design Patterns and Architecture. In a CQRS architecture, the write side might use PostgreSQL for transactional integrity, while the read side uses Elasticsearch for flexible search or DynamoDB for low-latency reads. The event bus between them keeps the read models in sync with the source of truth.

Migration Strategies

Migrating between databases is one of the highest-risk operations in production engineering. The general approach:
  1. Dual-write phase: Write to both old and new databases. Read from old. Verify data consistency between them.
  2. Shadow read phase: Read from both databases. Return old database results to users. Compare results in the background and log discrepancies.
  3. Cutover: Switch reads to the new database. Keep the old database running as a fallback.
  4. Cleanup: After a soak period (weeks, not days), decommission the old database.
Never do a “big bang” migration. The company that turns off MySQL on Friday and turns on MongoDB on Monday is the company that has a very bad Monday. Incremental migration with dual-writes and shadow reads is slower but dramatically safer. Stripe famously took over a year to migrate a core table between databases, with zero downtime and zero data loss.

Section 6: Database Scaling Decision Tree

When your database is struggling, engineers often jump straight to the most dramatic solution (sharding, new database). In practice, scaling follows a predictable escalation path — each step is cheaper and less risky than the next. Exhaust cheaper options before reaching for expensive ones.

The Escalation Ladder

Step 1: Optimize Queries and Add Indexes (Cost: hours. Risk: near-zero.) Before adding any infrastructure, check whether the database is slow because of bad queries, not insufficient hardware. Run pg_stat_statements (PostgreSQL), .explain('executionStats') (MongoDB), or check CloudWatch metrics (DynamoDB) to find your slowest operations.
  • Add missing indexes for common query patterns.
  • Rewrite N+1 query patterns into batch queries or JOINs.
  • Add ANALYZE / update statistics so the query planner makes better choices.
  • For MongoDB: ensure compound indexes match your query predicates and sort order.
  • For DynamoDB: verify GSIs cover your access patterns (a missing GSI means a Scan instead of a Query — orders of magnitude slower).
When to stop here: Query optimization alone can often deliver 10-100x improvements. If your p99 drops from 2 seconds to 20 ms after adding one compound index, you do not need read replicas.
Cross-chapter connection: The query optimization techniques here are the database-specific implementation of the general performance methodology in Performance and Scalability. Start with the numbers (EXPLAIN ANALYZE, pg_stat_statements), identify the bottleneck, fix the cheapest thing first.
Step 2: Add Read Replicas (Cost: one more instance + connection routing. Risk: low.) If your database is read-heavy (most web applications are 80-95% reads) and query optimization is not enough, distribute reads across replicas.
  • PostgreSQL: streaming replication to one or more replicas. Route read queries to replicas, writes to primary. Watch for replication lag on time-sensitive reads.
  • MongoDB: read preference secondaryPreferred or secondary. Same caveat about eventual consistency.
  • DynamoDB: eventually consistent reads (default) are already distributed. Use DAX for caching hot reads.
When to stop here: If your bottleneck is read throughput and you can tolerate milliseconds of replication lag for read queries. Most web applications can. When this is NOT enough: If writes are the bottleneck. Replicas help reads, not writes. All writes still go to the primary.
Cross-chapter connection: Managed read replicas on AWS are covered in Cloud Service Patterns. Aurora read replicas share the same storage layer (no replication lag for storage, only for buffer cache), while RDS read replicas use streaming replication with measurable lag.
Step 3: Add a Caching Layer (Cost: one Redis/Memcached instance + cache invalidation logic. Risk: moderate.) If the same data is read repeatedly and can tolerate being slightly stale, cache it in Redis or Memcached.
  • Cache database query results with a TTL matching your staleness tolerance.
  • Use cache-aside pattern: check cache first, fall through to database on miss, populate cache on read.
  • For write-heavy data that is read often: consider write-through caching.
When to stop here: When your read traffic follows a power-law distribution (a small fraction of data accounts for most reads). Caching the top 1% of data can eliminate 80% of database reads. When this is NOT enough: When the working set is too large to cache (billions of unique keys), when data changes too frequently for caching to help (TTL must be so short the hit rate is negligible), or when writes — not reads — are the bottleneck.
Cross-chapter connection: Cache invalidation strategies (cache-aside, write-through, write-behind), stampede prevention, and TTL design are covered in depth in Caching and Observability. The Redis data structures and eviction policies that underpin your caching layer are covered in Section 4 of this chapter.
Step 4: Partition (Table Partitioning) (Cost: schema change + query routing awareness. Risk: moderate.) When individual tables grow too large for efficient querying and vacuuming but the overall data fits on one server, partition the table.
  • PostgreSQL: range partitioning by date (most common), list partitioning by tenant/region, hash partitioning for even distribution.
  • MongoDB: already uses sharding for this (see Step 5). Internal WiredTiger data partitioning is automatic.
  • DynamoDB: already partitioned internally by partition key. The analog here is redesigning your partition key for better distribution.
When to stop here: When your problem is “this 500M-row table is slow to query and vacuum” rather than “this server cannot handle the total throughput.” When this is NOT enough: When a single server’s CPU, memory, or I/O is saturated even with partitioned tables. Now you need to distribute across machines. Step 5: Shard (Distribute Across Machines) (Cost: significant engineering effort + operational complexity. Risk: high.) Sharding splits data across multiple independent database instances. Each shard holds a subset of the data.
  • PostgreSQL: application-level sharding (route queries based on a shard key in your app code), or use Citus for distributed PostgreSQL.
  • MongoDB: native sharding with a shard key. The shard key choice is irreversible and determines everything — choose carefully (see Section 2.5).
  • DynamoDB: sharding is automatic and invisible (DynamoDB handles partition management). Your role is to choose a partition key that distributes evenly.
When to stop here: When horizontal scaling of your current database can handle your throughput and storage needs for the foreseeable future. When this is NOT enough: When your data model or access patterns fundamentally do not fit the database engine you are using. Step 6: Choose a Different Database (Cost: massive engineering effort. Risk: very high.) This is the last resort, not the first instinct. You should only be here if:
  • Your access patterns are fundamentally mismatched with your database’s strengths (e.g., graph traversals in a relational database, ad-hoc analytics in DynamoDB).
  • You have exhausted all tuning, replication, caching, and partitioning options.
  • The cost of maintaining the workarounds exceeds the cost of migration.
Examples of valid migrations:
  • Relational database struggling with time-series write throughput -> TimescaleDB or InfluxDB.
  • MongoDB struggling with complex cross-collection JOINs -> PostgreSQL.
  • PostgreSQL struggling with sub-millisecond key-value lookups at massive scale -> DynamoDB or Redis.
  • Any SQL database struggling with graph traversals (friends-of-friends queries) -> Neo4j or Neptune.
The decision tree is sequential for a reason. Engineers who jump to Step 5 or 6 without trying Steps 1-4 waste months of engineering effort on problems that a single compound index could have solved. Every step up the ladder is approximately 10x more expensive and 10x riskier than the previous one. The most senior move is often doing less: “We do not need to shard — we need to add one index and adjust autovacuum.”

Quick Reference: Scaling Decision Tree

Your database is slow. What is the bottleneck?
|
+-- Read latency (individual queries are slow)
|   |
|   +-- Missing index? --> Add index (Step 1)
|   +-- Stale statistics? --> ANALYZE (Step 1)
|   +-- Bad query plan? --> Rewrite query (Step 1)
|   +-- Table too large for efficient scans? --> Partition (Step 4)
|
+-- Read throughput (too many concurrent reads)
|   |
|   +-- Same data read repeatedly? --> Add cache (Step 3)
|   +-- Diverse reads, high volume? --> Add read replicas (Step 2)
|   +-- Single hot key/item? --> Cache that specific key (Step 3)
|
+-- Write throughput (too many concurrent writes)
|   |
|   +-- Single hot table/partition? --> Partition or reshard (Step 4/5)
|   +-- Overall write volume exceeds one server? --> Shard (Step 5)
|   +-- Write pattern mismatch? --> Different DB (Step 6)
|
+-- Storage (data exceeds single server capacity)
|   |
|   +-- Cold data can be archived? --> Partition + drop old partitions (Step 4)
|   +-- All data is hot? --> Shard (Step 5)
|
+-- Connection exhaustion
    |
    +-- Too many clients? --> Connection pooler (PgBouncer, RDS Proxy)
    +-- Serverless/Lambda? --> RDS Proxy or switch to DynamoDB (Step 6)

Section 7: Connection Pool Sizing Formula

Connection pool sizing is one of those problems that seems like it should be simple but causes production outages when done wrong. Too small and requests queue up waiting for a connection. Too large and you overwhelm the database with concurrent queries that compete for CPU, memory, and I/O.

Little’s Law Applied to Database Connections

The mathematical foundation for connection pool sizing is Little’s Law, a fundamental result from queueing theory:
L = lambda * W
Where:
  • L = average number of items in the system (connections in use)
  • lambda = arrival rate (queries per second)
  • W = average time each item spends in the system (average query duration)
Applied to database connection pools:
pool_size = throughput * average_query_latency
Or more precisely:
minimum_pool_size = (queries_per_second) * (average_query_duration_in_seconds)

Concrete Examples

Example 1: A typical web application.
  • Your app handles 500 requests/second.
  • Each request makes 1 database query averaging 10 ms (0.01 seconds).
  • Throughput (queries/second) = 500.
  • pool_size = 500 * 0.01 = 5 connections.
Five connections. Not 50, not 100 — five. This surprises most engineers, but the math is clear: if each query only holds a connection for 10 ms, one connection can handle 100 queries/second. Five connections handle 500. Example 2: A heavier workload with slow queries.
  • 2,000 requests/second, each making 2 database queries.
  • Average query latency: 25 ms (0.025 seconds).
  • Throughput = 4,000 queries/second.
  • pool_size = 4,000 * 0.025 = 100 connections.
This is a realistic pool size for a high-traffic application. Note how query latency is the dominant factor — if you could cut average latency from 25 ms to 5 ms, the pool size drops to 20. Example 3: The p99 problem.
  • Same 2,000 requests/second, 2 queries each.
  • Average query latency: 5 ms, but p99 latency: 200 ms.
  • Using average: pool_size = 4,000 * 0.005 = 20 connections.
  • But during p99 spikes: burst_pool_size = 4,000 * 0.200 = 800 connections.
If 1% of queries take 200 ms, they hold connections 40x longer than average. This is why you need headroom above the theoretical minimum.

The Practical Formula

In practice, add a multiplier for safety margin, burstiness, and overhead:
recommended_pool_size = (queries_per_second * average_latency_seconds) * safety_multiplier

where safety_multiplier = 1.5 to 3.0 depending on:
  - Traffic burstiness (steady = 1.5, bursty = 2.5-3.0)
  - Query latency variance (low variance = 1.5, high p99/avg ratio = 2.5+)
  - Tolerance for queuing (zero tolerance = higher multiplier)
For Example 1: 5 * 2.0 = 10 connections. A pool of 10 with a max of 15 handles normal traffic with room for bursts. For Example 2: 100 * 2.0 = 200 connections. But check: can PostgreSQL handle 200 concurrent connections? At ~10 MB per connection, that is 2 GB of overhead. If the server has 16 GB of RAM and you need 10 GB for shared_buffers and working memory, 200 connections might be too many. This is where PgBouncer’s transaction pooling becomes essential — 200 application connections multiplex onto 50-100 actual PostgreSQL connections.

The PostgreSQL-Specific Constraint

PostgreSQL has a hard constraint that most connection pool formulas ignore: more concurrent connections does not mean more throughput. Beyond a certain point (roughly 2-4x the number of CPU cores), adding connections decreases throughput due to context switching, lock contention, and buffer pool pressure. The empirical guideline from PostgreSQL experts:
max_useful_connections = (2 * cpu_cores) + number_of_disks
For a typical cloud instance with 8 vCPUs and 1 SSD: 2 * 8 + 1 = 17. This is the maximum number of active connections that can do useful work simultaneously. PgBouncer multiplexes hundreds of application connections onto this small pool.
The most common production mistake is setting max_connections = 500 and letting every microservice grab as many connections as it wants. The database drowns in context switches and lock contention. A better approach: set PostgreSQL max_connections to a moderate number (100-200), run PgBouncer in transaction mode with a pool of 20-50 server connections, and let PgBouncer handle the queuing. Application-level pools should be sized per-instance using Little’s Law, and the total across all instances should not exceed PgBouncer’s server-side pool.
Cross-chapter connection: Little’s Law applies far beyond database connections. It governs thread pool sizing, HTTP client connection pools, message consumer concurrency, and Kubernetes pod scaling. The general form — concurrency = throughput x latency — is one of the most useful formulas in systems engineering. See Performance and Scalability for how this principle applies to request processing pipelines, and Reliability Principles for how connection pool exhaustion cascades into full-system failures.

Section 8: Common Production Issues by Database

Every database has its own set of production failure modes. These are not theoretical — they are the issues that page engineers at 3 AM. Knowing them in advance separates engineers who operate databases from engineers who are surprised by databases.

PostgreSQL: Long-Running Transactions Blocking VACUUM

The problem: A developer opens a database connection, runs BEGIN, executes a query, then… does something else. Maybe they are debugging, maybe the application has a bug that never commits or rolls back. That open transaction holds a snapshot — and as long as that snapshot exists, VACUUM cannot clean up any dead tuples created after the transaction started. How it escalates:
  1. An idle transaction holds a snapshot for 6 hours.
  2. During those 6 hours, autovacuum runs but cannot remove any dead tuples from the last 6 hours of updates.
  3. Dead tuples accumulate. Table bloat grows. Sequential scans slow down because they must skip millions of dead tuples.
  4. Index bloat grows because indexes still point to dead tuples.
  5. In extreme cases: transaction ID wraparound approaches. PostgreSQL enters “safety shutdown” mode and refuses all writes until a manual VACUUM FREEZE completes.
Diagnosis:
-- Find long-running transactions
SELECT pid, now() - xact_start AS duration, state, query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC
LIMIT 10;

-- Check dead tuple accumulation
SELECT schemaname, relname, n_dead_tup, last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;
Prevention:
  • Set idle_in_transaction_session_timeout = '5min' to automatically kill idle transactions.
  • Configure statement_timeout as a safety net for runaway queries.
  • Monitor pg_stat_activity for transactions older than your acceptable threshold.
  • In PgBouncer: use transaction pooling mode, which returns connections to the pool after each transaction (making long-lived idle transactions impossible from pooled clients).
The senior insight: This issue is especially insidious because the cause (an idle transaction) and the symptom (slow queries, table bloat) appear unrelated. A junior engineer looks at slow queries and thinks “we need more indexes.” A senior engineer checks pg_stat_activity for idle transactions first.

MongoDB: Unbounded Array Growth

The problem: A developer embeds an ever-growing array inside a document. Activity logs per user, all comments on a popular post, IoT readings per device — any pattern where the array grows without bound. How it escalates:
  1. Documents start small (a few KB) and queries are fast.
  2. Over weeks or months, arrays grow to thousands of elements. Documents bloat to several MB.
  3. Every update rewrites the entire document. Appending one element to a 5 MB document means writing 5 MB. WiredTiger’s document-level concurrency control means all updates to that document are serialized.
  4. If a document hits the 16 MB BSON limit, all further updates fail. Your application starts throwing errors.
  5. Even before hitting the limit, large documents cause WiredTiger cache pressure. The cache stores uncompressed documents — a 5 MB document on disk might be 20 MB in the WiredTiger cache.
Diagnosis:
// Find oversized documents in a collection
db.users.aggregate([
  { $project: { docSize: { $bsonSize: "$$ROOT" }, _id: 1 } },
  { $sort: { docSize: -1 } },
  { $limit: 10 }
]);

// Find documents with large arrays
db.users.aggregate([
  { $project: { arraySize: { $size: "$activityLog" }, _id: 1 } },
  { $sort: { arraySize: -1 } },
  { $limit: 10 }
]);
Prevention:
  • Use the Bucket Pattern: group entries into time-based buckets (one document per day/hour) instead of one growing array per entity.
  • Use the Subset Pattern: keep only the last N items embedded; archive older items to a separate collection.
  • Use the Outlier Pattern: flag documents that exceed a threshold and store overflow data externally.
  • Set a hard application-level limit: { $push: { activityLog: { $each: [newEntry], $slice: -100 } } } keeps only the last 100 entries.

DynamoDB: Hot Partitions

The problem: All traffic concentrates on a small number of DynamoDB partitions because the partition key has low cardinality or a skewed distribution. How it escalates:
  1. DynamoDB distributes provisioned throughput evenly across partitions. With 10,000 WCU and 10 partitions, each partition gets ~1,000 WCU.
  2. A flash sale drives all writes to PK=PRODUCT#hot-item. That single partition receives 5,000 WCU of traffic.
  3. The partition throttles. ProvisionedThroughputExceededException errors spike. The application retries with exponential backoff, increasing latency.
  4. Adaptive capacity (DynamoDB’s automatic rebalancing) kicks in, but it takes minutes — an eternity during a flash sale.
  5. If you are on provisioned capacity, increasing the table’s WCU does NOT help because the new capacity is still distributed evenly. You are paying for 50,000 WCU but only one partition is hot.
Diagnosis:
  • Enable DynamoDB Contributor Insights — it shows the most accessed partition keys over time.
  • CloudWatch: compare ConsumedWriteCapacityUnits (table-level) with ThrottledRequests. If throttling occurs while consumed capacity is well below provisioned, you have a hot partition.
  • Check your partition key’s cardinality distribution. A partition key with 90% of items under one value is a time bomb.
Prevention:
  • Design partition keys for high cardinality from the start (user IDs, order IDs, UUIDs — not statuses or dates).
  • Use write sharding: append a random suffix (PK=PRODUCT#hot-item#7) to distribute across N partitions. Reads require querying all N suffixes and merging.
  • For genuinely hot items (a single viral product page), put DAX in front for reads and use SQS to buffer and smooth writes.
  • Switch to on-demand capacity mode for unpredictable or spiky workloads — it handles partition-level throttling more gracefully.
Cross-chapter connection: The hot partition problem is a specific instance of the general load distribution problem discussed in Reliability Principles — Load Shedding and Backpressure. The same principle applies: uneven load distribution causes failures at a fraction of theoretical capacity. Whether it is database partitions, server instances, or message queue consumers, the fix is always better distribution or dedicated capacity for hot paths.

Redis: Memory Fragmentation

The problem: Redis’s used_memory_rss (actual memory consumed as reported by the OS) is significantly higher than used_memory (memory Redis believes it is using). The ratio of these two values is the fragmentation ratio. A ratio of 1.5 means 50% of Redis’s memory is wasted on fragmentation. How it escalates:
  1. Your application writes and deletes keys of varying sizes over time. Small keys, large keys, keys that grow and shrink.
  2. The memory allocator (jemalloc by default) allocates memory in fixed-size classes. A 50-byte value gets a 64-byte allocation. A 70-byte value gets a 128-byte allocation. The wasted space in each allocation is internal fragmentation.
  3. Over time, keys are deleted and new keys of different sizes are created. The freed memory may not match the sizes needed for new allocations. Memory becomes a Swiss cheese of small free gaps that cannot be coalesced. This is external fragmentation.
  4. Redis reports used_memory = 8 GB but the OS reports used_memory_rss = 14 GB. You are paying for 14 GB of RAM but only 8 GB holds useful data.
  5. If maxmemory is set to 10 GB, Redis thinks it has 2 GB of headroom. But the OS is already at 14 GB. If the server only has 16 GB of RAM, the OOM killer may terminate Redis.
Diagnosis:
> INFO memory
used_memory:8589934592
used_memory_rss:14495514624
mem_fragmentation_ratio:1.69
mem_allocator:jemalloc-5.2.1

# Fragmentation ratio > 1.5 is problematic
# Fragmentation ratio < 1.0 means Redis is using swap (even worse)
Prevention and remediation:
  • Redis 4.0+ active defragmentation: Enable with activedefrag yes. Redis reorganizes memory in the background, moving allocations to reduce fragmentation. Configure active-defrag-threshold-lower 10 (start defragmentation when fragmentation exceeds 10%) and active-defrag-cycle-min 5 (use at least 5% of CPU for defragmentation).
  • Avoid frequent large-to-small value changes. If a key alternates between a 10 KB value and a 100-byte value, the allocator wastes space on the larger allocation class.
  • Prefer fixed-size data patterns. Hashes with consistent field counts, Sorted Sets with consistent element sizes, and Strings with consistent value lengths fragment less.
  • Schedule periodic restarts for Redis instances used as caches (not data stores). A restart loads the RDB snapshot into fresh, defragmented memory. This is a blunt instrument but effective.
  • Monitor mem_fragmentation_ratio in your alerting. Alert at 1.5, page at 2.0.
A fragmentation ratio below 1.0 is worse than a high ratio. A ratio below 1.0 means used_memory_rss is less than used_memory, which means Redis is using swap space. Swap makes Redis orders of magnitude slower because every memory access potentially hits disk. If you see a sub-1.0 ratio, Redis is actively swapping and needs more physical RAM immediately. See OS Fundamentals — Virtual Memory and Swap for why swap devastates in-memory systems.

Quick Reference: Production Issue Checklist

DatabaseIssueKey MetricAlert Threshold
PostgreSQLLong idle transactionspg_stat_activity.xact_startTransactions older than 5 minutes
PostgreSQLDead tuple accumulationpg_stat_user_tables.n_dead_tup> 10% of n_live_tup
PostgreSQLTransaction ID wraparoundage(datfrozenxid)> 500 million (critical at 1 billion)
PostgreSQLConnection exhaustionpg_stat_activity count vs max_connections> 80% of max_connections
MongoDBDocument bloat$bsonSize in aggregationDocuments > 1 MB (critical at 8 MB)
MongoDBWiredTiger cache pressurewiredTiger.cache.bytes currently in cache> 90% of configured cache size
MongoDBReplication lagrs.printSlaveReplicationInfo()> 10 seconds
DynamoDBHot partitionsThrottledRequests while consumed < provisionedAny throttling at < 80% overall capacity
DynamoDBGSI backpressureGSI ThrottledWriteRequestsAny sustained throttling
RedisMemory fragmentationmem_fragmentation_ratio> 1.5 (critical > 2.0)
RedisSwap usagemem_fragmentation_ratio < 1.0Any ratio < 1.0
RedisSlow commandsSLOWLOG GET 10Commands > 10 ms
RedisKey evictionevicted_keysIncreasing when it should not be

Interview Questions

What they are really testing: Do you understand PostgreSQL’s MVCC internals and have operational experience with large tables?Strong answer:First, I would diagnose why VACUUM is slow. Check pg_stat_user_tables for n_dead_tup and last_autovacuum. If dead tuples are in the hundreds of millions, autovacuum fell behind.Immediate actions:
  1. Tune autovacuum for this specific table. Reduce autovacuum_vacuum_scale_factor to 0.01 (trigger at 1% dead tuples instead of 20%) and increase autovacuum_vacuum_cost_limit to let autovacuum work faster.
  2. Increase maintenance_work_mem for the VACUUM session (e.g., SET maintenance_work_mem = '2GB'). VACUUM uses this to store dead tuple IDs — if it cannot fit them all, it makes multiple passes over the table.
  3. Check for long-running transactions. Any open transaction prevents VACUUM from cleaning up tuples created after that transaction started. A forgotten idle transaction from 3 days ago can block all vacuuming. Check pg_stat_activity for old transactions and terminate them.
  4. Run VACUUM manually with VERBOSE to see progress and identify where it is spending time.
Long-term fix: Partition the table. If this is time-series data, partition by month or week. VACUUM runs per-partition, so a 500M-row table becomes 12 partitions of ~42M rows each. Dropping old partitions is instant (DROP TABLE) instead of DELETEing and VACUUMing millions of rows. Also consider pg_repack if the table is heavily bloated — it performs an online table rewrite without the exclusive lock that VACUUM FULL requires.Words that impress: dead tuples, xmin/xmax visibility, autovacuum scale factor, cost-based vacuum delay, maintenance_work_mem, transaction ID wraparound, pg_repack.
Senior vs Staff distinction. A senior engineer identifies the immediate diagnosis steps (check pg_stat_user_tables, tune autovacuum, check for long-running transactions) and fixes the table. A staff/principal engineer goes further: they ask why autovacuum fell behind in the first place (missing monitoring? default settings never tuned? no table-level overrides on high-write tables?), sets up per-table autovacuum policies as organizational standards, adds pg_stat_user_tables.n_dead_tup to the Tier 1 dashboard, and establishes a runbook so the next engineer who encounters this does not start from scratch. Staff-level thinking treats the incident as a system design problem, not a one-off fix.
Follow-up chain:
  • Failure mode: What happens if VACUUM FREEZE cannot complete before transaction ID wraparound? PostgreSQL enters read-only safety mode for the entire cluster — all writes stop, not just the affected table. Recovery requires manual VACUUM FREEZE which can take hours on large tables.
  • Rollout: If you partition the table, how do you migrate 500M rows into partitions without downtime? Use pg_partman for declarative partitioning or create partitions, backfill with INSERT INTO ... SELECT, set up a trigger to dual-write during migration, then swap.
  • Rollback: If the new autovacuum settings cause too much I/O during peak hours, revert the per-table settings with ALTER TABLE ... RESET (autovacuum_vacuum_scale_factor) and schedule manual VACUUM during off-peak windows.
  • Measurement: Track n_dead_tup / n_live_tup ratio per table, last_autovacuum freshness, and age(relfrozenxid) as Tier 1 metrics. Alert at 10% dead tuple ratio.
  • Cost: pg_repack requires temporary disk space equal to the table size. For a 500GB table on RDS, ensure your storage can handle the spike. VACUUM FULL is cheaper on disk but takes an exclusive lock — quantify the downtime cost to the business.
  • Security/governance: Long-running transactions often come from developer SSH sessions with open psql connections. Enforce idle_in_transaction_session_timeout and audit who has direct database access. In regulated environments, the wraparound risk itself may be a compliance concern.
Work-sample prompt: “Here is a pg_stat_user_tables output showing a table with 450M live tuples and 380M dead tuples. The last autovacuum was 18 hours ago. age(relfrozenxid) is 1.2 billion. Write the exact SQL commands you would run in the next 10 minutes, in order, explaining why each one matters.”
Structured Answer Template.
  1. Diagnose — What is the current state? Check pg_stat_user_tables, n_dead_tup, last_autovacuum, age(relfrozenxid).
  2. Find blockers — Any long-running transactions holding back the visibility horizon? Terminate them.
  3. Tune for this table — Per-table autovacuum overrides (scale factor, cost limit, maintenance_work_mem).
  4. Recover spacepg_repack for online rewrite, never VACUUM FULL on a live hot table.
  5. Prevent recurrence — Partition the table, add monitoring on n_dead_tup / n_live_tup and age(relfrozenxid).
Real-World Example. GitLab famously shared a post-mortem where a single 3-day-old idle-in-transaction session held back VACUUM on a 500 GB ci_builds table. age(relfrozenxid) climbed past 1.5 billion before monitoring caught it. They now enforce a 5-minute idle_in_transaction_session_timeout cluster-wide and page on-call at 500M xid age — a pattern any team running PostgreSQL at scale should copy.
Big Word Alert — MVCC. Multi-Version Concurrency Control means PostgreSQL keeps multiple versions of a row so concurrent transactions see consistent snapshots without locking readers. Say “MVCC dead tuples” when explaining why VACUUM is necessary — it shows you understand the why, not just the what.
Big Word Alert — index bloat. When rows are updated/deleted under MVCC, index entries still point to now-dead tuples until VACUUM cleans them. Bloated indexes are larger and slower. Use the phrase when recommending REINDEX CONCURRENTLY or pg_repack.
Follow-up Q&A Chain:Q: Why can’t you just run VACUUM FULL during a maintenance window on a 500GB table? A: VACUUM FULL takes an ACCESS EXCLUSIVE lock — no reads, no writes — for the entire duration, which on 500GB is typically 2-6 hours. It also needs ~500GB of free disk to rewrite the table. pg_repack does the same work online using triggers and a shadow table, swapping at the end with a brief lock.Q: If autovacuum is configured correctly, why does it still fall behind? A: Three common reasons: (1) cost-based throttling is too aggressive (autovacuum_vacuum_cost_delay default of 20ms on fast SSDs is conservative), (2) not enough workers (autovacuum_max_workers default 3 can’t cover dozens of hot tables simultaneously), (3) long-running transactions pin the visibility horizon so VACUUM has nothing to reclaim.Q: How do you validate that pg_repack completed cleanly without data loss? A: Compare row counts and a checksum of a stable column (e.g., MD5(string_agg(id::text, ',')) over ordered rows) before and after. pg_repack ships with internal consistency checks, but a post-run validation query against a replica snapshot is the belt-and-suspenders check I always run.
Further Reading:
  • PostgreSQL Docs — Routine Vacuuming
  • Crunchy Data Blog — “VACUUM Strategy for Busy PostgreSQL Tables” (search: Crunchy Data vacuum strategy)
  • “pg_repack — Reorganize tables in PostgreSQL databases with minimal locks” — official project README on GitHub
What they are really testing: Can you think in access patterns rather than entities? Do you understand single-table design?Strong answer:Start by listing access patterns:
  1. Get user profile by user ID
  2. Get all posts by a user (most recent first)
  3. Get a specific post
  4. Get all comments on a post (most recent first)
  5. Get a user’s feed (posts from users they follow)
  6. Get all followers of a user
Single-table design:
PKSKGSI1-PKGSI1-SKType
USER#janePROFILEUser profile
USER#janePOST#2024-04-09T10:00:00#uuidPOST#uuidPOSTPost by Jane
POST#uuidCOMMENT#2024-04-09T10:05:00#uuid2USER#bobCOMMENTComment by Bob on post
USER#janeFOLLOWS#bobUSER#bobFOLLOWER#janeJane follows Bob
Access patterns mapped:
  • User profile: GetItem(PK=USER#jane, SK=PROFILE)
  • User’s posts: Query(PK=USER#jane, SK begins_with POST#) — sorted by timestamp in SK
  • Post comments: Query(PK=POST#uuid, SK begins_with COMMENT#) — sorted by time
  • Bob’s followers: Query(GSI1-PK=USER#bob, GSI1-SK begins_with FOLLOWER#)
The feed problem: The feed (posts from followed users) is the hardest access pattern. Two approaches: (1) Fan-out on write — when Jane posts, write the post ID to each follower’s feed (a separate item type). Fast reads, expensive writes for popular users. (2) Fan-out on read — at read time, query each followed user’s recent posts and merge. Simpler writes, potentially slow reads. A hybrid approach (fan-out on write for users with <10K followers, fan-out on read for celebrities) is what Twitter and Instagram use.Key design decisions I would explain: Sort keys include timestamps for natural ordering. UUIDs are appended to timestamps to guarantee uniqueness. GSI1 enables the inverse query (given a post, find the user; given a user, find followers). The feed strategy depends on the ratio of followers to follow frequency.
Structured Answer Template.
  1. List access patterns first — enumerate every read/write the app will perform. Do not start with entities.
  2. Design PK/SK — PK for distribution, SK for sort/filter within a partition.
  3. Add GSIs for inverse lookups — each GSI answers one specific access pattern.
  4. Tackle the hard pattern last — feeds, search, counts. These usually need write sharding, fan-out, or materialized aggregates.
  5. Validate — walk through every access pattern and confirm it is a GetItem or Query, never a Scan.
Real-World Example. Notion’s engineering team described migrating their block storage from a purely relational model to a hybrid where hot read paths (document tree traversal) were served by DynamoDB-style key-based lookups. The winning lesson was exactly this access-patterns-first approach: they modeled “get all children of block X sorted by position” as a single Query call with PK=block:parent_id, SK begins_with position#, avoiding recursive joins entirely.
Big Word Alert — single-table design. Storing multiple entity types (users, posts, comments) in one DynamoDB table by encoding type prefixes into the partition and sort keys. Use when explaining why you are not creating users_table, posts_table, comments_table separately — the phrase signals you understand DynamoDB’s access-pattern-first philosophy.
Big Word Alert — fan-out on write. Writing a denormalized copy of new data into many target locations (e.g., each follower’s feed) at write time so reads are cheap. The opposite is fan-out on read, where you aggregate at query time. Name both trade-offs explicitly when discussing feed architectures.
Follow-up Q&A Chain:Q: What happens when a user with 50 million followers posts something? Fan-out on write breaks down. A: You switch that user to fan-out on read (they are a “celebrity”). The system maintains two code paths: normal users’ posts are fanned out to followers at write time; celebrity posts stay on the celebrity’s timeline, and followers’ feeds are merged at read time from the normal fan-out plus the celebrities they follow. Instagram and Twitter both use this hybrid.Q: How would you handle deleting a comment? It sounds trivial but has subtle issues. A: A DeleteItem on the comment is trivial, but the comment count on the post (if you denormalized it) needs to decrement atomically. Use a TransactWriteItems that deletes the comment and decrements the counter together. If the counter lives on the parent post item, this is a single transactional operation — no race with concurrent additions.Q: Your GSI1 gets hot because many comments are on one viral post. How do you handle it? A: Write-shard the GSI partition key. Instead of GSI1-PK = POST#uuid, use GSI1-PK = POST#uuid#shard-{0-9} where the shard is chosen by hashing the comment ID. Reads fan out across all 10 shards and merge — more complex but eliminates the hot partition.
Further Reading:
What they are really testing: Systematic debugging methodology, understanding of Redis memory internals.Strong answer:Step 1: Understand the current state.
  • INFO memory — check used_memory, used_memory_rss (actual OS allocation, which includes fragmentation), mem_fragmentation_ratio. If fragmentation ratio is > 1.5, Redis is wasting significant memory on fragmentation. If used_memory is close to maxmemory, eviction is being triggered.
  • INFO keyspace — how many keys in each database? Is the key count growing unexpectedly?
Step 2: Find what is consuming memory.
  • MEMORY USAGE <key> on suspected large keys.
  • redis-cli --bigkeys — scans the keyspace and reports the largest keys of each type. This often reveals a single Sorted Set or Hash that has grown to several GB.
  • For a deeper analysis: redis-cli --memkeys or use Redis’s MEMORY DOCTOR command.
Step 3: Common culprits and fixes.
  • Missing TTLs: Keys that should expire but never do. Use OBJECT IDLETIME <key> to find keys that have not been accessed in weeks but have no TTL. Set TTLs aggressively.
  • Unbounded data structures: A List or Sorted Set that grows without limit (e.g., appending every event without trimming). Fix with LTRIM (keep last N elements) or ZREMRANGEBYSCORE (remove entries older than X).
  • Serialization bloat: Large JSON strings stored as Redis Strings. Compress at the application level or switch to Redis Hashes for structured data (hashes use ziplist encoding for small objects, which is much more memory-efficient).
  • Key fragmentation: Many small keys with verbose names. Shorten key names or group related data into Hashes.
Step 4: If the data legitimately needs more memory.
  • Scale vertically (larger instance).
  • Implement Redis Cluster to shard across multiple nodes.
  • Move cold or less latency-sensitive data to disk-based storage and keep only hot data in Redis.
The meta-point: I would approach this the same way I approach any resource exhaustion issue — measure first, identify the largest consumers, then decide whether to reduce consumption or increase capacity.
Structured Answer Template.
  1. MeasureINFO memory, INFO keyspace, fragmentation ratio. Know the current state precisely.
  2. Identify top consumersredis-cli --bigkeys, MEMORY USAGE on suspects.
  3. Categorize the cause — missing TTLs, unbounded structures, serialization bloat, or fragmentation.
  4. Apply targeted fixes — TTLs, LTRIM/ZREMRANGEBYSCORE, Hash encoding, MEMORY PURGE.
  5. Scale only if genuinely needed — vertical first, then Cluster. Never reach for more RAM before fixing the leak.
Real-World Example. Shopify ran into Redis memory pressure on their session store when a bug caused sessions to be written without TTLs for a three-week period. The instance grew from 8 GB to 48 GB and started evicting real sessions. Their playbook now requires every SET-family command in production code to pass through a helper that enforces a TTL unless explicitly marked persistent, caught by a linter in CI.
Big Word Alert — fragmentation ratio. mem_fragmentation_ratio = used_memory_rss / used_memory. A ratio above 1.5 means the OS is holding ~50% more memory than Redis actually uses — memory allocator overhead from churn. Ratios under 1.0 mean Redis is swapping, which is a production emergency. Name the specific threshold when diagnosing.
Big Word Alert — ziplist/listpack encoding. Redis stores small Hashes, Lists, and Sorted Sets in a compact serialized format (ziplist in Redis <7, listpack in 7+) until they exceed size thresholds. Grouping small fields into a single Hash instead of many top-level keys can cut memory 5-10x. Mention this when recommending “use a Hash instead.”
Follow-up Q&A Chain:Q: mem_fragmentation_ratio is 2.1. Do you restart Redis or try to fix it online? A: Neither first — try MEMORY PURGE and CONFIG SET activedefrag yes. Active defragmentation runs in the background and slowly rewrites fragmented keys. Restarting Redis wipes the cache (massive stampede on the backing store) and is almost never the right first move.Q: You find a 4 GB Sorted Set that has no TTL and is growing. The team says “it’s our event log.” What do you recommend? A: Redis is the wrong home for unbounded logs. Migrate to Redis Streams (has built-in MAXLEN trimming) or, better, to Kafka/Kinesis if the data is truly append-only. Interim fix: add a scheduled ZREMRANGEBYSCORE event_log -inf (now - 7d) cron to cap it at 7 days while the migration is planned.Q: After all fixes, you are still near maxmemory. What is your last-resort before vertical scaling? A: Tune the eviction policy and audit the key distribution. If the workload is a cache, allkeys-lru (or allkeys-lfu for skewed access) lets Redis evict cold keys automatically. If the workload mixes cache and source-of-truth data, split them into separate instances — eviction should never touch data you cannot re-derive.
Further Reading:
What they are really testing: Nuanced understanding of database trade-offs, not tribal loyalty.Strong answer:I would choose MongoDB over PostgreSQL in specific scenarios:
  1. Highly variable schema with rapid iteration. If I am building a product where the data model changes weekly — different content types with different fields, A/B testing different data structures — MongoDB’s flexible schema means I can iterate without migration scripts. PostgreSQL’s JSONB can handle this too, but MongoDB’s query language is designed for it.
  2. Hierarchical document data. If my primary entity is a deeply nested document (a product catalog with variants, each variant with options, each option with regional pricing), MongoDB stores and retrieves this as a single document read. PostgreSQL would require multiple JOINs or JSONB with less ergonomic querying.
  3. Horizontal scaling as a primary requirement. If I know from day one that I need to shard across many nodes (multi-region, multi-tenant at massive scale), MongoDB’s native sharding is more mature than PostgreSQL’s horizontal scaling options (which require Citus or application-level sharding).
  4. When the team has deep MongoDB expertise and the use case does not require multi-document transactions, complex JOINs, or strong schema enforcement.
When I would NOT choose MongoDB:
  • When I need complex transactions across multiple entities (financial systems).
  • When ad-hoc analytical queries with JOINs are a core requirement.
  • When data integrity constraints (foreign keys, check constraints, unique constraints across related entities) are critical.
  • When the team does not have MongoDB experience — PostgreSQL is a safer default because it handles a wider range of use cases well.
The honest take: PostgreSQL is my default choice for most applications. MongoDB is a specialist tool that excels in its niche. The worst outcome is choosing MongoDB because “NoSQL scales better” (it does not, inherently) and then fighting against the document model for every relational query you need.
Structured Answer Template.
  1. State your default — “PostgreSQL is my default.” This signals you are not a bandwagon picker.
  2. Enumerate scenarios where MongoDB wins — variable schema, deeply nested docs, native sharding needs.
  3. Enumerate scenarios where it loses — multi-entity transactions, heavy JOINs, strict integrity.
  4. Name the trap — “I would avoid choosing MongoDB for relational data because NoSQL sounds modern.”
  5. Close with a decision rule — concrete enough to apply on Monday.
Real-World Example. Figma kept their primary data (files, users, teams — relational) in PostgreSQL while using a purpose-built document model only where variable schemas genuinely matter. GitHub made the opposite tactical choice for Gists (document-like) but stayed on MySQL for everything relational. The pattern repeats: successful companies pick the database that fits the shape of the data, not the shape of the hype cycle.
Big Word Alert — schema-on-read vs schema-on-write. Schema-on-write (PostgreSQL) validates structure at insert time; schema-on-read (MongoDB) stores whatever you give it and parses at query time. Use these phrases when explaining flexibility trade-offs — they are the precise technical framing.
Big Word Alert — polyglot persistence. Using multiple databases in one system, each for what it does best (e.g., PostgreSQL + Redis + Elasticsearch). Name-drop it when discussing hybrid architectures, but caveat it: each additional database adds operational cost.
Follow-up Q&A Chain:Q: What about PostgreSQL’s JSONB? Doesn’t that close the gap? A: Yes, for about 80% of “we need flexible schema” use cases. JSONB supports GIN indexes, partial indexes on specific paths, and ACID transactions mixing relational and semi-structured data. If you are already on PostgreSQL, JSONB is almost always a better answer than adding MongoDB.Q: A team says “we need MongoDB because we’ll scale horizontally someday.” How do you respond? A: “Someday” is the problem. PostgreSQL scales vertically comfortably to 2-4 TB and millions of QPS — most companies never outgrow it. When they do, Citus, Vitess-for-Postgres, or application-level sharding are proven paths. Picking MongoDB preemptively pays a real daily cost (JOINs, transactions, tooling maturity) for a hypothetical future pain that usually never materializes.Q: Are there workloads where MongoDB is genuinely faster than PostgreSQL? A: For single-document reads/writes where the document is already the natural unit of access (e.g., a user profile with deeply nested preferences), MongoDB’s storage format avoids JOINs and can beat PostgreSQL. But once you need aggregations across documents, PostgreSQL’s planner usually wins.
Further Reading:
  • “PostgreSQL JSONB — Use Cases and Limitations” — pganalyze blog
  • MongoDB Docs — Data Modeling Introduction
  • Martin Fowler — “Polyglot Persistence” (martinfowler.com/bliki/PolyglotPersistence.html)
What they are really testing: Depth of DynamoDB operational experience.Strong answer:A hot partition occurs when a disproportionate amount of traffic targets items that share the same partition key prefix and therefore land on the same physical partition. Example: sharding by date (PK=2024-04-09) means today’s partition receives all writes while yesterday’s is idle.A hot key is a single item that receives an extreme volume of reads or writes. Example: a configuration item (PK=CONFIG, SK=FEATURE_FLAGS) that every Lambda invocation reads on startup.Diagnosis:
  • CloudWatch ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits metrics — look for uneven distribution across partitions.
  • Enable DynamoDB Contributor Insights — it shows the most accessed partition keys and sort keys.
  • ThrottlingEvents metric increasing while the table’s overall consumed capacity is well below provisioned capacity is the classic hot partition signal.
Fixes for hot partitions: Redesign the partition key for higher cardinality. Use write sharding (append random suffixes). Use composite keys that distribute more evenly.Fixes for hot keys: Use DAX (caching layer) for read-hot keys. For write-hot keys, implement application-level write sharding (scatter writes across multiple items, aggregate on read). DynamoDB’s adaptive capacity can automatically isolate hot items onto dedicated partitions, but this is reactive and has limits.
Structured Answer Template.
  1. Define both terms precisely — hot partition (key prefix collision) vs hot key (single item).
  2. Diagnose — CloudWatch metrics, Contributor Insights, throttling despite low overall consumption.
  3. Hot partition fix — increase partition key cardinality, write sharding.
  4. Hot key fix — DAX for reads, scatter-gather writes for writes.
  5. Mention adaptive capacity — what it does and why you cannot rely on it alone.
Real-World Example. Coinbase engineers have publicly discussed how their early DynamoDB design used partition keys like PK = CURRENCY#BTC, which became a hot partition during Bitcoin price spikes — every trade hit one partition. The fix was PK = CURRENCY#BTC#{random_suffix_0-99}, distributing writes across 100 virtual partitions. Reads aggregate across all suffixes, but writes scale linearly.
Big Word Alert — sharding (write sharding specifically). Appending a random or hash-based suffix to a partition key to spread writes across multiple physical partitions. Say “write sharding with N-way suffix” — the number tells the interviewer you have sized it for a specific throughput.
Big Word Alert — adaptive capacity. DynamoDB’s automatic redistribution of provisioned throughput from cold partitions to hot ones, enabled by default since 2018. Name it, but always caveat: it’s reactive (takes minutes) and bounded (cannot exceed the partition’s hardware ceiling of 1000 WCU / 3000 RCU).
Follow-up Q&A Chain:Q: If you write-shard by appending a random suffix, how do you read all the writes back? A: Scatter-gather: issue N parallel Query calls, one per suffix, and merge results in the application. For strict ordering, sort after merging. The trade-off is N* the RCU cost per read, which is why you only write-shard items that are write-hot and read-infrequently (or read-aggregatable), like counters, time-series buckets, or leaderboards with periodic rollups.Q: You see throttling only on a GSI, not the base table. What does that mean? A: GSIs have their own provisioned capacity, and writes to the base table replicate asynchronously to GSIs. A hot GSI partition throttles GSI writes, and back-pressure can slow base table writes too (the base table will throttle if the GSI cannot keep up). Diagnose with ConsumedWriteCapacity per-GSI CloudWatch metric and redesign the GSI partition key with the same high-cardinality principles.Q: Why not just switch to on-demand mode and let DynamoDB figure it out? A: On-demand handles overall capacity elastically but does nothing for partition-level hot spots — the partition hardware limits (1000 WCU, 3000 RCU per partition) still apply. A hot partition on on-demand throttles just like on provisioned. On-demand also costs ~6.5x more per request at steady state. It’s a scaling convenience, not a hot-partition fix.
Further Reading:
What they are really testing: Can you read a query plan and reason about join strategies?Strong answer:The Nested Loop join is executing the inner Index Scan 50,000 times — once for each row in the outer relation. Each Index Scan might take 0.5 ms (reasonable for a single lookup), but 50,000 * 0.5 ms = 25 seconds. This is the classic “N+1 at the database level.”Why the planner chose this: The planner estimated the outer relation would return a small number of rows (perhaps 100), making a Nested Loop with 100 index lookups efficient. But the actual row count is 50,000 — a bad estimate, likely due to stale statistics or a correlation the planner does not model well.Fixes in order of preference:
  1. Run ANALYZE on both tables to update statistics. Rerun the query — the planner might switch to a Hash Join or Merge Join with accurate estimates.
  2. Check for a missing index on the join column of the inner table. If the inner Index Scan is on a poorly selective index, adding a better one helps.
  3. Increase work_mem for this query (SET work_mem = '256MB'). Hash Joins need memory to build the hash table — if work_mem is too small, the planner avoids them. With more memory, the planner might choose a Hash Join (build a hash table from the smaller table, probe it once for each row of the larger table — O(N+M) instead of O(N*M)).
  4. Force a different join strategy as a last resort: SET enable_nestloop = off and rerun to see if a Hash Join or Merge Join is faster. This is a diagnostic tool, not a production fix — the real fix is making the planner’s estimates accurate.
The key insight in an interview: Always multiply the inner node’s time by the loop count. An inner operation that looks fast (0.5 ms) becomes devastating when multiplied by 50,000.
Structured Answer Template.
  1. Compute the real cost — inner node time × loops = actual join cost.
  2. Explain why the planner chose it — stale stats usually, sometimes correlated columns.
  3. Fix in orderANALYZE first (free), then indexes, then work_mem, then forcing the join type.
  4. Name the alternative — Hash Join (O(N+M)) vs Nested Loop (O(N*M)).
  5. Never end with “force the plan” — it’s a diagnostic step, not a production fix.
Real-World Example. PlanetScale’s engineering blog described a customer whose query regressed from 30ms to 30 seconds overnight after a bulk backfill skewed the data distribution. The plan had flipped from Hash Join to Nested Loop because the planner’s row estimate for the outer relation was 100x off. A single ANALYZE restored the original plan — a two-second fix after two days of debugging the application layer.
Big Word Alert — N+1 (at the database level). The anti-pattern where one initial query is followed by N follow-up queries (usually from a loop). A Nested Loop join with large outer rows is essentially N+1 materialized in the query planner. Name the pattern — it signals you see the same shape at the app and DB layers.
Big Word Alert — selectivity and cardinality. Selectivity = fraction of rows matching a predicate (0.01 means 1%); cardinality = estimated or actual row count. The planner’s job is to estimate these correctly. When stats are stale, selectivity estimates are wrong, cardinality estimates are wrong, and the join strategy is wrong.
Follow-up Q&A Chain:Q: Why would ANALYZE fix the plan but not prevent regression next week? A: Autovacuum’s analyze_scale_factor default of 0.1 means stats only refresh after 10% of the table changes. On a large slow-growing table, that could be weeks. Lower it per-table (ALTER TABLE ... SET (autovacuum_analyze_scale_factor = 0.01)) for tables where the planner’s accuracy matters most.Q: EXPLAIN ANALYZE shows the plan, but the query was previously fast. How do you find the old plan to compare? A: auto_explain with log_min_duration = 1s logs full plans for all slow queries — when deployed ahead of time, you have a historical record. Without it, pg_stat_statements at least shows mean/min/max execution time so you can confirm when the regression started, correlating with recent data-volume changes.Q: When is a Nested Loop actually the right choice? A: When the outer relation is truly small (say, under 100 rows) and the inner has a selective index. Then N index lookups beat building a hash table. The planner is not wrong to pick Nested Loops — it is only wrong when its estimate of the outer relation size is.
Further Reading:
  • PostgreSQL Docs — Using EXPLAIN
  • Use The Index, Luke — “Joining Tables” chapter (use-the-index-luke.com/sql/join)
  • pganalyze Blog — “Understanding PostgreSQL Query Plans”
What they are really testing: Understanding of MongoDB’s aggregation framework internals and optimization strategies.Strong answer:Step 1: Use .explain('executionStats') to see the query plan and understand which stages are slow.Step 2: Apply optimizations:
  1. Move $match to the beginning. $match (and $sort on indexed fields) at the start of the pipeline can use indexes. After a $group or $project, no indexes are available.
  2. Add $limit early if you only need a subset of results. $match -> $sort -> $limit -> $group processes far fewer documents than $match -> $group -> $sort -> $limit.
  3. Check $lookup performance. If you are doing a $lookup on a large collection, ensure the joined collection has an index on the foreignField. Without it, every lookup does a collection scan.
  4. Avoid $unwind on large arrays before $group if you can use array operators ($sum, $avg on array fields) instead. $unwind explodes document count.
  5. Use allowDiskUse: true if the pipeline exceeds 100 MB of RAM (the default limit for in-memory aggregation). But also investigate why it needs that much memory — it may indicate the pipeline is processing too many documents.
  6. Consider a materialized view for aggregations that run frequently on the same data. Pre-compute the result and store it, refreshing periodically.
Step 3: If $lookup is the bottleneck, consider whether the data model should be restructured. A slow $lookup in a hot path often means the data should be denormalized (embedded in the parent document) rather than joined at query time.
Structured Answer Template.
  1. Measure.explain('executionStats'), look at totalDocsExamined vs nReturned.
  2. Push filters early$match and $sort on indexed fields at the top of the pipeline.
  3. Cut the working set$limit before $group, projection to drop unused fields.
  4. Attack joins — ensure $lookup foreign fields are indexed, or denormalize.
  5. If still slow — materialize. The fastest aggregation is the one computed in the background.
Real-World Example. MongoDB’s own engineering blog describes optimizing a dashboard aggregation that took 45 seconds on a 200M document collection. The fix was reordering from $lookup → $match → $group to $match → $lookup → $group, which reduced the input to $lookup by 99.5% and dropped the runtime to 600ms. The code change was five lines; the insight was understanding pipeline ordering.
**Big Word Alert — lookup.MongoDBsequivalentofaSQLLEFTOUTERJOIN,doneataggregationtime.WithoutanindexontheforeignField,itperformsacollectionscanperinputdocumenteffectivelyanN+1atthedatabaselevel.Alwayspairlookup.** MongoDB's equivalent of a SQL LEFT OUTER JOIN, done at aggregation time. Without an index on the `foreignField`, it performs a collection scan per input document — effectively an N+1 at the database level. Always pair `lookup` with “and the foreign field must be indexed” when mentioning it.
Big Word Alert — materialized view. A precomputed, persisted result of a query/aggregation, refreshed on a schedule. MongoDB supports this via $merge or $out as the last stage of an aggregation. Name it when recommending moving expensive pipelines off the read path.
Follow-up Q&A Chain:Q: Your $match stage is at the top but the plan still shows a collection scan. Why? A: The $match references fields that are not indexed, OR references a field computed earlier in the pipeline (MongoDB cannot use an index after a $project or $addFields), OR the query uses operators like $regex with a leading wildcard that can’t use the index. Check the winningPlan.stage — if it’s COLLSCAN not IXSCAN, no index is being used.Q: When does allowDiskUse: true become a red flag rather than a feature? A: If it’s set on a hot, real-time endpoint serving user requests. Disk-spilled aggregations are 10-100x slower than in-memory ones. allowDiskUse is appropriate for offline/batch jobs (reports, exports) but signals a fundamental design problem if production queries need it. Either reduce the working set with earlier filtering or precompute the aggregation.Q: How do you avoid pipeline regressions when the data grows 10x? A: Add a performance regression test to CI using a scaled copy of production data. Run the pipeline with explain("executionStats") and assert totalDocsExamined stays below a threshold. This catches plan changes before they hit production — far cheaper than finding out at 3 AM that a query that was fine at 10M documents is catastrophic at 100M.
Further Reading:
What they are really testing: Understanding of distributed systems constraints in Redis Cluster.Strong answer:Redis Cluster does NOT support cross-slot transactions. If keys are on different hash slots (and therefore potentially on different nodes), MULTI/EXEC and Lua scripts will fail because Redis cannot guarantee atomicity across nodes.Solutions:
  1. Hash tags (preferred). Force related keys to the same slot using {tag} in the key name. {order:123}:items and {order:123}:total both hash on order:123 and land on the same slot. Design your key naming convention so that keys that need transactional access share a hash tag.
  2. Application-level coordination. If the keys genuinely belong on different nodes, handle the “transaction” in application code with idempotent operations and compensating actions (saga pattern). This is complex and should be avoided if possible.
  3. Reconsider the architecture. If you frequently need transactions across unrelated keys, Redis Cluster might not be the right tool. Consider whether the operation can be redesigned as independent operations, or whether a database with distributed transactions (PostgreSQL, CockroachDB) should handle this part of the workload.
The senior insight: The constraint is not a Redis bug — it is a fundamental distributed systems trade-off. Distributed transactions across nodes require coordination (2PC or similar), which kills the sub-millisecond latency that makes Redis valuable. Redis Cluster chose performance over distributed transactions. Understanding why the limitation exists matters more than knowing a workaround.
Structured Answer Template.
  1. State the constraint — cross-slot operations are not atomic in Redis Cluster.
  2. Explain why — distributed consensus kills latency; Redis Cluster chose speed.
  3. Give the idiomatic fix — hash tags ({...}) to colocate related keys.
  4. Give the escape hatch — app-layer sagas with compensating actions.
  5. Challenge the premise — if you need cross-key transactions often, Redis Cluster may be the wrong tool.
Real-World Example. Discord uses Redis Cluster for ephemeral state like typing indicators and voice presence. They rigorously colocate related keys — e.g., {guild:123}:typing and {guild:123}:presence — using hash tags so per-guild operations can be atomic. When they needed cross-guild aggregates, they moved that workload to a separate system (ScyllaDB) rather than fighting Redis Cluster’s constraints.
Big Word Alert — hash tag. The {...} substring in a Redis key that forces all keys sharing the same tag into the same hash slot (and therefore the same node). Essential vocabulary for any Redis Cluster discussion — it is how you opt in to atomic multi-key operations.
Big Word Alert — saga pattern. A sequence of local transactions with compensating actions for rollback, used when distributed ACID is impossible. Name it when discussing app-level coordination across cluster slots — it shows you know the proper vocabulary for distributed workflow patterns.
Follow-up Q&A Chain:Q: Hash tags sound great until one tag gets hot. What do you do? A: You have created a hot slot by design — everything for {user:42} goes to one node. If a user’s data grows huge or access becomes disproportionate, you lose the load-balancing benefits of Cluster. The fix is to shard within the tag ({user:42}:shard-0..N) or to accept that certain “whale” keys need dedicated infrastructure.Q: If I use a Lua script on Redis Cluster, what are the rules? A: All keys the script touches must hash to the same slot — declared upfront in the KEYS array. Redis Cluster checks at script-load time. If your script needs to touch keys across slots, it fails. This is why hash tags exist: they let you group keys so a script can operate on them atomically.Q: Could I implement two-phase commit manually across Redis nodes? A: You could, but you would be reinventing a distributed transaction coordinator on top of a system designed to avoid one. The added latency typically exceeds what Redis buys you vs a database that natively supports distributed transactions (CockroachDB, Spanner). If you need 2PC semantics, pick a database designed for them.
Further Reading:
What they are really testing: Understanding of DynamoDB’s partition-level throughput allocation.Strong answer:This is the classic hot partition problem. DynamoDB distributes provisioned throughput evenly across partitions. If you have 10,000 WCU provisioned and the table has 10 partitions, each partition gets approximately 1,000 WCU. If one partition receives 2,000 WCU of traffic (because the partition key is uneven), that partition throttles — even though the table overall is at 20% utilization.Diagnosis: Enable DynamoDB Contributor Insights to see which partition keys are getting the most traffic. Check the ThrottledRequests CloudWatch metric grouped by operation type.Fixes:
  1. Short-term: Switch to on-demand capacity mode, which handles this better (DynamoDB automatically allocates capacity per-partition based on actual traffic).
  2. Medium-term: DynamoDB’s adaptive capacity feature (enabled by default) can redistribute unused capacity from cold partitions to hot ones, but it has limits and is not instantaneous.
  3. Long-term: Redesign the partition key. This is the real fix. If you are writing to PK=STATUS#pending and most orders are pending, that partition is always hot. Change to PK=USER#<userId> so writes distribute across users.
Words that signal depth: partition-level throughput, adaptive capacity, burst capacity (5-minute rolling window of unused capacity), Contributor Insights, hot partition isolation.
Structured Answer Template.
  1. Name the root cause immediately — partition-level throughput, not table-level.
  2. Explain the math — 10,000 WCU ÷ N partitions = per-partition ceiling.
  3. Diagnose — Contributor Insights and ThrottledRequests per operation type.
  4. Short/medium/long-term fixes — on-demand, adaptive capacity, partition key redesign.
  5. End with the real fix — the partition key is the architecture; everything else is patchwork.
Real-World Example. Stripe’s engineering blog once described a DynamoDB workload that throttled at 20% table utilization because of a partition key like PK = EVENT_TYPE#checkout_succeeded. With most events being one type, one partition took all the writes. Switching to PK = MERCHANT_ID#EVENT_TYPE distributed writes across millions of merchants and resolved the throttling without changing the provisioned capacity.
Big Word Alert — partition-level throughput. DynamoDB provisions capacity at the table level but enforces at the partition level — each physical partition has a hard ceiling of 1000 WCU / 3000 RCU. Use the phrase when explaining why a table at 20% utilization can still throttle.
Big Word Alert — burst capacity. DynamoDB stores 5 minutes of unused read/write capacity per partition, allowing short spikes above the provisioned rate. Once the burst bucket is drained, you throttle. Mention it when explaining why throttling sometimes appears suddenly after a sustained workload.
Follow-up Q&A Chain:Q: How do you figure out how many partitions your table actually has? A: DynamoDB doesn’t expose partition count directly, but you can estimate: each partition holds ~10 GB and supports 1000 WCU / 3000 RCU. Take max(storage_size / 10 GB, WCU / 1000, RCU / 3000) — whichever is largest is your partition count. CloudWatch PartitionCount metric is now available on some regions/accounts too.Q: You redesign the partition key but have 2 TB of existing data. How do you migrate? A: You cannot alter a partition key in place. Create a new table with the new schema, backfill via Scan with ParallelScan workers and BatchWriteItem into the new table, enable DynamoDB Streams on the old table to capture live changes during backfill, and cut over with a dual-write period. For multi-TB migrations, AWS Data Pipeline or Glue can parallelize the backfill.Q: A colleague suggests just increasing provisioned capacity 10x. What do you say? A: Spending 10x on capacity to keep a broken partition key working is expensive and temporary — the hot partition still has its per-partition ceiling. You might spend 50K/monthoncapacitythat50K/month on capacity that 0 of partition key redesign would make unnecessary. It is the classic “treating the symptom” trap.
Further Reading:
What they are really testing: Whether you understand both tools deeply enough to make a nuanced comparison, rather than defaulting to tribal preferences.Strong answer:PostgreSQL JSONB advantages:
  • ACID transactions that span both relational and JSONB columns. You can JOIN a JSONB column with a relational table in a single transactional query.
  • GIN indexes on JSONB paths for fast queries (CREATE INDEX ON products USING GIN (attributes jsonb_path_ops)).
  • No separate database to operate. If your primary data is already in PostgreSQL, adding JSONB columns avoids the operational cost of a second database.
MongoDB advantages:
  • First-class document query language. Querying nested documents and arrays in MongoDB is more ergonomic than PostgreSQL’s ->, ->>, and @> operators.
  • Native horizontal sharding. If you need to scale semi-structured data across many nodes, MongoDB’s sharding is more mature.
  • Better developer experience for document-centric workloads. The MongoDB shell, Compass, and aggregation pipeline are purpose-built for working with documents.
Decision rule: If you already run PostgreSQL and the semi-structured data is a feature of a larger relational model (like product attributes in an e-commerce system with relational orders and users), use JSONB. If the entire application is document-centric and you need horizontal sharding, MongoDB is purpose-built for that. The worst choice is adding MongoDB alongside PostgreSQL just for JSONB-like functionality — that is two databases to operate for a marginal developer experience improvement.
Structured Answer Template.
  1. Refuse the false dichotomy — both tools support semi-structured data; the question is which already fits your stack.
  2. List PostgreSQL JSONB strengths — ACID mixing relational + JSON, GIN indexes, one database to operate.
  3. List MongoDB strengths — ergonomic nested querying, native sharding, document-native mental model.
  4. State your decision rule — one-line heuristic.
  5. Name the anti-pattern — adding MongoDB next to PostgreSQL just for JSONB-like features.
Real-World Example. Shopify stores product attributes (highly variable per category) as JSONB in PostgreSQL alongside strictly relational orders and inventory. One database, GIN indexes on the attribute paths, and ACID transactions across both relational and semi-structured data. Adding MongoDB would have meant dual-writing every product mutation and losing the transactional guarantees that checkout flows depend on.
Big Word Alert — GIN index (Generalized Inverted Index). PostgreSQL’s index type for composite data (JSONB, arrays, full-text). CREATE INDEX ON products USING GIN (attributes) lets queries like WHERE attributes @> '{"color":"red"}' use an index instead of scanning every row. Mention it whenever discussing JSONB performance.
Big Word Alert — schema-on-read. Storing data without enforcing structure at write time, interpreting it at query time. Both MongoDB and PostgreSQL JSONB support this, but PostgreSQL adds the option of schema-on-write for the relational portion of the same record — a superpower people often overlook.
Follow-up Q&A Chain:Q: JSONB is stored how internally, and why does that matter for performance? A: PostgreSQL stores JSONB as a decomposed binary format, not text. Field lookups are O(log N) in the number of keys at each level (not a full reparse), and it deduplicates keys internally. This is why JSONB is ~2-3x faster than JSON (text) for queries while being only slightly slower for ingestion. You cannot preserve key order or duplicate keys, which is fine for 99% of use cases.Q: When does a GIN index on JSONB not help? A: Queries using -> or ->> operators for specific paths don’t use GIN — they use B-tree expression indexes instead (CREATE INDEX ON products ((attributes->>'color'))). GIN helps for containment (@>), existence (?), and path existence (@?) operators. Know which operator your query uses before picking the index type.Q: You need horizontal scaling for semi-structured data. Does that tip you to MongoDB? A: Not automatically. PostgreSQL with Citus or Vitess-for-Postgres handles horizontal scaling of relational + JSONB data. Only if the entire data model is document-centric and you need sharding from day one does MongoDB’s native sharding meaningfully win. For everyone else, single-node PostgreSQL or Citus is simpler.
Further Reading:
  • PostgreSQL Docs — JSON Types and JSON Functions and Operators
  • “When to Use JSONB vs Normalized Tables” — pganalyze blog
  • “PostgreSQL anti-patterns: Unnecessary JSON” — 2ndQuadrant blog
What they are really testing: Practical understanding of both tools’ strengths and the ability to reason about trade-offs.Strong answer:Redis implementation:
  • Use a Sorted Set with timestamp scores. ZADD rate:user:123 <timestamp> <request_id>, ZREMRANGEBYSCORE to remove old entries, ZCARD to count.
  • Or use Lua scripting for an atomic sliding window counter (INCR + EXPIRE in a single atomic operation).
  • Latency: <1 ms. Perfect for inline rate limiting in the request path.
  • Risk: if Redis goes down, rate limiting stops. Need Sentinel or Cluster for HA.
DynamoDB implementation:
  • Use a table with PK=user_id, SK=timestamp. TTL automatically removes old entries. Count items with a Query to check the rate.
  • Latency: 5-10 ms. Still fast, but 10x slower than Redis.
  • Advantage: fully managed, no operational burden, survives without additional HA setup.
My recommendation: Redis, almost always. Rate limiting is latency-sensitive (it runs in the hot path of every request). The 1 ms vs 5-10 ms difference matters when you are adding it to every API call. Redis’s atomic operations (INCR, Lua scripts) make the implementation simpler and race-condition-free. DynamoDB is a reasonable fallback if you already have it in your stack and do not want to operate Redis, or if your rate limits are generous enough (per-minute, not per-second) that the extra latency does not matter.The nuance a senior would add: For a distributed rate limiter across multiple regions, DynamoDB Global Tables give you a globally consistent view with less effort than setting up Redis Cluster across regions. It depends on whether the rate limit needs to be enforced globally or per-region.
Structured Answer Template.
  1. State your default — “Redis for the hot path.” Do not hedge until you have characterized the workload.
  2. Quantify — concrete latency numbers (Redis 1ms vs DynamoDB 5-10ms) so the trade-off is numerical, not vibes.
  3. Map each tool to a scenario — Redis for inline rate limiting; DynamoDB for globally distributed or already-in-stack cases.
  4. Name the atomicity mechanism — Lua scripts for Redis, UpdateExpression with ADD for DynamoDB. Both avoid race conditions.
  5. Add a distributed-systems nuance — DynamoDB Global Tables vs Redis Cluster cross-region. This signals seniority.
Real-World Example. Stripe’s engineering blog “Scaling your API with rate limiters” describes a four-tier rate-limiting architecture built on Redis: a request rate limiter, a concurrent request limiter, a fleet usage load shedder, and a worker utilization load shedder — all Lua-backed, all sub-millisecond. The entire tiered system protects their payment API; they picked Redis precisely for the atomicity of Lua scripts plus the latency budget required to sit inline on every charge request.
Big Word Alert — token bucket. An algorithm where a bucket of N tokens refills at a steady rate; each request consumes one token. Allows bursts up to the bucket size, then throttles to the refill rate. Named when discussing rate limiters that should “allow short bursts but enforce a sustained rate” — the phrase signals you know the algorithm, not just the effect.
Big Word Alert — sliding window counter. A rate-limiting algorithm that weights the previous window’s count with the current window’s count to approximate a continuous window without storing individual request timestamps. More memory-efficient than a sliding window log. Use when explaining why Redis INCR with a rolling TTL approximates a sliding window.
Follow-up Q&A Chain:Q: Your Redis is down. Should your rate limiter fail open (allow all traffic) or fail closed (reject all)? A: Fail open for most APIs — a brief Redis outage allowing some extra traffic is better than rejecting all traffic. Fail closed only when the rate limit protects a hard downstream ceiling (third-party API that will ban you, payment processor with strict TPS). The hybrid answer: fail open with an in-memory fallback that applies a coarse per-instance limit so you are not completely unprotected.Q: A Redis Lua rate limiter uses server time. If Redis clocks drift, what breaks? A: Sliding window calculations use timestamp arithmetic; clock drift shifts window boundaries and attributes requests to the wrong window, producing incorrect counts. Use TIME command (server time) instead of client-side timestamps, and for multi-node Redis Cluster, accept that cross-node clock drift makes global sliding windows approximate. Token bucket is more robust to clock drift because it only needs elapsed time, not absolute time.Q: How do you rate-limit per-tenant in a multi-tenant API without one tenant starving another? A: Hierarchical rate limits: a global token bucket caps total API throughput, and per-tenant buckets cap each tenant’s share. A request must pass both buckets to be allowed. This prevents one tenant from exhausting the global bucket while still enforcing fair per-tenant slicing. Stripe and AWS implement this exact hierarchy.
Further Reading:
  • Stripe Blog — “Scaling your API with rate limiters”
  • Redis Docs — Rate Limiting pattern
  • Cloudflare Blog — “How we built rate limiting capable of scaling to millions of domains”
What they are really testing: Deep understanding of PostgreSQL’s write path, MVCC, and WAL.Strong answer:
  1. Client sends UPDATE users SET name = 'Jane' WHERE id = 42 to the PostgreSQL backend process over the connection.
  2. Parser converts the SQL text into a parse tree. Rewriter applies any rules (updatable views, etc.). Planner generates an execution plan (likely an Index Scan on the primary key to find row 42).
  3. Executor locates the target row. PostgreSQL does NOT update the row in place. Instead, it:
    • Marks the old row version as “dead” by setting its xmax to the current transaction ID.
    • Inserts a new row version at a new location on the heap with the updated values. The new version’s xmin is set to the current transaction ID.
  4. Index entries that point to the old row must also be updated. If the new row version is on the same page and it is a HOT (Heap-Only Tuple) update, the index does not need updating (a pointer chain in the heap connects old to new). If it is on a different page, new index entries are created.
  5. WAL record describing both the old row’s xmax change and the new row’s insertion is written to the WAL buffer.
  6. On COMMIT: The WAL buffer is flushed to disk (fsync to pg_wal/). This is the durability guarantee. The actual heap pages are NOT flushed yet — they remain “dirty” in the shared buffer pool.
  7. Background writer and checkpointer eventually flush dirty pages from the buffer pool to the data files on disk.
  8. VACUUM (later) will reclaim the space occupied by the dead row version, once no active transaction can see it.
This is why PostgreSQL UPDATEs are actually “INSERT new version + mark old version dead” — a direct consequence of MVCC. It is also why update-heavy workloads create table bloat without aggressive vacuuming.
Structured Answer Template.
  1. Start at the wire — parser, rewriter, planner, executor. Signals you know the pipeline, not just the storage.
  2. MVCC mechanics — UPDATE is “insert new tuple, mark old as dead via xmax.” This is the insight that separates candidates.
  3. HOT update optimization — same-page update skips index maintenance. Worth naming.
  4. WAL commit path — WAL flush is the durability boundary, not the heap write.
  5. Cleanup — dead tuples accumulate until VACUUM. Connect back to why update-heavy workloads need aggressive autovacuum.
Real-World Example. PlanetScale’s engineering blog describes a customer whose update-heavy workload on PostgreSQL hit catastrophic bloat because the table was updated ~100x per row per day but autovacuum ran at the default 20% scale factor. Their migration was not to a new database but to tuning autovacuum_vacuum_scale_factor = 0.02 and adding fillfactor = 70 to leave room for HOT updates. Write throughput tripled without any schema change.
Big Word Alert — HOT update (Heap-Only Tuple). When a row is updated and the new version fits on the same page AND no indexed column changed, PostgreSQL chains old-to-new via a page-local pointer and skips index maintenance entirely. Name it when explaining why fillfactor < 100 is a write-performance optimization — leaving free space per page enables HOT updates.
Big Word Alert — WAL (Write-Ahead Log). The append-only log of every change, flushed to disk before the change is applied to data files. WAL is the durability boundary — when fsync on WAL completes, the transaction is durable even if the heap pages are still in memory. Mention WAL when explaining why fsync is the real commit cost, not the heap write.
Follow-up Q&A Chain:Q: Why does PostgreSQL not just overwrite the row in place? A: MVCC requires multiple row versions so concurrent readers can see a consistent snapshot without locking. If UPDATE overwrote in place, readers would either block (like Oracle’s undo segments) or see half-modified rows. PostgreSQL’s choice — always create a new version — trades disk space (dead tuples) for reader/writer concurrency.Q: Why is the heap page not immediately flushed to disk on commit? A: Performance. Flushing every dirty page on every commit would destroy write throughput. Instead, PostgreSQL relies on the WAL for durability (the commit is safe as soon as WAL is fsynced) and lets the background writer and checkpointer flush heap pages opportunistically. On crash recovery, WAL replay reconstructs any heap pages that were not yet flushed. This is the write-ahead log protocol — the foundation of every production RDBMS.Q: How does this interact with replication? A: Streaming replication sends WAL records to replicas in real time. The replica’s walreceiver writes WAL to local disk, then startup replays WAL records against the replica’s heap. This is why replication in PostgreSQL is bit-for-bit identical to the primary — replicas are literally replaying the same WAL. It is also why logical replication (Debezium CDC) decodes from the same WAL stream.
Further Reading:
  • PostgreSQL Docs — MVCC and WAL Internals
  • “The Internals of PostgreSQL” by Hironobu Suzuki — Chapter 5 (Concurrency Control) and Chapter 9 (WAL)
  • pganalyze Blog — “How PostgreSQL’s UPDATE works internally”

Interview Deep-Dive Questions

These questions go beyond surface-level knowledge. Each one is designed to simulate a real senior-level interview chain — an initial question followed by the kind of probing follow-ups an experienced interviewer uses to separate candidates who know about databases from candidates who have operated them in production.

How would you debug a PostgreSQL query that was fast yesterday but is slow today, with no schema or code changes?

Difficulty: Senior What the interviewer is really testing: Systematic debugging methodology, understanding of statistics-driven query planning, and awareness that the database is a living system where performance changes without code changes. Strong Answer: The key insight is that query performance depends on the plan, and the plan depends on statistics — not just the query text. No code change does not mean nothing changed. Step 1: Compare query plans. Run EXPLAIN (ANALYZE, BUFFERS) on the slow query now. If you have the old plan (from pg_stat_statements or auto_explain logs), compare them side by side. Look for a plan shape change — did it switch from an Index Scan to a Sequential Scan? Did the join strategy change from Hash Join to Nested Loop? Step 2: Check statistics freshness. The most common cause is stale statistics after a data distribution change. Maybe a bulk import ran last night and doubled the table size, or a migration changed the data distribution (e.g., a status column went from 50% active to 95% active). The planner now estimates selectivity wrong.
  • Run ANALYZE on the involved tables and re-execute. If it fixes the problem, the root cause was stale statistics. Set more aggressive autovacuum_analyze_scale_factor on these tables.
Step 3: Check for bloat. If a large UPDATE or DELETE ran overnight, millions of dead tuples now sit in the table. Sequential scans are slower because they read dead tuples. The visibility map may be invalidated, breaking index-only scans.
  • Check pg_stat_user_tables for n_dead_tup. If it is high, VACUUM has not kept up.
Step 4: Check for lock contention or resource pressure. A new batch job may be running during business hours, consuming shared_buffers and pushing the query’s working set out of the buffer cache. Check BUFFERS output in EXPLAIN — if shared read (disk) is now much higher than shared hit (cache), the buffer pool is under pressure. Step 5: Check pg_stat_activity for long-running transactions. An idle-in-transaction session can hold back the visibility horizon, forcing PostgreSQL to examine more row versions. The meta-principle: in PostgreSQL, “nothing changed” almost never means nothing changed. Data volume changed, data distribution changed, vacuum state changed, or concurrent workload changed. The query planner is a function of statistics, and statistics are a function of data.

Follow-up: What is correlation in PostgreSQL statistics and how can it cause a plan change?

Answer: Correlation measures how closely the physical ordering of rows on disk matches the logical ordering of values in an index. It ranges from -1 to 1. A correlation of 1.0 means rows with sequential index values are stored on sequential disk pages (the table is physically sorted by that column). A correlation near 0 means the values are randomly scattered across pages. This matters because it determines whether an Index Scan is cheap or expensive. With high correlation, fetching 1,000 rows via an index reads a small number of sequential pages (the rows are physically adjacent). With low correlation, those 1,000 rows might be on 1,000 different pages — 1,000 random I/O operations. After a bulk load or major table reorganization, correlation can change dramatically. The planner sees this in pg_statistic and may switch from an Index Scan (which was efficient when rows were physically ordered) to a Bitmap Index Scan or Sequential Scan (which handle random physical ordering better). This causes a “slow query with no code change” because the data layout changed. You can check correlation with:
SELECT tablename, attname, correlation
FROM pg_stats
WHERE tablename = 'your_table';
If correlation dropped and you need it back, CLUSTER tablename USING index_name physically reorders the table to match the index. But CLUSTER takes an exclusive lock on the table — in production, use pg_repack instead for an online reorder.

Follow-up: How would you set up proactive monitoring so this kind of regression does not page you at 3 AM?

Answer: Three layers of defense:
  1. Query performance baselines with pg_stat_statements. Track mean_exec_time and calls for your top 50 queries. Alert when any query’s mean execution time exceeds 3x its rolling 7-day average. This catches plan regressions within minutes.
  2. auto_explain extension. Configure auto_explain.log_min_duration = '1s' to automatically log the full EXPLAIN plan for any query exceeding 1 second. When a regression happens, you already have the bad plan in the logs — no need to reproduce it.
  3. Statistics freshness monitoring. Track last_autoanalyze in pg_stat_user_tables. Alert if any high-traffic table has not been analyzed in over 24 hours, or if n_dead_tup / n_live_tup exceeds 10%. This catches the root cause (stale stats, vacuum falling behind) before it causes a plan regression.
The principle is the same as any observability strategy: do not wait for symptoms (slow queries) — monitor the leading indicators (stale statistics, dead tuple accumulation, plan changes).

Going Deeper: You have auto_explain enabled and you can see that the planner’s row estimate was off by 1000x. Statistics are fresh. What else could cause such a massive estimation error?

Answer: Fresh statistics can still produce wildly wrong estimates in several scenarios:
  • Correlated columns. PostgreSQL assumes column values are independent. If you query WHERE country = 'Japan' AND language = 'Japanese', PostgreSQL multiplies the selectivity of each predicate independently. It might estimate 2% of rows match country = 'Japan' and 1% match language = 'Japanese', giving 0.02% combined. But in reality, 95% of Japan rows have Japanese — the actual selectivity is close to 2%, not 0.02%. That is a 100x error. PostgreSQL 14+ has extended statistics (CREATE STATISTICS ... (dependencies, ndistinct, mcv) ON country, language FROM users) that model multi-column correlations.
  • Parameterized queries with generic plans. When using prepared statements, PostgreSQL starts with custom plans (per-parameter) but after 5 executions switches to a generic plan (parameter-agnostic). If the generic plan is optimal for typical values but terrible for an outlier value (e.g., status = 'deleted' matches 0.01% of rows but the generic plan is optimized for status = 'active' which matches 90%), you get a bad plan for the outlier. Setting plan_cache_mode = force_custom_plan forces per-parameter planning at the cost of planning overhead.
  • Statistics target too low. The default default_statistics_target is 100, meaning PostgreSQL samples 300 * 100 = 30,000 rows for histogram construction. For a table with 500M rows and a highly skewed distribution, 30,000 samples may miss important value frequencies. Increase the target: ALTER TABLE t ALTER COLUMN c SET STATISTICS 1000.

Walk me through how you would design a DynamoDB table for a multi-tenant SaaS platform where each tenant can have vastly different data volumes.

Difficulty: Staff-Level What the interviewer is really testing: Real-world DynamoDB modeling at scale, understanding of the noisy-neighbor problem, and the ability to design for both small and large tenants in the same table. Strong Answer: This is one of the hardest DynamoDB design problems because of the noisy-neighbor effect — a single large tenant can consume a disproportionate share of partition throughput, causing throttling for other tenants. Key design decisions: 1. Partition key strategy. Use TENANT#<tenantId> as the partition key prefix. For small tenants (under 10 GB of data), all their data lives under one partition key prefix and queries are simple Query(PK = TENANT#small-co). For large tenants (hundreds of GBs, thousands of RPS), a single partition key creates a hot partition. Apply write sharding: TENANT#big-co#SHARD-0 through TENANT#big-co#SHARD-15. The application hashes or round-robins writes across shards. Reads require scatter-gather across all shards. Store the shard count in a metadata item so the application knows how many shards to query. 2. Sort key design. Use composite sort keys with entity type prefixes: RESOURCE#<type>#<timestamp>#<id>. This lets you query all resources of a type sorted by time, or get a specific resource by its full key. 3. Tenant tier metadata table. Maintain a lightweight table (or a DynamoDB item) that stores each tenant’s tier: small, medium, large. This drives the application logic for shard count and capacity allocation. When a tenant grows past a threshold, a migration job reshards their data. 4. Cost isolation. On provisioned mode, you cannot isolate capacity per-tenant within a single table. Options:
  • Use on-demand mode so hot tenants pay per request and do not affect cold tenants’ throughput.
  • For enterprise tenants with strict SLAs, provision a separate table (table-per-tenant for your top 5% tenants). This gives hard capacity isolation at the cost of operational complexity.
  • Use DynamoDB resource-based policies and AWS Organizations for billing isolation.
5. GSI design. Design GSIs so that they do not create secondary hot partitions. A GSI with PK = STATUS puts all active items on one partition — bad for multi-tenant. Instead use PK = TENANT#<tenantId>#STATUS#active so each tenant’s active items land on their own GSI partition. The honest trade-off: single-table multi-tenant DynamoDB is operationally simpler but creates coupling between tenants. Table-per-tenant gives isolation but multiplies operational burden. Most SaaS platforms start with single-table and graduate their largest tenants to dedicated tables as they grow.

Follow-up: A large tenant does a bulk import of 10 million records. How do you handle this without impacting other tenants?

Answer: Bulk imports are dangerous in shared DynamoDB tables because they can spike write capacity and trigger throttling across all partitions. Approach:
  1. Use BatchWriteItem with rate limiting. Do not just fire-hose writes. Implement client-side rate limiting (e.g., 500 WCU/sec) with exponential backoff on UnprocessedItems. This keeps writes below the partition-level throughput ceiling.
  2. Spread writes across shards. If the tenant has 16 shards, distribute bulk writes evenly across all 16 partition keys. This gives 16x the per-partition throughput ceiling.
  3. Temporarily scale up. If using provisioned mode, increase WCU before the import and scale back down after. If using on-demand, DynamoDB auto-scales but has a warm-up period — if the table has been seeing 1,000 WCU and you suddenly hit 50,000, you will be throttled until adaptive scaling kicks in (minutes). Pre-warm by gradually ramping up write traffic.
  4. Consider an offline path. For truly massive imports, write to S3 and use the DynamoDB Import from S3 feature (launched 2023). It provisions temporary capacity separate from the online table, avoiding any impact on live traffic. The data appears in the table once the import completes.
  5. If isolation is critical, import into a staging table first. Validate the data, then use DynamoDB Streams to replicate the items into the production table at a controlled rate.

Follow-up: How would you implement per-tenant rate limiting at the DynamoDB level?

Answer: DynamoDB itself does not have per-tenant rate limiting built in — capacity is shared across the table. So you implement it at the application layer: Option 1: Token bucket in Redis. Before every DynamoDB operation, the application checks a per-tenant token bucket in Redis. Each tenant gets a bucket sized to their tier (free tier: 100 WCU, enterprise: 10,000 WCU). If the bucket is empty, the request is queued or rejected. This is the most common pattern — fast, centralized, and independent of DynamoDB. Option 2: API Gateway throttling per API key. If tenants are identified by API keys, configure per-key rate limits in API Gateway or your load balancer. This limits the request rate before it reaches DynamoDB. Less granular (limits requests, not capacity units) but simpler. Option 3: Application-level middleware. Track consumed capacity per-tenant using DynamoDB’s ReturnConsumedCapacity response field. Maintain running counters per tenant and throttle when a tenant exceeds their allocation. The key insight is that DynamoDB’s capacity model is table-level, not tenant-level. Multi-tenancy rate limiting is always an application concern. This is a fundamental architectural limitation you accept when choosing a shared-table approach.

Your company runs a critical MongoDB replica set. The primary just went down. Walk me through exactly what happens next and what could go wrong.

Difficulty: Senior What the interviewer is really testing: Understanding of MongoDB’s consensus mechanism, the realities of failover, and awareness that “automatic failover” is not as magical as it sounds. Strong Answer: When the primary becomes unreachable, here is the sequence: 1. Detection phase (seconds 0-10). The remaining secondaries notice they have not received a heartbeat from the primary within the electionTimeoutMillis window (default: 10 seconds). This is not instant — there is a built-in delay to avoid false positives from transient network blips. 2. Election phase (seconds 10-12). An eligible secondary calls an election. MongoDB uses a Raft-like consensus protocol. To win, a candidate needs votes from a majority of the replica set members. With a 3-node set (1 down), the remaining 2 nodes form a majority and can elect a new primary. With a 2-node set plus an arbiter, the secondary plus arbiter form the majority. 3. The new primary accepts writes (seconds 12-15). The winning secondary becomes the new primary and starts accepting writes. Total failover time: typically 10-15 seconds, though it can be longer under heavy load. What could go wrong:
  • No majority. If you have a 3-node set and 2 nodes go down, the remaining node cannot form a majority and will not become primary. The replica set is read-only until a majority of nodes recover. This is why the minimum production deployment is 3 nodes across at least 2 data centers (or 2 nodes + 1 arbiter).
  • Writes lost during failover. If the old primary accepted writes with w: 1 (the default) and went down before replicating those writes to a secondary, those writes are gone. When the old primary comes back, it has writes that the new primary does not have — these become rollback writes and are placed in a rollback directory. This is why w: "majority" matters for critical data.
  • Driver connection handling. Client drivers need to discover the new primary. Smart drivers (using the replica set connection string) handle this automatically — they detect the topology change and redirect writes. But there is a brief window where the application throws connection errors. If your application does not retry writes, user requests fail during failover. If it does retry writes, they must be idempotent or you risk duplicate operations.
  • Election ties and priority. If two secondaries have the same optime (are equally up-to-date), priority settings determine who wins. If priority is misconfigured, a secondary in a remote data center might become primary, increasing latency for all write operations. Set priority: 0 on nodes that should never become primary (analytics replicas, cross-region replicas used only for reads).

Follow-up: After failover, the old primary comes back online with writes that the new primary does not have. What happens to those writes?

Answer: Those writes enter a rollback. The old primary detects that its oplog has diverged from the current primary’s oplog. It identifies the point of divergence and undoes all operations after that point, saving the rolled-back data to a special rollback/ directory in the data directory. The rollback data is stored as BSON files that you can inspect and manually re-apply if needed. But in practice, most teams treat rollbacks as data loss — the data was never acknowledged as durable (since w: 1 only guaranteed the primary, which failed). The maximum rollback size is 300 MB by default. If more than 300 MB of data needs to be rolled back, MongoDB refuses to roll back automatically and the node shuts down, requiring manual intervention. This is the strongest argument for w: "majority" on any write you cannot afford to lose. With majority write concern, a write is not acknowledged until it is on at least 2 of 3 nodes. If the primary dies, the surviving majority already has the data — no rollback needed.

Follow-up: How would you design your application to handle the 10-15 second failover window gracefully?

Answer: Three strategies layered together: 1. Retries with backoff. Configure your MongoDB driver with retryWrites: true (default since MongoDB 4.2 drivers). The driver automatically retries certain write operations once on a new primary. For reads, use retryReads: true. These handle transient errors during failover transparently. 2. Circuit breaker at the application layer. If the database is unreachable for more than a few seconds, trip a circuit breaker that returns degraded responses (cached data, a “try again shortly” message) instead of queuing up requests that will all timeout. Without this, you get a thundering herd of retries the moment the new primary is available. 3. Write idempotency. Since retries may re-execute a write that actually succeeded (the primary crashed after persisting but before acknowledging), every write operation must be idempotent. Use unique identifiers in your writes (updateOne with upsert: true and a unique constraint) rather than operations that are not idempotent like $inc without a transaction. The honest truth: 10-15 seconds of write unavailability is the cost of automatic failover. If your business cannot tolerate this, you need to look at architectures that decouple the write path from the database — buffer writes in a message queue (SQS, Kafka) and drain them to MongoDB asynchronously.

Explain the Redis single-threaded model. How can it be fast when it only uses one core?

Difficulty: Intermediate What the interviewer is really testing: Understanding of event-driven architecture, I/O multiplexing, and the difference between CPU-bound and I/O-bound workloads. Strong Answer: The way I think about this is: Redis is fast because it is single-threaded, not despite it. Redis uses an event loop with I/O multiplexing (epoll on Linux, kqueue on macOS). A single thread monitors all client connections simultaneously. When data arrives on any connection, the event loop processes the command, performs the in-memory operation, and queues the response — all without context switches. Why this works:
  1. All data is in memory. The bottleneck for most databases is disk I/O. Redis eliminates this entirely. An in-memory hash lookup takes ~100 nanoseconds. At that speed, a single core can process hundreds of thousands of operations per second because the CPU never waits for anything.
  2. No context switching overhead. A multi-threaded design with locks would spend more time on lock acquisition, cache invalidation, and context switching than on actual data operations. For sub-microsecond operations, the coordination cost of multithreading exceeds the operation cost itself.
  3. No lock contention. Since one thread processes all commands sequentially, there is no need for mutexes, spinlocks, or atomic operations on data structures. Every command sees a consistent state. This is what makes Lua scripts and MULTI/EXEC transactions atomic for free — nothing else can interleave.
  4. Pipeline-friendly. Clients can send multiple commands without waiting for individual responses. The event loop processes them in sequence and sends all responses back in one batch. Pipelining amortizes the network round-trip cost, and the single thread chews through pipelined commands at memory speed.
Where single-threaded Redis hits its limit: CPU-bound operations. A KEYS * scan on 10 million keys, a SORT on a large list, or a Lua script doing heavy computation blocks the event loop for other clients. One slow command blocks everything. The modern nuance (Redis 6+): Redis is no longer purely single-threaded. The main event loop is still single-threaded, but I/O operations (reading from and writing to sockets) are now offloaded to I/O threads. This helps when the bottleneck is network I/O serialization/deserialization rather than command execution. On high-bandwidth workloads (large values), I/O threading can improve throughput by 2x. But the command execution is still single-threaded — that is the correctness guarantee.

Follow-up: If Redis is single-threaded, how does it handle persistence (RDB snapshots, AOF rewrites) without blocking clients?

Answer: Redis uses fork(). When it is time to create an RDB snapshot or rewrite the AOF, Redis forks the process. The child process gets a copy-on-write snapshot of the entire memory space. The child writes the snapshot to disk while the parent continues serving requests. Copy-on-write means the child initially shares all memory pages with the parent. Only when the parent modifies a page (due to a write command) does the OS create a separate copy of that page. For a read-heavy workload, almost no extra memory is needed. For a write-heavy workload, the memory usage during a snapshot can temporarily double in the worst case (every page gets modified). This is why you should never run Redis on a machine with swap enabled — if the fork causes memory to temporarily exceed physical RAM, the OS starts swapping, and a single swapped page turns a 100-nanosecond operation into a 10-millisecond operation. The latency spike is catastrophic. The fork() call itself can also cause a latency spike. On a Redis instance with 25 GB of data, fork() takes 10-20 ms on Linux (copying page table entries). During that time, the event loop is blocked. This is the latest_fork_usec metric in INFO stats — monitor it.

Follow-up: A developer proposes running multiple Redis instances on the same server to utilize all CPU cores. What are the trade-offs?

Answer: This is actually a valid and common pattern — it is how Redis was designed to scale vertically before Redis Cluster existed. Benefits:
  • Uses all cores. An 8-core server runs 8 Redis instances, each handling its own keyspace.
  • Each instance has its own event loop, so a slow command on one instance does not affect others.
Trade-offs:
  • Memory overhead. Each instance has its own memory space. You cannot share data between instances without network calls. If you have 64 GB of RAM and run 8 instances, each gets ~8 GB (minus OS overhead). No benefit of shared caching.
  • Operational complexity. 8 instances means 8 ports, 8 configs, 8 monitoring endpoints, 8 persistence schedules. Your deployment and monitoring tooling must handle this.
  • fork() contention. If multiple instances trigger RDB snapshots simultaneously, the server faces multiple fork() calls, potentially doubling memory usage across all instances at once. Stagger snapshot schedules.
  • Application routing. The application must know which instance holds which data. This is essentially manual sharding, with all the routing complexity that entails.
In modern deployments, Redis Cluster is usually the better answer — it handles the sharding, routing, and rebalancing automatically. Running multiple standalone instances is still valid for simple caching use cases where you want isolated keyspaces (e.g., separate instances for sessions vs. cache vs. rate limiting) with different eviction policies and persistence settings.

You are migrating a PostgreSQL table with 2 billion rows from one database to another with zero downtime. How?

Difficulty: Staff-Level What the interviewer is really testing: Production migration experience, understanding of data consistency guarantees, and risk management under real-world constraints. Strong Answer: A 2-billion-row table with zero downtime rules out any strategy that requires locking the source table or stopping writes. This is a CDC (Change Data Capture) based migration. Phase 1: Initial bulk copy.
  • Set up logical replication from the source table using PostgreSQL’s logical replication or a tool like pglogical. Alternatively, use pg_dump --no-synchronized-snapshots in parallel segments to speed up the initial copy.
  • For a 2B row table, the initial copy might take hours or days depending on row size and network bandwidth. The key is to start the CDC capture before the bulk copy begins, so that any writes during the copy are recorded.
Phase 2: CDC catch-up.
  • While the bulk copy runs, the source continues receiving writes. Logical replication automatically streams these changes (inserts, updates, deletes) to the target.
  • After the initial copy completes, the replication stream catches up. Monitor the replication lag until it drops to sub-second.
Phase 3: Shadow reads (validation).
  • Deploy application code that reads from both databases for a percentage of traffic. Compare results. Log discrepancies. This catches issues like missing rows, data type mismatches, or timezone handling differences.
  • Run consistency checks: SELECT COUNT(*), checksum comparisons on key columns, and spot-check critical business queries.
Phase 4: Cutover.
  • When confident in data consistency, perform the cutover in a maintenance window as small as possible:
    1. Stop writes to the source table (or set it to read-only).
    2. Wait for the last CDC events to replicate (seconds).
    3. Verify final consistency.
    4. Switch the application’s write connection to the new database.
    5. Switch reads to the new database.
  • Total write downtime: seconds, not minutes. If you need truly zero write downtime, use a dual-write approach: the application writes to both databases simultaneously during cutover, and you switch reads once you verify consistency.
Phase 5: Soak and rollback plan.
  • Keep the old database running for at least 2 weeks. If anything goes wrong, the rollback is: switch connections back to the old database. If you used dual-write during cutover, both databases have the latest data and rollback is instant.
What can go wrong:
  • Sequences. If the new table uses a sequence for primary keys, ensure the sequence value is set higher than the maximum migrated ID. Otherwise, new inserts collide with migrated rows.
  • Triggers and constraints. Logical replication replays raw DML. If the target table has triggers that fire on INSERT (audit logging, denormalization), they will fire for every replicated row — potentially billions of trigger executions. Disable non-essential triggers during the bulk copy phase.
  • Large objects and TOAST. Rows with large text or JSONB columns are stored in TOAST tables. Logical replication handles this, but it increases replication bandwidth significantly. Monitor network throughput.
  • Schema differences. If the target schema has additional NOT NULL columns or stricter constraints, replicated rows may violate them. Validate schema compatibility before starting.

Follow-up: How would you handle the migration if you also need to transform the data during migration — for example, splitting one table into two?

Answer: Now you cannot use simple logical replication because the target schema differs from the source. You need a transformation layer in between. Option 1: CDC with a stream processor. Use Debezium to capture changes from the source PostgreSQL WAL and publish them to Kafka. A stream processing consumer (Kafka Streams, Flink, or even a simple application) reads the CDC events, applies the transformation (splitting columns into two tables, reformatting data), and writes to the target tables. This gives you a decoupled, replayable transformation pipeline. Option 2: Dual-write at the application layer. Modify the application to write to both the old single table and the new split tables. Backfill the new tables from the old table using a batch job. Once the backfill completes and the dual-write has been running long enough for consistency, cut over reads to the new tables and remove the dual-write. Option 3: Materialized view or trigger-based replication. Create triggers on the source table that, on every INSERT/UPDATE/DELETE, write the transformed data to the target tables (which can be in the same or a different database via dblink or FDW). Backfill the historical data with a batch job. This is the simplest approach for same-server migrations but does not scale well for cross-server migrations. The choice depends on latency requirements, operational maturity, and whether the team already runs Kafka/Debezium. For most teams, Option 2 (dual-write with backfill) is the simplest to implement and reason about, even if it requires a temporary code path in the application.

Going Deeper: The dual-write approach has a well-known consistency problem. What is it and how do you mitigate it?

Answer: The problem: partial write failures. If the application writes to Database A successfully but the write to Database B fails (network error, timeout, constraint violation), the two databases are now inconsistent. Retrying the failed write might succeed, but if the application crashes between the first write and the retry, the data is permanently divergent. This is fundamentally the two-phase commit problem across independent databases — and there is no general solution that preserves both consistency and availability. Pragmatic mitigations:
  1. Write to the source of truth first, then asynchronously replicate. Make Database A the authoritative source. After writing to A, publish an event (to a transactional outbox table in A, then to Kafka). A consumer applies the write to B. If B fails, the event stays in the queue and retries. This is the transactional outbox pattern — it guarantees at-least-once delivery to B at the cost of temporary inconsistency.
  2. Reconciliation job. Run a periodic job that compares the two databases and fixes discrepancies. This is your safety net — the dual-write handles the happy path, the reconciliation catches edge cases.
  3. Accept temporary inconsistency. If the application can tolerate a few seconds of divergence (and for most applications it can), the combination of “write to both + reconciliation” is sufficient. Perfect consistency across two independent databases requires distributed transactions, which most teams rightly avoid.

Your team uses Redis for session storage. An engineer proposes switching to DynamoDB to reduce operational burden. Evaluate this proposal.

Difficulty: Senior What the interviewer is really testing: Ability to evaluate a real architectural trade-off with nuance, not just pick a side. Strong Answer: This is a reasonable proposal. Let me evaluate it honestly by weighing both sides. Arguments for switching to DynamoDB:
  • Operational simplicity. DynamoDB is fully managed — no patching, no failover configuration, no memory sizing, no eviction policy tuning. Redis (even managed ElastiCache) requires capacity planning, persistence configuration, and monitoring for fragmentation and memory pressure.
  • Durability by default. DynamoDB writes are durable to disk. If a node fails, no data is lost. Redis, even with AOF, can lose the last second of data. For session data in a regulated environment (healthcare, finance), durability matters.
  • Scaling is automatic. DynamoDB on-demand mode scales to any traffic level without intervention. Redis requires manual scaling (resizing the instance or adding cluster nodes).
  • TTL built-in. DynamoDB’s TTL feature automatically deletes expired sessions at zero cost. In Redis, expired keys are cleaned up lazily (on access) or by the active expiry cycle, which can lag under memory pressure.
Arguments against switching:
  • Latency. Redis serves sessions in 0.1-0.5 ms. DynamoDB serves them in 3-10 ms. That is a 10-50x latency increase on every authenticated request. If your API’s p99 target is 50 ms and session lookup adds 10 ms, you have used 20% of your latency budget on one operation.
  • Cost at scale. At 100,000 session reads per second (a large web app), Redis on a single r6g.xlarge costs roughly 300/month.DynamoDBondemandat100KRCU/seccostsroughly300/month. DynamoDB on-demand at 100K RCU/sec costs roughly 5,000/month. Provisioned mode reduces this but requires capacity planning — negating the “less operational burden” benefit.
  • Data structure richness. If sessions contain complex data (nested objects, arrays, counters that increment), Redis Hashes with HINCRBY and HSET are more expressive than DynamoDB’s update expressions. If sessions are simple key-value blobs, this does not matter.
My recommendation: If the team’s primary pain is operating Redis (frequent paging for memory issues, fragmentation, failover problems), switching to DynamoDB is justified — but first consider switching to AWS MemoryDB instead. MemoryDB is Redis-compatible, fully managed, durable (writes to a distributed WAL), and eliminates most Redis operational burden while keeping Redis’s latency and data structure advantages. It is the best of both worlds for this specific use case. If MemoryDB is not an option (cost, region availability), then DynamoDB is a solid choice if the latency increase is acceptable and the cost at your traffic volume makes sense. Prototype it, measure the p99 impact, and make a data-driven decision.

Follow-up: The engineer says “latency does not matter because we will put DAX in front of DynamoDB and get the same latency as Redis.” Is this correct?

Answer: Partially correct, but with important caveats. DAX can serve cached reads in ~200 microseconds — comparable to Redis. So for session reads (which dominate), the latency would be similar. But:
  1. Session writes still hit DynamoDB directly. DAX is a read-through/write-through cache. Writes go to DynamoDB (3-10 ms) and the DAX cache is updated asynchronously. If a user logs in (write) and immediately makes a request (read), the read might hit DAX before the write has propagated — returning a stale “no session” result. This is a consistency window of typically <10 ms but it can cause authentication flicker.
  2. DAX adds another managed service. You have DynamoDB + DAX. The “reduce operational burden” argument weakens because now you are operating a cache in front of a database… which is what you were doing with Redis in the first place, just with different branding.
  3. DAX cost. DAX runs on dedicated nodes (dax.r5.large ~200/monthpernode,minimum3nodesforproduction).Adding200/month per node, minimum 3 nodes for production). Adding 600+/month for DAX on top of DynamoDB costs approaches or exceeds managed Redis pricing.
The honest assessment: DAX + DynamoDB for sessions is a Rube Goldberg version of what MemoryDB does natively. If sub-millisecond latency matters for sessions, use a purpose-built in-memory store.

What are the real-world differences between PostgreSQL’s SERIALIZABLE isolation and DynamoDB’s transactions?

Difficulty: Senior What the interviewer is really testing: Deep understanding of transaction models across different database paradigms, and whether the candidate understands that “transactions” means very different things in different databases. Strong Answer: These are fundamentally different transaction models solving different problems. PostgreSQL SERIALIZABLE:
  • Provides true serializability — the gold standard of isolation. The result is equivalent to some serial ordering of all concurrent transactions, even though they actually executed concurrently.
  • Implemented via SSI (Serializable Snapshot Isolation). PostgreSQL tracks read and write dependencies between concurrent transactions. If it detects a dependency cycle that could produce a non-serializable result, it aborts one transaction with a serialization failure. The application must retry.
  • Scope: any number of tables, any number of rows, any query pattern. You can read 50 tables and write 30 in a single serializable transaction. The database guarantees correctness.
  • Cost: higher abort rates under contention. The more concurrent transactions touch overlapping data, the more serialization failures you get. At extreme contention, throughput collapses.
DynamoDB Transactions (TransactWriteItems / TransactGetItems):
  • Provides atomicity and isolation for a batch of up to 100 items across one or more tables. Either all operations succeed or none do.
  • Isolation is “serializable” at the item level — concurrent transactions on the same items are serialized. But there is no dependency tracking across arbitrary queries. You specify exactly which items participate in the transaction upfront by their primary keys.
  • No read-then-write logic inside the transaction itself. You can do conditional writes (ConditionExpression) to check a value before writing, but you cannot do a query, make a decision, and then write — that must be done in application code across two API calls.
  • Cost: transactions consume 2x the normal WCU/RCU. A transaction that writes 5 items costs 10 WCU. DynamoDB also has a 25-item limit per TransactWriteItems call (though you can include up to 100 items across reads and writes in a mixed transaction).
The key differences in practice:
AspectPostgreSQL SERIALIZABLEDynamoDB Transactions
ScopeArbitrary queries, any number of rowsUp to 100 explicitly named items
LogicFull SQL with subqueries, CTEs, JOINsConditional expressions on individual items
Conflict detectionAutomatic dependency trackingOptimistic concurrency on named items
Failure modeSerialization failure (retry needed)Transaction cancelled (retry needed)
Cost modelSame as normal queries (slightly more CPU)2x capacity consumption
ScaleSingle database instanceDistributed, any scale
When this matters: If you are building an order placement system where you need to check inventory, reserve stock, create an order, and charge a payment — all atomically — PostgreSQL SERIALIZABLE handles this in a single transaction with arbitrary SQL logic. In DynamoDB, you would need to pre-determine the exact items (inventory item, order item, payment item), express the conditions as ConditionExpression on each, and submit them as a TransactWriteItems batch. The DynamoDB approach works but requires more upfront design and cannot handle dynamic logic (like “find the cheapest available item and reserve it” — because you do not know the item key until you query).

Follow-up: When would you recommend DynamoDB transactions over PostgreSQL SERIALIZABLE?

Answer: When you need transactional guarantees at a scale where a single PostgreSQL instance cannot keep up, and your transaction scope is bounded and predictable. Concrete example: a gaming platform where each match result updates the stats of 2-10 players atomically. You know exactly which player items you are updating (by their player IDs). You need this to work at 50,000 transactions per second across millions of players. PostgreSQL’s SERIALIZABLE cannot handle this throughput on a single instance, and distributed PostgreSQL (Citus, CockroachDB) adds complexity. DynamoDB handles it natively because each transaction touches a small, well-defined set of items that distribute across partitions. The rule of thumb: if your transactions are bounded, key-addressable, and high-throughput, DynamoDB transactions are an excellent fit. If your transactions are ad-hoc, query-driven, or involve complex business logic, PostgreSQL is the right tool.

A Redis Cluster node goes down and some commands start returning MOVED and ASK errors. Explain what is happening and how the client should handle it.

Difficulty: Intermediate What the interviewer is really testing: Understanding of Redis Cluster’s slot-based routing and the client-server protocol during topology changes. Strong Answer: MOVED errors mean the client sent a command to the wrong node for that key’s hash slot. The error includes the correct node address: MOVED 3999 127.0.0.1:6381. This happens when:
  • The client’s slot-to-node mapping is stale (a node went down and slots were reassigned).
  • The client has not yet discovered the cluster topology.
The correct client behavior: update the internal slot map by calling CLUSTER SLOTS or CLUSTER NODES, then re-send the command to the correct node. Smart Redis clients (like ioredis, redis-py-cluster) handle this transparently. They cache the slot map and refresh it on MOVED errors. ASK errors mean a slot is in the process of being migrated from one node to another. The slot is partially on the old node and partially on the new node. The error says: “this specific key has already been migrated to the new node, try there.” The error format is ASK 3999 127.0.0.1:6382. The correct client behavior for ASK is different from MOVED:
  1. Send ASKING to the target node (this tells the target to accept a command for a slot it does not yet fully own).
  2. Re-send the original command to the target node.
  3. Do NOT update the slot map. Unlike MOVED, ASK is temporary. The slot is still being migrated. Future commands for other keys in this slot should still go to the old node (if those keys have not migrated yet).
The node-down scenario: When a node goes down, its slots become temporarily unavailable. If the node was a primary, its replica takes over via failover (automatic if cluster-node-timeout is configured). During the failover window (typically 15-30 seconds), commands for that node’s slots fail with CLUSTERDOWN errors. After failover, the client receives MOVED redirections to the new primary. What most people miss: During failover, only the slots owned by the failed node are affected. Commands for slots on healthy nodes continue working normally. This is partial availability — better than total unavailability, but your application must handle errors for the affected key range.

Follow-up: What happens to in-flight commands when a Redis Cluster node crashes mid-pipeline?

Answer: If a client has sent a pipeline of 100 commands and the node crashes after processing 50:
  • The first 50 commands were executed. If they were writes, those writes are in the node’s memory and AOF (if enabled). Whether they survive depends on the persistence configuration and whether the replica received them before the crash.
  • The remaining 50 commands get a connection error. The client receives errors for these commands.
  • The client cannot know exactly which commands succeeded and which did not without checking the data. This is the pipeline atomicity problem — pipelines are NOT transactions. Each command in a pipeline is independent.
This is why critical multi-step operations should use MULTI/EXEC transactions (which are atomic on a single node) or Lua scripts, not pipelines. Pipelines are for performance optimization of independent commands, not for ensuring atomicity.

Your PostgreSQL database is approaching transaction ID wraparound. Explain the urgency and your remediation plan.

Difficulty: Senior What the interviewer is really testing: Deep understanding of one of PostgreSQL’s most dangerous operational scenarios, and whether the candidate has the judgment to handle a high-pressure situation. Strong Answer: This is a five-alarm fire. Transaction ID wraparound is the one PostgreSQL failure mode that can cause the database to shut down and refuse all writes — and it is completely preventable. Why it happens: PostgreSQL uses 32-bit transaction IDs (roughly 4 billion values, but effectively 2 billion in each “direction” due to modular arithmetic). Every transaction consumes one ID. VACUUM FREEZE is responsible for marking old transactions as “frozen” — removing their dependency on the transaction ID counter. If VACUUM FREEZE cannot keep up (because of long-running transactions, disabled autovacuum, or massive tables), the database runs out of usable transaction IDs. The warning signs (check immediately):
-- Check how close each database is to wraparound
SELECT datname, age(datfrozenxid) AS xid_age,
       2000000000 - age(datfrozenxid) AS remaining
FROM pg_database
ORDER BY age(datfrozenxid) DESC;

-- Check per-table freeze status
SELECT schemaname, relname, age(relfrozenxid) AS xid_age
FROM pg_stat_user_tables
ORDER BY age(relfrozenxid) DESC
LIMIT 20;
Severity levels:
  • age(datfrozenxid) > 200 million: investigate why autovacuum is not freezing. Tune it.
  • age(datfrozenxid) > 1 billion: urgent. Autovacuum should be running aggressively. Check for blockers.
  • age(datfrozenxid) > 2 billion - 1 million: critical. PostgreSQL will start emitting warnings in the logs and will eventually enter read-only “safety mode.” Immediate intervention required.
Remediation plan: Immediate (within minutes):
  1. Check pg_stat_activity for long-running transactions and terminate them. Any open transaction holds back the freeze horizon.
  2. Check if autovacuum is running on the offending tables. If not, start a manual VACUUM FREEZE on the table with the oldest relfrozenxid.
  3. Increase autovacuum_freeze_max_age awareness — this is the threshold where autovacuum becomes aggressive about freezing. Default is 200 million. If your tables are well past this, autovacuum should already be trying, but something is blocking it.
Short-term (hours):
  1. Run VACUUM FREEZE manually on the worst tables. On a huge table, this can take hours. Use VERBOSE to monitor progress.
  2. Increase autovacuum_max_workers and reduce autovacuum_vacuum_cost_delay to let autovacuum work faster across all tables.
  3. If the database is in read-only mode: the only option is running VACUUM FREEZE. No writes are possible until the freeze completes. This is why prevention is critical.
Long-term prevention:
  • Monitor age(datfrozenxid) as a Tier 1 metric. Alert at 500 million, page at 1 billion.
  • Tune autovacuum aggressively on high-write tables (lower autovacuum_vacuum_scale_factor, higher autovacuum_vacuum_cost_limit).
  • Set idle_in_transaction_session_timeout to prevent long-running transactions from blocking VACUUM.
  • For very large tables (hundreds of millions of rows), partition them. VACUUM FREEZE runs per-partition, making it more manageable.

Follow-up: Why can long-running transactions prevent VACUUM FREEZE from doing its job?

Answer: VACUUM can only freeze a transaction ID if no active transaction could possibly need to see the row version that transaction created. PostgreSQL determines this using the oldest active transaction’s snapshot. If a transaction started 3 days ago and is still open (even if idle), PostgreSQL must preserve all row versions created since that transaction started. VACUUM cannot freeze any transaction ID newer than that transaction’s ID. This means 3 days worth of transaction IDs remain unfrozen, accumulating toward the wraparound limit. A single idle-in-transaction session can hold back the entire freeze horizon. On a busy system processing 10,000 transactions per second, a 3-day idle transaction accumulates 2.6 billion unfrozen transaction IDs — very close to the wraparound limit on its own. This is why idle_in_transaction_session_timeout is one of the most important PostgreSQL production settings. The default is 0 (disabled). Set it to something aggressive like 5 minutes. Developers who leave transactions open for debugging should use a separate connection with a higher timeout, not the production pool.

Compare how PostgreSQL, MongoDB, and DynamoDB handle a scenario where you need to count the total number of items matching a condition across your entire dataset.

Difficulty: Intermediate What the interviewer is really testing: Understanding of how each database’s architecture affects a seemingly simple operation, and why “just do a COUNT” is not the same across databases. Strong Answer: This is a great question because it exposes fundamental architectural differences through a deceptively simple operation. PostgreSQL: SELECT COUNT(*) FROM orders WHERE status = 'shipped' PostgreSQL must actually scan every matching row to count it — there is no pre-computed count stored anywhere. With an index on status, it does an Index Scan (or Bitmap Index Scan for moderate selectivity). Without an index, it does a full Sequential Scan. On a 500M row table, this can take minutes. Why no stored count? Because of MVCC. Different transactions see different row versions. A row that is visible to Transaction A might be invisible to Transaction B (it was inserted after B’s snapshot). So there is no single “correct” count — it depends on the transaction’s snapshot. PostgreSQL must check visibility for each row. Optimization: Maintain a materialized count in a separate table, updated by triggers or periodic refresh. Or use pg_stat_user_tables.n_live_tup for an approximate count (fast but imprecise, based on the last ANALYZE). MongoDB: db.orders.countDocuments({ status: "shipped" }) Similar to PostgreSQL — MongoDB scans the matching documents using an index (if available) or a collection scan. The countDocuments() method actually runs an aggregation pipeline ($match + $group with $sum). It is accurate but can be slow on large collections. MongoDB offers estimatedDocumentCount() which reads the collection metadata for an instant but approximate total count (it does not support filters). DynamoDB: There is no efficient COUNT. DynamoDB does not have a COUNT operation in the traditional sense. You have two bad options:
  1. Query with Select: 'COUNT' — this still reads every matching item (consuming read capacity for each item), it just does not return the data. On millions of items, this is expensive and slow.
  2. Scan with a FilterExpression — even worse. Reads every item in the table, applies the filter, and counts matches. For a billion-item table, this costs thousands of dollars in read capacity.
The DynamoDB way: Maintain a counter item. PK=COUNTER, SK=shipped_orders, updated atomically with ADD on every order shipment. Reads are instant (GetItem). The trade-off is that the counter must be maintained by the application — if the update fails or the application has a bug, the count drifts. Reconcile periodically with a full scan job (run during off-peak, accept the cost). The architectural lesson: PostgreSQL and MongoDB trade write simplicity for read flexibility — you can count anything, anytime, but it costs a scan. DynamoDB trades read flexibility for write and read performance at scale — you must pre-compute any aggregation you need.

Follow-up: What if you need real-time counts displayed on a dashboard — say, “orders shipped today” updating every second?

Answer: This changes the problem from “how to count” to “how to maintain a running count efficiently.” PostgreSQL approach: Use a materialized counter table with a trigger. On every INSERT INTO orders with status = 'shipped', a trigger increments the counter in a daily_stats table. The dashboard queries the counter table (instant). The risk: trigger contention under high write volume — all inserts compete for a lock on the same counter row. Mitigate with multiple counter rows (shard the counter by, say, order_id % 10, sum the 10 rows on read). Redis approach (any source database): Maintain a Redis counter with INCR shipped_today. Reset it at midnight with a TTL or a cron job. The application increments the counter after each successful order shipment write. The dashboard reads from Redis (sub-millisecond). This decouples the counting from the database and works regardless of which database stores the orders. DynamoDB approach: The counter item (PK=COUNTER#2024-04-09, SK=shipped) updated with UpdateExpression: 'ADD shipped_count :one'. DynamoDB’s ADD is atomic, so concurrent updates do not lose counts. The dashboard reads this single item. For all three, the real-time dashboard reads a pre-computed counter, not a live aggregate. This is the Computed Pattern — trade write complexity for read speed. No production system refreshes a COUNT(*) query every second against a large table.

Going Deeper: The counter approach can drift if the application crashes between writing the order and incrementing the counter. How do you handle this?

Answer: This is the classic dual-write consistency problem. Three approaches:
  1. Transactional outbox. In PostgreSQL, update both the orders table and the counter table in the same transaction. They are either both committed or both rolled back. Atomic, no drift. This is the simplest and most correct approach when both writes are in the same database.
  2. Event-driven counter. Do not increment the counter in the application. Instead, use CDC (DynamoDB Streams, PostgreSQL logical replication, MongoDB Change Streams) to capture the order insert event and have a consumer increment the counter. The consumer processes events with at-least-once semantics, so make the counter update idempotent (e.g., track the last processed event ID and skip duplicates).
  3. Periodic reconciliation. Accept that the counter might drift and run a reconciliation job hourly that does the full COUNT(*) scan and corrects the counter. For a dashboard that says “~15,234 orders shipped today,” being off by a handful is acceptable. For a financial report, it is not.
The choice depends on how accurate the count needs to be and how much complexity the team can absorb. For dashboards, option 3 is usually sufficient. For billing or compliance, option 1 or 2 is necessary.

Advanced Interview Scenarios

These questions are designed to break candidates out of textbook thinking. Several have answers where the obvious approach is wrong. They test the kind of judgment that only comes from operating databases in production — where the clean theory meets the messy reality of traffic spikes, human error, cost constraints, and 3 AM pages.

Your application’s p99 latency spikes every day at 2 AM. Nothing in the application code runs at that time. The database is PostgreSQL. What is happening?

Difficulty: Senior What the interviewer is really testing: Whether the candidate understands that databases have background maintenance processes with real performance impact, and whether they can trace a symptom back to a non-obvious root cause. What weak candidates say: “Maybe there is a cron job running. I would check the cron tab.” They stay at the application layer and do not consider the database’s own internal processes. Some suggest “maybe it is backups” without explaining how a backup causes latency spikes or how to verify. What strong candidates say: The way I think about this is: what happens at 2 AM that does not involve application code? Three PostgreSQL-internal culprits:
  • Autovacuum anti-wraparound. PostgreSQL has a hard-coded escalation: when a table’s relfrozenxid age exceeds autovacuum_freeze_max_age (default 200 million), autovacuum ignores its normal cost-based throttling and runs at full speed to prevent transaction ID wraparound. This “emergency autovacuum” consumes massive I/O. On a busy table, this can saturate the disk for 30-60 minutes. Check pg_stat_user_tables for last_autovacuum timestamps clustering around 2 AM, and look at pg_stat_progress_vacuum during the spike.
  • Checkpoint storms. PostgreSQL’s checkpointer writes all dirty pages from shared buffers to disk periodically (controlled by checkpoint_timeout, default 5 minutes, and max_wal_size). If a large batch of dirty pages accumulated during daytime activity and the checkpoint fires during a low-traffic window, the sudden burst of disk writes spikes latency for concurrent queries. Check pg_stat_bgwriter for checkpoints_req (forced checkpoints, which are the disruptive kind) vs checkpoints_timed. Tune checkpoint_completion_target to 0.9 to spread checkpoint writes over a longer window.
  • RDS or cloud-provider maintenance windows. On AWS RDS, minor engine patches, OS patching, and multi-AZ failover tests often run in the default maintenance window (which many teams never change from the 2-4 AM default). A multi-AZ failover causes a 15-30 second connection drop. Check the AWS RDS Events console.
War Story: At a fintech company running PostgreSQL 13 on RDS, we saw p99 spikes every night at 2:17 AM. The team spent two weeks blaming a Lambda function that ran at 2 AM. It turned out to be autovacuum anti-wraparound on the transactions table — 800 million rows, autovacuum_freeze_max_age at the default 200 million. The emergency vacuum was ignoring cost delays and saturating the EBS volume at 3,000 IOPS for 45 minutes. We fixed it by reducing autovacuum_vacuum_scale_factor to 0.01 on that table so regular autovacuum ran more frequently, preventing the age from ever reaching the emergency threshold. The p99 spikes disappeared completely. The monitoring tool that caught it was pg_stat_progress_vacuum combined with CloudWatch VolumeWriteOps — the correlation was unmistakable.

Follow-up: How would you distinguish between a checkpoint storm and an autovacuum storm using only pg_stat views?

Answer: Check pg_stat_bgwriter. The checkpoints_req counter increments when a forced checkpoint occurs (triggered by max_wal_size being reached). If that counter is climbing during the spike, it is a checkpoint. For autovacuum, check pg_stat_progress_vacuum — it shows which table is being vacuumed, the phase (scanning heap, vacuuming indexes, etc.), and progress. Also look at pg_stat_activity for backend processes with backend_type = 'autovacuum worker' and wait_event_type = 'IO'. If autovacuum workers are in I/O wait states, they are the ones hammering the disk. You can also correlate with pg_stat_user_tables.autovacuum_count and last_autovacuum to see which tables were vacuumed during the window.

Follow-up: The team wants to “just disable autovacuum on that table to stop the spikes.” Why is this dangerous?

Answer: Disabling autovacuum on a high-write table is like disabling the fire alarm because the beeping annoys you. Without autovacuum: dead tuples accumulate indefinitely, bloating the table (a 50 GB table can grow to 500 GB). Index bloat makes every query slower. Most critically, the relfrozenxid age keeps climbing toward the 2-billion hard limit. When it gets within 10 million of the limit, PostgreSQL enters emergency mode and starts logging warnings. When it hits the limit, PostgreSQL shuts down all writes to the entire database cluster — not just that table, every table — and refuses to process any transaction until someone manually runs VACUUM FREEZE. I have seen this take a production database offline for 6 hours on a 1 TB table. The correct fix is never to disable autovacuum — it is to tune it so it runs frequently in small increments instead of rarely in one massive burst.

You inherit a MongoDB cluster that is sharded on { created_at: 1 }. The team says queries are slow. What is your diagnosis before you look at a single query?

Difficulty: Senior What the interviewer is really testing: Whether the candidate recognizes a fundamentally flawed shard key on sight, and can explain the cascading consequences without needing to investigate further. What weak candidates say: “I would look at the slow query log and see which queries are slow, then add indexes.” They treat it as a query optimization problem and miss that the architecture itself is broken. What strong candidates say: I already know the primary problem before looking at a single query: the shard key is monotonically increasing. Every new document has a created_at value greater than all previous documents. That means every insert goes to the same shard — the one that owns the highest chunk range. You have N shards, but only one of them is receiving writes. The other N-1 shards are idle for writes and only serve reads for historical data. This causes three cascading problems:
  1. Write hot shard. One shard absorbs 100% of write traffic. It becomes the bottleneck. Adding more shards does not help — the writes still go to the newest chunk range. The moveChunk balancer constantly tries to migrate chunks from the hot shard to cold shards, but it cannot keep up with the ingest rate. You see balancerStatus showing continuous chunk migrations and moveChunk operations consuming I/O on the hot shard.
  2. Scatter-gather reads. Any query that filters by something other than created_at (which is almost every query — “find me user X’s documents,” “find documents with status Y”) must scatter to all shards because the shard key tells MongoDB nothing about where user X’s data lives. These scatter-gather queries get linearly slower as you add shards. The team adds shards for capacity but queries get slower — the opposite of what they expected.
  3. Jumbo chunks. If many documents are created within the same second (high-throughput systems), they share the same created_at value and land in the same chunk. That chunk cannot be split because all its documents have the same shard key value. It becomes a jumbo chunk that cannot be migrated, further imbalancing the cluster.
War Story: An IoT platform I consulted for had sharded their sensor readings collection on { timestamp: 1 }. 200 million inserts per day, all hitting one shard. They had 12 shards but write throughput was identical to a single node. The balancer was consuming 40% of the hot shard’s I/O budget just migrating chunks. The fix was a reshard to { device_id: 1, timestamp: 1 } — a compound shard key where device_id provides cardinality and distribution, while timestamp in the sort key gives chronological ordering within each device’s data. Queries like “get device X’s readings for the last hour” went from scatter-gather across 12 shards to a single targeted query. Write throughput increased 9x because writes distributed across all shards. The migration took 3 weeks of dual-writing to a new collection, but the team said they wished they had done it six months earlier.

Follow-up: How do you actually migrate to a new shard key on a live MongoDB cluster without downtime?

Answer: Since MongoDB 5.0, there is reshardCollection which performs an online reshard. Under the hood, it creates a new sharded collection, clones existing data using a background process, catches up with ongoing writes via change streams, and then atomically swaps the collection namespace. During the cutover (typically sub-second), writes are briefly paused. In practice, I have seen the background cloning phase take days for multi-terabyte collections, and it doubles your storage temporarily. You need to monitor disk usage carefully. For MongoDB versions before 5.0, or when you need to restructure the schema during migration, the pattern is: create a new collection with the correct shard key, set up a change stream on the old collection to capture live writes, backfill historical data from old to new using a batch job, then switch the application to the new collection once the change stream consumer has caught up. The challenge is the cutover — you need to drain the change stream buffer to zero lag, switch the application atomically (typically a feature flag), and have a rollback plan.

Follow-up: The team pushes back — “we chose created_at because all our queries filter by time range.” Is their reasoning valid?

Answer: Their reasoning is partially valid but still wrong. Yes, filtering by time range is important. But a shard key serves two purposes: query routing and write distribution. A good shard key optimizes both. { created_at: 1 } optimizes routing for time-range queries but catastrophically fails at write distribution. The correct approach is a compound shard key where the first element provides distribution and the second provides range-query efficiency. For example, { tenant_id: 1, created_at: 1 } distributes writes across tenants (good cardinality) while still allowing time-range queries within a tenant to hit a single shard. If there are no natural high-cardinality prefixes, use a hashed compound key: { hashed_created_at: "hashed", created_at: 1 } — but this loses time-range routing, so you are back to scatter-gather for range queries. There is no perfect shard key; there are only trade-offs that match your access patterns. The “all queries filter by time” argument ignores the write side of the equation, which is the classic mistake.

A Redis Lua script that implements a distributed lock is intermittently failing to release the lock, causing downstream services to hang for 30 seconds (the lock TTL). How do you debug this?

Difficulty: Staff-Level What the interviewer is really testing: Practical debugging of distributed locking, understanding of Redis atomicity guarantees, and awareness of the subtle failure modes that make distributed locks treacherous. What weak candidates say: “I would check if the Redis server is overloaded. Maybe increase the TTL.” They do not understand the actual failure modes of distributed locks and suggest making the problem worse (longer TTL means longer hangs on failure). What strong candidates say: This is a classic distributed lock lifecycle bug. Let me walk through the failure modes in order of likelihood: 1. Lock ownership mismatch on release. The most common bug. Service A acquires lock with token “abc123” and TTL 30s. Service A takes 35 seconds to complete the work (GC pause, slow downstream call, CPU contention). The lock expires at 30s. Service B acquires the same lock with token “xyz789”. Service A finishes at 35s and releases… Service B’s lock. Now there is no lock and a third service enters the critical section. The “intermittent failure to release” is actually Service A trying to release a lock it no longer owns and the DEL either deletes B’s lock (if no ownership check) or fails silently (if the Lua script checks ownership but the token does not match). The fix: the release Lua script must compare the lock value (a unique token per acquisition) before deleting:
if redis.call("GET", KEYS[1]) == ARGV[1] then
  return redis.call("DEL", KEYS[1])
else
  return 0
end
And the TTL must be longer than the maximum realistic operation time, including tail latency. If p99 of the protected operation is 20s, the lock TTL should be at least 60s, not 30s. Better yet, implement a lock renewal (watchdog) thread that extends the TTL while the operation is in progress — this is what Redisson does with its watchdog pattern. 2. Clock drift in Redlock. If using the Redlock algorithm across multiple Redis instances, clock drift between the client and Redis servers can cause the client to believe it still holds the lock when the lock has already expired from Redis’s perspective. Martin Kleppmann’s critique specifically targets this: a process pause (GC, page fault) between acquiring the lock and using it can exceed the TTL. 3. Network partition during release. The release command is sent but the TCP connection to Redis is broken. The client does not receive the confirmation. Depending on the client library’s retry behavior, the release may never be retried (the library treats the command as failed and does not retry writes by default). The lock sits in Redis until TTL expiry. War Story: At a payments company, we had a distributed lock protecting idempotency key deduplication. The lock TTL was 10 seconds. About once every 2,000 requests, the downstream payment processor took 12-15 seconds to respond (their p99.9 tail latency). Our service would finish processing, try to release the lock, but the lock had already expired and been re-acquired by a retry. The release Lua script returned 0 (token mismatch), we logged it as a warning but did not act on it, and the retry created a duplicate charge. The fix was three-fold: (1) implement a lock renewal thread that extended the TTL every 5 seconds while the operation was in progress, (2) add a circuit breaker on the downstream call so we would not wait 15 seconds, and (3) add a secondary idempotency check at the database level. The lock went from being the sole correctness mechanism to one layer of defense.

Follow-up: The team says “just use a PostgreSQL advisory lock instead of Redis — it is simpler and has real transactions.” When is this the right call?

Answer: PostgreSQL advisory locks are underrated and often the correct choice. SELECT pg_advisory_lock(hash_of_key) acquires a lock that is automatically released on transaction end or session end. No TTL management, no ownership tokens, no Lua scripts. They are transactional: if the transaction rolls back, the lock releases. Use PostgreSQL advisory locks when: (1) all lock participants already connect to the same PostgreSQL instance, (2) the lock contention is moderate (not thousands of lock acquisitions per second), and (3) you need the lock’s lifecycle tied to a database transaction. Use Redis locks when: (1) lock participants are different services that do not share a database, (2) you need sub-millisecond lock acquisition (PostgreSQL round-trip is 1-5 ms vs Redis 0.1-0.5 ms), (3) lock throughput is very high, or (4) the protected operation spans multiple systems and cannot be wrapped in a database transaction. The honest truth is that most teams using Redis for distributed locks would be better served by PostgreSQL advisory locks, because most lock scenarios involve protecting a database operation that is already running inside a PostgreSQL transaction. Adding Redis as a separate lock coordinator introduces a new failure mode (Redis unavailability) and the dual-system consistency problem.

Follow-up: How would you monitor distributed lock health in production?

Answer: Four metrics I track:
  1. Lock acquisition latency (p50, p99). If this starts climbing, it means contention is increasing or the Redis cluster is under pressure.
  2. Lock hold duration vs TTL ratio. If the median hold time is 8 seconds and TTL is 10 seconds, you are dangerously close to TTL expiry during normal operations. Alert when the p99 hold duration exceeds 50% of TTL.
  3. Lock release failure rate. The percentage of release operations where the ownership token does not match (the lock expired and was re-acquired). This is the leading indicator of correctness bugs. Any non-zero rate should trigger investigation.
  4. Lock contention rate. How often an acquisition attempt is blocked (lock already held). High contention means either the critical section is too slow or too many processes are competing. Consider sharding the lock (per-user instead of global) or redesigning the critical section to be shorter.
We built a Grafana dashboard at my previous company with these four metrics per lock name. The most actionable alert was “lock release failure rate > 0.1% over 5 minutes” — it caught every TTL-related bug before it became a customer-facing issue.

Your DynamoDB bill tripled last month. No one deployed new features. Find the root cause.

Difficulty: Senior What the interviewer is really testing: Cost awareness in production database operations, ability to use AWS tooling for forensics, and understanding of DynamoDB’s pricing model beyond the basics. What weak candidates say: “Switch to on-demand mode to save money.” This shows they do not understand DynamoDB pricing — on-demand mode is 6.5x more expensive per request at steady state. Or they say “look at the AWS bill” without knowing which specific DynamoDB metrics to investigate. What strong candidates say: DynamoDB has five cost axes, and the bill increase must come from one or more of them. I would investigate in this order: 1. Read/Write capacity consumption. Check CloudWatch ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits for the table and all GSIs. Compare with the previous month. If consumed capacity increased but no new features deployed, look for: (a) a change in traffic patterns (seasonal growth, a marketing campaign, a bot), (b) a batch job that was accidentally left running or running more frequently than before, (c) a retry storm caused by throttling (application retries create a positive feedback loop — throttling causes retries, retries increase consumption, which causes more throttling). 2. GSI cost. This is the most common surprise cost. Every write to the base table is replicated to every GSI that projects the modified attribute. If someone added a GSI with ProjectionType: ALL on a high-write table, that GSI now doubles the write cost for that table. Check ConsumedWriteCapacityUnits per GSI. Also check for GSI throttling backpressure — if the GSI is throttled, it slows down base table writes, increasing request latency, which can increase concurrent request count and overall capacity consumption. 3. DynamoDB Streams. Streams are free to enable, but the Lambda (or other consumer) invocations triggered by streams cost money. If stream processing Lambda functions are failing and retrying (DynamoDB Streams retries failed batches until they succeed or expire), you could have millions of extra Lambda invocations. Check the Lambda concurrent executions and error rates for stream consumers. 4. Data storage. DynamoDB charges 0.25/GB/monthforstorage.Ifstoragegrewfrom100GBto1TB(maybeTTLwasaccidentallydisabledandolddataisnotbeingcleanedup),thatisa0.25/GB/month for storage. If storage grew from 100 GB to 1 TB (maybe TTL was accidentally disabled and old data is not being cleaned up), that is a 225/month increase. Check the table’s ItemCount and TableSizeBytes in the DynamoDB console. Also check GSI storage — each GSI stores its own copy of projected attributes. 5. Data transfer. DynamoDB charges for cross-region data transfer (Global Tables replication), VPC endpoint data processing, and data returned to the client. If ProjectionExpression is not used and items are large, read costs include both capacity units and data transfer. A code change that removed ProjectionExpression from a high-traffic query could significantly increase costs. War Story: A logistics company I worked with saw their DynamoDB bill go from 4,200/monthto4,200/month to 14,800/month in one month. No new features. The root cause was a seemingly innocent change: a developer added a new GSI to enable a back-office search feature. The GSI had ProjectionType: ALL on their shipments table, which had 50,000 writes per second. That single GSI doubled the write capacity consumption for the entire table. The GSI itself was only queried 10 times per day — a 10,000/monthGSIforafeatureusedby3people.ThefixwaschangingtheGSIprojectiontoKEYSONLY(thesearchonlyneededtheshipmentID)andpayingforindividualGetItemcallsonthebasetablefortheoccasionaldetailfetch.Thebilldroppedby10,000/month GSI for a feature used by 3 people. The fix was changing the GSI projection to `KEYS_ONLY` (the search only needed the shipment ID) and paying for individual `GetItem` calls on the base table for the occasional detail fetch. The bill dropped by 9,000/month.

Follow-up: How would you set up cost alerting to catch this before the monthly bill arrives?

Answer: Three layers:
  1. AWS Budgets with a DynamoDB-specific budget. Set a monthly budget for DynamoDB and configure alerts at 50%, 80%, and 100% thresholds with SNS notifications to Slack or PagerDuty. This catches overall cost growth.
  2. CloudWatch per-table alarms. Create alarms on ConsumedWriteCapacityUnits and ConsumedReadCapacityUnits at the table level and GSI level. Alert when any metric exceeds 120% of the 7-day rolling average. This catches per-table anomalies before they aggregate into a bill surprise.
  3. AWS Cost Explorer with daily granularity. Set up a daily report filtered to DynamoDB. Enable “group by resource” to see cost per table. The bill tripling happened over 30 days — with daily cost monitoring, you would have caught it on day 2 when the daily cost jumped from 140to140 to 420.
The meta-lesson: database cost monitoring should be as granular as performance monitoring. A cost anomaly is often a symptom of an operational issue (retry storms, missing TTL, over-projected GSIs).

Follow-up: The team is on provisioned capacity and is considering switching to on-demand to “never think about capacity again.” Walk them through the trade-offs.

Answer: On-demand mode pricing is approximately 6.5x more expensive per request than provisioned capacity at steady state. For a table consuming 10,000 WCU steady-state, provisioned costs roughly 4,700/month.Ondemandforthesamethroughputcostsroughly4,700/month. On-demand for the same throughput costs roughly 30,500/month. On-demand makes financial sense in three scenarios: (1) truly unpredictable traffic with extreme spikes (0 to 50,000 WCU in minutes), (2) development and staging environments with minimal traffic, and (3) new tables where you do not yet know the access pattern. For production tables with reasonably predictable traffic, provisioned capacity with auto-scaling is almost always cheaper. Set the target utilization to 70%, minimum capacity to your off-peak baseline, and maximum capacity to your peak with headroom. Auto-scaling adjusts within 1-2 minutes, which handles most traffic variations. The hybrid approach: use provisioned with auto-scaling for steady-state tables, and on-demand for event-driven or burst-heavy tables. DynamoDB lets you switch between modes, but there is a 24-hour cooldown after switching from on-demand back to provisioned.

You are designing a system that needs to keep PostgreSQL, Redis, and Elasticsearch in sync. When a user updates their profile, the relational data, the cache, and the search index all need to reflect the change. How do you ensure consistency?

Difficulty: Staff-Level What the interviewer is really testing: Understanding of the fundamental impossibility of consistent dual-writes across independent systems, and practical patterns for achieving eventual consistency with acceptable trade-offs. What weak candidates say: “I would write to all three in the same API request handler. If any write fails, roll back the others.” This is the distributed transaction trap — there is no rollback mechanism across PostgreSQL, Redis, and Elasticsearch. If the PostgreSQL write succeeds, the Redis write succeeds, and the Elasticsearch write fails, you cannot roll back Redis (the DEL might itself fail). They also do not address the case where the application process crashes between write 2 and write 3. What strong candidates say: There is no way to make three independent systems atomically consistent without a distributed transaction coordinator, and implementing 2PC across PostgreSQL, Redis, and Elasticsearch is not practical. The real question is: what consistency guarantees does the business actually need, and what is the cheapest pattern that satisfies them? Pattern 1: Transactional Outbox (my default recommendation). Write the profile update AND an “outbox” event to PostgreSQL in the same transaction. A separate process (Debezium polling the outbox table, or a PostgreSQL LISTEN/NOTIFY consumer) reads the outbox events and fans them out to Redis and Elasticsearch. This guarantees that if the profile update committed, the event was also committed. The downstream updates are eventually consistent — typically within 100-500ms. Implementation: the profile_updates outbox table has columns id, aggregate_id, event_type, payload, processed_at. A worker polls for processed_at IS NULL, updates Redis (invalidate or update the cached profile), updates Elasticsearch (index the new profile), and marks the event as processed. Idempotent processing is essential because the worker might crash after updating Redis but before updating Elasticsearch. Pattern 2: CDC with Debezium. Skip the outbox table. Use Debezium to capture WAL changes directly from PostgreSQL and publish them to Kafka. Separate consumers read from Kafka: one updates Redis, another updates the Elasticsearch index. Debezium guarantees exactly-once publication from WAL to Kafka. The consumers handle idempotent processing. This is cleaner (no outbox table, no polling) but adds Kafka and Debezium to the operational surface area. For a team already running Kafka, this is the preferred approach. For a team that does not, the outbox pattern is simpler to start with. Pattern 3: Eventual invalidation (simplest for caching). For Redis specifically: do not try to keep the cache in sync. Just invalidate the cache on write (DEL user:profile:123). The next read misses the cache, fetches from PostgreSQL, and repopulates. This is the cache-aside pattern. It is eventually consistent (the stale cache entry is gone immediately, the fresh entry appears on the next read) and requires zero event infrastructure. For Elasticsearch: invalidation is not practical because search needs the data indexed. Use the outbox or CDC pattern for Elasticsearch, and simple invalidation for Redis. War Story: An e-commerce platform I worked on tried Pattern 1 — writing to PostgreSQL, Redis, and Elasticsearch sequentially in the request handler. One day, the Elasticsearch cluster went into a 45-second recovery (a node restarted due to OOM). During those 45 seconds, every profile update API call hung for 30 seconds (the Elasticsearch client timeout), then returned a 500 error to the user. But the PostgreSQL write had already committed — so the profile was updated in the database but the user saw an error. The Redis cache had the old data. Search had the old data. The user retried, PostgreSQL returned “nothing changed” (idempotent), Redis and Elasticsearch eventually updated, but the user was confused and support tickets piled up. We migrated to the outbox pattern within the week. The API now writes to PostgreSQL only (latency: 5ms), returns success immediately, and the async worker handles Redis and Elasticsearch. Elasticsearch downtime no longer affects user-facing latency.

Follow-up: The outbox table is growing fast because events are being produced faster than the worker can process them. How do you scale the worker?

Answer: Three approaches:
  1. Partition the outbox by aggregate ID. Run multiple worker instances, each responsible for a range of aggregate IDs (consistent hashing). This parallelizes processing while guaranteeing ordering within each aggregate (a user’s profile updates are processed in order by one worker). This is essentially the Kafka consumer group pattern implemented at the outbox level.
  2. Batch processing. Instead of processing one event at a time, the worker reads 100 events, batches the Redis updates (PIPELINE), batches the Elasticsearch updates (_bulk API), and marks all 100 as processed in one UPDATE statement. Batching can improve throughput 10-50x by amortizing network round-trips.
  3. Switch to CDC. If the outbox polling pattern hits its scaling limit, migrate to Debezium + Kafka. Kafka’s partitioned consumer groups provide natural parallelism, backpressure handling, and replay capability. This is the “graduated” version of the outbox pattern.
The key constraint: within a single aggregate (one user’s profile), events must be processed in order. You cannot process event 5 (set name to “Jane”) before event 4 (set name to “John”) — you would end up with stale data. Parallelism is safe across aggregates but dangerous within one.

Follow-up: How do you detect when PostgreSQL, Redis, and Elasticsearch have drifted out of sync?

Answer: A reconciliation job that runs hourly (or daily, depending on the acceptable drift window):
  1. Sample-based comparison. Query a random sample of 1,000 user IDs from PostgreSQL. For each, fetch the Redis cached version and the Elasticsearch indexed version. Compare key fields (name, email, updated_at). Log discrepancies.
  2. Checksum comparison. For each user, compute a hash of the canonical fields in PostgreSQL. Store the hash as a field in the Redis entry and the Elasticsearch document. The reconciliation job queries Elasticsearch for documents where the stored hash does not match the current PostgreSQL hash. This is more efficient than fetching and comparing full records.
  3. Event lag monitoring. Track the outbox’s oldest unprocessed event. If it is more than N minutes old, alert. This is the leading indicator — if the worker is behind, drift is accumulating. Monitor processed_at - created_at as a latency metric in your dashboard.
At one company, we ran the sample comparison as a nightly job and found a drift rate of 0.02% — caused by the idempotent processing incorrectly deduplicating events when two rapid updates had the same updated_at timestamp (sub-second precision). We switched to using the outbox event ID for deduplication instead of timestamps, and the drift dropped to zero.

Your PostgreSQL table has 200 million rows. An engineer adds range partitioning by month to speed things up. Queries are now SLOWER. What went wrong?

Difficulty: Senior What the interviewer is really testing: Whether the candidate understands that partitioning is not a universal performance improvement — it has preconditions, and violating them makes things worse. This is a question where the obvious answer (“partitioning should help”) is wrong. What weak candidates say: “Maybe the partitions are too small. Try partitioning by week instead.” They assume the partitioning strategy is correct and just needs tuning. They do not question whether partitioning was appropriate in the first place. What strong candidates say: Partitioning makes queries faster only when the query planner can eliminate partitions — skip partitions that cannot contain matching rows. If partition pruning does not happen, every query hits every partition, and the overhead of planning across N partitions makes things slower, not faster. The most likely causes: 1. Queries do not filter by the partition key. If the table is partitioned by created_at (month), but the hot queries filter by user_id or status without a created_at predicate, the planner cannot prune partitions. A query that previously did one Index Scan on a single table now does N Index Scans (one per partition) and merges results. With 60 monthly partitions, that is 60x the planning overhead and 60 index lookups instead of 1. Check EXPLAIN ANALYZE for Append nodes with all partitions listed. 2. The partition key is not in the WHERE clause but is needed for the index. Before partitioning, a single index on (user_id, created_at) served all queries efficiently. After partitioning by month, each partition has its own index. A query WHERE user_id = 42 ORDER BY created_at must query each partition’s index, get results, and merge-sort them. The planner might even fall back to sequential scans on each partition if it estimates the merge is cheaper. 3. Excessive partition count. Each partition is a separate table, and the planner considers each one during planning. With 120 monthly partitions (10 years of data), the planning time itself can exceed the query execution time for simple lookups. PostgreSQL 14+ has improved partition pruning performance, but at 500+ partitions, planning overhead is measurable. 4. Index-only scans broken. If the pre-partitioned table had a well-maintained visibility map (enabling index-only scans), partitioning resets this. Each new partition starts with an empty visibility map. Until VACUUM runs on each partition and marks pages as all-visible, index-only scans fall back to regular index scans (requiring heap fetches). On a freshly partitioned table with 200M rows, this transition period can last hours. War Story: A SaaS platform partitioned their events table (200M rows) by month to “speed up” a reporting dashboard. The dashboard queries filtered by tenant_id and event_type — never by date. After partitioning, the dashboard p99 went from 200ms to 4 seconds. The EXPLAIN ANALYZE showed an Append node with 36 child Index Scan nodes (36 monthly partitions), each scanning its own (tenant_id, event_type) index. The fix was not to remove partitioning — it was to add AND created_at >= NOW() - INTERVAL '90 days' to the dashboard queries (which was the actual business requirement anyway; they only needed recent data). With that predicate, the planner pruned 33 of 36 partitions and the p99 dropped to 150ms — even faster than pre-partitioning because each partition’s index was smaller and fit entirely in shared_buffers.

Follow-up: When is partitioning genuinely the right call, and what prerequisites must be true?

Answer: Three prerequisites for partitioning to improve performance:
  1. All hot-path queries include the partition key in their WHERE clause. This is non-negotiable. If even one critical query does not filter by the partition key, that query gets worse.
  2. The table is large enough that partition-level indexes are faster. Rule of thumb: partitioning helps when the full table’s index does not fit in shared_buffers but individual partition indexes do. A 200M row table with a 20 GB index that only has 8 GB of shared_buffers benefits from partitioning — each partition’s 500 MB index fits comfortably. A 10M row table with a 1 GB index that fits in memory does not benefit.
  3. You need lifecycle management. This is the strongest argument for partitioning: dropping old data. DROP TABLE partition_2023_01 is instant and creates zero dead tuples. DELETE FROM events WHERE created_at < '2023-02-01' on a 200M row table takes hours, creates millions of dead tuples, and triggers an aggressive autovacuum. If you have a retention policy, partitioning is almost always worth it for the lifecycle benefit alone, even if query performance stays the same.

Follow-up: Can you have a global unique index across all partitions in PostgreSQL?

Answer: No, and this is one of the most frustrating limitations. In PostgreSQL, unique indexes on partitioned tables must include the partition key. If the table is partitioned by created_at, you cannot create a unique index on just email — it must be (email, created_at). This means two rows with the same email in different partitions can both exist, violating the uniqueness you wanted. Workarounds: (1) Enforce uniqueness at the application level (check-then-insert with a retry loop). (2) Use a separate, non-partitioned “uniqueness” table with just the unique column and a foreign key to the partitioned table. (3) Redesign: if global uniqueness is critical, partition on the unique column instead (hash partitioning on email). PostgreSQL 15+ improved this somewhat with better unique constraint support, but the partition-key-in-unique-index requirement remains. This is a legitimate reason to reconsider whether partitioning is appropriate for your schema.

You run a Redis-backed leaderboard for a mobile game with 10 million players. An engineer suggests switching the Sorted Set to a PostgreSQL table with an index on score because “we need persistent rankings and Redis might lose data.” Evaluate this.

Difficulty: Senior What the interviewer is really testing: Whether the candidate can evaluate a proposal by quantifying the actual trade-offs, not just reasoning abstractly. This is another “obvious answer is wrong” question — the engineer’s concern about durability is valid but their solution destroys the feature. What weak candidates say: “Yes, durability is important, so PostgreSQL is the safer choice.” They agree with the proposal’s framing without examining the performance implications. Or they say “Redis is always better for leaderboards” without addressing the durability concern. What strong candidates say: Let me quantify this because the devil is in the numbers. Redis Sorted Set performance at 10M members:
  • ZADD (update score): O(log N) = ~23 operations deep in a skip list. In practice, 0.01-0.05 ms.
  • ZREVRANK (get player’s rank): O(log N). 0.01-0.05 ms.
  • ZREVRANGE 0 99 (top 100 leaderboard): O(log N + 100). 0.05 ms.
  • Memory: ~10M members * (~50 bytes per member + score) = ~700 MB. Fits on one instance.
PostgreSQL with btree index on score:
  • Update score: UPDATE players SET score = X WHERE id = Y. Requires an index update, WAL write, potential page split. 1-5 ms.
  • Get player rank: SELECT COUNT(*) FROM players WHERE score > (SELECT score FROM players WHERE id = Y). This is a full index scan of all rows with higher scores. For a median player (rank 5M), this scans 5 million index entries. 2-10 seconds.
  • Top 100: SELECT * FROM players ORDER BY score DESC LIMIT 100. Uses the index. 1-5 ms. This one is comparable.
The rank query is the killer. There is no efficient way in PostgreSQL to answer “what is player X’s rank among 10 million players?” without scanning. PostgreSQL does not store positional information in btree indexes — it knows the ordering but not the count of entries before a given value. You could maintain a materialized rank column, but updating it on every score change means updating millions of rows (every player below the changed score shifts by one). My recommendation: Keep the leaderboard in Redis and address the durability concern directly:
  1. Use Redis with AOF (appendfsync everysec). Maximum data loss is 1 second. For a game leaderboard, this is acceptable. If a player’s score update from 0.8 seconds ago is lost due to a crash, the impact is negligible.
  2. Use AWS MemoryDB instead of ElastiCache Redis. MemoryDB writes to a distributed, durable transaction log (similar to DynamoDB’s durability model). It is Redis-API compatible with multi-AZ durability. Zero data loss on node failure. This addresses the engineer’s concern without sacrificing Redis’s data structure advantages.
  3. If regulatory or compliance requirements demand PostgreSQL as the system of record, write scores to both PostgreSQL (source of truth) and Redis (leaderboard). PostgreSQL stores the canonical score and handles disputes/audits. Redis serves the real-time leaderboard. Reconcile periodically. This is polyglot persistence — each database doing what it does best.
War Story: A gaming studio I advised tried the “just use PostgreSQL” approach for a leaderboard with 5 million players. The rank query took 6 seconds. They tried adding a rank column and updating it with triggers, but each score change triggered an UPDATE ... SET rank = rank + 1 WHERE score < X that modified hundreds of thousands of rows. Autovacuum could not keep up. They reverted to Redis within a week. Their final architecture was Redis for the real-time leaderboard (sub-millisecond for all operations) with nightly snapshots to PostgreSQL for analytics, dispute resolution, and compliance.

Follow-up: How would you handle leaderboard sharding if the game grows to 500 million players?

Answer: 500 million members in a single Redis Sorted Set would consume roughly 35 GB of memory. A single Redis instance with 64 GB can handle it, but ZREVRANK on 500M members is O(log 500M) which is still under 1 ms. The issue is not performance but availability — a single instance is a single point of failure. Sharding approaches:
  1. Regional leaderboards. Most games have natural sharding by region (NA, EU, Asia). Each region has its own Redis instance with its own Sorted Set. Players see their regional rank. A global leaderboard is computed periodically by merging the top N from each region. This works because most players care about their regional rank; only the top 1% care about global rank.
  2. Tiered leaderboard. A top-10,000 leaderboard in one Sorted Set (updated in real-time, tiny data). Below that, bucket players into score ranges (e.g., players with scores 1000-2000). Each bucket is a separate Sorted Set on a different shard. A player’s approximate rank is: (number of players in higher buckets) + (rank within their bucket). This gives O(1) for the bucket lookup plus O(log N) within the bucket.
  3. Redis Cluster with hash tags. Shard the Sorted Set across multiple nodes using hash tags. But this defeats the purpose — Sorted Set operations are per-key, and a Sorted Set split across nodes loses its ordering guarantees. Approach 1 or 2 is preferred.

Follow-up: What happens to the leaderboard during a Redis failover? Players will see stale or missing ranks.

Answer: During failover (15-30 seconds for Sentinel, longer for manual intervention), the leaderboard is unavailable for writes. Reads may work if you read from replicas. Graceful degradation strategies:
  1. Read from replicas during failover. Configure the application to fall back to a replica for leaderboard reads if the primary is unreachable. The rank will be slightly stale (by the replication lag, typically <1 second) but available. Writes queue in the application and flush when the new primary is ready.
  2. Cache the last known leaderboard. The top-100 leaderboard changes slowly (once every few seconds at most). Cache the rendered leaderboard in the application or a CDN with a 5-second TTL. During failover, the CDN serves the slightly stale leaderboard. Players do not notice.
  3. Accept temporary unavailability for rank lookups. Individual rank queries (“my rank”) are hard to cache. During failover, return “rank unavailable, try again shortly” in the UI. Most games already handle this gracefully with a loading spinner.
The key insight: leaderboards are one of the few features where temporary staleness is completely acceptable. No player is harmed by seeing a rank that is 30 seconds old. Design the feature to degrade gracefully rather than trying to make it perfectly available.

A developer inserts 10 million documents into MongoDB in a loop, one at a time, and it takes 3 hours. They expected it to take 10 minutes. What is wrong and how do you fix it?

Difficulty: Intermediate What the interviewer is really testing: Understanding of write concern defaults, network round-trip overhead, batching, and WiredTiger write amplification. This is a common real-world scenario where the fix is simple but the understanding of why matters. What weak candidates say: “MongoDB is just slow for writes, they should use DynamoDB.” This is wrong — MongoDB can handle hundreds of thousands of writes per second on a single node. Or they say “increase the server resources” without diagnosing the actual bottleneck. What strong candidates say: Three hours for 10 million inserts means roughly 920 inserts per second. A single MongoDB node should handle 30,000-100,000 inserts per second depending on document size and hardware. The throughput is 30-100x below what the hardware can deliver. The bottleneck is not MongoDB; it is the insertion pattern. Root cause 1: One-at-a-time inserts with network round-trips. Each insertOne() is a separate network round-trip: client sends the document, waits for the server to acknowledge, then sends the next. If the network latency between the application and MongoDB is 1 ms, the theoretical maximum is 1,000 inserts/second — close to the observed 920/sec. The fix is insertMany() with batches of 1,000-5,000 documents. Each insertMany() sends the batch in one network call, amortizing the round-trip. This alone can improve throughput 50-100x. Root cause 2: Write concern overhead. The default w: 1 waits for the primary to acknowledge each write. If they changed it to w: "majority" (perhaps following a best-practices guide without understanding the cost), each write waits for majority replication acknowledgment: primary writes to journal, secondary receives and writes to journal, acknowledgment propagates back. With a 3-node replica set, this adds 2-5 ms per write. At w: "majority", even with batching, throughput caps at around 5,000-10,000 inserts/second because of the majority-acknowledgment latency. For a bulk import, w: 1 is usually acceptable (the data can be re-imported if the primary fails). Root cause 3: Index maintenance. If the collection has 5 indexes, each insert updates all 5 indexes. With 10 million documents, that is 50 million index updates. Each index update involves a btree page split and WiredTiger cache management. On a collection with many indexes, write throughput is directly proportional to the number of indexes. The fix for bulk imports: drop non-essential indexes before the import, then rebuild them afterward. Index building on 10 million documents in one pass is far faster than incrementally maintaining indexes on 10 million individual inserts. The combined fix that drops it to under 5 minutes:
// 1. Drop non-essential indexes (keep only _id and shard key)
// 2. Batch inserts
const BATCH_SIZE = 5000;
for (let i = 0; i < documents.length; i += BATCH_SIZE) {
  const batch = documents.slice(i, i + BATCH_SIZE);
  await db.collection.insertMany(batch, {
    ordered: false,  // unordered is faster (parallel server-side processing)
    writeConcern: { w: 1 }
  });
}
// 3. Rebuild indexes
await db.collection.createIndex({ userId: 1, createdAt: -1 });
The ordered: false flag is important — it tells MongoDB it can process the inserts in any order and in parallel. With ordered: true (default), MongoDB processes inserts sequentially and stops on the first error. Unordered inserts can be 2-3x faster because the server can batch internal operations. War Story: A data pipeline at a media company was inserting 50 million events per day into MongoDB using Python’s insert_one() in a for loop. It took 14 hours. The developer had been adding time.sleep(0.001) between inserts “to not overload the database.” We removed the sleep (the database was at 3% CPU utilization), switched to insert_many() with batches of 5,000, used ordered: false, and temporarily dropped 3 indexes that were not needed for the import. The same 50 million inserts completed in 22 minutes. The developer’s instinct to “be gentle with the database” was well-intentioned but cost them 13.5 hours per day.

Follow-up: What is the difference between ordered and unordered bulk writes, and when would you prefer ordered?

Answer: Ordered bulk writes process documents sequentially. If document 5,000 fails (duplicate key error, validation error), documents 5,001 through 10,000 are not attempted. The client receives an error with the count of successful inserts. This is safe when document order matters and you want to stop on the first error. Unordered bulk writes process all documents, potentially in parallel on the server. If document 5,000 fails, the server continues processing the remaining documents. The client receives a response listing all errors and all successes. This is faster because: (1) the server can group inserts by shard in a sharded cluster, (2) the server can process independent groups concurrently, and (3) a single failure does not stop the entire batch. Use ordered when: inserts must be sequential (event logs where later events depend on earlier ones being present), or when you need to stop on the first error to investigate (data quality is critical). Use unordered for bulk imports, migrations, and any scenario where individual failures should not stop the batch. Handle errors after the batch completes by inspecting the writeErrors array.

Follow-up: The developer asks “should I use mongoimport instead of writing application code for bulk loads?”

Answer: Yes, for flat-file imports (JSON, CSV, TSV), mongoimport is almost always faster than application code for a few reasons: (1) it uses optimal batch sizes internally, (2) it is compiled in Go and avoids driver serialization overhead, (3) it supports --numInsertionWorkers for parallel inserts, and (4) it handles connection pooling and retry logic optimally for bulk operations. mongoimport --uri "mongodb://..." --collection events --file events.json --jsonArray --numInsertionWorkers 4 --writeConcern "{w:1}" With 4 insertion workers on a sharded cluster, mongoimport can saturate the write throughput of all shards. For 50 million 1 KB documents, expect 15-30 minutes on a 3-node replica set with SSDs. Use application code instead when: you need to transform documents during import (enrich, validate, compute derived fields), when you need to coordinate with other systems (update a progress bar, send notifications at milestones), or when the source data is not a flat file (reading from another database, an API, or a message queue).

You are on-call and get paged: “Redis latency spiked from 0.2ms to 50ms.” What are the first five things you check, in order?

Difficulty: Senior What the interviewer is really testing: Systematic incident response for Redis, prioritized by likelihood and impact. Not just “things that can go wrong with Redis” but the specific diagnostic order an experienced engineer follows under time pressure. What weak candidates say: “I would check if the server is overloaded and maybe restart Redis.” Restarting a stateful data store as a first response is dangerous — you might lose data and you definitely lose diagnostic information. They also do not have a prioritized checklist. What strong candidates say: When Redis latency jumps from sub-millisecond to 50ms, something fundamental changed. Here is my diagnostic order, optimized for fastest time-to-diagnosis: Check 1: redis-cli --latency and SLOWLOG GET 10 (30 seconds). The SLOWLOG is the single most valuable diagnostic for Redis latency spikes. It logs every command that exceeded slowlog-log-slower-than (default 10ms). If the slowlog shows a specific command pattern (a KEYS *, a SORT on a million elements, a Lua script doing heavy computation), you have found the smoking gun. A single O(N) command on a large dataset blocks the event loop for all other clients. The most common culprit I have seen in production: an engineer added KEYS pattern:* in a health check that runs every 5 seconds. On 10 million keys, KEYS * takes 2-5 seconds during which every other command is blocked. Check 2: INFO memory — check mem_fragmentation_ratio and swap usage (15 seconds). If mem_fragmentation_ratio < 1.0, Redis is swapping. Swap turns every memory access into a potential disk operation, causing 1000x latency increase. If used_memory is near maxmemory, evictions are happening on every write. Each eviction is additional work before the actual command executes. Check 3: INFO clients — check connected_clients and blocked_clients (10 seconds). If connected_clients jumped from 500 to 50,000, the connection handling overhead itself causes latency. Each new connection consumes memory and kernel resources. A thundering herd of new connections (from a deployment, an autoscaler spinning up pods, or a retry storm) can spike latency. blocked_clients > 0 means clients are waiting on blocking commands (BLPOP, BRPOP) — check if a consumer died and a queue is backed up. Check 4: INFO persistence — check rdb_last_bgsave_status and aof_rewrite_in_progress (10 seconds). If rdb_bgsave_in_progress: 1 or aof_rewrite_in_progress: 1, Redis is forking for a background save. The fork() call itself can cause a latency spike of 10-50ms on a large dataset (Redis must copy the page table). During the save, copy-on-write amplifies memory usage, potentially causing the OOM killer to start killing processes. Check latest_fork_usec — if it is 50,000 (50ms), that is your latency spike right there. Check 5: Network and host-level metrics (60 seconds). If none of the above explain it, the problem is outside Redis. Check: (a) network latency between the application and Redis (redis-cli --latency from the app server vs from the Redis host itself — if the difference is large, it is the network), (b) CPU steal on the Redis host (if on a shared VM and the hypervisor is throttling you, top shows %st > 5%), (c) noisy neighbor on the same host (another container or process consuming I/O bandwidth), (d) kernel-level issues (dmesg for OOM events, vmstat for swap activity). War Story: At a ride-hailing company, Redis latency spiked to 200ms every hour on the hour. The slowlog was empty — no single command was slow. INFO persistence showed rdb_bgsave_in_progress: 1 at every spike. The RDB snapshot was configured to run every hour (save 3600 1). The Redis instance had 30 GB of data. The fork() call took 45ms (latest_fork_usec: 45000), and during the snapshot, the write-heavy workload triggered massive copy-on-write, causing memory pressure. The fix was disabling RDB snapshots entirely (they already had AOF enabled with appendfsync everysec) and using AOF rewrite instead, which is less disruptive because it is optimized for the current dataset rather than doing a full dump. The hourly latency spikes disappeared. We also switched the monitoring cron from checking KEYS stats:* to SCAN with a cursor, which eliminated a secondary 800ms spike that happened during monitoring.

Follow-up: How do you prevent the KEYS command from ever being used in production?

Answer: Three layers of defense:
  1. Rename the command. In redis.conf: rename-command KEYS "" (disables it entirely) or rename-command KEYS KEYS_DANGER_DO_NOT_USE. This prevents accidental use. Any code trying to use KEYS gets an error.
  2. Code review enforcement. Add a linter rule or a grep check in CI that flags any use of KEYS in application code. The alternative is always SCAN with a cursor, which returns results incrementally without blocking the event loop.
  3. Redis ACLs (Redis 6+). Create application-specific users that do not have permission for dangerous commands: ACL SETUSER appuser ~* +@all -@dangerous. The @dangerous category includes KEYS, FLUSHALL, FLUSHDB, DEBUG, and other commands that should never run in production.
At a previous company, we went further: we ran a nightly audit that used MONITOR for 10 seconds on each Redis instance and flagged any KEYS usage. This caught a third-party library that was using KEYS internally for cache namespace management. We replaced the library.

Follow-up: The latency spike lasts exactly 200ms and happens once per minute. SLOWLOG is empty. What specific Redis internal event takes exactly ~200ms?

Answer: If it is exactly periodic and the SLOWLOG is empty (meaning no individual command is slow), the two most likely causes are:
  1. Active expiry cycle. Redis runs an active key expiration cycle 10 times per second (every 100ms). In each cycle, it samples 20 keys with TTLs and deletes expired ones. If more than 25% of sampled keys are expired, it loops again (up to a time limit of 25ms by default, or 250ms if hz is tuned up). On a Redis instance with millions of keys all expiring around the same time (a common pattern when TTLs are set to round numbers like 60s, 300s, or 3600s), the expiry cycle can consume its full time budget. The solution: add jitter to TTLs (TTL = base_ttl + random(0, 30)). This spreads expirations over time instead of creating thundering herd expiry.
  2. Lazy freeing of large keys (Redis 4.0+). When a large key is deleted or expires, Redis 4.0+ can free the memory asynchronously using lazyfree-lazy-expire yes. But if this is not enabled, deleting a Hash with 1 million fields blocks the event loop for the duration of the memory deallocation (100-300ms for a 500 MB key). The SLOWLOG would not show this because the DEL command itself reports as fast — the time is spent in memory management after the command completes. Enable lazyfree-lazy-expire yes, lazyfree-lazy-server-del yes, and lazyfree-lazy-user-del yes to move large key deletion to a background thread.

Curated Resources

PostgreSQL:
  • PostgreSQL Official Documentation — Internals Chapter — the authoritative source for understanding MVCC, WAL, query planning, and the executor. Dense but unmatched in accuracy.
  • The Internals of PostgreSQL by Hironobu Suzuki — a free, deeply visual guide to PostgreSQL internals covering buffer management, WAL, VACUUM, query processing, and replication. The best free resource for learning PostgreSQL internals.
  • pganalyze Blog — production-focused articles on EXPLAIN ANALYZE, index tuning, connection management, and pg_stat_statements. Practical, not academic.
  • Crunchy Data Blog — excellent PostgreSQL operational guides covering partitioning, extensions, and real-world tuning scenarios.
MongoDB:
  • MongoDB University — free courses covering schema design, aggregation, sharding, and operations. The M320 Data Modeling course is particularly valuable.
  • MongoDB Schema Design Patterns — the official summary of the Bucket, Computed, Extended Reference, Outlier, Subset, and other patterns. Bookmark this.
  • MongoDB Developer Center — practical guides, code examples, and architecture patterns from MongoDB’s developer advocacy team.
DynamoDB:
  • The DynamoDB Book by Alex DeBrie — the definitive guide to DynamoDB data modeling. If you are serious about DynamoDB, this is required reading. It covers single-table design, GSI patterns, and real-world modeling examples in depth.
  • AWS DynamoDB Developer Guide — official documentation covering every feature, capacity mode, and best practice. Start with the “Best Practices” section.
  • Rick Houlihan’s DynamoDB Deep Dive (AWS re:Invent) — the legendary re:Invent talk that popularized single-table design. Dense, opinionated, and essential viewing.
Redis:
  • Redis University — free courses covering data structures, persistence, clustering, and Lua scripting.
  • Redis Documentation — comprehensive coverage of every command, data structure, and configuration option. The “Data types” section is the best starting point.
  • Martin Kleppmann’s Analysis of Redlock — the critical analysis of Redis’s distributed lock algorithm. Essential reading for anyone implementing distributed locks with Redis, regardless of whether you agree with Kleppmann’s conclusions.
  • Antirez’s Response to Kleppmann — Salvatore Sanfilippo’s (Redis creator) rebuttal. Read both sides to form your own opinion.
Cross-Database:
  • Designing Data-Intensive Applications by Martin Kleppmann — the single best book on databases, distributed systems, and data infrastructure. Covers replication, partitioning, consistency, encoding, batch/stream processing with depth that no other single resource matches. If you read one book on this list, make it this one.
Cross-chapter connections: