Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Database Deep Dives
Knowing SQL syntax is table stakes. Knowing why your query plan changed after aVACUUM, why your DynamoDB table has a hot partition at 3 AM, why MongoDB’s $lookup is killing your aggregation pipeline, or why Redis just evicted your session keys — that is what separates engineers who operate databases from engineers who understand them. This chapter goes beneath the surface of four databases you will encounter in production, with the depth that interviews at serious companies demand.
Section 1: PostgreSQL Internals
PostgreSQL is not just “a relational database.” It is a sophisticated system with its own approach to concurrency, crash recovery, query optimization, and extensibility. Understanding these internals transforms you from someone who writes SQL into someone who can diagnose why a query that ran in 2 ms yesterday takes 30 seconds today.1.1 MVCC (Multi-Version Concurrency Control)
Most databases face a fundamental tension: readers and writers want to access the same data simultaneously. The brute-force solution is locking — make readers wait while writers write, and vice versa. PostgreSQL chose a different path: readers never block writers, and writers never block readers. PostgreSQL achieves this through MVCC. Instead of modifying a row in place, every UPDATE creates a new version of the row. The old version is not immediately deleted — it sticks around so that any transaction that started before the update can still see the original value. Each row version has two hidden system columns:xmin (the transaction ID that created this version) and xmax (the transaction ID that deleted or replaced this version, or 0 if it is still current).
When a transaction runs a query, PostgreSQL checks each row version’s xmin and xmax against the transaction’s snapshot — a record of which transactions were committed at the moment this transaction began. If xmin is from a committed transaction that started before the snapshot and xmax is either 0 or from a transaction not yet visible, the row version is visible. Otherwise, it is skipped.
What this means in practice: Ten concurrent transactions can all read the orders table while an eleventh transaction is updating 50,000 rows. The readers see a consistent snapshot of the data as it existed when their transaction started. No locks. No waiting. No dirty reads.
1.2 WAL (Write-Ahead Log)
The Write-Ahead Log is PostgreSQL’s crash recovery mechanism and the foundation of its replication system. The principle is simple: before any change is written to the actual data files on disk, a record of that change is first written to the WAL. Why this matters: Imagine PostgreSQL is halfway through writing an UPDATE to a data page when the server loses power. Without WAL, that page is now corrupted — half old data, half new data. With WAL, on restart PostgreSQL reads the WAL from the last checkpoint, replays any committed transactions that were not yet flushed to the data files, and rolls back any uncommitted transactions. The data files might be in an inconsistent state, but the WAL has everything needed to reconstruct consistency. How it works step by step:- Transaction begins. Client issues
UPDATE accounts SET balance = 500 WHERE id = 42. - PostgreSQL modifies the row in the shared buffer pool (in-memory copy of the data page).
- A WAL record describing this change is written to the WAL buffer (in memory).
- When the transaction commits, the WAL buffer is flushed to disk (
pg_wal/directory). This is thefsynccall — the moment the change is durable. - The actual data page in the buffer pool is not immediately written to disk. It will be flushed later by the background writer or checkpointer.
- On crash, PostgreSQL replays WAL records from the last checkpoint to bring data files up to date.
pg_wal/ (called pg_xlog/ before PostgreSQL 10). Each segment is 16 MB by default. If this directory fills up (because archiving is failing or replication slots are retaining too many segments), PostgreSQL will refuse to accept new writes until space is freed. Monitor pg_wal disk usage in production.1.3 VACUUM — Dead Tuple Cleanup
VACUUM is the garbage collector for MVCC. It reclaims space occupied by dead tuples (old row versions that are no longer visible to any active transaction), updates the visibility map (which pages are fully visible to all transactions — enabling index-only scans), and updates the free space map (which pages have room for new rows). What happens without VACUUM:- Table bloat: a 1 GB table grows to 10 GB as dead tuples accumulate.
- Index bloat: indexes still point to dead tuples, making them larger and slower.
- Transaction ID wraparound: PostgreSQL uses 32-bit transaction IDs. After ~2 billion transactions, IDs wrap around and old transactions appear to be in the “future.” If this happens, PostgreSQL shuts down to prevent data corruption. VACUUM marks old transaction IDs as “frozen” to prevent wraparound.
50 + 0.2 * row_count. For a 10-million-row table, that means autovacuum triggers after 2,000,050 dead tuples — often too late.
Autovacuum tuning for large tables:
1.4 The Query Planner
When you run a query, PostgreSQL does not just execute it. It first generates multiple possible execution plans and estimates the cost of each, then picks the cheapest one. Understanding the planner means understanding why PostgreSQL sometimes ignores your carefully crafted index. Scan types the planner chooses between:| Scan Type | When Chosen | Example Scenario |
|---|---|---|
| Sequential Scan | Reading most of the table, or table is small | SELECT * FROM users (no WHERE), or WHERE status = 'active' when 90% of rows are active |
| Index Scan | Fetching a small fraction of rows via index | WHERE id = 42 on a B-tree primary key index |
| Bitmap Index Scan | Fetching a moderate fraction (1-20%) of rows | WHERE created_at > '2024-01-01' returning 10% of rows |
| Index Only Scan | All required columns exist in the index (covering index) and visibility map is current | SELECT id, email FROM users WHERE email = 'x' with an index on (email) INCLUDE (id) |
ANALYZE has not run recently), the planner makes bad estimates. If the planner estimates a query will return 50% of the table, a sequential scan is genuinely faster than an index scan — random I/O (jumping around the heap via index pointers) is slower than sequential I/O (reading pages in order) when you are reading most of the table anyway.
1.5 EXPLAIN ANALYZE Deep Dive
EXPLAIN shows the plan. EXPLAIN ANALYZE runs the query and shows actual vs estimated numbers. This is the single most important diagnostic tool for PostgreSQL performance.
- Actual rows vs estimated rows. If the planner estimated 100 rows but got 50,000, the plan is based on wrong statistics. Run
ANALYZEon the table. - Loops. A Nested Loop with
loops=10000means the inner operation ran 10,000 times. Multiply the inner node’s time by the loop count for the true cost. - Buffers: shared hit vs shared read.
shared hit= read from buffer cache (fast).shared read= read from disk (slow). A highshared readcount means the working set does not fit inshared_buffers. - Sort Method: external merge disk. The sort spilled to disk because
work_memwas too small. Increasework_memfor this session or globally. - Seq Scan with Filter: rows removed by filter. If a sequential scan filtered out 99% of rows, an index would be far more efficient.
1.6 Connection Management
Each PostgreSQL connection is a full OS process (not a thread). On Linux, each backend process consumes approximately 5-10 MB of RAM. A server withmax_connections = 500 reserves 2.5-5 GB just for connection overhead — before any query execution.
The problem: Modern web applications using serverless functions or microservices can easily open thousands of connections. Each Lambda invocation, each Kubernetes pod, each application instance opens its own connections. Without connection pooling, you exhaust PostgreSQL’s connection limit quickly.
PgBouncer is the standard solution — a lightweight connection pooler that sits between your application and PostgreSQL. It maintains a small pool of actual PostgreSQL connections (say, 50) and multiplexes hundreds or thousands of client connections onto them.
Pooling modes:
- Transaction pooling (most common): A server connection is assigned to a client for the duration of one transaction, then returned to the pool. Best for web workloads with short transactions.
- Session pooling: A server connection is assigned for the entire client session. Needed for features that require session state (prepared statements, temp tables,
SETcommands). - Statement pooling: A connection is assigned per statement. Most aggressive but breaks multi-statement transactions.
1.7 Partitioning
Partitioning splits a single logical table into multiple physical tables. PostgreSQL supports three strategies:- Range partitioning: Divide by value ranges. Most common for time-series data: one partition per month, per week, or per day.
- List partitioning: Divide by discrete values. Partition by region (
us-east,eu-west) or status (active,archived). - Hash partitioning: Distribute evenly across N partitions based on a hash of the partition key. Good for even distribution when there is no natural range or list.
- Tables over 100 million rows where queries consistently filter by the partition key.
- Time-series data where old partitions can be cheaply dropped (
DROP TABLEis instant;DELETE FROM ... WHERE date < Xon a 500M row table takes hours and creates massive dead tuple bloat). - When you need different storage strategies for hot vs cold data (recent partitions on SSD, old partitions on cheaper storage).
1.8 Extensions
PostgreSQL’s extension ecosystem is one of its strongest competitive advantages:- PostGIS: Geospatial data types and queries.
ST_DWithin(point, location, 5000)finds everything within 5 km. Powers location features at companies from Instacart to OpenStreetMap. - pg_trgm: Trigram-based fuzzy text search. Enables
LIKE '%search%'queries to use an index (normally impossible with leading wildcards). Create a GIN trigram index and suddenly fuzzy search is fast. - TimescaleDB: Turns PostgreSQL into a time-series database with automatic partitioning (hypertables), compression (90%+ space reduction on old data), and continuous aggregates. If you are choosing between PostgreSQL + TimescaleDB and InfluxDB, keep it in the PostgreSQL ecosystem unless you have a strong reason not to.
- pgvector: Vector similarity search for AI/ML embeddings. Store embeddings alongside relational data and query both in a single SQL statement. Rapidly becoming the default for RAG (Retrieval-Augmented Generation) applications that do not need billion-vector scale.
Section 2: MongoDB Patterns
MongoDB is a document database — it stores data as JSON-like documents (BSON internally) rather than rows in tables. It excels at rapid iteration with flexible schemas, hierarchical data that maps naturally to nested documents, and read-heavy workloads with denormalized data. It struggles with complex relationships, multi-document transactions at scale, and workloads that would benefit from JOINs.2.1 The Document Model
A MongoDB document is a JSON object. Unlike a relational row, it can contain nested objects, arrays, and mixed types. This flexibility is both its greatest strength and its most common source of production pain. When the document model shines:- Hierarchical data: A product with variants, each variant with sizes, each size with inventory by warehouse. In PostgreSQL, that is 4 tables and 3 JOINs. In MongoDB, it is one document.
- Flexible schema: An e-commerce platform where shoes have
sizeandcolor, electronics havevoltageandwattage, and books haveisbnandauthor. In relational, you either use sparse columns (wasteful) or an EAV pattern (painful). In MongoDB, each document has exactly the fields it needs. - Read-heavy with denormalization: When 90% of your operations are “fetch this entire thing and display it,” embedding all the data in one document means one read, no JOINs.
- Many-to-many relationships: Students and courses. If a student document embeds course references and a course document embeds student references, updating a course name means updating every student document that references it. This is the relational model’s strength and the document model’s weakness.
- Data that is frequently updated in nested arrays: Updating the 500th element of a 1000-element embedded array is expensive. MongoDB must rewrite the entire document.
- Complex analytical queries with JOINs across collections: MongoDB’s
$lookup(the equivalent of JOIN) is significantly slower than relational JOINs and does not support the same optimization strategies.
2.2 Schema Design Patterns
MongoDB schema design is not “throw everything in one document.” Production schemas use well-established patterns: Extended Reference Pattern: Instead of embedding the entire related document, embed only the fields you need for display. An order document contains{ customer_name: "Jane", customer_email: "jane@co.com" } rather than the entire customer object. Reduces document size and read latency while accepting controlled denormalization.
Subset Pattern: When a document has a large array but queries only need the most recent entries, store only the last N items in the main document and archive older items to a separate collection. A product’s reviews document has the last 10 reviews embedded; the full review history lives in a reviews collection.
Bucket Pattern: Instead of one document per event (millions of tiny documents), group events into time-based buckets. A sensor reading every second becomes one document per minute with an array of 60 readings. Reduces document count by 60x, improves query efficiency, and reduces index overhead.
Computed Pattern: Pre-compute aggregations and store the results. Instead of calculating total_revenue by summing all order documents every time, maintain a running total in a statistics document updated on each new order. Trades write complexity for read speed.
Outlier Pattern: When 99% of documents have small arrays but 1% have massive arrays (a celebrity with 10 million followers vs a normal user with 200), handle the outliers differently. Normal users embed their follower list. Celebrity users store a flag has_overflow: true and put excess followers in a separate collection.
2.3 Write Concerns and Read Concerns
Write concern controls how many replicas must acknowledge a write before the driver returns success. Read concern controls the consistency guarantee of read operations.| Write Concern | Durability | Latency | Use Case |
|---|---|---|---|
w: 0 | Fire and forget (no acknowledgment) | Lowest | Metrics, logs where occasional loss is acceptable |
w: 1 | Primary acknowledged | Low | Default — good for most operations |
w: "majority" | Majority of replica set acknowledged | Higher | Financial data, anything you cannot afford to lose on primary failure |
| Read Concern | Guarantee | Trade-off |
|---|---|---|
local | Returns the most recent data on the queried node (may be rolled back) | Fastest, default |
majority | Returns data acknowledged by majority (durable, will not be rolled back) | Slightly slower |
linearizable | Returns data that is both majority-committed and reflects all successful writes before the read started | Slowest, strongest |
2.4 Aggregation Pipeline
MongoDB’s aggregation pipeline is a sequence of stages that transform documents. Think of it as a Unix pipeline for data: each stage takes documents in, processes them, and passes the results to the next stage. Key stages:$match (filter — always put first to reduce documents early), $group (aggregate), $project (reshape), $sort, $limit, $unwind (flatten arrays), $lookup (left outer join to another collection).
The $lookup limitation: $lookup performs a left outer join, but it is fundamentally different from a relational JOIN. It executes a separate query for each input document (unless the joined collection is small enough to fit in memory). On a pipeline processing 100,000 documents, a $lookup can trigger 100,000 separate queries against the joined collection. Always ensure the joined collection has an index on the foreign field. Even then, $lookup at scale is where MongoDB’s document model starts to feel its limitations.
Optimization tip: $match and $sort stages at the beginning of the pipeline can use indexes. Once a $group or $project reshapes the documents, subsequent stages cannot use indexes. Structure your pipeline to filter as much as possible before transforming.
2.5 Sharding
Sharding distributes data across multiple servers (shards). The shard key is the most critical MongoDB decision you will make — it determines how data is distributed and cannot be easily changed after the fact. Shard key selection criteria:- High cardinality: A boolean field (
is_active) is a terrible shard key — all data lands on two chunks. A user ID or order ID has millions of distinct values. - Even distribution: Monotonically increasing keys (timestamps, ObjectId) cause all new writes to hit the same shard (the one owning the highest range). This creates a “hot shard” bottleneck.
- Query isolation: The shard key should appear in most queries. If your queries filter by
user_id, shard onuser_idso queries hit a single shard. If you shard onorder_idbut query byuser_id, every query must scatter to all shards and gather results — a scatter-gather query that does not scale.
2.6 Change Streams
Change Streams let you subscribe to real-time changes on a collection, database, or entire cluster. Under the hood, they tail MongoDB’s oplog (operation log — similar to PostgreSQL’s WAL). Use cases: Real-time event processing (new order -> trigger fulfillment), cache invalidation (document updated -> invalidate Redis cache), sync to search index (change in MongoDB -> update Elasticsearch), audit logging. Key limitation: Change Streams require a replica set (not available on standalone instances). They are “at-least-once” delivery — your consumer must be idempotent because it may receive the same change event more than once after a resume.2.7 Common Gotchas
- Unbounded arrays: Embedding an ever-growing array (all user activity, all comments) into a single document eventually hits MongoDB’s 16 MB document size limit. Even before hitting the limit, updates to large arrays are slow because MongoDB rewrites the entire document. Use the Bucket or Outlier pattern instead.
- Missing compound indexes: If you query by
{ user_id, status, created_at }, an index on justuser_idforces MongoDB to scan all of that user’s documents to filter bystatusandcreated_at. Create a compound index matching your query pattern. - WiredTiger cache sizing: WiredTiger (MongoDB’s storage engine) defaults to using 50% of RAM minus 1 GB for its internal cache. On a 64 GB server, that is ~31 GB. If your working set (frequently accessed data + indexes) exceeds this, performance degrades dramatically as WiredTiger constantly reads from and evicts to disk. Monitor
wiredTiger.cache.bytes currently in the cacheand ensure it does not stay near the maximum.
Section 3: DynamoDB Strategies
DynamoDB is AWS’s fully managed NoSQL database. It is not a general-purpose database — it is purpose-built for applications that need predictable single-digit millisecond performance at any scale. The trade-off: you give up query flexibility (no JOINs, no ad-hoc queries) in exchange for virtually unlimited throughput and rock-solid latency guarantees.3.1 Access Patterns First
The most important DynamoDB design principle: design your table around your queries, not your entities. In relational databases, you normalize data into entities (users, orders, products) and rely on JOINs to answer any query. In DynamoDB, there are no JOINs. Every access pattern must be answerable by a single query against a single table (or index). Before writing a single line of schema, list every access pattern your application needs:- Get user by ID
- Get all orders for a user, sorted by date
- Get all comments on a post
- Get a user’s most recent 10 posts
- Get all users in an organization
GetItem, Query, or BatchGetItem operation.
3.2 Single-Table Design
Single-table design stores multiple entity types in one DynamoDB table. This is counterintuitive for engineers from a relational background, but it is the canonical DynamoDB pattern. How it works (adjacency list pattern):| PK | SK | Data |
|---|---|---|
USER#123 | PROFILE | { name: "Jane", email: "jane@co.com" } |
USER#123 | ORDER#2024-001 | { total: 59.99, status: "shipped" } |
USER#123 | ORDER#2024-002 | { total: 129.00, status: "pending" } |
ORDER#2024-001 | ITEM#A | { product: "Widget", qty: 2 } |
ORDER#2024-001 | ITEM#B | { product: "Gadget", qty: 1 } |
ORG#acme | MEMBER#123 | { role: "admin", joined: "2024-01-15" } |
ORG#acme | MEMBER#456 | { role: "viewer", joined: "2024-03-20" } |
- Get user profile:
GetItem(PK=USER#123, SK=PROFILE) - Get all orders for user:
Query(PK=USER#123, SK begins_with ORDER#) - Get all items in an order:
Query(PK=ORDER#2024-001, SK begins_with ITEM#) - Get all members of an org:
Query(PK=ORG#acme, SK begins_with MEMBER#)
3.3 Partition Key Design
DynamoDB distributes data across partitions based on a hash of the partition key. If many items share the same partition key, that partition becomes “hot” — receiving a disproportionate share of read/write traffic and potentially getting throttled. Avoiding hot partitions:- High-cardinality keys: Use user IDs, order IDs, or UUIDs — not statuses, dates, or categories.
- Write sharding: If you must use a low-cardinality key (e.g., tracking page views per URL), append a random suffix:
URL#homepage#3(where 3 is random between 0-9). This distributes writes across 10 partitions. Reads require querying all 10 suffixes and aggregating — a trade-off that is worth it for high-write scenarios. - Composite keys: Combine fields to increase cardinality.
TENANT#acme#USER#123is better thanTENANT#acmealone.
3.4 Global Secondary Indexes (GSI)
A GSI is a projection of your table with a different partition key and sort key. It lets you query data by attributes other than the base table’s keys. Example: Your base table hasPK=USER#123, SK=ORDER#2024-001. You need to query “all orders with status=pending.” Create a GSI with PK=STATUS, SK=created_at. Now Query(PK=pending, SK between '2024-01-01' and '2024-12-31') returns all pending orders sorted by date.
Key characteristics:
- GSIs are eventually consistent — there is no strongly consistent read option for GSIs. If you write an item and immediately query the GSI, you might not see it yet (typically milliseconds, but during high load it can be longer).
- GSIs have their own provisioned throughput. If the GSI is throttled, writes to the base table are also throttled (backpressure).
- You can project a subset of attributes to the GSI (reducing cost and improving performance) or project all attributes.
- Maximum 20 GSIs per table.
3.5 Local Secondary Indexes (LSI)
An LSI uses the same partition key as the base table but a different sort key. It must be created at table creation time (cannot be added later). LSIs support strongly consistent reads. When to use LSI: When you need an alternative sort order within the same partition AND you need strongly consistent reads on that sort order. Example: base table sorts orders byorder_id, but you also need orders sorted by created_at within the same user partition.
Limitations: LSIs share the partition’s 10 GB limit. If a single partition key’s items (across the base table and all LSIs) exceed 10 GB, writes fail. This is the main reason many DynamoDB practitioners avoid LSIs entirely and use GSIs instead.
3.6 DynamoDB Streams
DynamoDB Streams captures a time-ordered sequence of item-level modifications (inserts, updates, deletes) to a DynamoDB table. Think of it as DynamoDB’s version of a change data capture (CDC) log. Use cases:- Event-driven architectures: New order inserted -> Stream triggers Lambda -> Lambda sends confirmation email.
- Cross-region replication: DynamoDB Global Tables use Streams under the hood.
- Materialized views: Stream changes to build denormalized views in another table or in Elasticsearch for full-text search.
- Audit trail: Stream all changes to S3 for compliance and forensics.
3.7 DAX (DynamoDB Accelerator)
DAX is a fully managed, in-memory cache for DynamoDB. It sits between your application and DynamoDB, caching frequently read items and query results. When DAX makes sense: Read-heavy workloads where the same items are read repeatedly (product catalog, configuration data). When you need microsecond latency (DAX serves from memory in ~200 microseconds vs DynamoDB’s ~5 ms). When DAX does NOT make sense: Write-heavy workloads (DAX caches reads, not writes). When reads are highly varied with little repetition (low cache hit rate). When eventual consistency of the cache is problematic (DAX is eventually consistent by default with a 5-minute TTL).3.8 Capacity Modes
On-demand: Pay per request. No capacity planning. Instantly handles traffic spikes. Costs roughly 6.5x more per request than provisioned at steady state. Best for: unpredictable workloads, new applications where traffic patterns are unknown, spiky traffic. Provisioned: You specify read/write capacity units. Cheaper at steady state. Risk of throttling if you underestimate. Use with Auto Scaling to adjust capacity based on utilization. Best for: predictable, steady workloads. Burst capacity: DynamoDB reserves unused capacity for up to 5 minutes to handle short spikes. This is not infinite — sustained traffic above provisioned capacity will be throttled even with burst capacity available.3.9 Cost Optimization
- Avoid
Scanoperations. A Scan reads every item in the table and charges for every read capacity unit consumed. Always useQueryorGetItemwith a key condition. - Use sparse indexes. A GSI only contains items that have the GSI’s key attributes. If only 5% of items have a
premium_tierattribute, a GSI onpremium_tieris 95% smaller than the base table — dramatically cheaper to read. - Compress large items. DynamoDB charges per KB of read/write capacity consumed. Compressing a 10 KB JSON attribute to 2 KB saves 80% on throughput costs for that attribute.
- Use
ProjectionExpression. Only retrieve the attributes you need, not the entire item. Reduces data transfer and read capacity consumption. - Time-to-Live (TTL). Automatically delete expired items at no cost (DynamoDB handles deletion in the background without consuming write capacity).
Section 4: Redis Architecture
Redis is an in-memory data structure server. It is often described as “a cache,” but that undersells it dramatically. Redis is a Swiss Army knife for problems that need sub-millisecond latency: caching, session storage, rate limiting, leaderboards, pub/sub messaging, distributed locks, and real-time analytics.4.1 Data Structures Beyond Key-Value
Redis’s power comes from its rich data structure support. Each data structure has O(1) or O(log N) operations that would be expensive to build on top of a plain key-value store:- Strings: Basic key-value. But also atomic counters (
INCR), bit operations, and values up to 512 MB. - Hashes: A hash map within a single key. Perfect for objects:
HSET user:123 name "Jane" email "jane@co.com". Get individual fields without deserializing the whole object. - Lists: Doubly-linked lists.
LPUSH/RPOPgives you a queue.LRANGEgives you the last N items. Used for activity feeds, recent items. - Sets: Unordered unique collections.
SADD,SINTER(intersection),SUNION. Used for tags, mutual friends, unique visitors. - Sorted Sets: Sets with a score for each member.
ZADD leaderboard 1500 "player1".ZRANGE leaderboard 0 9 REVgives the top 10. O(log N) insert and rank query. The data structure behind leaderboards, priority queues, and rate limiters. - HyperLogLog: Probabilistic data structure for cardinality estimation. Count unique visitors with 0.81% standard error using only 12 KB of memory, regardless of the actual count.
PFADD visitors "user1" "user2".PFCOUNT visitorsreturns the approximate unique count. - Streams: An append-only log data structure (added in Redis 5.0). Think Kafka-lite. Supports consumer groups, acknowledgment, and message replay. Used for event sourcing, activity streams, and lightweight message queues.
- Bitmaps: Bit arrays. Set and get individual bits.
SETBIT active:2024-04-09 user_id 1. Count bits withBITCOUNT. Used for tracking daily active users (1 bit per user, 1 million users = 125 KB per day).
4.2 Persistence
Redis is in-memory, but it can persist data to disk via two mechanisms: RDB (Redis Database) snapshots: Point-in-time snapshots of the entire dataset written to a.rdb file. Redis forks the process and the child writes the snapshot while the parent continues serving requests (copy-on-write). Fast recovery (load the entire dataset from one file). Risk: data written between the last snapshot and a crash is lost. Configure with save 900 1 (snapshot if at least 1 key changed in 900 seconds).
AOF (Append Only File): Logs every write operation. Three fsync policies:
appendfsync always: fsync after every write. Most durable, slowest.appendfsync everysec: fsync every second. Loses at most 1 second of data on crash. The recommended default.appendfsync no: Let the OS decide when to flush. Fastest, least durable.
4.3 Eviction Policies
When Redis reachesmaxmemory, it must decide what to evict:
| Policy | Behavior | Use Case |
|---|---|---|
noeviction | Return errors on writes when memory is full | When data loss is unacceptable (session store, job queue) |
allkeys-lru | Evict least recently used keys across all keys | General-purpose caching |
volatile-lru | Evict least recently used keys among keys with a TTL set | When some keys must persist (config) and others can be evicted (cache) |
allkeys-lfu | Evict least frequently used keys | When access frequency matters more than recency (popular items should stay) |
volatile-lfu | Evict least frequently used keys among those with a TTL | Mixed workload with frequency-based eviction |
allkeys-random | Evict random keys | When all keys have equal value |
volatile-ttl | Evict keys with the shortest remaining TTL | When near-expiry keys should go first |
4.4 Redis Cluster
Redis Cluster provides horizontal scaling by distributing data across multiple nodes. The keyspace is divided into 16,384 hash slots. Each node is responsible for a subset of slots. The hash slot for a key isCRC16(key) mod 16384.
How resharding works: To add a node, you migrate hash slots from existing nodes to the new node. Redis handles this online — keys in migrating slots are redirected with ASK and MOVED responses that clients follow automatically.
Multi-key operations: Operations that span multiple keys (MGET, SUNION, transactions) work only if all keys map to the same hash slot. Use hash tags to force related keys to the same slot: {user:123}:profile and {user:123}:orders both hash on user:123 and land on the same slot.
Client-side vs proxy-based:
- Client-side (smart clients): The client library (like
redis-py-clusterorioredis) knows the slot-to-node mapping and routes requests directly to the correct node. Lower latency (no proxy hop), but the client must handle cluster topology changes. - Proxy-based (Twemproxy, Redis Cluster Proxy): A proxy sits between clients and the cluster, handling routing transparently. Simpler client code, but adds a network hop and a single point of failure.
4.5 Redis Sentinel
Redis Sentinel provides high availability for non-clustered Redis (single master with replicas). Sentinel monitors the master, detects failures, and automatically promotes a replica to master. How failover works:- Multiple Sentinel instances (minimum 3 for quorum) monitor the master.
- When a Sentinel detects the master is unreachable, it asks other Sentinels to confirm (to avoid false positives from network issues).
- If a quorum of Sentinels agree the master is down, one Sentinel is elected to perform the failover.
- The elected Sentinel promotes the best replica (most up-to-date data) to master and reconfigures the other replicas to follow the new master.
- Clients using Sentinel-aware drivers are notified of the new master address.
4.6 Pub/Sub vs Streams
Pub/Sub: Fire-and-forget messaging. Publishers send messages to channels; subscribers receive messages in real-time. Messages are NOT persisted — if a subscriber is disconnected, it misses messages. No consumer groups. No acknowledgment. Best for: real-time notifications, chat, live updates where message loss is acceptable. Streams: Persistent, append-only log with consumer groups. Messages are stored and can be replayed. Consumer groups allow multiple consumers to cooperatively process messages with acknowledgment and pending message tracking. Best for: reliable event processing, task queues, audit logs, anything where message loss is not acceptable. Decision rule: If you would be upset that a message was lost, use Streams. If the message is only valuable in real-time and losing it is fine (like a live scoreboard update), Pub/Sub is simpler and faster.4.7 Lua Scripting
Redis executes Lua scripts atomically — no other command runs between the script’s start and finish. This eliminates race conditions in multi-step operations. Example: Atomic rate limiterINCR and EXPIRE would be separate commands with a gap between them — if the process crashes after INCR but before EXPIRE, the key never expires and the rate limit becomes permanent. Lua makes this atomic.
4.8 Common Patterns
Distributed locks (Redlock): Acquire locks across multiple independent Redis instances. A lock is considered acquired only if the majority of instances grant it within a time limit. More robust than a single-instance lock but controversial — Martin Kleppmann published a detailed critique arguing that Redlock does not provide the guarantees it claims in the presence of clock drift and process pauses. Use it for efficiency (avoiding duplicate work) but not for correctness (where a race condition would corrupt data). Rate limiting: Use Sorted Sets with timestamps as scores.ZADD rate:user:123 <timestamp> <request_id>. ZREMRANGEBYSCORE to remove entries outside the window. ZCARD to count entries in the window. This gives a precise sliding window rate limiter.
Leaderboards: Sorted Sets are purpose-built for this. ZADD leaderboard <score> <player>. ZREVRANK leaderboard <player> gives rank in O(log N). ZREVRANGE leaderboard 0 9 WITHSCORES gives the top 10. Used by gaming companies, Stack Overflow, and fitness apps.
Session storage: Store session data as Redis Hashes with a TTL. HSET session:<id> user_id 123 role admin. EXPIRE session:<id> 3600. All app instances share the same session store. Stateless application servers.
4.9 Memory Optimization
- Key naming conventions: Use
user:123:profilenotthe_profile_data_for_user_id_one_hundred_twenty_three. Short, consistent key names save significant memory at scale. At 100 million keys, saving 20 bytes per key name saves 2 GB. - Ziplist encoding: Redis automatically uses a compact encoding (ziplist) for small Hashes, Lists, and Sorted Sets. A Hash with fewer than
hash-max-ziplist-entries(default 128) entries and values smaller thanhash-max-ziplist-value(default 64 bytes) is stored as a ziplist instead of a hash table — using 5-10x less memory. Keep small objects small to benefit from this. - TTL strategies: Set TTLs aggressively. Every key without a TTL is a potential memory leak. Use
SCANperiodically to find keys without TTLs that should have them. - Compression: For large values (JSON blobs > 1 KB), compress with gzip or LZ4 at the application level before storing in Redis. Redis does not compress values internally.
Section 5: When to Choose What
Decision Matrix
| Criteria | PostgreSQL | MongoDB | DynamoDB | Redis |
|---|---|---|---|---|
| Data model | Relational (tables, rows, JOINs) | Documents (JSON, nested, flexible) | Key-value / wide column | In-memory data structures |
| Query flexibility | Highest (SQL, ad-hoc queries, JOINs) | High (flexible queries, aggregation pipeline) | Lowest (must design for access patterns) | Low (key-based access, Lua for complex ops) |
| Scale model | Vertical (read replicas, partitioning, Citus for horizontal) | Horizontal (native sharding) | Horizontal (fully managed, unlimited) | Vertical (Cluster for horizontal) |
| Consistency | Strong (ACID, serializable isolation) | Tunable (w:majority for strong writes) | Tunable (strong reads on base table) | Eventual (replication), strong (single node) |
| Latency | 1-100 ms (query dependent) | 1-50 ms (query dependent) | <10 ms (predictable at any scale) | <1 ms (in-memory) |
| Operational burden | Medium (VACUUM, connection mgmt, replication) | Medium (sharding, WiredTiger tuning) | Low (fully managed by AWS) | Medium (memory management, persistence) |
| Best for | Complex queries, transactions, data integrity | Flexible schema, hierarchical data, rapid iteration | Predictable latency at scale, serverless backends | Caching, real-time, sessions, leaderboards |
| Worst for | Unlimited horizontal scale without extensions | Complex relationships, heavy JOINs | Ad-hoc queries, complex transactions | Primary data store (data > memory), complex queries |
Concrete Scenarios
| Scenario | Choose | Why |
|---|---|---|
| E-commerce platform with orders, inventory, payments | PostgreSQL | Transactions, data integrity, complex queries across entities |
| Content management system with varied content types | MongoDB | Flexible schema (articles, videos, podcasts all have different fields) |
| Mobile app backend with millions of users, simple reads/writes | DynamoDB | Predictable latency, scales with users, serverless-friendly |
| Real-time leaderboard, session store, rate limiter | Redis | Sub-millisecond latency, purpose-built data structures |
| IoT sensor data (millions of events/second, time-series queries) | PostgreSQL + TimescaleDB or DynamoDB | TimescaleDB for SQL analytics on time-series; DynamoDB for pure ingestion throughput |
| Social network (users, posts, comments, likes, follows) | PostgreSQL + Redis (cache) | Relational model for relationships; Redis for feeds, counts, sessions |
| AI/ML embeddings search + metadata | PostgreSQL + pgvector | Vector search + relational queries in one database |
Polyglot Persistence
Most production systems use multiple databases, each for what it does best:- PostgreSQL for the core transactional data (users, orders, billing).
- Redis for caching, sessions, rate limiting, and real-time features.
- Elasticsearch (or MongoDB with Atlas Search) for full-text search.
- DynamoDB for high-throughput, low-latency specific access patterns (API keys, feature flags).
- S3 for blob storage (images, files, backups).
Migration Strategies
Migrating between databases is one of the highest-risk operations in production engineering. The general approach:- Dual-write phase: Write to both old and new databases. Read from old. Verify data consistency between them.
- Shadow read phase: Read from both databases. Return old database results to users. Compare results in the background and log discrepancies.
- Cutover: Switch reads to the new database. Keep the old database running as a fallback.
- Cleanup: After a soak period (weeks, not days), decommission the old database.
Section 6: Database Scaling Decision Tree
When your database is struggling, engineers often jump straight to the most dramatic solution (sharding, new database). In practice, scaling follows a predictable escalation path — each step is cheaper and less risky than the next. Exhaust cheaper options before reaching for expensive ones.The Escalation Ladder
Step 1: Optimize Queries and Add Indexes (Cost: hours. Risk: near-zero.) Before adding any infrastructure, check whether the database is slow because of bad queries, not insufficient hardware. Runpg_stat_statements (PostgreSQL), .explain('executionStats') (MongoDB), or check CloudWatch metrics (DynamoDB) to find your slowest operations.
- Add missing indexes for common query patterns.
- Rewrite N+1 query patterns into batch queries or JOINs.
- Add
ANALYZE/ update statistics so the query planner makes better choices. - For MongoDB: ensure compound indexes match your query predicates and sort order.
- For DynamoDB: verify GSIs cover your access patterns (a missing GSI means a Scan instead of a Query — orders of magnitude slower).
- PostgreSQL: streaming replication to one or more replicas. Route read queries to replicas, writes to primary. Watch for replication lag on time-sensitive reads.
- MongoDB: read preference
secondaryPreferredorsecondary. Same caveat about eventual consistency. - DynamoDB: eventually consistent reads (default) are already distributed. Use DAX for caching hot reads.
- Cache database query results with a TTL matching your staleness tolerance.
- Use cache-aside pattern: check cache first, fall through to database on miss, populate cache on read.
- For write-heavy data that is read often: consider write-through caching.
- PostgreSQL: range partitioning by date (most common), list partitioning by tenant/region, hash partitioning for even distribution.
- MongoDB: already uses sharding for this (see Step 5). Internal WiredTiger data partitioning is automatic.
- DynamoDB: already partitioned internally by partition key. The analog here is redesigning your partition key for better distribution.
- PostgreSQL: application-level sharding (route queries based on a shard key in your app code), or use Citus for distributed PostgreSQL.
- MongoDB: native sharding with a shard key. The shard key choice is irreversible and determines everything — choose carefully (see Section 2.5).
- DynamoDB: sharding is automatic and invisible (DynamoDB handles partition management). Your role is to choose a partition key that distributes evenly.
- Your access patterns are fundamentally mismatched with your database’s strengths (e.g., graph traversals in a relational database, ad-hoc analytics in DynamoDB).
- You have exhausted all tuning, replication, caching, and partitioning options.
- The cost of maintaining the workarounds exceeds the cost of migration.
- Relational database struggling with time-series write throughput -> TimescaleDB or InfluxDB.
- MongoDB struggling with complex cross-collection JOINs -> PostgreSQL.
- PostgreSQL struggling with sub-millisecond key-value lookups at massive scale -> DynamoDB or Redis.
- Any SQL database struggling with graph traversals (friends-of-friends queries) -> Neo4j or Neptune.
Quick Reference: Scaling Decision Tree
Section 7: Connection Pool Sizing Formula
Connection pool sizing is one of those problems that seems like it should be simple but causes production outages when done wrong. Too small and requests queue up waiting for a connection. Too large and you overwhelm the database with concurrent queries that compete for CPU, memory, and I/O.Little’s Law Applied to Database Connections
The mathematical foundation for connection pool sizing is Little’s Law, a fundamental result from queueing theory:- L = average number of items in the system (connections in use)
- lambda = arrival rate (queries per second)
- W = average time each item spends in the system (average query duration)
Concrete Examples
Example 1: A typical web application.- Your app handles 500 requests/second.
- Each request makes 1 database query averaging 10 ms (0.01 seconds).
- Throughput (queries/second) = 500.
pool_size = 500 * 0.01 = 5 connections.
- 2,000 requests/second, each making 2 database queries.
- Average query latency: 25 ms (0.025 seconds).
- Throughput = 4,000 queries/second.
pool_size = 4,000 * 0.025 = 100 connections.
- Same 2,000 requests/second, 2 queries each.
- Average query latency: 5 ms, but p99 latency: 200 ms.
- Using average:
pool_size = 4,000 * 0.005 = 20 connections. - But during p99 spikes:
burst_pool_size = 4,000 * 0.200 = 800 connections.
The Practical Formula
In practice, add a multiplier for safety margin, burstiness, and overhead:5 * 2.0 = 10 connections. A pool of 10 with a max of 15 handles normal traffic with room for bursts.
For Example 2: 100 * 2.0 = 200 connections. But check: can PostgreSQL handle 200 concurrent connections? At ~10 MB per connection, that is 2 GB of overhead. If the server has 16 GB of RAM and you need 10 GB for shared_buffers and working memory, 200 connections might be too many. This is where PgBouncer’s transaction pooling becomes essential — 200 application connections multiplex onto 50-100 actual PostgreSQL connections.
The PostgreSQL-Specific Constraint
PostgreSQL has a hard constraint that most connection pool formulas ignore: more concurrent connections does not mean more throughput. Beyond a certain point (roughly 2-4x the number of CPU cores), adding connections decreases throughput due to context switching, lock contention, and buffer pool pressure. The empirical guideline from PostgreSQL experts:2 * 8 + 1 = 17. This is the maximum number of active connections that can do useful work simultaneously. PgBouncer multiplexes hundreds of application connections onto this small pool.
Section 8: Common Production Issues by Database
Every database has its own set of production failure modes. These are not theoretical — they are the issues that page engineers at 3 AM. Knowing them in advance separates engineers who operate databases from engineers who are surprised by databases.PostgreSQL: Long-Running Transactions Blocking VACUUM
The problem: A developer opens a database connection, runsBEGIN, executes a query, then… does something else. Maybe they are debugging, maybe the application has a bug that never commits or rolls back. That open transaction holds a snapshot — and as long as that snapshot exists, VACUUM cannot clean up any dead tuples created after the transaction started.
How it escalates:
- An idle transaction holds a snapshot for 6 hours.
- During those 6 hours, autovacuum runs but cannot remove any dead tuples from the last 6 hours of updates.
- Dead tuples accumulate. Table bloat grows. Sequential scans slow down because they must skip millions of dead tuples.
- Index bloat grows because indexes still point to dead tuples.
- In extreme cases: transaction ID wraparound approaches. PostgreSQL enters “safety shutdown” mode and refuses all writes until a manual VACUUM FREEZE completes.
- Set
idle_in_transaction_session_timeout = '5min'to automatically kill idle transactions. - Configure
statement_timeoutas a safety net for runaway queries. - Monitor
pg_stat_activityfor transactions older than your acceptable threshold. - In PgBouncer: use transaction pooling mode, which returns connections to the pool after each transaction (making long-lived idle transactions impossible from pooled clients).
pg_stat_activity for idle transactions first.
MongoDB: Unbounded Array Growth
The problem: A developer embeds an ever-growing array inside a document. Activity logs per user, all comments on a popular post, IoT readings per device — any pattern where the array grows without bound. How it escalates:- Documents start small (a few KB) and queries are fast.
- Over weeks or months, arrays grow to thousands of elements. Documents bloat to several MB.
- Every update rewrites the entire document. Appending one element to a 5 MB document means writing 5 MB. WiredTiger’s document-level concurrency control means all updates to that document are serialized.
- If a document hits the 16 MB BSON limit, all further updates fail. Your application starts throwing errors.
- Even before hitting the limit, large documents cause WiredTiger cache pressure. The cache stores uncompressed documents — a 5 MB document on disk might be 20 MB in the WiredTiger cache.
- Use the Bucket Pattern: group entries into time-based buckets (one document per day/hour) instead of one growing array per entity.
- Use the Subset Pattern: keep only the last N items embedded; archive older items to a separate collection.
- Use the Outlier Pattern: flag documents that exceed a threshold and store overflow data externally.
- Set a hard application-level limit:
{ $push: { activityLog: { $each: [newEntry], $slice: -100 } } }keeps only the last 100 entries.
DynamoDB: Hot Partitions
The problem: All traffic concentrates on a small number of DynamoDB partitions because the partition key has low cardinality or a skewed distribution. How it escalates:- DynamoDB distributes provisioned throughput evenly across partitions. With 10,000 WCU and 10 partitions, each partition gets ~1,000 WCU.
- A flash sale drives all writes to
PK=PRODUCT#hot-item. That single partition receives 5,000 WCU of traffic. - The partition throttles.
ProvisionedThroughputExceededExceptionerrors spike. The application retries with exponential backoff, increasing latency. - Adaptive capacity (DynamoDB’s automatic rebalancing) kicks in, but it takes minutes — an eternity during a flash sale.
- If you are on provisioned capacity, increasing the table’s WCU does NOT help because the new capacity is still distributed evenly. You are paying for 50,000 WCU but only one partition is hot.
- Enable DynamoDB Contributor Insights — it shows the most accessed partition keys over time.
- CloudWatch: compare
ConsumedWriteCapacityUnits(table-level) withThrottledRequests. If throttling occurs while consumed capacity is well below provisioned, you have a hot partition. - Check your partition key’s cardinality distribution. A partition key with 90% of items under one value is a time bomb.
- Design partition keys for high cardinality from the start (user IDs, order IDs, UUIDs — not statuses or dates).
- Use write sharding: append a random suffix (
PK=PRODUCT#hot-item#7) to distribute across N partitions. Reads require querying all N suffixes and merging. - For genuinely hot items (a single viral product page), put DAX in front for reads and use SQS to buffer and smooth writes.
- Switch to on-demand capacity mode for unpredictable or spiky workloads — it handles partition-level throttling more gracefully.
Redis: Memory Fragmentation
The problem: Redis’sused_memory_rss (actual memory consumed as reported by the OS) is significantly higher than used_memory (memory Redis believes it is using). The ratio of these two values is the fragmentation ratio. A ratio of 1.5 means 50% of Redis’s memory is wasted on fragmentation.
How it escalates:
- Your application writes and deletes keys of varying sizes over time. Small keys, large keys, keys that grow and shrink.
- The memory allocator (jemalloc by default) allocates memory in fixed-size classes. A 50-byte value gets a 64-byte allocation. A 70-byte value gets a 128-byte allocation. The wasted space in each allocation is internal fragmentation.
- Over time, keys are deleted and new keys of different sizes are created. The freed memory may not match the sizes needed for new allocations. Memory becomes a Swiss cheese of small free gaps that cannot be coalesced. This is external fragmentation.
- Redis reports
used_memory = 8 GBbut the OS reportsused_memory_rss = 14 GB. You are paying for 14 GB of RAM but only 8 GB holds useful data. - If
maxmemoryis set to 10 GB, Redis thinks it has 2 GB of headroom. But the OS is already at 14 GB. If the server only has 16 GB of RAM, the OOM killer may terminate Redis.
- Redis 4.0+ active defragmentation: Enable with
activedefrag yes. Redis reorganizes memory in the background, moving allocations to reduce fragmentation. Configureactive-defrag-threshold-lower 10(start defragmentation when fragmentation exceeds 10%) andactive-defrag-cycle-min 5(use at least 5% of CPU for defragmentation). - Avoid frequent large-to-small value changes. If a key alternates between a 10 KB value and a 100-byte value, the allocator wastes space on the larger allocation class.
- Prefer fixed-size data patterns. Hashes with consistent field counts, Sorted Sets with consistent element sizes, and Strings with consistent value lengths fragment less.
- Schedule periodic restarts for Redis instances used as caches (not data stores). A restart loads the RDB snapshot into fresh, defragmented memory. This is a blunt instrument but effective.
- Monitor
mem_fragmentation_ratioin your alerting. Alert at 1.5, page at 2.0.
Quick Reference: Production Issue Checklist
| Database | Issue | Key Metric | Alert Threshold |
|---|---|---|---|
| PostgreSQL | Long idle transactions | pg_stat_activity.xact_start | Transactions older than 5 minutes |
| PostgreSQL | Dead tuple accumulation | pg_stat_user_tables.n_dead_tup | > 10% of n_live_tup |
| PostgreSQL | Transaction ID wraparound | age(datfrozenxid) | > 500 million (critical at 1 billion) |
| PostgreSQL | Connection exhaustion | pg_stat_activity count vs max_connections | > 80% of max_connections |
| MongoDB | Document bloat | $bsonSize in aggregation | Documents > 1 MB (critical at 8 MB) |
| MongoDB | WiredTiger cache pressure | wiredTiger.cache.bytes currently in cache | > 90% of configured cache size |
| MongoDB | Replication lag | rs.printSlaveReplicationInfo() | > 10 seconds |
| DynamoDB | Hot partitions | ThrottledRequests while consumed < provisioned | Any throttling at < 80% overall capacity |
| DynamoDB | GSI backpressure | GSI ThrottledWriteRequests | Any sustained throttling |
| Redis | Memory fragmentation | mem_fragmentation_ratio | > 1.5 (critical > 2.0) |
| Redis | Swap usage | mem_fragmentation_ratio < 1.0 | Any ratio < 1.0 |
| Redis | Slow commands | SLOWLOG GET 10 | Commands > 10 ms |
| Redis | Key eviction | evicted_keys | Increasing when it should not be |
Interview Questions
Your PostgreSQL table has 500M rows and VACUUM is taking too long. What do you do?
Your PostgreSQL table has 500M rows and VACUUM is taking too long. What do you do?
pg_stat_user_tables for n_dead_tup and last_autovacuum. If dead tuples are in the hundreds of millions, autovacuum fell behind.Immediate actions:- Tune autovacuum for this specific table. Reduce
autovacuum_vacuum_scale_factorto 0.01 (trigger at 1% dead tuples instead of 20%) and increaseautovacuum_vacuum_cost_limitto let autovacuum work faster. - Increase
maintenance_work_memfor the VACUUM session (e.g.,SET maintenance_work_mem = '2GB'). VACUUM uses this to store dead tuple IDs — if it cannot fit them all, it makes multiple passes over the table. - Check for long-running transactions. Any open transaction prevents VACUUM from cleaning up tuples created after that transaction started. A forgotten idle transaction from 3 days ago can block all vacuuming. Check
pg_stat_activityfor old transactions and terminate them. - Run VACUUM manually with
VERBOSEto see progress and identify where it is spending time.
DROP TABLE) instead of DELETEing and VACUUMing millions of rows. Also consider pg_repack if the table is heavily bloated — it performs an online table rewrite without the exclusive lock that VACUUM FULL requires.Words that impress: dead tuples, xmin/xmax visibility, autovacuum scale factor, cost-based vacuum delay, maintenance_work_mem, transaction ID wraparound, pg_repack.pg_stat_user_tables, tune autovacuum, check for long-running transactions) and fixes the table. A staff/principal engineer goes further: they ask why autovacuum fell behind in the first place (missing monitoring? default settings never tuned? no table-level overrides on high-write tables?), sets up per-table autovacuum policies as organizational standards, adds pg_stat_user_tables.n_dead_tup to the Tier 1 dashboard, and establishes a runbook so the next engineer who encounters this does not start from scratch. Staff-level thinking treats the incident as a system design problem, not a one-off fix.- Failure mode: What happens if VACUUM FREEZE cannot complete before transaction ID wraparound? PostgreSQL enters read-only safety mode for the entire cluster — all writes stop, not just the affected table. Recovery requires manual
VACUUM FREEZEwhich can take hours on large tables. - Rollout: If you partition the table, how do you migrate 500M rows into partitions without downtime? Use
pg_partmanfor declarative partitioning or create partitions, backfill withINSERT INTO ... SELECT, set up a trigger to dual-write during migration, then swap. - Rollback: If the new autovacuum settings cause too much I/O during peak hours, revert the per-table settings with
ALTER TABLE ... RESET (autovacuum_vacuum_scale_factor)and schedule manual VACUUM during off-peak windows. - Measurement: Track
n_dead_tup / n_live_tupratio per table,last_autovacuumfreshness, andage(relfrozenxid)as Tier 1 metrics. Alert at 10% dead tuple ratio. - Cost:
pg_repackrequires temporary disk space equal to the table size. For a 500GB table on RDS, ensure your storage can handle the spike. VACUUM FULL is cheaper on disk but takes an exclusive lock — quantify the downtime cost to the business. - Security/governance: Long-running transactions often come from developer SSH sessions with open
psqlconnections. Enforceidle_in_transaction_session_timeoutand audit who has direct database access. In regulated environments, the wraparound risk itself may be a compliance concern.
pg_stat_user_tables output showing a table with 450M live tuples and 380M dead tuples. The last autovacuum was 18 hours ago. age(relfrozenxid) is 1.2 billion. Write the exact SQL commands you would run in the next 10 minutes, in order, explaining why each one matters.”Real-World Example. GitLab famously shared a post-mortem where a single 3-day-old idle-in-transaction session held back VACUUM on a 500 GB ci_builds table. age(relfrozenxid) climbed past 1.5 billion before monitoring caught it. They now enforce a 5-minute idle_in_transaction_session_timeout cluster-wide and page on-call at 500M xid age — a pattern any team running PostgreSQL at scale should copy.REINDEX CONCURRENTLY or pg_repack.VACUUM FULL during a maintenance window on a 500GB table?
A: VACUUM FULL takes an ACCESS EXCLUSIVE lock — no reads, no writes — for the entire duration, which on 500GB is typically 2-6 hours. It also needs ~500GB of free disk to rewrite the table. pg_repack does the same work online using triggers and a shadow table, swapping at the end with a brief lock.Q: If autovacuum is configured correctly, why does it still fall behind?
A: Three common reasons: (1) cost-based throttling is too aggressive (autovacuum_vacuum_cost_delay default of 20ms on fast SSDs is conservative), (2) not enough workers (autovacuum_max_workers default 3 can’t cover dozens of hot tables simultaneously), (3) long-running transactions pin the visibility horizon so VACUUM has nothing to reclaim.Q: How do you validate that pg_repack completed cleanly without data loss?
A: Compare row counts and a checksum of a stable column (e.g., MD5(string_agg(id::text, ',')) over ordered rows) before and after. pg_repack ships with internal consistency checks, but a post-run validation query against a replica snapshot is the belt-and-suspenders check I always run.- PostgreSQL Docs — Routine Vacuuming
- Crunchy Data Blog — “VACUUM Strategy for Busy PostgreSQL Tables” (search: Crunchy Data vacuum strategy)
- “pg_repack — Reorganize tables in PostgreSQL databases with minimal locks” — official project README on GitHub
Design a DynamoDB table for a social media app with users, posts, and comments.
Design a DynamoDB table for a social media app with users, posts, and comments.
Redis is running out of memory. Walk me through your investigation.
Redis is running out of memory. Walk me through your investigation.
INFO memory— checkused_memory,used_memory_rss(actual OS allocation, which includes fragmentation),mem_fragmentation_ratio. If fragmentation ratio is > 1.5, Redis is wasting significant memory on fragmentation. Ifused_memoryis close tomaxmemory, eviction is being triggered.INFO keyspace— how many keys in each database? Is the key count growing unexpectedly?
MEMORY USAGE <key>on suspected large keys.redis-cli --bigkeys— scans the keyspace and reports the largest keys of each type. This often reveals a single Sorted Set or Hash that has grown to several GB.- For a deeper analysis:
redis-cli --memkeysor use Redis’sMEMORY DOCTORcommand.
- Missing TTLs: Keys that should expire but never do. Use
OBJECT IDLETIME <key>to find keys that have not been accessed in weeks but have no TTL. Set TTLs aggressively. - Unbounded data structures: A List or Sorted Set that grows without limit (e.g., appending every event without trimming). Fix with
LTRIM(keep last N elements) orZREMRANGEBYSCORE(remove entries older than X). - Serialization bloat: Large JSON strings stored as Redis Strings. Compress at the application level or switch to Redis Hashes for structured data (hashes use ziplist encoding for small objects, which is much more memory-efficient).
- Key fragmentation: Many small keys with verbose names. Shorten key names or group related data into Hashes.
- Scale vertically (larger instance).
- Implement Redis Cluster to shard across multiple nodes.
- Move cold or less latency-sensitive data to disk-based storage and keep only hot data in Redis.
SET-family command in production code to pass through a helper that enforces a TTL unless explicitly marked persistent, caught by a linter in CI.mem_fragmentation_ratio = used_memory_rss / used_memory. A ratio above 1.5 means the OS is holding ~50% more memory than Redis actually uses — memory allocator overhead from churn. Ratios under 1.0 mean Redis is swapping, which is a production emergency. Name the specific threshold when diagnosing.mem_fragmentation_ratio is 2.1. Do you restart Redis or try to fix it online?
A: Neither first — try MEMORY PURGE and CONFIG SET activedefrag yes. Active defragmentation runs in the background and slowly rewrites fragmented keys. Restarting Redis wipes the cache (massive stampede on the backing store) and is almost never the right first move.Q: You find a 4 GB Sorted Set that has no TTL and is growing. The team says “it’s our event log.” What do you recommend?
A: Redis is the wrong home for unbounded logs. Migrate to Redis Streams (has built-in MAXLEN trimming) or, better, to Kafka/Kinesis if the data is truly append-only. Interim fix: add a scheduled ZREMRANGEBYSCORE event_log -inf (now - 7d) cron to cap it at 7 days while the migration is planned.Q: After all fixes, you are still near maxmemory. What is your last-resort before vertical scaling?
A: Tune the eviction policy and audit the key distribution. If the workload is a cache, allkeys-lru (or allkeys-lfu for skewed access) lets Redis evict cold keys automatically. If the workload mixes cache and source-of-truth data, split them into separate instances — eviction should never touch data you cannot re-derive.- Redis Docs — Memory optimization
- Redis Docs — Key eviction
- “Redis Persistence Deep Dive” — Redis Labs engineering blog
When would you choose MongoDB over PostgreSQL?
When would you choose MongoDB over PostgreSQL?
- Highly variable schema with rapid iteration. If I am building a product where the data model changes weekly — different content types with different fields, A/B testing different data structures — MongoDB’s flexible schema means I can iterate without migration scripts. PostgreSQL’s JSONB can handle this too, but MongoDB’s query language is designed for it.
- Hierarchical document data. If my primary entity is a deeply nested document (a product catalog with variants, each variant with options, each option with regional pricing), MongoDB stores and retrieves this as a single document read. PostgreSQL would require multiple JOINs or JSONB with less ergonomic querying.
- Horizontal scaling as a primary requirement. If I know from day one that I need to shard across many nodes (multi-region, multi-tenant at massive scale), MongoDB’s native sharding is more mature than PostgreSQL’s horizontal scaling options (which require Citus or application-level sharding).
- When the team has deep MongoDB expertise and the use case does not require multi-document transactions, complex JOINs, or strong schema enforcement.
- When I need complex transactions across multiple entities (financial systems).
- When ad-hoc analytical queries with JOINs are a core requirement.
- When data integrity constraints (foreign keys, check constraints, unique constraints across related entities) are critical.
- When the team does not have MongoDB experience — PostgreSQL is a safer default because it handles a wider range of use cases well.
- “PostgreSQL JSONB — Use Cases and Limitations” — pganalyze blog
- MongoDB Docs — Data Modeling Introduction
- Martin Fowler — “Polyglot Persistence” (martinfowler.com/bliki/PolyglotPersistence.html)
Explain the difference between a hot partition and a hot key in DynamoDB. How do you diagnose and fix each?
Explain the difference between a hot partition and a hot key in DynamoDB. How do you diagnose and fix each?
PK=2024-04-09) means today’s partition receives all writes while yesterday’s is idle.A hot key is a single item that receives an extreme volume of reads or writes. Example: a configuration item (PK=CONFIG, SK=FEATURE_FLAGS) that every Lambda invocation reads on startup.Diagnosis:- CloudWatch
ConsumedReadCapacityUnitsandConsumedWriteCapacityUnitsmetrics — look for uneven distribution across partitions. - Enable DynamoDB Contributor Insights — it shows the most accessed partition keys and sort keys.
ThrottlingEventsmetric increasing while the table’s overall consumed capacity is well below provisioned capacity is the classic hot partition signal.
PK = CURRENCY#BTC, which became a hot partition during Bitcoin price spikes — every trade hit one partition. The fix was PK = CURRENCY#BTC#{random_suffix_0-99}, distributing writes across 100 virtual partitions. Reads aggregate across all suffixes, but writes scale linearly.Query calls, one per suffix, and merge results in the application. For strict ordering, sort after merging. The trade-off is N* the RCU cost per read, which is why you only write-shard items that are write-hot and read-infrequently (or read-aggregatable), like counters, time-series buckets, or leaderboards with periodic rollups.Q: You see throttling only on a GSI, not the base table. What does that mean?
A: GSIs have their own provisioned capacity, and writes to the base table replicate asynchronously to GSIs. A hot GSI partition throttles GSI writes, and back-pressure can slow base table writes too (the base table will throttle if the GSI cannot keep up). Diagnose with ConsumedWriteCapacity per-GSI CloudWatch metric and redesign the GSI partition key with the same high-cardinality principles.Q: Why not just switch to on-demand mode and let DynamoDB figure it out?
A: On-demand handles overall capacity elastically but does nothing for partition-level hot spots — the partition hardware limits (1000 WCU, 3000 RCU per partition) still apply. A hot partition on on-demand throttles just like on provisioned. On-demand also costs ~6.5x more per request at steady state. It’s a scaling convenience, not a hot-partition fix.- AWS Docs — Designing partition keys to distribute your workload
- AWS Blog — “Choosing the Right DynamoDB Partition Key”
- “The DynamoDB Book” by Alex DeBrie — chapter on partition key design
You run EXPLAIN ANALYZE on a PostgreSQL query and see a Nested Loop with loops=50000 and each iteration doing an Index Scan. The query takes 30 seconds. What is happening and how do you fix it?
You run EXPLAIN ANALYZE on a PostgreSQL query and see a Nested Loop with loops=50000 and each iteration doing an Index Scan. The query takes 30 seconds. What is happening and how do you fix it?
- Run
ANALYZEon both tables to update statistics. Rerun the query — the planner might switch to a Hash Join or Merge Join with accurate estimates. - Check for a missing index on the join column of the inner table. If the inner Index Scan is on a poorly selective index, adding a better one helps.
- Increase
work_memfor this query (SET work_mem = '256MB'). Hash Joins need memory to build the hash table — ifwork_memis too small, the planner avoids them. With more memory, the planner might choose a Hash Join (build a hash table from the smaller table, probe it once for each row of the larger table — O(N+M) instead of O(N*M)). - Force a different join strategy as a last resort:
SET enable_nestloop = offand rerun to see if a Hash Join or Merge Join is faster. This is a diagnostic tool, not a production fix — the real fix is making the planner’s estimates accurate.
ANALYZE restored the original plan — a two-second fix after two days of debugging the application layer.ANALYZE fix the plan but not prevent regression next week?
A: Autovacuum’s analyze_scale_factor default of 0.1 means stats only refresh after 10% of the table changes. On a large slow-growing table, that could be weeks. Lower it per-table (ALTER TABLE ... SET (autovacuum_analyze_scale_factor = 0.01)) for tables where the planner’s accuracy matters most.Q: EXPLAIN ANALYZE shows the plan, but the query was previously fast. How do you find the old plan to compare?
A: auto_explain with log_min_duration = 1s logs full plans for all slow queries — when deployed ahead of time, you have a historical record. Without it, pg_stat_statements at least shows mean/min/max execution time so you can confirm when the regression started, correlating with recent data-volume changes.Q: When is a Nested Loop actually the right choice?
A: When the outer relation is truly small (say, under 100 rows) and the inner has a selective index. Then N index lookups beat building a hash table. The planner is not wrong to pick Nested Loops — it is only wrong when its estimate of the outer relation size is.- PostgreSQL Docs — Using EXPLAIN
- Use The Index, Luke — “Joining Tables” chapter (use-the-index-luke.com/sql/join)
- pganalyze Blog — “Understanding PostgreSQL Query Plans”
Your MongoDB aggregation pipeline is slow. How do you optimize it?
Your MongoDB aggregation pipeline is slow. How do you optimize it?
.explain('executionStats') to see the query plan and understand which stages are slow.Step 2: Apply optimizations:- Move
$matchto the beginning.$match(and$sorton indexed fields) at the start of the pipeline can use indexes. After a$groupor$project, no indexes are available. - Add
$limitearly if you only need a subset of results.$match -> $sort -> $limit -> $groupprocesses far fewer documents than$match -> $group -> $sort -> $limit. - Check
$lookupperformance. If you are doing a$lookupon a large collection, ensure the joined collection has an index on theforeignField. Without it, every lookup does a collection scan. - Avoid
$unwindon large arrays before$groupif you can use array operators ($sum,$avgon array fields) instead.$unwindexplodes document count. - Use
allowDiskUse: trueif the pipeline exceeds 100 MB of RAM (the default limit for in-memory aggregation). But also investigate why it needs that much memory — it may indicate the pipeline is processing too many documents. - Consider a materialized view for aggregations that run frequently on the same data. Pre-compute the result and store it, refreshing periodically.
$lookup is the bottleneck, consider whether the data model should be restructured. A slow $lookup in a hot path often means the data should be denormalized (embedded in the parent document) rather than joined at query time.Real-World Example. MongoDB’s own engineering blog describes optimizing a dashboard aggregation that took 45 seconds on a 200M document collection. The fix was reordering from $lookup → $match → $group to $match → $lookup → $group, which reduced the input to $lookup by 99.5% and dropped the runtime to 600ms. The code change was five lines; the insight was understanding pipeline ordering.$merge or $out as the last stage of an aggregation. Name it when recommending moving expensive pipelines off the read path.$match stage is at the top but the plan still shows a collection scan. Why?
A: The $match references fields that are not indexed, OR references a field computed earlier in the pipeline (MongoDB cannot use an index after a $project or $addFields), OR the query uses operators like $regex with a leading wildcard that can’t use the index. Check the winningPlan.stage — if it’s COLLSCAN not IXSCAN, no index is being used.Q: When does allowDiskUse: true become a red flag rather than a feature?
A: If it’s set on a hot, real-time endpoint serving user requests. Disk-spilled aggregations are 10-100x slower than in-memory ones. allowDiskUse is appropriate for offline/batch jobs (reports, exports) but signals a fundamental design problem if production queries need it. Either reduce the working set with earlier filtering or precompute the aggregation.Q: How do you avoid pipeline regressions when the data grows 10x?
A: Add a performance regression test to CI using a scaled copy of production data. Run the pipeline with explain("executionStats") and assert totalDocsExamined stays below a threshold. This catches plan changes before they hit production — far cheaper than finding out at 3 AM that a query that was fine at 10M documents is catastrophic at 100M.- MongoDB Docs — Aggregation Pipeline Optimization
- MongoDB Docs — $lookup with foreignField indexes
- MongoDB Engineering Blog — “Optimizing MongoDB Aggregation Pipelines”
How does Redis Cluster handle a situation where you need to run a transaction across keys that are on different nodes?
How does Redis Cluster handle a situation where you need to run a transaction across keys that are on different nodes?
MULTI/EXEC and Lua scripts will fail because Redis cannot guarantee atomicity across nodes.Solutions:-
Hash tags (preferred). Force related keys to the same slot using
{tag}in the key name.{order:123}:itemsand{order:123}:totalboth hash onorder:123and land on the same slot. Design your key naming convention so that keys that need transactional access share a hash tag. - Application-level coordination. If the keys genuinely belong on different nodes, handle the “transaction” in application code with idempotent operations and compensating actions (saga pattern). This is complex and should be avoided if possible.
- Reconsider the architecture. If you frequently need transactions across unrelated keys, Redis Cluster might not be the right tool. Consider whether the operation can be redesigned as independent operations, or whether a database with distributed transactions (PostgreSQL, CockroachDB) should handle this part of the workload.
{guild:123}:typing and {guild:123}:presence — using hash tags so per-guild operations can be atomic. When they needed cross-guild aggregates, they moved that workload to a separate system (ScyllaDB) rather than fighting Redis Cluster’s constraints.{...} substring in a Redis key that forces all keys sharing the same tag into the same hash slot (and therefore the same node). Essential vocabulary for any Redis Cluster discussion — it is how you opt in to atomic multi-key operations.{user:42} goes to one node. If a user’s data grows huge or access becomes disproportionate, you lose the load-balancing benefits of Cluster. The fix is to shard within the tag ({user:42}:shard-0..N) or to accept that certain “whale” keys need dedicated infrastructure.Q: If I use a Lua script on Redis Cluster, what are the rules?
A: All keys the script touches must hash to the same slot — declared upfront in the KEYS array. Redis Cluster checks at script-load time. If your script needs to touch keys across slots, it fails. This is why hash tags exist: they let you group keys so a script can operate on them atomically.Q: Could I implement two-phase commit manually across Redis nodes?
A: You could, but you would be reinventing a distributed transaction coordinator on top of a system designed to avoid one. The added latency typically exceeds what Redis buys you vs a database that natively supports distributed transactions (CockroachDB, Spanner). If you need 2PC semantics, pick a database designed for them.- Redis Docs — Redis Cluster tutorial
- Redis Docs — Keys hashing tags
- Martin Kleppmann — “Designing Data-Intensive Applications,” Chapter 9 on distributed transactions
You notice that your DynamoDB table's consumed write capacity is only 20% of provisioned, but you are still seeing throttling. What is happening?
You notice that your DynamoDB table's consumed write capacity is only 20% of provisioned, but you are still seeing throttling. What is happening?
ThrottledRequests CloudWatch metric grouped by operation type.Fixes:- Short-term: Switch to on-demand capacity mode, which handles this better (DynamoDB automatically allocates capacity per-partition based on actual traffic).
- Medium-term: DynamoDB’s adaptive capacity feature (enabled by default) can redistribute unused capacity from cold partitions to hot ones, but it has limits and is not instantaneous.
- Long-term: Redesign the partition key. This is the real fix. If you are writing to
PK=STATUS#pendingand most orders are pending, that partition is always hot. Change toPK=USER#<userId>so writes distribute across users.
PK = EVENT_TYPE#checkout_succeeded. With most events being one type, one partition took all the writes. Switching to PK = MERCHANT_ID#EVENT_TYPE distributed writes across millions of merchants and resolved the throttling without changing the provisioned capacity.max(storage_size / 10 GB, WCU / 1000, RCU / 3000) — whichever is largest is your partition count. CloudWatch PartitionCount metric is now available on some regions/accounts too.Q: You redesign the partition key but have 2 TB of existing data. How do you migrate?
A: You cannot alter a partition key in place. Create a new table with the new schema, backfill via Scan with ParallelScan workers and BatchWriteItem into the new table, enable DynamoDB Streams on the old table to capture live changes during backfill, and cut over with a dual-write period. For multi-TB migrations, AWS Data Pipeline or Glue can parallelize the backfill.Q: A colleague suggests just increasing provisioned capacity 10x. What do you say?
A: Spending 10x on capacity to keep a broken partition key working is expensive and temporary — the hot partition still has its per-partition ceiling. You might spend 0 of partition key redesign would make unnecessary. It is the classic “treating the symptom” trap.- AWS Docs — DynamoDB adaptive capacity
- AWS Docs — Isolate frequently accessed items
- AWS Blog — “Choosing the Right DynamoDB Partition Key”
Compare PostgreSQL's JSONB with MongoDB for storing semi-structured data. When would you use each?
Compare PostgreSQL's JSONB with MongoDB for storing semi-structured data. When would you use each?
- ACID transactions that span both relational and JSONB columns. You can JOIN a JSONB column with a relational table in a single transactional query.
- GIN indexes on JSONB paths for fast queries (
CREATE INDEX ON products USING GIN (attributes jsonb_path_ops)). - No separate database to operate. If your primary data is already in PostgreSQL, adding JSONB columns avoids the operational cost of a second database.
- First-class document query language. Querying nested documents and arrays in MongoDB is more ergonomic than PostgreSQL’s
->,->>, and@>operators. - Native horizontal sharding. If you need to scale semi-structured data across many nodes, MongoDB’s sharding is more mature.
- Better developer experience for document-centric workloads. The MongoDB shell, Compass, and aggregation pipeline are purpose-built for working with documents.
CREATE INDEX ON products USING GIN (attributes) lets queries like WHERE attributes @> '{"color":"red"}' use an index instead of scanning every row. Mention it whenever discussing JSONB performance.-> or ->> operators for specific paths don’t use GIN — they use B-tree expression indexes instead (CREATE INDEX ON products ((attributes->>'color'))). GIN helps for containment (@>), existence (?), and path existence (@?) operators. Know which operator your query uses before picking the index type.Q: You need horizontal scaling for semi-structured data. Does that tip you to MongoDB?
A: Not automatically. PostgreSQL with Citus or Vitess-for-Postgres handles horizontal scaling of relational + JSONB data. Only if the entire data model is document-centric and you need sharding from day one does MongoDB’s native sharding meaningfully win. For everyone else, single-node PostgreSQL or Citus is simpler.- PostgreSQL Docs — JSON Types and JSON Functions and Operators
- “When to Use JSONB vs Normalized Tables” — pganalyze blog
- “PostgreSQL anti-patterns: Unnecessary JSON” — 2ndQuadrant blog
Your team is building a rate limiter. Compare implementing it in Redis vs DynamoDB. Which would you choose and why?
Your team is building a rate limiter. Compare implementing it in Redis vs DynamoDB. Which would you choose and why?
- Use a Sorted Set with timestamp scores.
ZADD rate:user:123 <timestamp> <request_id>,ZREMRANGEBYSCOREto remove old entries,ZCARDto count. - Or use Lua scripting for an atomic sliding window counter (INCR + EXPIRE in a single atomic operation).
- Latency: <1 ms. Perfect for inline rate limiting in the request path.
- Risk: if Redis goes down, rate limiting stops. Need Sentinel or Cluster for HA.
- Use a table with
PK=user_id, SK=timestamp. TTL automatically removes old entries. Count items with aQueryto check the rate. - Latency: 5-10 ms. Still fast, but 10x slower than Redis.
- Advantage: fully managed, no operational burden, survives without additional HA setup.
INCR with a rolling TTL approximates a sliding window.TIME command (server time) instead of client-side timestamps, and for multi-node Redis Cluster, accept that cross-node clock drift makes global sliding windows approximate. Token bucket is more robust to clock drift because it only needs elapsed time, not absolute time.Q: How do you rate-limit per-tenant in a multi-tenant API without one tenant starving another?
A: Hierarchical rate limits: a global token bucket caps total API throughput, and per-tenant buckets cap each tenant’s share. A request must pass both buckets to be allowed. This prevents one tenant from exhausting the global bucket while still enforcing fair per-tenant slicing. Stripe and AWS implement this exact hierarchy.- Stripe Blog — “Scaling your API with rate limiters”
- Redis Docs — Rate Limiting pattern
- Cloudflare Blog — “How we built rate limiting capable of scaling to millions of domains”
Explain what happens internally when you run an UPDATE on a row in PostgreSQL. Trace the full lifecycle from client to disk.
Explain what happens internally when you run an UPDATE on a row in PostgreSQL. Trace the full lifecycle from client to disk.
- Client sends
UPDATE users SET name = 'Jane' WHERE id = 42to the PostgreSQL backend process over the connection. - Parser converts the SQL text into a parse tree. Rewriter applies any rules (updatable views, etc.). Planner generates an execution plan (likely an Index Scan on the primary key to find row 42).
- Executor locates the target row. PostgreSQL does NOT update the row in place. Instead, it:
- Marks the old row version as “dead” by setting its
xmaxto the current transaction ID. - Inserts a new row version at a new location on the heap with the updated values. The new version’s
xminis set to the current transaction ID.
- Marks the old row version as “dead” by setting its
- Index entries that point to the old row must also be updated. If the new row version is on the same page and it is a HOT (Heap-Only Tuple) update, the index does not need updating (a pointer chain in the heap connects old to new). If it is on a different page, new index entries are created.
- WAL record describing both the old row’s xmax change and the new row’s insertion is written to the WAL buffer.
- On COMMIT: The WAL buffer is flushed to disk (
fsynctopg_wal/). This is the durability guarantee. The actual heap pages are NOT flushed yet — they remain “dirty” in the shared buffer pool. - Background writer and checkpointer eventually flush dirty pages from the buffer pool to the data files on disk.
- VACUUM (later) will reclaim the space occupied by the dead row version, once no active transaction can see it.
autovacuum_vacuum_scale_factor = 0.02 and adding fillfactor = 70 to leave room for HOT updates. Write throughput tripled without any schema change.fillfactor < 100 is a write-performance optimization — leaving free space per page enables HOT updates.fsync on WAL completes, the transaction is durable even if the heap pages are still in memory. Mention WAL when explaining why fsync is the real commit cost, not the heap write.fsynced) and lets the background writer and checkpointer flush heap pages opportunistically. On crash recovery, WAL replay reconstructs any heap pages that were not yet flushed. This is the write-ahead log protocol — the foundation of every production RDBMS.Q: How does this interact with replication?
A: Streaming replication sends WAL records to replicas in real time. The replica’s walreceiver writes WAL to local disk, then startup replays WAL records against the replica’s heap. This is why replication in PostgreSQL is bit-for-bit identical to the primary — replicas are literally replaying the same WAL. It is also why logical replication (Debezium CDC) decodes from the same WAL stream.- PostgreSQL Docs — MVCC and WAL Internals
- “The Internals of PostgreSQL” by Hironobu Suzuki — Chapter 5 (Concurrency Control) and Chapter 9 (WAL)
- pganalyze Blog — “How PostgreSQL’s UPDATE works internally”
Interview Deep-Dive Questions
These questions go beyond surface-level knowledge. Each one is designed to simulate a real senior-level interview chain — an initial question followed by the kind of probing follow-ups an experienced interviewer uses to separate candidates who know about databases from candidates who have operated them in production.How would you debug a PostgreSQL query that was fast yesterday but is slow today, with no schema or code changes?
Difficulty: Senior What the interviewer is really testing: Systematic debugging methodology, understanding of statistics-driven query planning, and awareness that the database is a living system where performance changes without code changes. Strong Answer: The key insight is that query performance depends on the plan, and the plan depends on statistics — not just the query text. No code change does not mean nothing changed. Step 1: Compare query plans. RunEXPLAIN (ANALYZE, BUFFERS) on the slow query now. If you have the old plan (from pg_stat_statements or auto_explain logs), compare them side by side. Look for a plan shape change — did it switch from an Index Scan to a Sequential Scan? Did the join strategy change from Hash Join to Nested Loop?
Step 2: Check statistics freshness. The most common cause is stale statistics after a data distribution change. Maybe a bulk import ran last night and doubled the table size, or a migration changed the data distribution (e.g., a status column went from 50% active to 95% active). The planner now estimates selectivity wrong.
- Run
ANALYZEon the involved tables and re-execute. If it fixes the problem, the root cause was stale statistics. Set more aggressiveautovacuum_analyze_scale_factoron these tables.
- Check
pg_stat_user_tablesforn_dead_tup. If it is high, VACUUM has not kept up.
shared_buffers and pushing the query’s working set out of the buffer cache. Check BUFFERS output in EXPLAIN — if shared read (disk) is now much higher than shared hit (cache), the buffer pool is under pressure.
Step 5: Check pg_stat_activity for long-running transactions. An idle-in-transaction session can hold back the visibility horizon, forcing PostgreSQL to examine more row versions.
The meta-principle: in PostgreSQL, “nothing changed” almost never means nothing changed. Data volume changed, data distribution changed, vacuum state changed, or concurrent workload changed. The query planner is a function of statistics, and statistics are a function of data.
Follow-up: What is correlation in PostgreSQL statistics and how can it cause a plan change?
Answer: Correlation measures how closely the physical ordering of rows on disk matches the logical ordering of values in an index. It ranges from -1 to 1. A correlation of 1.0 means rows with sequential index values are stored on sequential disk pages (the table is physically sorted by that column). A correlation near 0 means the values are randomly scattered across pages. This matters because it determines whether an Index Scan is cheap or expensive. With high correlation, fetching 1,000 rows via an index reads a small number of sequential pages (the rows are physically adjacent). With low correlation, those 1,000 rows might be on 1,000 different pages — 1,000 random I/O operations. After a bulk load or major table reorganization, correlation can change dramatically. The planner sees this inpg_statistic and may switch from an Index Scan (which was efficient when rows were physically ordered) to a Bitmap Index Scan or Sequential Scan (which handle random physical ordering better). This causes a “slow query with no code change” because the data layout changed.
You can check correlation with:
CLUSTER tablename USING index_name physically reorders the table to match the index. But CLUSTER takes an exclusive lock on the table — in production, use pg_repack instead for an online reorder.
Follow-up: How would you set up proactive monitoring so this kind of regression does not page you at 3 AM?
Answer: Three layers of defense:-
Query performance baselines with
pg_stat_statements. Trackmean_exec_timeandcallsfor your top 50 queries. Alert when any query’s mean execution time exceeds 3x its rolling 7-day average. This catches plan regressions within minutes. -
auto_explain extension. Configure
auto_explain.log_min_duration = '1s'to automatically log the full EXPLAIN plan for any query exceeding 1 second. When a regression happens, you already have the bad plan in the logs — no need to reproduce it. -
Statistics freshness monitoring. Track
last_autoanalyzeinpg_stat_user_tables. Alert if any high-traffic table has not been analyzed in over 24 hours, or ifn_dead_tup / n_live_tupexceeds 10%. This catches the root cause (stale stats, vacuum falling behind) before it causes a plan regression.
Going Deeper: You have auto_explain enabled and you can see that the planner’s row estimate was off by 1000x. Statistics are fresh. What else could cause such a massive estimation error?
Answer: Fresh statistics can still produce wildly wrong estimates in several scenarios:-
Correlated columns. PostgreSQL assumes column values are independent. If you query
WHERE country = 'Japan' AND language = 'Japanese', PostgreSQL multiplies the selectivity of each predicate independently. It might estimate 2% of rows matchcountry = 'Japan'and 1% matchlanguage = 'Japanese', giving 0.02% combined. But in reality, 95% of Japan rows have Japanese — the actual selectivity is close to 2%, not 0.02%. That is a 100x error. PostgreSQL 14+ has extended statistics (CREATE STATISTICS ... (dependencies, ndistinct, mcv) ON country, language FROM users) that model multi-column correlations. -
Parameterized queries with generic plans. When using prepared statements, PostgreSQL starts with custom plans (per-parameter) but after 5 executions switches to a generic plan (parameter-agnostic). If the generic plan is optimal for typical values but terrible for an outlier value (e.g.,
status = 'deleted'matches 0.01% of rows but the generic plan is optimized forstatus = 'active'which matches 90%), you get a bad plan for the outlier. Settingplan_cache_mode = force_custom_planforces per-parameter planning at the cost of planning overhead. -
Statistics target too low. The default
default_statistics_targetis 100, meaning PostgreSQL samples 300 * 100 = 30,000 rows for histogram construction. For a table with 500M rows and a highly skewed distribution, 30,000 samples may miss important value frequencies. Increase the target:ALTER TABLE t ALTER COLUMN c SET STATISTICS 1000.
Walk me through how you would design a DynamoDB table for a multi-tenant SaaS platform where each tenant can have vastly different data volumes.
Difficulty: Staff-Level What the interviewer is really testing: Real-world DynamoDB modeling at scale, understanding of the noisy-neighbor problem, and the ability to design for both small and large tenants in the same table. Strong Answer: This is one of the hardest DynamoDB design problems because of the noisy-neighbor effect — a single large tenant can consume a disproportionate share of partition throughput, causing throttling for other tenants. Key design decisions: 1. Partition key strategy. UseTENANT#<tenantId> as the partition key prefix. For small tenants (under 10 GB of data), all their data lives under one partition key prefix and queries are simple Query(PK = TENANT#small-co).
For large tenants (hundreds of GBs, thousands of RPS), a single partition key creates a hot partition. Apply write sharding: TENANT#big-co#SHARD-0 through TENANT#big-co#SHARD-15. The application hashes or round-robins writes across shards. Reads require scatter-gather across all shards. Store the shard count in a metadata item so the application knows how many shards to query.
2. Sort key design. Use composite sort keys with entity type prefixes: RESOURCE#<type>#<timestamp>#<id>. This lets you query all resources of a type sorted by time, or get a specific resource by its full key.
3. Tenant tier metadata table. Maintain a lightweight table (or a DynamoDB item) that stores each tenant’s tier: small, medium, large. This drives the application logic for shard count and capacity allocation. When a tenant grows past a threshold, a migration job reshards their data.
4. Cost isolation. On provisioned mode, you cannot isolate capacity per-tenant within a single table. Options:
- Use on-demand mode so hot tenants pay per request and do not affect cold tenants’ throughput.
- For enterprise tenants with strict SLAs, provision a separate table (table-per-tenant for your top 5% tenants). This gives hard capacity isolation at the cost of operational complexity.
- Use DynamoDB resource-based policies and AWS Organizations for billing isolation.
PK = STATUS puts all active items on one partition — bad for multi-tenant. Instead use PK = TENANT#<tenantId>#STATUS#active so each tenant’s active items land on their own GSI partition.
The honest trade-off: single-table multi-tenant DynamoDB is operationally simpler but creates coupling between tenants. Table-per-tenant gives isolation but multiplies operational burden. Most SaaS platforms start with single-table and graduate their largest tenants to dedicated tables as they grow.
Follow-up: A large tenant does a bulk import of 10 million records. How do you handle this without impacting other tenants?
Answer: Bulk imports are dangerous in shared DynamoDB tables because they can spike write capacity and trigger throttling across all partitions. Approach:-
Use
BatchWriteItemwith rate limiting. Do not just fire-hose writes. Implement client-side rate limiting (e.g., 500 WCU/sec) with exponential backoff onUnprocessedItems. This keeps writes below the partition-level throughput ceiling. - Spread writes across shards. If the tenant has 16 shards, distribute bulk writes evenly across all 16 partition keys. This gives 16x the per-partition throughput ceiling.
- Temporarily scale up. If using provisioned mode, increase WCU before the import and scale back down after. If using on-demand, DynamoDB auto-scales but has a warm-up period — if the table has been seeing 1,000 WCU and you suddenly hit 50,000, you will be throttled until adaptive scaling kicks in (minutes). Pre-warm by gradually ramping up write traffic.
- Consider an offline path. For truly massive imports, write to S3 and use the DynamoDB Import from S3 feature (launched 2023). It provisions temporary capacity separate from the online table, avoiding any impact on live traffic. The data appears in the table once the import completes.
- If isolation is critical, import into a staging table first. Validate the data, then use DynamoDB Streams to replicate the items into the production table at a controlled rate.
Follow-up: How would you implement per-tenant rate limiting at the DynamoDB level?
Answer: DynamoDB itself does not have per-tenant rate limiting built in — capacity is shared across the table. So you implement it at the application layer: Option 1: Token bucket in Redis. Before every DynamoDB operation, the application checks a per-tenant token bucket in Redis. Each tenant gets a bucket sized to their tier (free tier: 100 WCU, enterprise: 10,000 WCU). If the bucket is empty, the request is queued or rejected. This is the most common pattern — fast, centralized, and independent of DynamoDB. Option 2: API Gateway throttling per API key. If tenants are identified by API keys, configure per-key rate limits in API Gateway or your load balancer. This limits the request rate before it reaches DynamoDB. Less granular (limits requests, not capacity units) but simpler. Option 3: Application-level middleware. Track consumed capacity per-tenant using DynamoDB’sReturnConsumedCapacity response field. Maintain running counters per tenant and throttle when a tenant exceeds their allocation.
The key insight is that DynamoDB’s capacity model is table-level, not tenant-level. Multi-tenancy rate limiting is always an application concern. This is a fundamental architectural limitation you accept when choosing a shared-table approach.
Your company runs a critical MongoDB replica set. The primary just went down. Walk me through exactly what happens next and what could go wrong.
Difficulty: Senior What the interviewer is really testing: Understanding of MongoDB’s consensus mechanism, the realities of failover, and awareness that “automatic failover” is not as magical as it sounds. Strong Answer: When the primary becomes unreachable, here is the sequence: 1. Detection phase (seconds 0-10). The remaining secondaries notice they have not received a heartbeat from the primary within theelectionTimeoutMillis window (default: 10 seconds). This is not instant — there is a built-in delay to avoid false positives from transient network blips.
2. Election phase (seconds 10-12). An eligible secondary calls an election. MongoDB uses a Raft-like consensus protocol. To win, a candidate needs votes from a majority of the replica set members. With a 3-node set (1 down), the remaining 2 nodes form a majority and can elect a new primary. With a 2-node set plus an arbiter, the secondary plus arbiter form the majority.
3. The new primary accepts writes (seconds 12-15). The winning secondary becomes the new primary and starts accepting writes. Total failover time: typically 10-15 seconds, though it can be longer under heavy load.
What could go wrong:
- No majority. If you have a 3-node set and 2 nodes go down, the remaining node cannot form a majority and will not become primary. The replica set is read-only until a majority of nodes recover. This is why the minimum production deployment is 3 nodes across at least 2 data centers (or 2 nodes + 1 arbiter).
-
Writes lost during failover. If the old primary accepted writes with
w: 1(the default) and went down before replicating those writes to a secondary, those writes are gone. When the old primary comes back, it has writes that the new primary does not have — these become rollback writes and are placed in a rollback directory. This is whyw: "majority"matters for critical data. - Driver connection handling. Client drivers need to discover the new primary. Smart drivers (using the replica set connection string) handle this automatically — they detect the topology change and redirect writes. But there is a brief window where the application throws connection errors. If your application does not retry writes, user requests fail during failover. If it does retry writes, they must be idempotent or you risk duplicate operations.
-
Election ties and priority. If two secondaries have the same optime (are equally up-to-date), priority settings determine who wins. If priority is misconfigured, a secondary in a remote data center might become primary, increasing latency for all write operations. Set
priority: 0on nodes that should never become primary (analytics replicas, cross-region replicas used only for reads).
Follow-up: After failover, the old primary comes back online with writes that the new primary does not have. What happens to those writes?
Answer: Those writes enter a rollback. The old primary detects that its oplog has diverged from the current primary’s oplog. It identifies the point of divergence and undoes all operations after that point, saving the rolled-back data to a specialrollback/ directory in the data directory.
The rollback data is stored as BSON files that you can inspect and manually re-apply if needed. But in practice, most teams treat rollbacks as data loss — the data was never acknowledged as durable (since w: 1 only guaranteed the primary, which failed).
The maximum rollback size is 300 MB by default. If more than 300 MB of data needs to be rolled back, MongoDB refuses to roll back automatically and the node shuts down, requiring manual intervention.
This is the strongest argument for w: "majority" on any write you cannot afford to lose. With majority write concern, a write is not acknowledged until it is on at least 2 of 3 nodes. If the primary dies, the surviving majority already has the data — no rollback needed.
Follow-up: How would you design your application to handle the 10-15 second failover window gracefully?
Answer: Three strategies layered together: 1. Retries with backoff. Configure your MongoDB driver withretryWrites: true (default since MongoDB 4.2 drivers). The driver automatically retries certain write operations once on a new primary. For reads, use retryReads: true. These handle transient errors during failover transparently.
2. Circuit breaker at the application layer. If the database is unreachable for more than a few seconds, trip a circuit breaker that returns degraded responses (cached data, a “try again shortly” message) instead of queuing up requests that will all timeout. Without this, you get a thundering herd of retries the moment the new primary is available.
3. Write idempotency. Since retries may re-execute a write that actually succeeded (the primary crashed after persisting but before acknowledging), every write operation must be idempotent. Use unique identifiers in your writes (updateOne with upsert: true and a unique constraint) rather than operations that are not idempotent like $inc without a transaction.
The honest truth: 10-15 seconds of write unavailability is the cost of automatic failover. If your business cannot tolerate this, you need to look at architectures that decouple the write path from the database — buffer writes in a message queue (SQS, Kafka) and drain them to MongoDB asynchronously.
Explain the Redis single-threaded model. How can it be fast when it only uses one core?
Difficulty: Intermediate What the interviewer is really testing: Understanding of event-driven architecture, I/O multiplexing, and the difference between CPU-bound and I/O-bound workloads. Strong Answer: The way I think about this is: Redis is fast because it is single-threaded, not despite it. Redis uses an event loop with I/O multiplexing (epoll on Linux, kqueue on macOS). A single thread monitors all client connections simultaneously. When data arrives on any connection, the event loop processes the command, performs the in-memory operation, and queues the response — all without context switches.
Why this works:
- All data is in memory. The bottleneck for most databases is disk I/O. Redis eliminates this entirely. An in-memory hash lookup takes ~100 nanoseconds. At that speed, a single core can process hundreds of thousands of operations per second because the CPU never waits for anything.
- No context switching overhead. A multi-threaded design with locks would spend more time on lock acquisition, cache invalidation, and context switching than on actual data operations. For sub-microsecond operations, the coordination cost of multithreading exceeds the operation cost itself.
-
No lock contention. Since one thread processes all commands sequentially, there is no need for mutexes, spinlocks, or atomic operations on data structures. Every command sees a consistent state. This is what makes Lua scripts and
MULTI/EXECtransactions atomic for free — nothing else can interleave. - Pipeline-friendly. Clients can send multiple commands without waiting for individual responses. The event loop processes them in sequence and sends all responses back in one batch. Pipelining amortizes the network round-trip cost, and the single thread chews through pipelined commands at memory speed.
KEYS * scan on 10 million keys, a SORT on a large list, or a Lua script doing heavy computation blocks the event loop for other clients. One slow command blocks everything.
The modern nuance (Redis 6+): Redis is no longer purely single-threaded. The main event loop is still single-threaded, but I/O operations (reading from and writing to sockets) are now offloaded to I/O threads. This helps when the bottleneck is network I/O serialization/deserialization rather than command execution. On high-bandwidth workloads (large values), I/O threading can improve throughput by 2x. But the command execution is still single-threaded — that is the correctness guarantee.
Follow-up: If Redis is single-threaded, how does it handle persistence (RDB snapshots, AOF rewrites) without blocking clients?
Answer: Redis usesfork(). When it is time to create an RDB snapshot or rewrite the AOF, Redis forks the process. The child process gets a copy-on-write snapshot of the entire memory space. The child writes the snapshot to disk while the parent continues serving requests.
Copy-on-write means the child initially shares all memory pages with the parent. Only when the parent modifies a page (due to a write command) does the OS create a separate copy of that page. For a read-heavy workload, almost no extra memory is needed. For a write-heavy workload, the memory usage during a snapshot can temporarily double in the worst case (every page gets modified).
This is why you should never run Redis on a machine with swap enabled — if the fork causes memory to temporarily exceed physical RAM, the OS starts swapping, and a single swapped page turns a 100-nanosecond operation into a 10-millisecond operation. The latency spike is catastrophic.
The fork() call itself can also cause a latency spike. On a Redis instance with 25 GB of data, fork() takes 10-20 ms on Linux (copying page table entries). During that time, the event loop is blocked. This is the latest_fork_usec metric in INFO stats — monitor it.
Follow-up: A developer proposes running multiple Redis instances on the same server to utilize all CPU cores. What are the trade-offs?
Answer: This is actually a valid and common pattern — it is how Redis was designed to scale vertically before Redis Cluster existed. Benefits:- Uses all cores. An 8-core server runs 8 Redis instances, each handling its own keyspace.
- Each instance has its own event loop, so a slow command on one instance does not affect others.
- Memory overhead. Each instance has its own memory space. You cannot share data between instances without network calls. If you have 64 GB of RAM and run 8 instances, each gets ~8 GB (minus OS overhead). No benefit of shared caching.
- Operational complexity. 8 instances means 8 ports, 8 configs, 8 monitoring endpoints, 8 persistence schedules. Your deployment and monitoring tooling must handle this.
fork()contention. If multiple instances trigger RDB snapshots simultaneously, the server faces multiplefork()calls, potentially doubling memory usage across all instances at once. Stagger snapshot schedules.- Application routing. The application must know which instance holds which data. This is essentially manual sharding, with all the routing complexity that entails.
You are migrating a PostgreSQL table with 2 billion rows from one database to another with zero downtime. How?
Difficulty: Staff-Level What the interviewer is really testing: Production migration experience, understanding of data consistency guarantees, and risk management under real-world constraints. Strong Answer: A 2-billion-row table with zero downtime rules out any strategy that requires locking the source table or stopping writes. This is a CDC (Change Data Capture) based migration. Phase 1: Initial bulk copy.- Set up logical replication from the source table using PostgreSQL’s logical replication or a tool like
pglogical. Alternatively, usepg_dump --no-synchronized-snapshotsin parallel segments to speed up the initial copy. - For a 2B row table, the initial copy might take hours or days depending on row size and network bandwidth. The key is to start the CDC capture before the bulk copy begins, so that any writes during the copy are recorded.
- While the bulk copy runs, the source continues receiving writes. Logical replication automatically streams these changes (inserts, updates, deletes) to the target.
- After the initial copy completes, the replication stream catches up. Monitor the replication lag until it drops to sub-second.
- Deploy application code that reads from both databases for a percentage of traffic. Compare results. Log discrepancies. This catches issues like missing rows, data type mismatches, or timezone handling differences.
- Run consistency checks:
SELECT COUNT(*), checksum comparisons on key columns, and spot-check critical business queries.
- When confident in data consistency, perform the cutover in a maintenance window as small as possible:
- Stop writes to the source table (or set it to read-only).
- Wait for the last CDC events to replicate (seconds).
- Verify final consistency.
- Switch the application’s write connection to the new database.
- Switch reads to the new database.
- Total write downtime: seconds, not minutes. If you need truly zero write downtime, use a dual-write approach: the application writes to both databases simultaneously during cutover, and you switch reads once you verify consistency.
- Keep the old database running for at least 2 weeks. If anything goes wrong, the rollback is: switch connections back to the old database. If you used dual-write during cutover, both databases have the latest data and rollback is instant.
- Sequences. If the new table uses a sequence for primary keys, ensure the sequence value is set higher than the maximum migrated ID. Otherwise, new inserts collide with migrated rows.
- Triggers and constraints. Logical replication replays raw DML. If the target table has triggers that fire on INSERT (audit logging, denormalization), they will fire for every replicated row — potentially billions of trigger executions. Disable non-essential triggers during the bulk copy phase.
- Large objects and TOAST. Rows with large text or JSONB columns are stored in TOAST tables. Logical replication handles this, but it increases replication bandwidth significantly. Monitor network throughput.
- Schema differences. If the target schema has additional NOT NULL columns or stricter constraints, replicated rows may violate them. Validate schema compatibility before starting.
Follow-up: How would you handle the migration if you also need to transform the data during migration — for example, splitting one table into two?
Answer: Now you cannot use simple logical replication because the target schema differs from the source. You need a transformation layer in between. Option 1: CDC with a stream processor. Use Debezium to capture changes from the source PostgreSQL WAL and publish them to Kafka. A stream processing consumer (Kafka Streams, Flink, or even a simple application) reads the CDC events, applies the transformation (splitting columns into two tables, reformatting data), and writes to the target tables. This gives you a decoupled, replayable transformation pipeline. Option 2: Dual-write at the application layer. Modify the application to write to both the old single table and the new split tables. Backfill the new tables from the old table using a batch job. Once the backfill completes and the dual-write has been running long enough for consistency, cut over reads to the new tables and remove the dual-write. Option 3: Materialized view or trigger-based replication. Create triggers on the source table that, on every INSERT/UPDATE/DELETE, write the transformed data to the target tables (which can be in the same or a different database via dblink or FDW). Backfill the historical data with a batch job. This is the simplest approach for same-server migrations but does not scale well for cross-server migrations. The choice depends on latency requirements, operational maturity, and whether the team already runs Kafka/Debezium. For most teams, Option 2 (dual-write with backfill) is the simplest to implement and reason about, even if it requires a temporary code path in the application.Going Deeper: The dual-write approach has a well-known consistency problem. What is it and how do you mitigate it?
Answer: The problem: partial write failures. If the application writes to Database A successfully but the write to Database B fails (network error, timeout, constraint violation), the two databases are now inconsistent. Retrying the failed write might succeed, but if the application crashes between the first write and the retry, the data is permanently divergent. This is fundamentally the two-phase commit problem across independent databases — and there is no general solution that preserves both consistency and availability. Pragmatic mitigations:- Write to the source of truth first, then asynchronously replicate. Make Database A the authoritative source. After writing to A, publish an event (to a transactional outbox table in A, then to Kafka). A consumer applies the write to B. If B fails, the event stays in the queue and retries. This is the transactional outbox pattern — it guarantees at-least-once delivery to B at the cost of temporary inconsistency.
- Reconciliation job. Run a periodic job that compares the two databases and fixes discrepancies. This is your safety net — the dual-write handles the happy path, the reconciliation catches edge cases.
- Accept temporary inconsistency. If the application can tolerate a few seconds of divergence (and for most applications it can), the combination of “write to both + reconciliation” is sufficient. Perfect consistency across two independent databases requires distributed transactions, which most teams rightly avoid.
Your team uses Redis for session storage. An engineer proposes switching to DynamoDB to reduce operational burden. Evaluate this proposal.
Difficulty: Senior What the interviewer is really testing: Ability to evaluate a real architectural trade-off with nuance, not just pick a side. Strong Answer: This is a reasonable proposal. Let me evaluate it honestly by weighing both sides. Arguments for switching to DynamoDB:- Operational simplicity. DynamoDB is fully managed — no patching, no failover configuration, no memory sizing, no eviction policy tuning. Redis (even managed ElastiCache) requires capacity planning, persistence configuration, and monitoring for fragmentation and memory pressure.
- Durability by default. DynamoDB writes are durable to disk. If a node fails, no data is lost. Redis, even with AOF, can lose the last second of data. For session data in a regulated environment (healthcare, finance), durability matters.
- Scaling is automatic. DynamoDB on-demand mode scales to any traffic level without intervention. Redis requires manual scaling (resizing the instance or adding cluster nodes).
- TTL built-in. DynamoDB’s TTL feature automatically deletes expired sessions at zero cost. In Redis, expired keys are cleaned up lazily (on access) or by the active expiry cycle, which can lag under memory pressure.
- Latency. Redis serves sessions in 0.1-0.5 ms. DynamoDB serves them in 3-10 ms. That is a 10-50x latency increase on every authenticated request. If your API’s p99 target is 50 ms and session lookup adds 10 ms, you have used 20% of your latency budget on one operation.
- Cost at scale. At 100,000 session reads per second (a large web app), Redis on a single
r6g.xlargecosts roughly 5,000/month. Provisioned mode reduces this but requires capacity planning — negating the “less operational burden” benefit. - Data structure richness. If sessions contain complex data (nested objects, arrays, counters that increment), Redis Hashes with
HINCRBYandHSETare more expressive than DynamoDB’s update expressions. If sessions are simple key-value blobs, this does not matter.
Follow-up: The engineer says “latency does not matter because we will put DAX in front of DynamoDB and get the same latency as Redis.” Is this correct?
Answer: Partially correct, but with important caveats. DAX can serve cached reads in ~200 microseconds — comparable to Redis. So for session reads (which dominate), the latency would be similar. But:- Session writes still hit DynamoDB directly. DAX is a read-through/write-through cache. Writes go to DynamoDB (3-10 ms) and the DAX cache is updated asynchronously. If a user logs in (write) and immediately makes a request (read), the read might hit DAX before the write has propagated — returning a stale “no session” result. This is a consistency window of typically <10 ms but it can cause authentication flicker.
- DAX adds another managed service. You have DynamoDB + DAX. The “reduce operational burden” argument weakens because now you are operating a cache in front of a database… which is what you were doing with Redis in the first place, just with different branding.
-
DAX cost. DAX runs on dedicated nodes (
dax.r5.large~600+/month for DAX on top of DynamoDB costs approaches or exceeds managed Redis pricing.
What are the real-world differences between PostgreSQL’s SERIALIZABLE isolation and DynamoDB’s transactions?
Difficulty: Senior What the interviewer is really testing: Deep understanding of transaction models across different database paradigms, and whether the candidate understands that “transactions” means very different things in different databases. Strong Answer: These are fundamentally different transaction models solving different problems. PostgreSQL SERIALIZABLE:- Provides true serializability — the gold standard of isolation. The result is equivalent to some serial ordering of all concurrent transactions, even though they actually executed concurrently.
- Implemented via SSI (Serializable Snapshot Isolation). PostgreSQL tracks read and write dependencies between concurrent transactions. If it detects a dependency cycle that could produce a non-serializable result, it aborts one transaction with a serialization failure. The application must retry.
- Scope: any number of tables, any number of rows, any query pattern. You can read 50 tables and write 30 in a single serializable transaction. The database guarantees correctness.
- Cost: higher abort rates under contention. The more concurrent transactions touch overlapping data, the more serialization failures you get. At extreme contention, throughput collapses.
TransactWriteItems / TransactGetItems):
- Provides atomicity and isolation for a batch of up to 100 items across one or more tables. Either all operations succeed or none do.
- Isolation is “serializable” at the item level — concurrent transactions on the same items are serialized. But there is no dependency tracking across arbitrary queries. You specify exactly which items participate in the transaction upfront by their primary keys.
- No read-then-write logic inside the transaction itself. You can do conditional writes (
ConditionExpression) to check a value before writing, but you cannot do a query, make a decision, and then write — that must be done in application code across two API calls. - Cost: transactions consume 2x the normal WCU/RCU. A transaction that writes 5 items costs 10 WCU. DynamoDB also has a 25-item limit per
TransactWriteItemscall (though you can include up to 100 items across reads and writes in a mixed transaction).
| Aspect | PostgreSQL SERIALIZABLE | DynamoDB Transactions |
|---|---|---|
| Scope | Arbitrary queries, any number of rows | Up to 100 explicitly named items |
| Logic | Full SQL with subqueries, CTEs, JOINs | Conditional expressions on individual items |
| Conflict detection | Automatic dependency tracking | Optimistic concurrency on named items |
| Failure mode | Serialization failure (retry needed) | Transaction cancelled (retry needed) |
| Cost model | Same as normal queries (slightly more CPU) | 2x capacity consumption |
| Scale | Single database instance | Distributed, any scale |
ConditionExpression on each, and submit them as a TransactWriteItems batch. The DynamoDB approach works but requires more upfront design and cannot handle dynamic logic (like “find the cheapest available item and reserve it” — because you do not know the item key until you query).
Follow-up: When would you recommend DynamoDB transactions over PostgreSQL SERIALIZABLE?
Answer: When you need transactional guarantees at a scale where a single PostgreSQL instance cannot keep up, and your transaction scope is bounded and predictable. Concrete example: a gaming platform where each match result updates the stats of 2-10 players atomically. You know exactly which player items you are updating (by their player IDs). You need this to work at 50,000 transactions per second across millions of players. PostgreSQL’s SERIALIZABLE cannot handle this throughput on a single instance, and distributed PostgreSQL (Citus, CockroachDB) adds complexity. DynamoDB handles it natively because each transaction touches a small, well-defined set of items that distribute across partitions. The rule of thumb: if your transactions are bounded, key-addressable, and high-throughput, DynamoDB transactions are an excellent fit. If your transactions are ad-hoc, query-driven, or involve complex business logic, PostgreSQL is the right tool.A Redis Cluster node goes down and some commands start returning MOVED and ASK errors. Explain what is happening and how the client should handle it.
Difficulty: Intermediate What the interviewer is really testing: Understanding of Redis Cluster’s slot-based routing and the client-server protocol during topology changes. Strong Answer: MOVED errors mean the client sent a command to the wrong node for that key’s hash slot. The error includes the correct node address:MOVED 3999 127.0.0.1:6381. This happens when:
- The client’s slot-to-node mapping is stale (a node went down and slots were reassigned).
- The client has not yet discovered the cluster topology.
CLUSTER SLOTS or CLUSTER NODES, then re-send the command to the correct node. Smart Redis clients (like ioredis, redis-py-cluster) handle this transparently. They cache the slot map and refresh it on MOVED errors.
ASK errors mean a slot is in the process of being migrated from one node to another. The slot is partially on the old node and partially on the new node. The error says: “this specific key has already been migrated to the new node, try there.” The error format is ASK 3999 127.0.0.1:6382.
The correct client behavior for ASK is different from MOVED:
- Send
ASKINGto the target node (this tells the target to accept a command for a slot it does not yet fully own). - Re-send the original command to the target node.
- Do NOT update the slot map. Unlike MOVED, ASK is temporary. The slot is still being migrated. Future commands for other keys in this slot should still go to the old node (if those keys have not migrated yet).
cluster-node-timeout is configured). During the failover window (typically 15-30 seconds), commands for that node’s slots fail with CLUSTERDOWN errors. After failover, the client receives MOVED redirections to the new primary.
What most people miss: During failover, only the slots owned by the failed node are affected. Commands for slots on healthy nodes continue working normally. This is partial availability — better than total unavailability, but your application must handle errors for the affected key range.
Follow-up: What happens to in-flight commands when a Redis Cluster node crashes mid-pipeline?
Answer: If a client has sent a pipeline of 100 commands and the node crashes after processing 50:- The first 50 commands were executed. If they were writes, those writes are in the node’s memory and AOF (if enabled). Whether they survive depends on the persistence configuration and whether the replica received them before the crash.
- The remaining 50 commands get a connection error. The client receives errors for these commands.
- The client cannot know exactly which commands succeeded and which did not without checking the data. This is the pipeline atomicity problem — pipelines are NOT transactions. Each command in a pipeline is independent.
MULTI/EXEC transactions (which are atomic on a single node) or Lua scripts, not pipelines. Pipelines are for performance optimization of independent commands, not for ensuring atomicity.
Your PostgreSQL database is approaching transaction ID wraparound. Explain the urgency and your remediation plan.
Difficulty: Senior What the interviewer is really testing: Deep understanding of one of PostgreSQL’s most dangerous operational scenarios, and whether the candidate has the judgment to handle a high-pressure situation. Strong Answer: This is a five-alarm fire. Transaction ID wraparound is the one PostgreSQL failure mode that can cause the database to shut down and refuse all writes — and it is completely preventable. Why it happens: PostgreSQL uses 32-bit transaction IDs (roughly 4 billion values, but effectively 2 billion in each “direction” due to modular arithmetic). Every transaction consumes one ID. VACUUM FREEZE is responsible for marking old transactions as “frozen” — removing their dependency on the transaction ID counter. If VACUUM FREEZE cannot keep up (because of long-running transactions, disabled autovacuum, or massive tables), the database runs out of usable transaction IDs. The warning signs (check immediately):age(datfrozenxid)> 200 million: investigate why autovacuum is not freezing. Tune it.age(datfrozenxid)> 1 billion: urgent. Autovacuum should be running aggressively. Check for blockers.age(datfrozenxid)> 2 billion - 1 million: critical. PostgreSQL will start emitting warnings in the logs and will eventually enter read-only “safety mode.” Immediate intervention required.
- Check
pg_stat_activityfor long-running transactions and terminate them. Any open transaction holds back the freeze horizon. - Check if autovacuum is running on the offending tables. If not, start a manual VACUUM FREEZE on the table with the oldest
relfrozenxid. - Increase
autovacuum_freeze_max_ageawareness — this is the threshold where autovacuum becomes aggressive about freezing. Default is 200 million. If your tables are well past this, autovacuum should already be trying, but something is blocking it.
- Run
VACUUM FREEZEmanually on the worst tables. On a huge table, this can take hours. UseVERBOSEto monitor progress. - Increase
autovacuum_max_workersand reduceautovacuum_vacuum_cost_delayto let autovacuum work faster across all tables. - If the database is in read-only mode: the only option is running VACUUM FREEZE. No writes are possible until the freeze completes. This is why prevention is critical.
- Monitor
age(datfrozenxid)as a Tier 1 metric. Alert at 500 million, page at 1 billion. - Tune autovacuum aggressively on high-write tables (lower
autovacuum_vacuum_scale_factor, higherautovacuum_vacuum_cost_limit). - Set
idle_in_transaction_session_timeoutto prevent long-running transactions from blocking VACUUM. - For very large tables (hundreds of millions of rows), partition them. VACUUM FREEZE runs per-partition, making it more manageable.
Follow-up: Why can long-running transactions prevent VACUUM FREEZE from doing its job?
Answer: VACUUM can only freeze a transaction ID if no active transaction could possibly need to see the row version that transaction created. PostgreSQL determines this using the oldest active transaction’s snapshot. If a transaction started 3 days ago and is still open (even if idle), PostgreSQL must preserve all row versions created since that transaction started. VACUUM cannot freeze any transaction ID newer than that transaction’s ID. This means 3 days worth of transaction IDs remain unfrozen, accumulating toward the wraparound limit. A single idle-in-transaction session can hold back the entire freeze horizon. On a busy system processing 10,000 transactions per second, a 3-day idle transaction accumulates 2.6 billion unfrozen transaction IDs — very close to the wraparound limit on its own. This is whyidle_in_transaction_session_timeout is one of the most important PostgreSQL production settings. The default is 0 (disabled). Set it to something aggressive like 5 minutes. Developers who leave transactions open for debugging should use a separate connection with a higher timeout, not the production pool.
Compare how PostgreSQL, MongoDB, and DynamoDB handle a scenario where you need to count the total number of items matching a condition across your entire dataset.
Difficulty: Intermediate What the interviewer is really testing: Understanding of how each database’s architecture affects a seemingly simple operation, and why “just do a COUNT” is not the same across databases. Strong Answer: This is a great question because it exposes fundamental architectural differences through a deceptively simple operation. PostgreSQL:SELECT COUNT(*) FROM orders WHERE status = 'shipped'
PostgreSQL must actually scan every matching row to count it — there is no pre-computed count stored anywhere. With an index on status, it does an Index Scan (or Bitmap Index Scan for moderate selectivity). Without an index, it does a full Sequential Scan. On a 500M row table, this can take minutes.
Why no stored count? Because of MVCC. Different transactions see different row versions. A row that is visible to Transaction A might be invisible to Transaction B (it was inserted after B’s snapshot). So there is no single “correct” count — it depends on the transaction’s snapshot. PostgreSQL must check visibility for each row.
Optimization: Maintain a materialized count in a separate table, updated by triggers or periodic refresh. Or use pg_stat_user_tables.n_live_tup for an approximate count (fast but imprecise, based on the last ANALYZE).
MongoDB: db.orders.countDocuments({ status: "shipped" })
Similar to PostgreSQL — MongoDB scans the matching documents using an index (if available) or a collection scan. The countDocuments() method actually runs an aggregation pipeline ($match + $group with $sum). It is accurate but can be slow on large collections.
MongoDB offers estimatedDocumentCount() which reads the collection metadata for an instant but approximate total count (it does not support filters).
DynamoDB: There is no efficient COUNT.
DynamoDB does not have a COUNT operation in the traditional sense. You have two bad options:
QuerywithSelect: 'COUNT'— this still reads every matching item (consuming read capacity for each item), it just does not return the data. On millions of items, this is expensive and slow.Scanwith aFilterExpression— even worse. Reads every item in the table, applies the filter, and counts matches. For a billion-item table, this costs thousands of dollars in read capacity.
PK=COUNTER, SK=shipped_orders, updated atomically with ADD on every order shipment. Reads are instant (GetItem). The trade-off is that the counter must be maintained by the application — if the update fails or the application has a bug, the count drifts. Reconcile periodically with a full scan job (run during off-peak, accept the cost).
The architectural lesson: PostgreSQL and MongoDB trade write simplicity for read flexibility — you can count anything, anytime, but it costs a scan. DynamoDB trades read flexibility for write and read performance at scale — you must pre-compute any aggregation you need.
Follow-up: What if you need real-time counts displayed on a dashboard — say, “orders shipped today” updating every second?
Answer: This changes the problem from “how to count” to “how to maintain a running count efficiently.” PostgreSQL approach: Use a materialized counter table with a trigger. On everyINSERT INTO orders with status = 'shipped', a trigger increments the counter in a daily_stats table. The dashboard queries the counter table (instant). The risk: trigger contention under high write volume — all inserts compete for a lock on the same counter row. Mitigate with multiple counter rows (shard the counter by, say, order_id % 10, sum the 10 rows on read).
Redis approach (any source database): Maintain a Redis counter with INCR shipped_today. Reset it at midnight with a TTL or a cron job. The application increments the counter after each successful order shipment write. The dashboard reads from Redis (sub-millisecond). This decouples the counting from the database and works regardless of which database stores the orders.
DynamoDB approach: The counter item (PK=COUNTER#2024-04-09, SK=shipped) updated with UpdateExpression: 'ADD shipped_count :one'. DynamoDB’s ADD is atomic, so concurrent updates do not lose counts. The dashboard reads this single item.
For all three, the real-time dashboard reads a pre-computed counter, not a live aggregate. This is the Computed Pattern — trade write complexity for read speed. No production system refreshes a COUNT(*) query every second against a large table.
Going Deeper: The counter approach can drift if the application crashes between writing the order and incrementing the counter. How do you handle this?
Answer: This is the classic dual-write consistency problem. Three approaches:- Transactional outbox. In PostgreSQL, update both the orders table and the counter table in the same transaction. They are either both committed or both rolled back. Atomic, no drift. This is the simplest and most correct approach when both writes are in the same database.
- Event-driven counter. Do not increment the counter in the application. Instead, use CDC (DynamoDB Streams, PostgreSQL logical replication, MongoDB Change Streams) to capture the order insert event and have a consumer increment the counter. The consumer processes events with at-least-once semantics, so make the counter update idempotent (e.g., track the last processed event ID and skip duplicates).
-
Periodic reconciliation. Accept that the counter might drift and run a reconciliation job hourly that does the full
COUNT(*)scan and corrects the counter. For a dashboard that says “~15,234 orders shipped today,” being off by a handful is acceptable. For a financial report, it is not.
Advanced Interview Scenarios
These questions are designed to break candidates out of textbook thinking. Several have answers where the obvious approach is wrong. They test the kind of judgment that only comes from operating databases in production — where the clean theory meets the messy reality of traffic spikes, human error, cost constraints, and 3 AM pages.Your application’s p99 latency spikes every day at 2 AM. Nothing in the application code runs at that time. The database is PostgreSQL. What is happening?
Difficulty: Senior What the interviewer is really testing: Whether the candidate understands that databases have background maintenance processes with real performance impact, and whether they can trace a symptom back to a non-obvious root cause. What weak candidates say: “Maybe there is a cron job running. I would check the cron tab.” They stay at the application layer and do not consider the database’s own internal processes. Some suggest “maybe it is backups” without explaining how a backup causes latency spikes or how to verify. What strong candidates say: The way I think about this is: what happens at 2 AM that does not involve application code? Three PostgreSQL-internal culprits:-
Autovacuum anti-wraparound. PostgreSQL has a hard-coded escalation: when a table’s
relfrozenxidage exceedsautovacuum_freeze_max_age(default 200 million), autovacuum ignores its normal cost-based throttling and runs at full speed to prevent transaction ID wraparound. This “emergency autovacuum” consumes massive I/O. On a busy table, this can saturate the disk for 30-60 minutes. Checkpg_stat_user_tablesforlast_autovacuumtimestamps clustering around 2 AM, and look atpg_stat_progress_vacuumduring the spike. -
Checkpoint storms. PostgreSQL’s checkpointer writes all dirty pages from shared buffers to disk periodically (controlled by
checkpoint_timeout, default 5 minutes, andmax_wal_size). If a large batch of dirty pages accumulated during daytime activity and the checkpoint fires during a low-traffic window, the sudden burst of disk writes spikes latency for concurrent queries. Checkpg_stat_bgwriterforcheckpoints_req(forced checkpoints, which are the disruptive kind) vscheckpoints_timed. Tunecheckpoint_completion_targetto 0.9 to spread checkpoint writes over a longer window. - RDS or cloud-provider maintenance windows. On AWS RDS, minor engine patches, OS patching, and multi-AZ failover tests often run in the default maintenance window (which many teams never change from the 2-4 AM default). A multi-AZ failover causes a 15-30 second connection drop. Check the AWS RDS Events console.
transactions table — 800 million rows, autovacuum_freeze_max_age at the default 200 million. The emergency vacuum was ignoring cost delays and saturating the EBS volume at 3,000 IOPS for 45 minutes. We fixed it by reducing autovacuum_vacuum_scale_factor to 0.01 on that table so regular autovacuum ran more frequently, preventing the age from ever reaching the emergency threshold. The p99 spikes disappeared completely. The monitoring tool that caught it was pg_stat_progress_vacuum combined with CloudWatch VolumeWriteOps — the correlation was unmistakable.
Follow-up: How would you distinguish between a checkpoint storm and an autovacuum storm using only pg_stat views?
Answer: Checkpg_stat_bgwriter. The checkpoints_req counter increments when a forced checkpoint occurs (triggered by max_wal_size being reached). If that counter is climbing during the spike, it is a checkpoint. For autovacuum, check pg_stat_progress_vacuum — it shows which table is being vacuumed, the phase (scanning heap, vacuuming indexes, etc.), and progress. Also look at pg_stat_activity for backend processes with backend_type = 'autovacuum worker' and wait_event_type = 'IO'. If autovacuum workers are in I/O wait states, they are the ones hammering the disk. You can also correlate with pg_stat_user_tables.autovacuum_count and last_autovacuum to see which tables were vacuumed during the window.
Follow-up: The team wants to “just disable autovacuum on that table to stop the spikes.” Why is this dangerous?
Answer: Disabling autovacuum on a high-write table is like disabling the fire alarm because the beeping annoys you. Without autovacuum: dead tuples accumulate indefinitely, bloating the table (a 50 GB table can grow to 500 GB). Index bloat makes every query slower. Most critically, therelfrozenxid age keeps climbing toward the 2-billion hard limit. When it gets within 10 million of the limit, PostgreSQL enters emergency mode and starts logging warnings. When it hits the limit, PostgreSQL shuts down all writes to the entire database cluster — not just that table, every table — and refuses to process any transaction until someone manually runs VACUUM FREEZE. I have seen this take a production database offline for 6 hours on a 1 TB table. The correct fix is never to disable autovacuum — it is to tune it so it runs frequently in small increments instead of rarely in one massive burst.
You inherit a MongoDB cluster that is sharded on { created_at: 1 }. The team says queries are slow. What is your diagnosis before you look at a single query?
Difficulty: Senior
What the interviewer is really testing: Whether the candidate recognizes a fundamentally flawed shard key on sight, and can explain the cascading consequences without needing to investigate further.
What weak candidates say:
“I would look at the slow query log and see which queries are slow, then add indexes.” They treat it as a query optimization problem and miss that the architecture itself is broken.
What strong candidates say:
I already know the primary problem before looking at a single query: the shard key is monotonically increasing. Every new document has a created_at value greater than all previous documents. That means every insert goes to the same shard — the one that owns the highest chunk range. You have N shards, but only one of them is receiving writes. The other N-1 shards are idle for writes and only serve reads for historical data.
This causes three cascading problems:
-
Write hot shard. One shard absorbs 100% of write traffic. It becomes the bottleneck. Adding more shards does not help — the writes still go to the newest chunk range. The
moveChunkbalancer constantly tries to migrate chunks from the hot shard to cold shards, but it cannot keep up with the ingest rate. You seebalancerStatusshowing continuous chunk migrations andmoveChunkoperations consuming I/O on the hot shard. -
Scatter-gather reads. Any query that filters by something other than
created_at(which is almost every query — “find me user X’s documents,” “find documents with status Y”) must scatter to all shards because the shard key tells MongoDB nothing about where user X’s data lives. These scatter-gather queries get linearly slower as you add shards. The team adds shards for capacity but queries get slower — the opposite of what they expected. -
Jumbo chunks. If many documents are created within the same second (high-throughput systems), they share the same
created_atvalue and land in the same chunk. That chunk cannot be split because all its documents have the same shard key value. It becomes a jumbo chunk that cannot be migrated, further imbalancing the cluster.
{ timestamp: 1 }. 200 million inserts per day, all hitting one shard. They had 12 shards but write throughput was identical to a single node. The balancer was consuming 40% of the hot shard’s I/O budget just migrating chunks. The fix was a reshard to { device_id: 1, timestamp: 1 } — a compound shard key where device_id provides cardinality and distribution, while timestamp in the sort key gives chronological ordering within each device’s data. Queries like “get device X’s readings for the last hour” went from scatter-gather across 12 shards to a single targeted query. Write throughput increased 9x because writes distributed across all shards. The migration took 3 weeks of dual-writing to a new collection, but the team said they wished they had done it six months earlier.
Follow-up: How do you actually migrate to a new shard key on a live MongoDB cluster without downtime?
Answer: Since MongoDB 5.0, there isreshardCollection which performs an online reshard. Under the hood, it creates a new sharded collection, clones existing data using a background process, catches up with ongoing writes via change streams, and then atomically swaps the collection namespace. During the cutover (typically sub-second), writes are briefly paused. In practice, I have seen the background cloning phase take days for multi-terabyte collections, and it doubles your storage temporarily. You need to monitor disk usage carefully.
For MongoDB versions before 5.0, or when you need to restructure the schema during migration, the pattern is: create a new collection with the correct shard key, set up a change stream on the old collection to capture live writes, backfill historical data from old to new using a batch job, then switch the application to the new collection once the change stream consumer has caught up. The challenge is the cutover — you need to drain the change stream buffer to zero lag, switch the application atomically (typically a feature flag), and have a rollback plan.
Follow-up: The team pushes back — “we chose created_at because all our queries filter by time range.” Is their reasoning valid?
Answer: Their reasoning is partially valid but still wrong. Yes, filtering by time range is important. But a shard key serves two purposes: query routing and write distribution. A good shard key optimizes both.{ created_at: 1 } optimizes routing for time-range queries but catastrophically fails at write distribution.
The correct approach is a compound shard key where the first element provides distribution and the second provides range-query efficiency. For example, { tenant_id: 1, created_at: 1 } distributes writes across tenants (good cardinality) while still allowing time-range queries within a tenant to hit a single shard. If there are no natural high-cardinality prefixes, use a hashed compound key: { hashed_created_at: "hashed", created_at: 1 } — but this loses time-range routing, so you are back to scatter-gather for range queries. There is no perfect shard key; there are only trade-offs that match your access patterns. The “all queries filter by time” argument ignores the write side of the equation, which is the classic mistake.
A Redis Lua script that implements a distributed lock is intermittently failing to release the lock, causing downstream services to hang for 30 seconds (the lock TTL). How do you debug this?
Difficulty: Staff-Level What the interviewer is really testing: Practical debugging of distributed locking, understanding of Redis atomicity guarantees, and awareness of the subtle failure modes that make distributed locks treacherous. What weak candidates say: “I would check if the Redis server is overloaded. Maybe increase the TTL.” They do not understand the actual failure modes of distributed locks and suggest making the problem worse (longer TTL means longer hangs on failure). What strong candidates say: This is a classic distributed lock lifecycle bug. Let me walk through the failure modes in order of likelihood: 1. Lock ownership mismatch on release. The most common bug. Service A acquires lock with token “abc123” and TTL 30s. Service A takes 35 seconds to complete the work (GC pause, slow downstream call, CPU contention). The lock expires at 30s. Service B acquires the same lock with token “xyz789”. Service A finishes at 35s and releases… Service B’s lock. Now there is no lock and a third service enters the critical section. The “intermittent failure to release” is actually Service A trying to release a lock it no longer owns and theDEL either deletes B’s lock (if no ownership check) or fails silently (if the Lua script checks ownership but the token does not match).
The fix: the release Lua script must compare the lock value (a unique token per acquisition) before deleting:
Follow-up: The team says “just use a PostgreSQL advisory lock instead of Redis — it is simpler and has real transactions.” When is this the right call?
Answer: PostgreSQL advisory locks are underrated and often the correct choice.SELECT pg_advisory_lock(hash_of_key) acquires a lock that is automatically released on transaction end or session end. No TTL management, no ownership tokens, no Lua scripts. They are transactional: if the transaction rolls back, the lock releases.
Use PostgreSQL advisory locks when: (1) all lock participants already connect to the same PostgreSQL instance, (2) the lock contention is moderate (not thousands of lock acquisitions per second), and (3) you need the lock’s lifecycle tied to a database transaction.
Use Redis locks when: (1) lock participants are different services that do not share a database, (2) you need sub-millisecond lock acquisition (PostgreSQL round-trip is 1-5 ms vs Redis 0.1-0.5 ms), (3) lock throughput is very high, or (4) the protected operation spans multiple systems and cannot be wrapped in a database transaction.
The honest truth is that most teams using Redis for distributed locks would be better served by PostgreSQL advisory locks, because most lock scenarios involve protecting a database operation that is already running inside a PostgreSQL transaction. Adding Redis as a separate lock coordinator introduces a new failure mode (Redis unavailability) and the dual-system consistency problem.
Follow-up: How would you monitor distributed lock health in production?
Answer: Four metrics I track:- Lock acquisition latency (p50, p99). If this starts climbing, it means contention is increasing or the Redis cluster is under pressure.
- Lock hold duration vs TTL ratio. If the median hold time is 8 seconds and TTL is 10 seconds, you are dangerously close to TTL expiry during normal operations. Alert when the p99 hold duration exceeds 50% of TTL.
- Lock release failure rate. The percentage of release operations where the ownership token does not match (the lock expired and was re-acquired). This is the leading indicator of correctness bugs. Any non-zero rate should trigger investigation.
- Lock contention rate. How often an acquisition attempt is blocked (lock already held). High contention means either the critical section is too slow or too many processes are competing. Consider sharding the lock (per-user instead of global) or redesigning the critical section to be shorter.
Your DynamoDB bill tripled last month. No one deployed new features. Find the root cause.
Difficulty: Senior What the interviewer is really testing: Cost awareness in production database operations, ability to use AWS tooling for forensics, and understanding of DynamoDB’s pricing model beyond the basics. What weak candidates say: “Switch to on-demand mode to save money.” This shows they do not understand DynamoDB pricing — on-demand mode is 6.5x more expensive per request at steady state. Or they say “look at the AWS bill” without knowing which specific DynamoDB metrics to investigate. What strong candidates say: DynamoDB has five cost axes, and the bill increase must come from one or more of them. I would investigate in this order: 1. Read/Write capacity consumption. Check CloudWatchConsumedReadCapacityUnits and ConsumedWriteCapacityUnits for the table and all GSIs. Compare with the previous month. If consumed capacity increased but no new features deployed, look for: (a) a change in traffic patterns (seasonal growth, a marketing campaign, a bot), (b) a batch job that was accidentally left running or running more frequently than before, (c) a retry storm caused by throttling (application retries create a positive feedback loop — throttling causes retries, retries increase consumption, which causes more throttling).
2. GSI cost. This is the most common surprise cost. Every write to the base table is replicated to every GSI that projects the modified attribute. If someone added a GSI with ProjectionType: ALL on a high-write table, that GSI now doubles the write cost for that table. Check ConsumedWriteCapacityUnits per GSI. Also check for GSI throttling backpressure — if the GSI is throttled, it slows down base table writes, increasing request latency, which can increase concurrent request count and overall capacity consumption.
3. DynamoDB Streams. Streams are free to enable, but the Lambda (or other consumer) invocations triggered by streams cost money. If stream processing Lambda functions are failing and retrying (DynamoDB Streams retries failed batches until they succeed or expire), you could have millions of extra Lambda invocations. Check the Lambda concurrent executions and error rates for stream consumers.
4. Data storage. DynamoDB charges 225/month increase. Check the table’s ItemCount and TableSizeBytes in the DynamoDB console. Also check GSI storage — each GSI stores its own copy of projected attributes.
5. Data transfer. DynamoDB charges for cross-region data transfer (Global Tables replication), VPC endpoint data processing, and data returned to the client. If ProjectionExpression is not used and items are large, read costs include both capacity units and data transfer. A code change that removed ProjectionExpression from a high-traffic query could significantly increase costs.
War Story: A logistics company I worked with saw their DynamoDB bill go from 14,800/month in one month. No new features. The root cause was a seemingly innocent change: a developer added a new GSI to enable a back-office search feature. The GSI had ProjectionType: ALL on their shipments table, which had 50,000 writes per second. That single GSI doubled the write capacity consumption for the entire table. The GSI itself was only queried 10 times per day — a 9,000/month.
Follow-up: How would you set up cost alerting to catch this before the monthly bill arrives?
Answer: Three layers:- AWS Budgets with a DynamoDB-specific budget. Set a monthly budget for DynamoDB and configure alerts at 50%, 80%, and 100% thresholds with SNS notifications to Slack or PagerDuty. This catches overall cost growth.
-
CloudWatch per-table alarms. Create alarms on
ConsumedWriteCapacityUnitsandConsumedReadCapacityUnitsat the table level and GSI level. Alert when any metric exceeds 120% of the 7-day rolling average. This catches per-table anomalies before they aggregate into a bill surprise. - AWS Cost Explorer with daily granularity. Set up a daily report filtered to DynamoDB. Enable “group by resource” to see cost per table. The bill tripling happened over 30 days — with daily cost monitoring, you would have caught it on day 2 when the daily cost jumped from 420.
Follow-up: The team is on provisioned capacity and is considering switching to on-demand to “never think about capacity again.” Walk them through the trade-offs.
Answer: On-demand mode pricing is approximately 6.5x more expensive per request than provisioned capacity at steady state. For a table consuming 10,000 WCU steady-state, provisioned costs roughly 30,500/month. On-demand makes financial sense in three scenarios: (1) truly unpredictable traffic with extreme spikes (0 to 50,000 WCU in minutes), (2) development and staging environments with minimal traffic, and (3) new tables where you do not yet know the access pattern. For production tables with reasonably predictable traffic, provisioned capacity with auto-scaling is almost always cheaper. Set the target utilization to 70%, minimum capacity to your off-peak baseline, and maximum capacity to your peak with headroom. Auto-scaling adjusts within 1-2 minutes, which handles most traffic variations. The hybrid approach: use provisioned with auto-scaling for steady-state tables, and on-demand for event-driven or burst-heavy tables. DynamoDB lets you switch between modes, but there is a 24-hour cooldown after switching from on-demand back to provisioned.You are designing a system that needs to keep PostgreSQL, Redis, and Elasticsearch in sync. When a user updates their profile, the relational data, the cache, and the search index all need to reflect the change. How do you ensure consistency?
Difficulty: Staff-Level What the interviewer is really testing: Understanding of the fundamental impossibility of consistent dual-writes across independent systems, and practical patterns for achieving eventual consistency with acceptable trade-offs. What weak candidates say: “I would write to all three in the same API request handler. If any write fails, roll back the others.” This is the distributed transaction trap — there is no rollback mechanism across PostgreSQL, Redis, and Elasticsearch. If the PostgreSQL write succeeds, the Redis write succeeds, and the Elasticsearch write fails, you cannot roll back Redis (theDEL might itself fail). They also do not address the case where the application process crashes between write 2 and write 3.
What strong candidates say:
There is no way to make three independent systems atomically consistent without a distributed transaction coordinator, and implementing 2PC across PostgreSQL, Redis, and Elasticsearch is not practical. The real question is: what consistency guarantees does the business actually need, and what is the cheapest pattern that satisfies them?
Pattern 1: Transactional Outbox (my default recommendation).
Write the profile update AND an “outbox” event to PostgreSQL in the same transaction. A separate process (Debezium polling the outbox table, or a PostgreSQL LISTEN/NOTIFY consumer) reads the outbox events and fans them out to Redis and Elasticsearch. This guarantees that if the profile update committed, the event was also committed. The downstream updates are eventually consistent — typically within 100-500ms.
Implementation: the profile_updates outbox table has columns id, aggregate_id, event_type, payload, processed_at. A worker polls for processed_at IS NULL, updates Redis (invalidate or update the cached profile), updates Elasticsearch (index the new profile), and marks the event as processed. Idempotent processing is essential because the worker might crash after updating Redis but before updating Elasticsearch.
Pattern 2: CDC with Debezium.
Skip the outbox table. Use Debezium to capture WAL changes directly from PostgreSQL and publish them to Kafka. Separate consumers read from Kafka: one updates Redis, another updates the Elasticsearch index. Debezium guarantees exactly-once publication from WAL to Kafka. The consumers handle idempotent processing.
This is cleaner (no outbox table, no polling) but adds Kafka and Debezium to the operational surface area. For a team already running Kafka, this is the preferred approach. For a team that does not, the outbox pattern is simpler to start with.
Pattern 3: Eventual invalidation (simplest for caching).
For Redis specifically: do not try to keep the cache in sync. Just invalidate the cache on write (DEL user:profile:123). The next read misses the cache, fetches from PostgreSQL, and repopulates. This is the cache-aside pattern. It is eventually consistent (the stale cache entry is gone immediately, the fresh entry appears on the next read) and requires zero event infrastructure.
For Elasticsearch: invalidation is not practical because search needs the data indexed. Use the outbox or CDC pattern for Elasticsearch, and simple invalidation for Redis.
War Story: An e-commerce platform I worked on tried Pattern 1 — writing to PostgreSQL, Redis, and Elasticsearch sequentially in the request handler. One day, the Elasticsearch cluster went into a 45-second recovery (a node restarted due to OOM). During those 45 seconds, every profile update API call hung for 30 seconds (the Elasticsearch client timeout), then returned a 500 error to the user. But the PostgreSQL write had already committed — so the profile was updated in the database but the user saw an error. The Redis cache had the old data. Search had the old data. The user retried, PostgreSQL returned “nothing changed” (idempotent), Redis and Elasticsearch eventually updated, but the user was confused and support tickets piled up. We migrated to the outbox pattern within the week. The API now writes to PostgreSQL only (latency: 5ms), returns success immediately, and the async worker handles Redis and Elasticsearch. Elasticsearch downtime no longer affects user-facing latency.
Follow-up: The outbox table is growing fast because events are being produced faster than the worker can process them. How do you scale the worker?
Answer: Three approaches:- Partition the outbox by aggregate ID. Run multiple worker instances, each responsible for a range of aggregate IDs (consistent hashing). This parallelizes processing while guaranteeing ordering within each aggregate (a user’s profile updates are processed in order by one worker). This is essentially the Kafka consumer group pattern implemented at the outbox level.
-
Batch processing. Instead of processing one event at a time, the worker reads 100 events, batches the Redis updates (
PIPELINE), batches the Elasticsearch updates (_bulkAPI), and marks all 100 as processed in oneUPDATEstatement. Batching can improve throughput 10-50x by amortizing network round-trips. - Switch to CDC. If the outbox polling pattern hits its scaling limit, migrate to Debezium + Kafka. Kafka’s partitioned consumer groups provide natural parallelism, backpressure handling, and replay capability. This is the “graduated” version of the outbox pattern.
Follow-up: How do you detect when PostgreSQL, Redis, and Elasticsearch have drifted out of sync?
Answer: A reconciliation job that runs hourly (or daily, depending on the acceptable drift window):- Sample-based comparison. Query a random sample of 1,000 user IDs from PostgreSQL. For each, fetch the Redis cached version and the Elasticsearch indexed version. Compare key fields (name, email, updated_at). Log discrepancies.
- Checksum comparison. For each user, compute a hash of the canonical fields in PostgreSQL. Store the hash as a field in the Redis entry and the Elasticsearch document. The reconciliation job queries Elasticsearch for documents where the stored hash does not match the current PostgreSQL hash. This is more efficient than fetching and comparing full records.
-
Event lag monitoring. Track the outbox’s oldest unprocessed event. If it is more than N minutes old, alert. This is the leading indicator — if the worker is behind, drift is accumulating. Monitor
processed_at - created_atas a latency metric in your dashboard.
updated_at timestamp (sub-second precision). We switched to using the outbox event ID for deduplication instead of timestamps, and the drift dropped to zero.
Your PostgreSQL table has 200 million rows. An engineer adds range partitioning by month to speed things up. Queries are now SLOWER. What went wrong?
Difficulty: Senior What the interviewer is really testing: Whether the candidate understands that partitioning is not a universal performance improvement — it has preconditions, and violating them makes things worse. This is a question where the obvious answer (“partitioning should help”) is wrong. What weak candidates say: “Maybe the partitions are too small. Try partitioning by week instead.” They assume the partitioning strategy is correct and just needs tuning. They do not question whether partitioning was appropriate in the first place. What strong candidates say: Partitioning makes queries faster only when the query planner can eliminate partitions — skip partitions that cannot contain matching rows. If partition pruning does not happen, every query hits every partition, and the overhead of planning across N partitions makes things slower, not faster. The most likely causes: 1. Queries do not filter by the partition key. If the table is partitioned bycreated_at (month), but the hot queries filter by user_id or status without a created_at predicate, the planner cannot prune partitions. A query that previously did one Index Scan on a single table now does N Index Scans (one per partition) and merges results. With 60 monthly partitions, that is 60x the planning overhead and 60 index lookups instead of 1. Check EXPLAIN ANALYZE for Append nodes with all partitions listed.
2. The partition key is not in the WHERE clause but is needed for the index. Before partitioning, a single index on (user_id, created_at) served all queries efficiently. After partitioning by month, each partition has its own index. A query WHERE user_id = 42 ORDER BY created_at must query each partition’s index, get results, and merge-sort them. The planner might even fall back to sequential scans on each partition if it estimates the merge is cheaper.
3. Excessive partition count. Each partition is a separate table, and the planner considers each one during planning. With 120 monthly partitions (10 years of data), the planning time itself can exceed the query execution time for simple lookups. PostgreSQL 14+ has improved partition pruning performance, but at 500+ partitions, planning overhead is measurable.
4. Index-only scans broken. If the pre-partitioned table had a well-maintained visibility map (enabling index-only scans), partitioning resets this. Each new partition starts with an empty visibility map. Until VACUUM runs on each partition and marks pages as all-visible, index-only scans fall back to regular index scans (requiring heap fetches). On a freshly partitioned table with 200M rows, this transition period can last hours.
War Story: A SaaS platform partitioned their events table (200M rows) by month to “speed up” a reporting dashboard. The dashboard queries filtered by tenant_id and event_type — never by date. After partitioning, the dashboard p99 went from 200ms to 4 seconds. The EXPLAIN ANALYZE showed an Append node with 36 child Index Scan nodes (36 monthly partitions), each scanning its own (tenant_id, event_type) index. The fix was not to remove partitioning — it was to add AND created_at >= NOW() - INTERVAL '90 days' to the dashboard queries (which was the actual business requirement anyway; they only needed recent data). With that predicate, the planner pruned 33 of 36 partitions and the p99 dropped to 150ms — even faster than pre-partitioning because each partition’s index was smaller and fit entirely in shared_buffers.
Follow-up: When is partitioning genuinely the right call, and what prerequisites must be true?
Answer: Three prerequisites for partitioning to improve performance:- All hot-path queries include the partition key in their WHERE clause. This is non-negotiable. If even one critical query does not filter by the partition key, that query gets worse.
-
The table is large enough that partition-level indexes are faster. Rule of thumb: partitioning helps when the full table’s index does not fit in
shared_buffersbut individual partition indexes do. A 200M row table with a 20 GB index that only has 8 GB of shared_buffers benefits from partitioning — each partition’s 500 MB index fits comfortably. A 10M row table with a 1 GB index that fits in memory does not benefit. -
You need lifecycle management. This is the strongest argument for partitioning: dropping old data.
DROP TABLE partition_2023_01is instant and creates zero dead tuples.DELETE FROM events WHERE created_at < '2023-02-01'on a 200M row table takes hours, creates millions of dead tuples, and triggers an aggressive autovacuum. If you have a retention policy, partitioning is almost always worth it for the lifecycle benefit alone, even if query performance stays the same.
Follow-up: Can you have a global unique index across all partitions in PostgreSQL?
Answer: No, and this is one of the most frustrating limitations. In PostgreSQL, unique indexes on partitioned tables must include the partition key. If the table is partitioned bycreated_at, you cannot create a unique index on just email — it must be (email, created_at). This means two rows with the same email in different partitions can both exist, violating the uniqueness you wanted.
Workarounds: (1) Enforce uniqueness at the application level (check-then-insert with a retry loop). (2) Use a separate, non-partitioned “uniqueness” table with just the unique column and a foreign key to the partitioned table. (3) Redesign: if global uniqueness is critical, partition on the unique column instead (hash partitioning on email). PostgreSQL 15+ improved this somewhat with better unique constraint support, but the partition-key-in-unique-index requirement remains. This is a legitimate reason to reconsider whether partitioning is appropriate for your schema.
You run a Redis-backed leaderboard for a mobile game with 10 million players. An engineer suggests switching the Sorted Set to a PostgreSQL table with an index on score because “we need persistent rankings and Redis might lose data.” Evaluate this.
Difficulty: Senior What the interviewer is really testing: Whether the candidate can evaluate a proposal by quantifying the actual trade-offs, not just reasoning abstractly. This is another “obvious answer is wrong” question — the engineer’s concern about durability is valid but their solution destroys the feature. What weak candidates say: “Yes, durability is important, so PostgreSQL is the safer choice.” They agree with the proposal’s framing without examining the performance implications. Or they say “Redis is always better for leaderboards” without addressing the durability concern. What strong candidates say: Let me quantify this because the devil is in the numbers. Redis Sorted Set performance at 10M members:ZADD(update score): O(log N) = ~23 operations deep in a skip list. In practice, 0.01-0.05 ms.ZREVRANK(get player’s rank): O(log N). 0.01-0.05 ms.ZREVRANGE 0 99(top 100 leaderboard): O(log N + 100). 0.05 ms.- Memory: ~10M members * (~50 bytes per member + score) = ~700 MB. Fits on one instance.
- Update score:
UPDATE players SET score = X WHERE id = Y. Requires an index update, WAL write, potential page split. 1-5 ms. - Get player rank:
SELECT COUNT(*) FROM players WHERE score > (SELECT score FROM players WHERE id = Y). This is a full index scan of all rows with higher scores. For a median player (rank 5M), this scans 5 million index entries. 2-10 seconds. - Top 100:
SELECT * FROM players ORDER BY score DESC LIMIT 100. Uses the index. 1-5 ms. This one is comparable.
-
Use Redis with AOF (
appendfsync everysec). Maximum data loss is 1 second. For a game leaderboard, this is acceptable. If a player’s score update from 0.8 seconds ago is lost due to a crash, the impact is negligible. - Use AWS MemoryDB instead of ElastiCache Redis. MemoryDB writes to a distributed, durable transaction log (similar to DynamoDB’s durability model). It is Redis-API compatible with multi-AZ durability. Zero data loss on node failure. This addresses the engineer’s concern without sacrificing Redis’s data structure advantages.
- If regulatory or compliance requirements demand PostgreSQL as the system of record, write scores to both PostgreSQL (source of truth) and Redis (leaderboard). PostgreSQL stores the canonical score and handles disputes/audits. Redis serves the real-time leaderboard. Reconcile periodically. This is polyglot persistence — each database doing what it does best.
UPDATE ... SET rank = rank + 1 WHERE score < X that modified hundreds of thousands of rows. Autovacuum could not keep up. They reverted to Redis within a week. Their final architecture was Redis for the real-time leaderboard (sub-millisecond for all operations) with nightly snapshots to PostgreSQL for analytics, dispute resolution, and compliance.
Follow-up: How would you handle leaderboard sharding if the game grows to 500 million players?
Answer: 500 million members in a single Redis Sorted Set would consume roughly 35 GB of memory. A single Redis instance with 64 GB can handle it, butZREVRANK on 500M members is O(log 500M) which is still under 1 ms. The issue is not performance but availability — a single instance is a single point of failure.
Sharding approaches:
- Regional leaderboards. Most games have natural sharding by region (NA, EU, Asia). Each region has its own Redis instance with its own Sorted Set. Players see their regional rank. A global leaderboard is computed periodically by merging the top N from each region. This works because most players care about their regional rank; only the top 1% care about global rank.
- Tiered leaderboard. A top-10,000 leaderboard in one Sorted Set (updated in real-time, tiny data). Below that, bucket players into score ranges (e.g., players with scores 1000-2000). Each bucket is a separate Sorted Set on a different shard. A player’s approximate rank is: (number of players in higher buckets) + (rank within their bucket). This gives O(1) for the bucket lookup plus O(log N) within the bucket.
- Redis Cluster with hash tags. Shard the Sorted Set across multiple nodes using hash tags. But this defeats the purpose — Sorted Set operations are per-key, and a Sorted Set split across nodes loses its ordering guarantees. Approach 1 or 2 is preferred.
Follow-up: What happens to the leaderboard during a Redis failover? Players will see stale or missing ranks.
Answer: During failover (15-30 seconds for Sentinel, longer for manual intervention), the leaderboard is unavailable for writes. Reads may work if you read from replicas. Graceful degradation strategies:- Read from replicas during failover. Configure the application to fall back to a replica for leaderboard reads if the primary is unreachable. The rank will be slightly stale (by the replication lag, typically <1 second) but available. Writes queue in the application and flush when the new primary is ready.
- Cache the last known leaderboard. The top-100 leaderboard changes slowly (once every few seconds at most). Cache the rendered leaderboard in the application or a CDN with a 5-second TTL. During failover, the CDN serves the slightly stale leaderboard. Players do not notice.
- Accept temporary unavailability for rank lookups. Individual rank queries (“my rank”) are hard to cache. During failover, return “rank unavailable, try again shortly” in the UI. Most games already handle this gracefully with a loading spinner.
A developer inserts 10 million documents into MongoDB in a loop, one at a time, and it takes 3 hours. They expected it to take 10 minutes. What is wrong and how do you fix it?
Difficulty: Intermediate What the interviewer is really testing: Understanding of write concern defaults, network round-trip overhead, batching, and WiredTiger write amplification. This is a common real-world scenario where the fix is simple but the understanding of why matters. What weak candidates say: “MongoDB is just slow for writes, they should use DynamoDB.” This is wrong — MongoDB can handle hundreds of thousands of writes per second on a single node. Or they say “increase the server resources” without diagnosing the actual bottleneck. What strong candidates say: Three hours for 10 million inserts means roughly 920 inserts per second. A single MongoDB node should handle 30,000-100,000 inserts per second depending on document size and hardware. The throughput is 30-100x below what the hardware can deliver. The bottleneck is not MongoDB; it is the insertion pattern. Root cause 1: One-at-a-time inserts with network round-trips. EachinsertOne() is a separate network round-trip: client sends the document, waits for the server to acknowledge, then sends the next. If the network latency between the application and MongoDB is 1 ms, the theoretical maximum is 1,000 inserts/second — close to the observed 920/sec. The fix is insertMany() with batches of 1,000-5,000 documents. Each insertMany() sends the batch in one network call, amortizing the round-trip. This alone can improve throughput 50-100x.
Root cause 2: Write concern overhead. The default w: 1 waits for the primary to acknowledge each write. If they changed it to w: "majority" (perhaps following a best-practices guide without understanding the cost), each write waits for majority replication acknowledgment: primary writes to journal, secondary receives and writes to journal, acknowledgment propagates back. With a 3-node replica set, this adds 2-5 ms per write. At w: "majority", even with batching, throughput caps at around 5,000-10,000 inserts/second because of the majority-acknowledgment latency. For a bulk import, w: 1 is usually acceptable (the data can be re-imported if the primary fails).
Root cause 3: Index maintenance. If the collection has 5 indexes, each insert updates all 5 indexes. With 10 million documents, that is 50 million index updates. Each index update involves a btree page split and WiredTiger cache management. On a collection with many indexes, write throughput is directly proportional to the number of indexes. The fix for bulk imports: drop non-essential indexes before the import, then rebuild them afterward. Index building on 10 million documents in one pass is far faster than incrementally maintaining indexes on 10 million individual inserts.
The combined fix that drops it to under 5 minutes:
ordered: false flag is important — it tells MongoDB it can process the inserts in any order and in parallel. With ordered: true (default), MongoDB processes inserts sequentially and stops on the first error. Unordered inserts can be 2-3x faster because the server can batch internal operations.
War Story: A data pipeline at a media company was inserting 50 million events per day into MongoDB using Python’s insert_one() in a for loop. It took 14 hours. The developer had been adding time.sleep(0.001) between inserts “to not overload the database.” We removed the sleep (the database was at 3% CPU utilization), switched to insert_many() with batches of 5,000, used ordered: false, and temporarily dropped 3 indexes that were not needed for the import. The same 50 million inserts completed in 22 minutes. The developer’s instinct to “be gentle with the database” was well-intentioned but cost them 13.5 hours per day.
Follow-up: What is the difference between ordered and unordered bulk writes, and when would you prefer ordered?
Answer: Ordered bulk writes process documents sequentially. If document 5,000 fails (duplicate key error, validation error), documents 5,001 through 10,000 are not attempted. The client receives an error with the count of successful inserts. This is safe when document order matters and you want to stop on the first error. Unordered bulk writes process all documents, potentially in parallel on the server. If document 5,000 fails, the server continues processing the remaining documents. The client receives a response listing all errors and all successes. This is faster because: (1) the server can group inserts by shard in a sharded cluster, (2) the server can process independent groups concurrently, and (3) a single failure does not stop the entire batch. Use ordered when: inserts must be sequential (event logs where later events depend on earlier ones being present), or when you need to stop on the first error to investigate (data quality is critical). Use unordered for bulk imports, migrations, and any scenario where individual failures should not stop the batch. Handle errors after the batch completes by inspecting thewriteErrors array.
Follow-up: The developer asks “should I use mongoimport instead of writing application code for bulk loads?”
Answer: Yes, for flat-file imports (JSON, CSV, TSV),mongoimport is almost always faster than application code for a few reasons: (1) it uses optimal batch sizes internally, (2) it is compiled in Go and avoids driver serialization overhead, (3) it supports --numInsertionWorkers for parallel inserts, and (4) it handles connection pooling and retry logic optimally for bulk operations.
mongoimport --uri "mongodb://..." --collection events --file events.json --jsonArray --numInsertionWorkers 4 --writeConcern "{w:1}"
With 4 insertion workers on a sharded cluster, mongoimport can saturate the write throughput of all shards. For 50 million 1 KB documents, expect 15-30 minutes on a 3-node replica set with SSDs.
Use application code instead when: you need to transform documents during import (enrich, validate, compute derived fields), when you need to coordinate with other systems (update a progress bar, send notifications at milestones), or when the source data is not a flat file (reading from another database, an API, or a message queue).
You are on-call and get paged: “Redis latency spiked from 0.2ms to 50ms.” What are the first five things you check, in order?
Difficulty: Senior What the interviewer is really testing: Systematic incident response for Redis, prioritized by likelihood and impact. Not just “things that can go wrong with Redis” but the specific diagnostic order an experienced engineer follows under time pressure. What weak candidates say: “I would check if the server is overloaded and maybe restart Redis.” Restarting a stateful data store as a first response is dangerous — you might lose data and you definitely lose diagnostic information. They also do not have a prioritized checklist. What strong candidates say: When Redis latency jumps from sub-millisecond to 50ms, something fundamental changed. Here is my diagnostic order, optimized for fastest time-to-diagnosis: Check 1:redis-cli --latency and SLOWLOG GET 10 (30 seconds).
The SLOWLOG is the single most valuable diagnostic for Redis latency spikes. It logs every command that exceeded slowlog-log-slower-than (default 10ms). If the slowlog shows a specific command pattern (a KEYS *, a SORT on a million elements, a Lua script doing heavy computation), you have found the smoking gun. A single O(N) command on a large dataset blocks the event loop for all other clients.
The most common culprit I have seen in production: an engineer added KEYS pattern:* in a health check that runs every 5 seconds. On 10 million keys, KEYS * takes 2-5 seconds during which every other command is blocked.
Check 2: INFO memory — check mem_fragmentation_ratio and swap usage (15 seconds).
If mem_fragmentation_ratio < 1.0, Redis is swapping. Swap turns every memory access into a potential disk operation, causing 1000x latency increase. If used_memory is near maxmemory, evictions are happening on every write. Each eviction is additional work before the actual command executes.
Check 3: INFO clients — check connected_clients and blocked_clients (10 seconds).
If connected_clients jumped from 500 to 50,000, the connection handling overhead itself causes latency. Each new connection consumes memory and kernel resources. A thundering herd of new connections (from a deployment, an autoscaler spinning up pods, or a retry storm) can spike latency. blocked_clients > 0 means clients are waiting on blocking commands (BLPOP, BRPOP) — check if a consumer died and a queue is backed up.
Check 4: INFO persistence — check rdb_last_bgsave_status and aof_rewrite_in_progress (10 seconds).
If rdb_bgsave_in_progress: 1 or aof_rewrite_in_progress: 1, Redis is forking for a background save. The fork() call itself can cause a latency spike of 10-50ms on a large dataset (Redis must copy the page table). During the save, copy-on-write amplifies memory usage, potentially causing the OOM killer to start killing processes. Check latest_fork_usec — if it is 50,000 (50ms), that is your latency spike right there.
Check 5: Network and host-level metrics (60 seconds).
If none of the above explain it, the problem is outside Redis. Check: (a) network latency between the application and Redis (redis-cli --latency from the app server vs from the Redis host itself — if the difference is large, it is the network), (b) CPU steal on the Redis host (if on a shared VM and the hypervisor is throttling you, top shows %st > 5%), (c) noisy neighbor on the same host (another container or process consuming I/O bandwidth), (d) kernel-level issues (dmesg for OOM events, vmstat for swap activity).
War Story: At a ride-hailing company, Redis latency spiked to 200ms every hour on the hour. The slowlog was empty — no single command was slow. INFO persistence showed rdb_bgsave_in_progress: 1 at every spike. The RDB snapshot was configured to run every hour (save 3600 1). The Redis instance had 30 GB of data. The fork() call took 45ms (latest_fork_usec: 45000), and during the snapshot, the write-heavy workload triggered massive copy-on-write, causing memory pressure. The fix was disabling RDB snapshots entirely (they already had AOF enabled with appendfsync everysec) and using AOF rewrite instead, which is less disruptive because it is optimized for the current dataset rather than doing a full dump. The hourly latency spikes disappeared. We also switched the monitoring cron from checking KEYS stats:* to SCAN with a cursor, which eliminated a secondary 800ms spike that happened during monitoring.
Follow-up: How do you prevent the KEYS command from ever being used in production?
Answer: Three layers of defense:-
Rename the command. In
redis.conf:rename-command KEYS ""(disables it entirely) orrename-command KEYS KEYS_DANGER_DO_NOT_USE. This prevents accidental use. Any code trying to useKEYSgets an error. -
Code review enforcement. Add a linter rule or a grep check in CI that flags any use of
KEYSin application code. The alternative is alwaysSCANwith a cursor, which returns results incrementally without blocking the event loop. -
Redis ACLs (Redis 6+). Create application-specific users that do not have permission for dangerous commands:
ACL SETUSER appuser ~* +@all -@dangerous. The@dangerouscategory includesKEYS,FLUSHALL,FLUSHDB,DEBUG, and other commands that should never run in production.
MONITOR for 10 seconds on each Redis instance and flagged any KEYS usage. This caught a third-party library that was using KEYS internally for cache namespace management. We replaced the library.
Follow-up: The latency spike lasts exactly 200ms and happens once per minute. SLOWLOG is empty. What specific Redis internal event takes exactly ~200ms?
Answer: If it is exactly periodic and the SLOWLOG is empty (meaning no individual command is slow), the two most likely causes are:-
Active expiry cycle. Redis runs an active key expiration cycle 10 times per second (every 100ms). In each cycle, it samples 20 keys with TTLs and deletes expired ones. If more than 25% of sampled keys are expired, it loops again (up to a time limit of 25ms by default, or 250ms if
hzis tuned up). On a Redis instance with millions of keys all expiring around the same time (a common pattern when TTLs are set to round numbers like 60s, 300s, or 3600s), the expiry cycle can consume its full time budget. The solution: add jitter to TTLs (TTL = base_ttl + random(0, 30)). This spreads expirations over time instead of creating thundering herd expiry. -
Lazy freeing of large keys (Redis 4.0+). When a large key is deleted or expires, Redis 4.0+ can free the memory asynchronously using
lazyfree-lazy-expire yes. But if this is not enabled, deleting a Hash with 1 million fields blocks the event loop for the duration of the memory deallocation (100-300ms for a 500 MB key). The SLOWLOG would not show this because theDELcommand itself reports as fast — the time is spent in memory management after the command completes. Enablelazyfree-lazy-expire yes,lazyfree-lazy-server-del yes, andlazyfree-lazy-user-del yesto move large key deletion to a background thread.
Curated Resources
- PostgreSQL Official Documentation — Internals Chapter — the authoritative source for understanding MVCC, WAL, query planning, and the executor. Dense but unmatched in accuracy.
- The Internals of PostgreSQL by Hironobu Suzuki — a free, deeply visual guide to PostgreSQL internals covering buffer management, WAL, VACUUM, query processing, and replication. The best free resource for learning PostgreSQL internals.
- pganalyze Blog — production-focused articles on EXPLAIN ANALYZE, index tuning, connection management, and pg_stat_statements. Practical, not academic.
- Crunchy Data Blog — excellent PostgreSQL operational guides covering partitioning, extensions, and real-world tuning scenarios.
- MongoDB University — free courses covering schema design, aggregation, sharding, and operations. The M320 Data Modeling course is particularly valuable.
- MongoDB Schema Design Patterns — the official summary of the Bucket, Computed, Extended Reference, Outlier, Subset, and other patterns. Bookmark this.
- MongoDB Developer Center — practical guides, code examples, and architecture patterns from MongoDB’s developer advocacy team.
- The DynamoDB Book by Alex DeBrie — the definitive guide to DynamoDB data modeling. If you are serious about DynamoDB, this is required reading. It covers single-table design, GSI patterns, and real-world modeling examples in depth.
- AWS DynamoDB Developer Guide — official documentation covering every feature, capacity mode, and best practice. Start with the “Best Practices” section.
- Rick Houlihan’s DynamoDB Deep Dive (AWS re:Invent) — the legendary re:Invent talk that popularized single-table design. Dense, opinionated, and essential viewing.
- Redis University — free courses covering data structures, persistence, clustering, and Lua scripting.
- Redis Documentation — comprehensive coverage of every command, data structure, and configuration option. The “Data types” section is the best starting point.
- Martin Kleppmann’s Analysis of Redlock — the critical analysis of Redis’s distributed lock algorithm. Essential reading for anyone implementing distributed locks with Redis, regardless of whether you agree with Kleppmann’s conclusions.
- Antirez’s Response to Kleppmann — Salvatore Sanfilippo’s (Redis creator) rebuttal. Read both sides to form your own opinion.
- Designing Data-Intensive Applications by Martin Kleppmann — the single best book on databases, distributed systems, and data infrastructure. Covers replication, partitioning, consistency, encoding, batch/stream processing with depth that no other single resource matches. If you read one book on this list, make it this one.