This is one of the most common and most misunderstood decisions in system design. The real answer is almost never “one or the other” — most production systems at scale use both, each for what it does best. Think of SQL databases as spreadsheets with strict rules: every row follows the same schema, relationships between tables are enforced, and complex queries across multiple tables are a first-class feature. NoSQL databases are more like filing cabinets: each drawer (document/key) can hold whatever you put in it, you can add new drawers trivially, and looking up a specific drawer by its label is extremely fast — but correlating information across drawers requires more manual effort.
Distribute data across multiple databases for horizontal scaling. Sharding is the database equivalent of splitting a library’s catalog across multiple buildings by subject: each building handles a fraction of the total collection, so no single building becomes overcrowded. The catch is that if someone wants to search across all subjects at once, they now need to visit every building and combine the results.Sharding is a last resort, not a first choice. It introduces significant operational complexity: cross-shard queries become expensive, transactions spanning multiple shards require distributed coordination (like 2PC or Saga patterns), and rebalancing data when you add or remove shards is painful. Exhaust vertical scaling, read replicas, and caching before you reach for sharding.
┌─────────────────────────────────────────────────────────────┐│ User ID Range Sharding │├─────────────────────────────────────────────────────────────┤│ ││ User IDs 1-1M User IDs 1M-2M User IDs 2M-3M ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ Shard 1 │ │ Shard 2 │ │ Shard 3 │ ││ └─────────────┘ └─────────────┘ └─────────────┘ ││ │└─────────────────────────────────────────────────────────────┘Pros: Simple, range queries workCons: Hotspots (new users all hit last shard)
┌───────────────────┐ ╱ ╲ │ │ Node A ●───────────● Node B ● ● │ │ │ Hash Ring │ │ │ ● ● Node D ●───────────● Node C │ │ ╲ ╱ └───────────────────┘• Keys hash to positions on ring• Assigned to next node clockwise• Adding/removing nodes only affects neighbors
A query that took 5ms on a single DB may take 50-200ms when scatter-gathering across 10 shards
Transactions
2PC, Saga pattern, design to avoid
Distributed transactions are 10-100x slower than local transactions; design your shard key to keep related data together
Resharding
Consistent hashing, virtual shards
Without virtual shards, adding one node can require migrating 1/N of all data; with virtual shards, you only move specific virtual partitions
Hotspots
Better shard key, splitting hot shards
Instagram famously had to re-shard because celebrity accounts created hot partitions; the shard key choice is the most consequential sharding decision
Interview Pattern: When discussing sharding, always mention the shard key selection first. The best shard key matches your most common access pattern and distributes load evenly. A strong answer sounds like: “I would shard by user_id because our primary access pattern is per-user queries. This keeps all of a user’s data co-located on one shard, avoiding cross-shard joins for the hot path. The risk is celebrity users creating hot shards, which I would mitigate with a salt suffix for high-traffic accounts.”
-- Single Column IndexCREATE INDEX idx_email ON users(email);-- Composite Index (order matters!)CREATE INDEX idx_name ON users(last_name, first_name);-- Helps: WHERE last_name = 'Smith'-- Helps: WHERE last_name = 'Smith' AND first_name = 'John'-- Does NOT help: WHERE first_name = 'John'-- Covering Index (includes columns)CREATE INDEX idx_covering ON users(email) INCLUDE (name, status);-- Query reads only from index, no table lookup-- Partial IndexCREATE INDEX idx_active ON users(email) WHERE status = 'active';-- Smaller index, only for specific queries-- Full-Text IndexCREATE INDEX idx_content ON articles USING GIN(to_tsvector('english', content));-- For text search
Storage space (indexes can be 10-30% of table size)
Faster joins
Memory usage (hot indexes should fit in RAM)
—
Index maintenance (fragmentation, vacuuming)
Practical Sizing Rule: A well-indexed query on PostgreSQL can serve 1,000-5,000 QPS on a single server for simple lookups. Without the right index, the same query might only manage 10-50 QPS because it triggers a full table scan. In interviews, when you mention a database, always specify what you would index and why — it shows operational maturity. The composite index column order follows the “leftmost prefix” rule: an index on (A, B, C) supports queries on A, (A, B), and (A, B, C), but not B alone or (B, C).
EXPLAIN ANALYZESELECT * FROM orders WHERE user_id = 123AND created_at > '2024-01-01';-- Output:-- Index Scan using idx_orders_user_date on orders-- Index Cond: ((user_id = 123) AND (created_at > '2024-01-01'))-- Rows Returned: 45-- Actual Time: 0.023..0.089 ms
-- [Bad] Function on column prevents index useSELECT * FROM users WHERE YEAR(created_at) = 2024;-- [Good] Range query uses indexSELECT * FROM users WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01';-- [Bad] SELECT * fetches unnecessary dataSELECT * FROM users WHERE id = 123;-- [Good] Select only needed columnsSELECT id, name, email FROM users WHERE id = 123;-- [Bad] N+1 queriesfor user in users: orders = query("SELECT * FROM orders WHERE user_id = ?", user.id)-- [Good] Single query with JOINSELECT u.*, o.* FROM users u LEFT JOIN orders o ON u.id = o.user_idWHERE u.status = 'active';
Design Tip: Start with PostgreSQL for most applications. It handles JSON, full-text search, and scales well. Add specialized databases only when needed.
When would you choose a NoSQL database over a relational database, and what are the real trade-offs you've seen in production?
Strong answer:
The way I think about this is: SQL is the default, and you need a compelling reason to reach for NoSQL. The compelling reasons are (a) your data is genuinely unstructured or polymorphic, like user-generated content where every document has different fields, (b) you need horizontal write scalability beyond what a single Postgres node can handle — typically north of 50K writes/sec sustained, or (c) your access pattern is almost entirely key-value lookups with no cross-entity queries.
In practice, most teams I have seen pick MongoDB because they want “flexible schema,” but what they really mean is they do not want to write migrations. That is a terrible reason — you just move schema enforcement from the database to every single application query, which is strictly worse. Schemaless does not mean schema-free; it means the schema lives in your application code instead of the database, and every reader has to handle every historical shape of the data.
The real trade-off is operational: with SQL, you get transactions, foreign keys, and mature tooling for backup/restore, point-in-time recovery, and EXPLAIN plans. With NoSQL, you trade those guarantees for horizontal scalability and flexible data models. At a company processing 500M events/day for analytics, Cassandra was the right call because the write volume was enormous and we never needed cross-partition joins. But for the core user/account/billing system, Postgres was non-negotiable because we needed ACID.
What most people miss is that modern Postgres handles JSON columns (jsonb) extremely well — you can have a relational schema with flexible sub-documents, getting 80% of the NoSQL flexibility with full ACID guarantees.
Red flag answer: “NoSQL is better because it scales horizontally and SQL doesn’t scale.” This ignores that most applications never outgrow a single well-tuned Postgres instance, and that horizontal scaling introduces massive complexity around transactions and consistency.Follow-ups:
You mentioned Cassandra for high write volume. How does its write path differ from Postgres internally, and why does that make it faster for writes?
If a team is already on MongoDB and experiencing data consistency bugs from the lack of schema enforcement, what migration strategy would you propose to move the critical path to Postgres without downtime?
Explain the leftmost prefix rule in composite indexes. Why does column order matter, and how would you decide the order?
Strong answer:
A composite index on (A, B, C) is essentially a sorted tree where entries are first sorted by A, then by B within each A value, then by C within each (A, B) pair. This means the index can efficiently serve queries that filter on A, or (A, B), or (A, B, C), but not B alone or (B, C) — because without knowing A first, the index entries for a given B value are scattered across the entire tree.
The way I decide column order is: put the column with the highest selectivity that appears in equality conditions first. If your query is WHERE status = 'active' AND created_at > '2024-01-01', and status has 5 distinct values while created_at has millions, you still put status first if it is an equality filter. Equality columns always come before range columns in the index definition, because once you hit a range condition, the index cannot use subsequent columns efficiently.
A real-world example: on an e-commerce orders table, an index on (user_id, status, created_at) serves WHERE user_id = 123 AND status = 'shipped' AND created_at > '2024-01-01' perfectly. But if you put created_at first, the query planner would scan a huge date range and then filter in-memory for user_id and status — orders of magnitude slower.
The gotcha here is covering indexes: if your query only needs columns that are all in the index (including INCLUDE columns), the database never touches the heap table at all. That is an index-only scan and it is dramatically faster — I have seen queries go from 50ms to 0.5ms just by adding an INCLUDE clause.
Red flag answer: “You should just index every column that appears in a WHERE clause.” This shows no understanding of how multi-column indexes work and would bloat write performance and storage without actually helping query performance.Follow-ups:
You have a query that filters on WHERE B = 5 AND C = 10 but your only composite index is (A, B, C). What happens at the query plan level, and what would you do about it?
How would you decide between creating a new composite index versus adding INCLUDE columns to an existing one? What is the write amplification trade-off?
Walk me through how you would choose a shard key for a multi-tenant SaaS application.
Strong answer:
The shard key decision is the single most consequential choice in a sharded architecture because it is nearly impossible to change later without a full data migration. The key principles are: (1) the shard key should match your dominant access pattern so that most queries hit exactly one shard, (2) it should distribute data evenly to avoid hotspots, and (3) it should be immutable — you never want a row to need to move between shards.
For a multi-tenant SaaS app, tenant_id is the natural first choice. Most queries are scoped to a single tenant — “show me tenant X’s dashboards, users, settings.” This keeps all of a tenant’s data co-located on one shard, so the hot path never crosses shards.
The risk is tenant size skew. If you have 10,000 small tenants and 3 enterprise tenants that each have 100x the data, those 3 tenants become hot shards. The mitigation is a tiered approach: small tenants share shards (multi-tenant per shard), while large tenants get dedicated shards. Some teams use a shard mapping table — tenant_id -> shard_id — so you can move a growing tenant to its own shard without changing the application logic.
What most people miss is cross-tenant queries. Your admin dashboard that shows “total users across all tenants” now requires a scatter-gather across every shard. The solution is to maintain a separate analytics datastore (like ClickHouse or BigQuery) that aggregates cross-tenant data asynchronously, so your operational shards stay fast.
Red flag answer: “I would shard by auto-increment ID using modulo.” This ignores the access pattern entirely and guarantees that every tenant-scoped query becomes a scatter-gather across all shards.Follow-ups:
A tenant that was on a shared shard has grown 50x and is now causing latency for the other tenants on that shard. How do you migrate them to a dedicated shard with zero downtime?
How would your shard key decision change if the application also needed to support queries like “find all tenants in the healthcare industry” or global search across tenants?
What is the difference between synchronous and asynchronous replication, and when would you accept the risk of data loss with async replication?
Strong answer:
With synchronous replication, the primary waits for at least one replica to acknowledge the write before returning success to the client. This guarantees that if the primary dies, no committed data is lost — but it adds network round-trip latency to every write. With async replication, the primary returns success immediately and the replica catches up in the background, which is faster but means the replica can be seconds behind — if the primary dies, those un-replicated writes are gone.
In practice, I would use synchronous replication for anything involving money, legal records, or data that is expensive or impossible to reconstruct. Financial transactions at a payments company, for example — losing even one committed transaction is unacceptable.
I would accept async replication for data that is high-volume but reconstructable or where slight staleness is tolerable. Examples: analytics events (worst case, you lose a few seconds of events and your dashboard is marginally off), session data (user logs in again), or cached materialized views that get rebuilt periodically.
The nuanced middle ground is semi-synchronous replication, which Postgres supports as “synchronous commit to at least N replicas.” You can set synchronous_standby_names to require one of two replicas to ACK, so if one replica is slow, the other picks up the slack. This gets you durability without the worst-case latency of waiting for the slowest replica. In MySQL, rpl_semi_sync_master_wait_for_slave_count does the same thing. The trade-off is that if both replicas are down, writes either block or you have to manually failover to async mode.
Red flag answer: “Always use synchronous replication because data loss is never acceptable.” This shows no understanding of real-world trade-offs — synchronous replication across data centers can add 50-100ms per write, which would destroy throughput for high-volume systems.Follow-ups:
You are running async replication and the primary fails with 3 seconds of un-replicated writes. How do you detect what was lost and what is your recovery procedure?
How does replication lag affect your read/write splitting strategy, and what mechanisms would you put in place to handle it?
You run EXPLAIN ANALYZE on a query and see a sequential scan on a 50-million-row table despite having an index. What are the possible causes and how do you fix it?
Strong answer:
The most common reason is that the query planner estimates the query will return a large fraction of the table. Postgres’s cost-based optimizer decides that a sequential scan reading the entire table in order is cheaper than an index scan that requires random I/O for each row. This typically happens when the WHERE clause matches more than roughly 10-20% of the table. The fix depends on whether the planner’s estimate is correct.
If the estimate is wrong — for example, stale statistics — run ANALYZE on the table to refresh the planner’s statistics. I have seen cases where a table grew from 1M to 50M rows but ANALYZE had not run, so the planner thought the table was still small. In Postgres, autovacuum handles this, but on tables with very high write rates, autovacuum can fall behind.
If the query genuinely matches a large percentage of rows, an index will not help — it is actually faster to do a sequential scan. The fix is to redesign the query: add more selective filters, use a partial index that only covers the subset you care about (e.g., CREATE INDEX idx_active ON orders(created_at) WHERE status = 'active'), or use a covering index to enable an index-only scan.
Other causes: the function wrapping the indexed column prevents index usage (e.g., WHERE YEAR(created_at) = 2024 cannot use an index on created_at — rewrite as a range condition), type mismatch between the column and the parameter (e.g., comparing a varchar column to an integer), or the index was created with a different collation or operator class. A sneaky one in Postgres: if the column is nullable and the query uses IS NOT NULL, the B-tree index does include NULLs by default, but the planner may still prefer a seq scan if most rows are non-null.
Red flag answer: “The index must be corrupted, I would drop and recreate it.” Jumping to index corruption without checking EXPLAIN output, statistics, or query structure shows no systematic debugging approach.Follow-ups:
How would you distinguish between “planner chose seq scan because it is genuinely faster” versus “planner chose seq scan because of stale statistics”? What specific numbers in the EXPLAIN output would you look at?
You have a query with WHERE status IN ('pending', 'processing') AND created_at > NOW() - INTERVAL '7 days' on a 200M-row table. Design the optimal index and explain why.
Explain consistent hashing and why it matters for sharding. What happens when you add or remove a node?
Strong answer:
Consistent hashing maps both data keys and server nodes onto a circular hash space (a “ring” from 0 to 2^32). Each key is assigned to the first node encountered when walking clockwise from the key’s position on the ring. The critical advantage over simple hash(key) % N is what happens when N changes.
With hash(key) % N, adding or removing a single server changes the modulo result for nearly every key, meaning you need to migrate almost all data. With consistent hashing, only keys between the removed node and its predecessor (going counter-clockwise) need to move — roughly 1/N of all keys. This is the difference between migrating 100% of data versus migrating 7% when going from 15 to 16 nodes.
In practice, naive consistent hashing creates uneven distribution because nodes are placed at arbitrary points on the ring. The fix is virtual nodes (vnodes) — each physical server gets 100-200 positions on the ring instead of one. This smooths out the distribution and also makes rebalancing more granular. When a node goes down, its load is distributed across many other nodes instead of just the next one on the ring. Cassandra, DynamoDB, and Riak all use this approach.
The operational gotcha is that consistent hashing only helps with data placement — it does not solve the replication or data transfer problem. When you add a node, the data that maps to it still needs to be physically copied from the nodes that currently own it. In Cassandra, this is the “streaming” process and it can saturate the network for hours on large clusters. You need to throttle it to avoid impacting live traffic.
Red flag answer: “Consistent hashing just means using a hash function to pick a server.” This misses the entire point — it is about minimizing data movement during topology changes, not just about hashing.Follow-ups:
How do virtual nodes interact with replication? If a key’s primary is vnode V on physical node A, how do you ensure replicas land on different physical nodes rather than other vnodes on the same machine?
You are using consistent hashing and one node is consistently hotter than others despite vnodes. What debugging steps would you take, and what might be the root cause?
Describe the N+1 query problem, how you detect it in production, and what strategies exist beyond just adding a JOIN.
Strong answer:
The N+1 problem occurs when code loads a list of N entities and then, for each one, fires a separate query to load a related entity. So you get 1 query for the list plus N queries for the relations — hence N+1. On a page showing 50 orders with their products, that is 51 database round-trips instead of 1 or 2. Each round-trip adds latency (typically 0.5-2ms per query on a local network), so N+1 can silently turn a 5ms page load into a 100ms one.
Detection in production: the best tool is query logging with aggregation. In Postgres, pg_stat_statements groups identical query patterns and shows total execution count and time — if you see SELECT * FROM products WHERE id = $1 called 10,000 times per minute with identical patterns, that is N+1. APM tools like Datadog or New Relic can also show “queries per request” as a metric, and any request with more than ~10 queries is suspicious. Django has django-debug-toolbar and nplusone library; Rails has bullet gem.
Solutions beyond JOIN: (1) Batch loading — collect all the IDs and do WHERE id IN (...) in one query. ORMs call this “eager loading” (select_related/prefetch_related in Django, .includes() in Rails). (2) DataLoader pattern (from GraphQL) — batch and deduplicate database calls within a single request lifecycle. (3) Denormalization — embed the related data in the parent record if it is read-heavy and rarely changes. (4) Caching the related entities — if products rarely change, a cache hit avoids the query entirely.
The trade-off with JOINs at scale is that large JOINs can be expensive too — a JOIN across two 100M-row tables with poor indexes can be worse than N+1 with good caching. The real answer is always: measure first, optimize the specific bottleneck.
Red flag answer: “Just add eager loading everywhere to prevent it.” Blindly eager-loading all relations leads to loading massive amounts of unused data, wasting memory and bandwidth. You should only eager-load relations you actually use.Follow-ups:
You are working with a GraphQL API that has deeply nested resolvers. How does the DataLoader pattern solve N+1 at the resolver level, and what are its limitations?
You have identified an N+1 problem in a critical path, but the ORM’s eager loading generates a JOIN that is even slower due to Cartesian product explosion with multiple has-many relations. What do you do?
How does connection pooling work, and what happens when your pool is exhausted? Walk me through sizing a connection pool.
Strong answer:
A connection pool maintains a set of pre-established database connections that application threads borrow and return, avoiding the overhead of creating a new TCP connection and completing the authentication handshake for every query. Creating a Postgres connection takes 5-20ms (TCP handshake + TLS + auth), so a pool eliminates that latency for the steady state.
When the pool is exhausted, the behavior depends on configuration: most pools will queue the request and wait up to a timeout (e.g., 30 seconds) for a connection to be returned. If the timeout expires, the application gets an exception. In production, pool exhaustion is usually a symptom, not the root cause — the real problem is slow queries holding connections too long, a missing index causing queries to take 5 seconds instead of 5 milliseconds, or a transaction left open without committing.
For sizing, I use this formula as a starting point: pool_size = (number_of_cores * 2) + effective_spindle_count. For a 4-core machine with SSD storage, that is roughly 9-10 connections. The key insight is that more connections is not better — Postgres performance degrades beyond a certain point because each connection consumes about 5-10MB of RAM and the OS context-switches between them. I have seen cases where reducing the pool from 200 to 30 connections actually improved throughput because the database stopped thrashing.
In a microservices architecture, pool sizing gets harder because you have N services each with their own pool. If you have 10 services with 20 connections each, that is 200 connections to one Postgres instance. The solution is PgBouncer or a similar connection pooler in front of the database, which multiplexes hundreds of application-side connections into a smaller number of actual database connections (say, 50). PgBouncer in transaction-mode pooling is the standard recommendation.
Red flag answer: “Set the pool size to something large like 500 so you never run out.” This shows no understanding that databases have finite connection capacity, and that each connection consumes memory and CPU on the database server.Follow-ups:
What is the difference between PgBouncer’s session-mode, transaction-mode, and statement-mode pooling? When would you use each, and what breaks in transaction mode?
You see ERROR: remaining connection slots are reserved for superuser connections in production. Walk me through your immediate triage and the longer-term fix.