> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Database Design > SQL vs NoSQL, Sharding, Replication, and Indexing strategies Database Sharding Strategies

## SQL vs NoSQL This is one of the most common and most misunderstood decisions in system design. The real answer is almost never "one or the other" -- most production systems at scale use both, each for what it does best. Think of SQL databases as spreadsheets with strict rules: every row follows the same schema, relationships between tables are enforced, and complex queries across multiple tables are a first-class feature. NoSQL databases are more like filing cabinets: each drawer (document/key) can hold whatever you put in it, you can add new drawers trivially, and looking up a specific drawer by its label is extremely fast -- but correlating information across drawers requires more manual effort. ### When to Use SQL (Relational) Financial transactions, inventory management Reporting, analytics with JOINs Fixed schema, clear relationships Foreign keys, constraints matter ### When to Use NoSQL Evolving data models, semi-structured data Horizontal scaling, massive data volume Key-value lookups, caching layer Time series, event logging ### NoSQL Types ``` ┌─────────────────────────────────────────────────────────────────┐ │ NoSQL Types │ ├─────────────────┬─────────────────┬─────────────────┬───────────┤ │ Key-Value │ Document │ Column-Family │ Graph │ ├─────────────────┼─────────────────┼─────────────────┼───────────┤ │ Redis │ MongoDB │ Cassandra │ Neo4j │ │ Memcached │ CouchDB │ HBase │ Amazon │ │ DynamoDB │ DynamoDB │ BigTable │ Neptune │ ├─────────────────┼─────────────────┼─────────────────┼───────────┤ │ Simple lookups │ Flexible docs │ Wide columns │ Relations │ │ Caching │ Content mgmt │ Time series │ Social │ │ Session store │ User profiles │ Analytics │ Recommend │ └─────────────────┴─────────────────┴─────────────────┴───────────┘ ``` ## Database Sharding Distribute data across multiple databases for horizontal scaling. Sharding is the database equivalent of splitting a library's catalog across multiple buildings by subject: each building handles a fraction of the total collection, so no single building becomes overcrowded. The catch is that if someone wants to search across all subjects at once, they now need to visit every building and combine the results. Sharding is a last resort, not a first choice. It introduces significant operational complexity: cross-shard queries become expensive, transactions spanning multiple shards require distributed coordination (like 2PC or Saga patterns), and rebalancing data when you add or remove shards is painful. Exhaust vertical scaling, read replicas, and caching before you reach for sharding. ### Sharding Strategies #### 1. Range-Based Sharding ``` ┌─────────────────────────────────────────────────────────────┐ │ User ID Range Sharding │ ├─────────────────────────────────────────────────────────────┤ │ │ │ User IDs 1-1M User IDs 1M-2M User IDs 2M-3M │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Shard 1 │ │ Shard 2 │ │ Shard 3 │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ Pros: Simple, range queries work Cons: Hotspots (new users all hit last shard) ``` #### 2. Hash-Based Sharding ```python theme={null} def get_shard(user_id, num_shards): return hash(user_id) % num_shards # User 12345 → Shard (12345 % 4) = Shard 1 # User 67890 → Shard (67890 % 4) = Shard 2 ``` ``` ┌─────────────────────────────────────────────────────────────┐ │ Hash-Based Sharding │ ├─────────────────────────────────────────────────────────────┤ │ │ │ hash(id) % 4 │ │ ┌────────────────────────────────────────┐ │ │ │ Shard Key │ │ │ └────────────────────────────────────────┘ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │Shard 0 │ │Shard 1 │ │Shard 2 │ │Shard 3 │ │ │ └────────┘ └────────┘ └────────┘ └────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ Pros: Even distribution Cons: Range queries need all shards, resharding is hard ``` #### 3. Consistent Hashing ``` ┌───────────────────┐ ╱ ╲ │ │ Node A ●───────────● Node B ● ● │ │ │ Hash Ring │ │ │ ● ● Node D ●───────────● Node C │ │ ╲ ╱ └───────────────────┘ • Keys hash to positions on ring • Assigned to next node clockwise • Adding/removing nodes only affects neighbors ``` ### Sharding Challenges | Challenge | Solution | Real-World Impact | | ----------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | | **Cross-shard queries** | Application-level joins, denormalization | A query that took 5ms on a single DB may take 50-200ms when scatter-gathering across 10 shards | | **Transactions** | 2PC, Saga pattern, design to avoid | Distributed transactions are 10-100x slower than local transactions; design your shard key to keep related data together | | **Resharding** | Consistent hashing, virtual shards | Without virtual shards, adding one node can require migrating 1/N of all data; with virtual shards, you only move specific virtual partitions | | **Hotspots** | Better shard key, splitting hot shards | Instagram famously had to re-shard because celebrity accounts created hot partitions; the shard key choice is the most consequential sharding decision | **Interview Pattern**: When discussing sharding, always mention the shard key selection first. The best shard key matches your most common access pattern and distributes load evenly. A strong answer sounds like: "I would shard by user\_id because our primary access pattern is per-user queries. This keeps all of a user's data co-located on one shard, avoiding cross-shard joins for the hot path. The risk is celebrity users creating hot shards, which I would mitigate with a salt suffix for high-traffic accounts." ## Database Replication ### Synchronous vs Asynchronous ``` Synchronous Replication Asynchronous Replication Client ──► Master ──► Replica Client ──► Master ─ ─► Replica │ │ ▼ ▼ Wait for ACK Return immediately │ │ ▼ ▼ Return to client Replica catches up later Pros: Strong consistency Pros: Lower latency Cons: Higher latency Cons: Possible data loss ``` ### Replication Topologies ``` Single Leader Multi-Leader ┌────────┐ ┌────────┐ ┌────────┐ │ Leader │ │Leader A│◄─►│Leader B│ └────┬───┘ └────┬───┘ └────┬───┘ │ │ │ ┌────┴────┐ ┌────┴───┐ ┌────┴───┐ │ │ │Follower│ │Follower│ ┌──┴──┐ ┌──┴──┐ └────────┘ └────────┘ │Foll1│ │Foll2│ └─────┘ └─────┘ Use: Multi-datacenter Challenge: Conflict resolution Leaderless (Dynamo-style) ┌──────┐ ┌──────┐ ┌──────┐ │Node A│◄─►│Node B│◄─►│Node C│ └──────┘ └──────┘ └──────┘ ▲ ▲ └─────────────────────┘ Write to W nodes, Read from R nodes W + R > N ensures consistency ``` ## Indexing ### B-Tree Index (Default) ``` ┌───────────────────┐ │ Root Node │ │ [10, 20, 30] │ └─────────┬─────────┘ │ ┌─────────────────────┼─────────────────────┐ │ │ │ ┌───────▼───────┐ ┌─────────▼─────────┐ ┌───────▼───────┐ │ [5, 8, 9] │ │ [12, 15, 18] │ │ [25, 28, 35] │ └───────┬───────┘ └─────────┬─────────┘ └───────┬───────┘ │ │ │ ▼ ▼ ▼ [Leaf nodes with actual row pointers] • Balanced tree structure • O(log n) lookups • Good for range queries • Default for most databases ``` ### Index Types ```sql theme={null} -- Single Column Index CREATE INDEX idx_email ON users(email); -- Composite Index (order matters!) CREATE INDEX idx_name ON users(last_name, first_name); -- Helps: WHERE last_name = 'Smith' -- Helps: WHERE last_name = 'Smith' AND first_name = 'John' -- Does NOT help: WHERE first_name = 'John' -- Covering Index (includes columns) CREATE INDEX idx_covering ON users(email) INCLUDE (name, status); -- Query reads only from index, no table lookup -- Partial Index CREATE INDEX idx_active ON users(email) WHERE status = 'active'; -- Smaller index, only for specific queries -- Full-Text Index CREATE INDEX idx_content ON articles USING GIN(to_tsvector('english', content)); -- For text search ``` ### Index Trade-offs | Benefit | Cost | | -------------- | --------------------------------------------------- | | Faster reads | Slower writes (each write must update the index) | | Faster sorting | Storage space (indexes can be 10-30% of table size) | | Faster joins | Memory usage (hot indexes should fit in RAM) | | -- | Index maintenance (fragmentation, vacuuming) | **Practical Sizing Rule**: A well-indexed query on PostgreSQL can serve 1,000-5,000 QPS on a single server for simple lookups. Without the right index, the same query might only manage 10-50 QPS because it triggers a full table scan. In interviews, when you mention a database, always specify what you would index and why -- it shows operational maturity. The composite index column order follows the "leftmost prefix" rule: an index on (A, B, C) supports queries on A, (A, B), and (A, B, C), but not B alone or (B, C). ## Practical Database Patterns ### Connection Pooling Managing database connections efficiently is critical for performance. ```python theme={null} from sqlalchemy import create_engine from sqlalchemy.orm import sessionmaker, scoped_session from contextlib import contextmanager class DatabasePool: """Production-ready database connection pool""" def __init__(self, database_url: str): self.engine = create_engine( database_url, pool_size=20, # Persistent connections max_overflow=30, # Extra connections under load pool_timeout=30, # Wait time for connection pool_recycle=3600, # Recycle connections hourly pool_pre_ping=True, # Verify connection before use echo=False # Set True for SQL logging ) self._session_factory = sessionmaker(bind=self.engine) self.Session = scoped_session(self._session_factory) @contextmanager def get_session(self): """Context manager for database sessions""" session = self.Session() try: yield session session.commit() except Exception as e: session.rollback() raise e finally: session.close() def execute_with_retry(self, query, params=None, max_retries=3): """Execute query with automatic retry on transient failures""" import time for attempt in range(max_retries): try: with self.get_session() as session: result = session.execute(query, params or {}) return result.fetchall() except Exception as e: if "connection" in str(e).lower() and attempt < max_retries - 1: time.sleep(2 ** attempt) # Exponential backoff continue raise # Usage db = DatabasePool("postgresql://user:pass@localhost/mydb") with db.get_session() as session: users = session.query(User).filter(User.active == True).all() ``` ```javascript theme={null} const { Pool } = require('pg'); class DatabasePool { constructor(config) { this.pool = new Pool({ host: config.host, port: config.port, database: config.database, user: config.user, password: config.password, // Connection pool settings max: 20, // Max connections min: 5, // Min connections idleTimeoutMillis: 30000, // Close idle after 30s connectionTimeoutMillis: 5000, // Timeout on connect // Health checks allowExitOnIdle: false }); // Log pool events for monitoring this.pool.on('error', (err) => { console.error('Unexpected pool error:', err); }); this.pool.on('connect', (client) => { console.log('New client connected'); }); } async query(text, params) { const start = Date.now(); const result = await this.pool.query(text, params); const duration = Date.now() - start; console.log('Query executed', { text: text.substring(0, 100), duration, rows: result.rowCount }); return result; } async withTransaction(callback) { const client = await this.pool.connect(); try { await client.query('BEGIN'); const result = await callback(client); await client.query('COMMIT'); return result; } catch (error) { await client.query('ROLLBACK'); throw error; } finally { client.release(); } } async executeWithRetry(text, params, maxRetries = 3) { for (let attempt = 0; attempt < maxRetries; attempt++) { try { return await this.query(text, params); } catch (error) { if (this.isTransientError(error) && attempt < maxRetries - 1) { await this.sleep(Math.pow(2, attempt) * 1000); continue; } throw error; } } } isTransientError(error) { const transientCodes = ['ECONNRESET', 'ETIMEDOUT', '57P01']; return transientCodes.some(code => error.code === code || error.message.includes(code) ); } sleep(ms) { return new Promise(resolve => setTimeout(resolve, ms)); } } // Usage const db = new DatabasePool({ host: 'localhost', database: 'mydb', user: 'postgres', password: 'secret' }); // Simple query const users = await db.query( 'SELECT * FROM users WHERE active = $1', [true] ); // With transaction const result = await db.withTransaction(async (client) => { await client.query('UPDATE accounts SET balance = balance - $1 WHERE id = $2', [100, 'acc1']); await client.query('UPDATE accounts SET balance = balance + $1 WHERE id = $2', [100, 'acc2']); return { success: true }; }); ``` ### Read/Write Splitting Route reads to replicas and writes to primary for better scalability. ```python theme={null} import random from datetime import datetime, timedelta from typing import Optional, List, Any from enum import Enum class QueryType(Enum): READ = "read" WRITE = "write" class ReadWriteRouter: """ Routes queries to primary or replicas based on query type and session consistency requirements. """ def __init__(self, primary_pool, replica_pools: list): self.primary = primary_pool self.replicas = replica_pools self.user_writes = {} # Track recent writes per user self.replication_lag_threshold = 5 # seconds def get_connection(self, query_type: QueryType, user_id: Optional[str] = None): """ Get appropriate connection based on query type and user session. Ensures read-your-writes consistency. """ if query_type == QueryType.WRITE: # All writes go to primary self._record_write(user_id) return self.primary # For reads, check if user recently wrote if user_id and self._should_read_from_primary(user_id): return self.primary # Otherwise, load balance across replicas return self._get_healthy_replica() def _record_write(self, user_id: Optional[str]): if user_id: self.user_writes[user_id] = datetime.utcnow() def _should_read_from_primary(self, user_id: str) -> bool: """Read from primary if user wrote recently""" last_write = self.user_writes.get(user_id) if not last_write: return False time_since_write = (datetime.utcnow() - last_write).total_seconds() return time_since_write < self.replication_lag_threshold def _get_healthy_replica(self): """Get a healthy replica with lowest lag""" healthy = [r for r in self.replicas if self._is_healthy(r)] if not healthy: # Fallback to primary if no healthy replicas return self.primary # Random selection (could be weighted by lag) return random.choice(healthy) def _is_healthy(self, replica) -> bool: try: # Check replication lag result = replica.execute("SELECT pg_last_xact_replay_timestamp()") lag = (datetime.utcnow() - result).total_seconds() return lag < 30 # Healthy if lag < 30 seconds except: return False # ORM Integration Example class UserRepository: def __init__(self, router: ReadWriteRouter): self.router = router def find_by_id(self, user_id: str, session_user: str = None) -> dict: """Read operation - goes to replica""" conn = self.router.get_connection(QueryType.READ, session_user) return conn.execute( "SELECT * FROM users WHERE id = %s", (user_id,) ).fetchone() def update(self, user_id: str, data: dict, session_user: str) -> bool: """Write operation - goes to primary""" conn = self.router.get_connection(QueryType.WRITE, session_user) conn.execute( "UPDATE users SET name = %s WHERE id = %s", (data['name'], user_id) ) return True # Usage router = ReadWriteRouter(primary_pool, [replica1, replica2]) repo = UserRepository(router) # This write goes to primary repo.update("user_123", {"name": "Alice"}, session_user="user_123") # This read goes to primary (recent write by same user) user = repo.find_by_id("user_123", session_user="user_123") # This read goes to replica (different user) user = repo.find_by_id("user_123", session_user="user_456") ``` ```javascript theme={null} class ReadWriteRouter { constructor(primaryPool, replicaPools) { this.primary = primaryPool; this.replicas = replicaPools; this.userWrites = new Map(); // userId -> timestamp this.replicationLagThreshold = 5000; // 5 seconds this.currentReplicaIndex = 0; } getConnection(queryType, userId = null) { if (queryType === 'write') { this.recordWrite(userId); return this.primary; } // Read-your-writes consistency if (userId && this.shouldReadFromPrimary(userId)) { return this.primary; } return this.getHealthyReplica(); } recordWrite(userId) { if (userId) { this.userWrites.set(userId, Date.now()); } } shouldReadFromPrimary(userId) { const lastWrite = this.userWrites.get(userId); if (!lastWrite) return false; return (Date.now() - lastWrite) < this.replicationLagThreshold; } getHealthyReplica() { // Round-robin with health check const startIndex = this.currentReplicaIndex; do { const replica = this.replicas[this.currentReplicaIndex]; this.currentReplicaIndex = (this.currentReplicaIndex + 1) % this.replicas.length; if (this.isHealthy(replica)) { return replica; } } while (this.currentReplicaIndex !== startIndex); // Fallback to primary console.warn('No healthy replicas, falling back to primary'); return this.primary; } async isHealthy(replica) { try { const result = await replica.query( "SELECT pg_last_xact_replay_timestamp() as lag_time" ); const lagTime = new Date(result.rows[0].lag_time); const lagSeconds = (Date.now() - lagTime) / 1000; return lagSeconds < 30; } catch { return false; } } } // Express.js middleware for automatic routing function dbMiddleware(router) { return (req, res, next) => { const userId = req.user?.id; // Determine query type from HTTP method const queryType = ['GET', 'HEAD', 'OPTIONS'].includes(req.method) ? 'read' : 'write'; req.db = router.getConnection(queryType, userId); next(); }; } // Usage with Express const router = new ReadWriteRouter(primaryPool, [replica1, replica2]); app.use(dbMiddleware(router)); app.get('/api/users/:id', async (req, res) => { // Automatically routes to replica const result = await req.db.query( 'SELECT * FROM users WHERE id = $1', [req.params.id] ); res.json(result.rows[0]); }); app.put('/api/users/:id', async (req, res) => { // Automatically routes to primary await req.db.query( 'UPDATE users SET name = $1 WHERE id = $2', [req.body.name, req.params.id] ); res.json({ success: true }); }); ``` ### Sharding Implementation Sharding Strategies

```python theme={null} import hashlib from typing import Dict, Any, List class ShardRouter: """ Routes queries to appropriate shard based on shard key. Uses consistent hashing for even distribution. """ def __init__(self, shards: Dict[int, Any]): self.shards = shards # shard_id -> connection pool self.num_shards = len(shards) def get_shard(self, shard_key: str) -> Any: """Get shard connection for a given key""" shard_id = self._compute_shard_id(shard_key) return self.shards[shard_id] def _compute_shard_id(self, key: str) -> int: """Consistent hash to determine shard""" hash_value = int(hashlib.md5(key.encode()).hexdigest(), 16) return hash_value % self.num_shards def execute_on_shard(self, shard_key: str, query: str, params: tuple): """Execute query on appropriate shard""" shard = self.get_shard(shard_key) return shard.execute(query, params) def scatter_gather(self, query: str, params: tuple = None) -> List[Any]: """Execute query on ALL shards and combine results""" results = [] for shard_id, shard in self.shards.items(): try: result = shard.execute(query, params or ()) results.extend(result.fetchall()) except Exception as e: print(f"Shard {shard_id} failed: {e}") return results class ShardedUserRepository: """User repository with sharding by user_id""" def __init__(self, shard_router: ShardRouter): self.router = shard_router def create_user(self, user_id: str, data: dict) -> bool: """Create user on appropriate shard""" query = """ INSERT INTO users (id, name, email, created_at) VALUES (%s, %s, %s, NOW()) """ self.router.execute_on_shard( shard_key=user_id, query=query, params=(user_id, data['name'], data['email']) ) return True def find_by_id(self, user_id: str) -> dict: """Find user - knows exactly which shard""" query = "SELECT * FROM users WHERE id = %s" result = self.router.execute_on_shard(user_id, query, (user_id,)) return result.fetchone() def find_by_email(self, email: str) -> dict: """Find by email - must scatter-gather all shards""" query = "SELECT * FROM users WHERE email = %s" results = self.router.scatter_gather(query, (email,)) return results[0] if results else None def get_total_count(self) -> int: """Count across all shards""" query = "SELECT COUNT(*) as count FROM users" results = self.router.scatter_gather(query) return sum(r['count'] for r in results) # Setup shards = { 0: create_pool("postgresql://localhost:5432/users_shard_0"), 1: create_pool("postgresql://localhost:5433/users_shard_1"), 2: create_pool("postgresql://localhost:5434/users_shard_2"), 3: create_pool("postgresql://localhost:5435/users_shard_3"), } router = ShardRouter(shards) repo = ShardedUserRepository(router) # Usage repo.create_user("user_123", {"name": "Alice", "email": "alice@example.com"}) user = repo.find_by_id("user_123") # Goes to one shard user = repo.find_by_email("alice@example.com") # Queries ALL shards ``` ```javascript theme={null} const crypto = require('crypto'); class ShardRouter { constructor(shards) { this.shards = shards; // Map: shardId -> pool this.numShards = Object.keys(shards).length; } getShard(shardKey) { const shardId = this.computeShardId(shardKey); return this.shards[shardId]; } computeShardId(key) { const hash = crypto.createHash('md5').update(key).digest('hex'); const hashValue = parseInt(hash.substring(0, 8), 16); return hashValue % this.numShards; } async executeOnShard(shardKey, query, params = []) { const shard = this.getShard(shardKey); return await shard.query(query, params); } async scatterGather(query, params = []) { const promises = Object.entries(this.shards).map( async ([shardId, shard]) => { try { const result = await shard.query(query, params); return result.rows; } catch (error) { console.error(`Shard ${shardId} failed:`, error); return []; } } ); const results = await Promise.all(promises); return results.flat(); } } class ShardedUserRepository { constructor(shardRouter) { this.router = shardRouter; } async createUser(userId, data) { const query = ` INSERT INTO users (id, name, email, created_at) VALUES ($1, $2, $3, NOW()) RETURNING * `; const result = await this.router.executeOnShard( userId, query, [userId, data.name, data.email] ); return result.rows[0]; } async findById(userId) { const query = 'SELECT * FROM users WHERE id = $1'; const result = await this.router.executeOnShard(userId, query, [userId]); return result.rows[0]; } async findByEmail(email) { // Must query all shards (email is not shard key) const query = 'SELECT * FROM users WHERE email = $1'; const results = await this.router.scatterGather(query, [email]); return results[0]; } async searchByName(namePattern) { // Full scatter-gather search const query = `SELECT * FROM users WHERE name ILIKE $1 LIMIT 100`; return await this.router.scatterGather(query, [`%${namePattern}%`]); } async getTotalCount() { const query = 'SELECT COUNT(*) as count FROM users'; const results = await this.router.scatterGather(query); return results.reduce((sum, r) => sum + parseInt(r.count), 0); } } // Setup with 4 shards const shards = { 0: new Pool({ connectionString: 'postgres://localhost:5432/users_0' }), 1: new Pool({ connectionString: 'postgres://localhost:5433/users_1' }), 2: new Pool({ connectionString: 'postgres://localhost:5434/users_2' }), 3: new Pool({ connectionString: 'postgres://localhost:5435/users_3' }), }; const router = new ShardRouter(shards); const repo = new ShardedUserRepository(router); // Express routes app.post('/api/users', async (req, res) => { const userId = generateId(); const user = await repo.createUser(userId, req.body); res.status(201).json(user); }); app.get('/api/users/:id', async (req, res) => { const user = await repo.findById(req.params.id); if (!user) return res.status(404).json({ error: 'Not found' }); res.json(user); }); ``` ## Query Optimization ### EXPLAIN ANALYZE ```sql theme={null} EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123 AND created_at > '2024-01-01'; -- Output: -- Index Scan using idx_orders_user_date on orders -- Index Cond: ((user_id = 123) AND (created_at > '2024-01-01')) -- Rows Returned: 45 -- Actual Time: 0.023..0.089 ms ``` ### Common Optimizations ```sql theme={null} -- [Bad] Function on column prevents index use SELECT * FROM users WHERE YEAR(created_at) = 2024; -- [Good] Range query uses index SELECT * FROM users WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'; -- [Bad] SELECT * fetches unnecessary data SELECT * FROM users WHERE id = 123; -- [Good] Select only needed columns SELECT id, name, email FROM users WHERE id = 123; -- [Bad] N+1 queries for user in users: orders = query("SELECT * FROM orders WHERE user_id = ?", user.id) -- [Good] Single query with JOIN SELECT u.*, o.* FROM users u LEFT JOIN orders o ON u.id = o.user_id WHERE u.status = 'active'; ``` ## Database Selection Guide | Requirement | Recommended | | -------------------- | --------------------- | | Complex transactions | PostgreSQL, MySQL | | Simple key-value | Redis, DynamoDB | | Document storage | MongoDB | | Time series | InfluxDB, TimescaleDB | | Full-text search | Elasticsearch | | Graph relations | Neo4j | | Analytics | ClickHouse, BigQuery | | Global scale | CockroachDB, Spanner | **Design Tip**: Start with PostgreSQL for most applications. It handles JSON, full-text search, and scales well. Add specialized databases only when needed. ## Interview Questions **Strong answer:** * The way I think about this is: SQL is the default, and you need a compelling reason to reach for NoSQL. The compelling reasons are (a) your data is genuinely unstructured or polymorphic, like user-generated content where every document has different fields, (b) you need horizontal write scalability beyond what a single Postgres node can handle -- typically north of 50K writes/sec sustained, or (c) your access pattern is almost entirely key-value lookups with no cross-entity queries. * In practice, most teams I have seen pick MongoDB because they want "flexible schema," but what they really mean is they do not want to write migrations. That is a terrible reason -- you just move schema enforcement from the database to every single application query, which is strictly worse. Schemaless does not mean schema-free; it means the schema lives in your application code instead of the database, and every reader has to handle every historical shape of the data. * The real trade-off is operational: with SQL, you get transactions, foreign keys, and mature tooling for backup/restore, point-in-time recovery, and EXPLAIN plans. With NoSQL, you trade those guarantees for horizontal scalability and flexible data models. At a company processing 500M events/day for analytics, Cassandra was the right call because the write volume was enormous and we never needed cross-partition joins. But for the core user/account/billing system, Postgres was non-negotiable because we needed ACID. * What most people miss is that modern Postgres handles JSON columns (`jsonb`) extremely well -- you can have a relational schema with flexible sub-documents, getting 80% of the NoSQL flexibility with full ACID guarantees. **Red flag answer:** "NoSQL is better because it scales horizontally and SQL doesn't scale." This ignores that most applications never outgrow a single well-tuned Postgres instance, and that horizontal scaling introduces massive complexity around transactions and consistency. **Follow-ups:** 1. You mentioned Cassandra for high write volume. How does its write path differ from Postgres internally, and why does that make it faster for writes? 2. If a team is already on MongoDB and experiencing data consistency bugs from the lack of schema enforcement, what migration strategy would you propose to move the critical path to Postgres without downtime? **Strong answer:** * A composite index on `(A, B, C)` is essentially a sorted tree where entries are first sorted by A, then by B within each A value, then by C within each (A, B) pair. This means the index can efficiently serve queries that filter on A, or (A, B), or (A, B, C), but not B alone or (B, C) -- because without knowing A first, the index entries for a given B value are scattered across the entire tree. * The way I decide column order is: put the column with the highest selectivity that appears in equality conditions first. If your query is `WHERE status = 'active' AND created_at > '2024-01-01'`, and status has 5 distinct values while created\_at has millions, you still put status first if it is an equality filter. Equality columns always come before range columns in the index definition, because once you hit a range condition, the index cannot use subsequent columns efficiently. * A real-world example: on an e-commerce orders table, an index on `(user_id, status, created_at)` serves `WHERE user_id = 123 AND status = 'shipped' AND created_at > '2024-01-01'` perfectly. But if you put `created_at` first, the query planner would scan a huge date range and then filter in-memory for user\_id and status -- orders of magnitude slower. * The gotcha here is covering indexes: if your query only needs columns that are all in the index (including `INCLUDE` columns), the database never touches the heap table at all. That is an index-only scan and it is dramatically faster -- I have seen queries go from 50ms to 0.5ms just by adding an `INCLUDE` clause. **Red flag answer:** "You should just index every column that appears in a WHERE clause." This shows no understanding of how multi-column indexes work and would bloat write performance and storage without actually helping query performance. **Follow-ups:** 1. You have a query that filters on `WHERE B = 5 AND C = 10` but your only composite index is `(A, B, C)`. What happens at the query plan level, and what would you do about it? 2. How would you decide between creating a new composite index versus adding `INCLUDE` columns to an existing one? What is the write amplification trade-off? **Strong answer:** * The shard key decision is the single most consequential choice in a sharded architecture because it is nearly impossible to change later without a full data migration. The key principles are: (1) the shard key should match your dominant access pattern so that most queries hit exactly one shard, (2) it should distribute data evenly to avoid hotspots, and (3) it should be immutable -- you never want a row to need to move between shards. * For a multi-tenant SaaS app, `tenant_id` is the natural first choice. Most queries are scoped to a single tenant -- "show me tenant X's dashboards, users, settings." This keeps all of a tenant's data co-located on one shard, so the hot path never crosses shards. * The risk is tenant size skew. If you have 10,000 small tenants and 3 enterprise tenants that each have 100x the data, those 3 tenants become hot shards. The mitigation is a tiered approach: small tenants share shards (multi-tenant per shard), while large tenants get dedicated shards. Some teams use a shard mapping table -- `tenant_id -> shard_id` -- so you can move a growing tenant to its own shard without changing the application logic. * What most people miss is cross-tenant queries. Your admin dashboard that shows "total users across all tenants" now requires a scatter-gather across every shard. The solution is to maintain a separate analytics datastore (like ClickHouse or BigQuery) that aggregates cross-tenant data asynchronously, so your operational shards stay fast. **Red flag answer:** "I would shard by auto-increment ID using modulo." This ignores the access pattern entirely and guarantees that every tenant-scoped query becomes a scatter-gather across all shards. **Follow-ups:** 1. A tenant that was on a shared shard has grown 50x and is now causing latency for the other tenants on that shard. How do you migrate them to a dedicated shard with zero downtime? 2. How would your shard key decision change if the application also needed to support queries like "find all tenants in the healthcare industry" or global search across tenants? **Strong answer:** * With synchronous replication, the primary waits for at least one replica to acknowledge the write before returning success to the client. This guarantees that if the primary dies, no committed data is lost -- but it adds network round-trip latency to every write. With async replication, the primary returns success immediately and the replica catches up in the background, which is faster but means the replica can be seconds behind -- if the primary dies, those un-replicated writes are gone. * In practice, I would use synchronous replication for anything involving money, legal records, or data that is expensive or impossible to reconstruct. Financial transactions at a payments company, for example -- losing even one committed transaction is unacceptable. * I would accept async replication for data that is high-volume but reconstructable or where slight staleness is tolerable. Examples: analytics events (worst case, you lose a few seconds of events and your dashboard is marginally off), session data (user logs in again), or cached materialized views that get rebuilt periodically. * The nuanced middle ground is semi-synchronous replication, which Postgres supports as "synchronous commit to at least N replicas." You can set `synchronous_standby_names` to require one of two replicas to ACK, so if one replica is slow, the other picks up the slack. This gets you durability without the worst-case latency of waiting for the slowest replica. In MySQL, `rpl_semi_sync_master_wait_for_slave_count` does the same thing. The trade-off is that if both replicas are down, writes either block or you have to manually failover to async mode. **Red flag answer:** "Always use synchronous replication because data loss is never acceptable." This shows no understanding of real-world trade-offs -- synchronous replication across data centers can add 50-100ms per write, which would destroy throughput for high-volume systems. **Follow-ups:** 1. You are running async replication and the primary fails with 3 seconds of un-replicated writes. How do you detect what was lost and what is your recovery procedure? 2. How does replication lag affect your read/write splitting strategy, and what mechanisms would you put in place to handle it? **Strong answer:** * The most common reason is that the query planner estimates the query will return a large fraction of the table. Postgres's cost-based optimizer decides that a sequential scan reading the entire table in order is cheaper than an index scan that requires random I/O for each row. This typically happens when the WHERE clause matches more than roughly 10-20% of the table. The fix depends on whether the planner's estimate is correct. * If the estimate is wrong -- for example, stale statistics -- run `ANALYZE` on the table to refresh the planner's statistics. I have seen cases where a table grew from 1M to 50M rows but `ANALYZE` had not run, so the planner thought the table was still small. In Postgres, `autovacuum` handles this, but on tables with very high write rates, autovacuum can fall behind. * If the query genuinely matches a large percentage of rows, an index will not help -- it is actually faster to do a sequential scan. The fix is to redesign the query: add more selective filters, use a partial index that only covers the subset you care about (e.g., `CREATE INDEX idx_active ON orders(created_at) WHERE status = 'active'`), or use a covering index to enable an index-only scan. * Other causes: the function wrapping the indexed column prevents index usage (e.g., `WHERE YEAR(created_at) = 2024` cannot use an index on `created_at` -- rewrite as a range condition), type mismatch between the column and the parameter (e.g., comparing a varchar column to an integer), or the index was created with a different collation or operator class. A sneaky one in Postgres: if the column is nullable and the query uses `IS NOT NULL`, the B-tree index does include NULLs by default, but the planner may still prefer a seq scan if most rows are non-null. **Red flag answer:** "The index must be corrupted, I would drop and recreate it." Jumping to index corruption without checking EXPLAIN output, statistics, or query structure shows no systematic debugging approach. **Follow-ups:** 1. How would you distinguish between "planner chose seq scan because it is genuinely faster" versus "planner chose seq scan because of stale statistics"? What specific numbers in the EXPLAIN output would you look at? 2. You have a query with `WHERE status IN ('pending', 'processing') AND created_at > NOW() - INTERVAL '7 days'` on a 200M-row table. Design the optimal index and explain why. **Strong answer:** * Consistent hashing maps both data keys and server nodes onto a circular hash space (a "ring" from 0 to 2^32). Each key is assigned to the first node encountered when walking clockwise from the key's position on the ring. The critical advantage over simple `hash(key) % N` is what happens when N changes. * With `hash(key) % N`, adding or removing a single server changes the modulo result for nearly every key, meaning you need to migrate almost all data. With consistent hashing, only keys between the removed node and its predecessor (going counter-clockwise) need to move -- roughly `1/N` of all keys. This is the difference between migrating 100% of data versus migrating 7% when going from 15 to 16 nodes. * In practice, naive consistent hashing creates uneven distribution because nodes are placed at arbitrary points on the ring. The fix is virtual nodes (vnodes) -- each physical server gets 100-200 positions on the ring instead of one. This smooths out the distribution and also makes rebalancing more granular. When a node goes down, its load is distributed across many other nodes instead of just the next one on the ring. Cassandra, DynamoDB, and Riak all use this approach. * The operational gotcha is that consistent hashing only helps with data placement -- it does not solve the replication or data transfer problem. When you add a node, the data that maps to it still needs to be physically copied from the nodes that currently own it. In Cassandra, this is the "streaming" process and it can saturate the network for hours on large clusters. You need to throttle it to avoid impacting live traffic. **Red flag answer:** "Consistent hashing just means using a hash function to pick a server." This misses the entire point -- it is about minimizing data movement during topology changes, not just about hashing. **Follow-ups:** 1. How do virtual nodes interact with replication? If a key's primary is vnode V on physical node A, how do you ensure replicas land on different physical nodes rather than other vnodes on the same machine? 2. You are using consistent hashing and one node is consistently hotter than others despite vnodes. What debugging steps would you take, and what might be the root cause? **Strong answer:** * The N+1 problem occurs when code loads a list of N entities and then, for each one, fires a separate query to load a related entity. So you get 1 query for the list plus N queries for the relations -- hence N+1. On a page showing 50 orders with their products, that is 51 database round-trips instead of 1 or 2. Each round-trip adds latency (typically 0.5-2ms per query on a local network), so N+1 can silently turn a 5ms page load into a 100ms one. * Detection in production: the best tool is query logging with aggregation. In Postgres, `pg_stat_statements` groups identical query patterns and shows total execution count and time -- if you see `SELECT * FROM products WHERE id = $1` called 10,000 times per minute with identical patterns, that is N+1. APM tools like Datadog or New Relic can also show "queries per request" as a metric, and any request with more than \~10 queries is suspicious. Django has `django-debug-toolbar` and `nplusone` library; Rails has `bullet` gem. * Solutions beyond JOIN: (1) Batch loading -- collect all the IDs and do `WHERE id IN (...)` in one query. ORMs call this "eager loading" (`select_related`/`prefetch_related` in Django, `.includes()` in Rails). (2) DataLoader pattern (from GraphQL) -- batch and deduplicate database calls within a single request lifecycle. (3) Denormalization -- embed the related data in the parent record if it is read-heavy and rarely changes. (4) Caching the related entities -- if products rarely change, a cache hit avoids the query entirely. * The trade-off with JOINs at scale is that large JOINs can be expensive too -- a JOIN across two 100M-row tables with poor indexes can be worse than N+1 with good caching. The real answer is always: measure first, optimize the specific bottleneck. **Red flag answer:** "Just add eager loading everywhere to prevent it." Blindly eager-loading all relations leads to loading massive amounts of unused data, wasting memory and bandwidth. You should only eager-load relations you actually use. **Follow-ups:** 1. You are working with a GraphQL API that has deeply nested resolvers. How does the DataLoader pattern solve N+1 at the resolver level, and what are its limitations? 2. You have identified an N+1 problem in a critical path, but the ORM's eager loading generates a JOIN that is even slower due to Cartesian product explosion with multiple has-many relations. What do you do? **Strong answer:** * A connection pool maintains a set of pre-established database connections that application threads borrow and return, avoiding the overhead of creating a new TCP connection and completing the authentication handshake for every query. Creating a Postgres connection takes 5-20ms (TCP handshake + TLS + auth), so a pool eliminates that latency for the steady state. * When the pool is exhausted, the behavior depends on configuration: most pools will queue the request and wait up to a timeout (e.g., 30 seconds) for a connection to be returned. If the timeout expires, the application gets an exception. In production, pool exhaustion is usually a symptom, not the root cause -- the real problem is slow queries holding connections too long, a missing index causing queries to take 5 seconds instead of 5 milliseconds, or a transaction left open without committing. * For sizing, I use this formula as a starting point: `pool_size = (number_of_cores * 2) + effective_spindle_count`. For a 4-core machine with SSD storage, that is roughly 9-10 connections. The key insight is that more connections is not better -- Postgres performance degrades beyond a certain point because each connection consumes about 5-10MB of RAM and the OS context-switches between them. I have seen cases where reducing the pool from 200 to 30 connections actually improved throughput because the database stopped thrashing. * In a microservices architecture, pool sizing gets harder because you have N services each with their own pool. If you have 10 services with 20 connections each, that is 200 connections to one Postgres instance. The solution is PgBouncer or a similar connection pooler in front of the database, which multiplexes hundreds of application-side connections into a smaller number of actual database connections (say, 50). PgBouncer in transaction-mode pooling is the standard recommendation. **Red flag answer:** "Set the pool size to something large like 500 so you never run out." This shows no understanding that databases have finite connection capacity, and that each connection consumes memory and CPU on the database server. **Follow-ups:** 1. What is the difference between PgBouncer's session-mode, transaction-mode, and statement-mode pooling? When would you use each, and what breaks in transaction mode? 2. You see `ERROR: remaining connection slots are reserved for superuser connections` in production. Walk me through your immediate triage and the longer-term fix.