Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Database Interview Questions (50+ Detailed Q&A)
EXPLAIN ANALYZE. Understands MVCC, locking strategies, and replication lag. Chooses between SQL and NoSQL for a given use case with clear reasoning.Staff Engineer: Owns the database strategy across the organization. Writes migration runbooks for billion-row tables. Designs the data platform architecture — connection pooling topology, read replica routing, sharding strategy, backup/PITR policy. Makes build-vs-buy decisions on NewSQL, evaluates vendor lock-in, and sets org-wide standards for schema evolution, data lifecycle, and compliance (GDPR deletion, audit trails). Mentors teams on database internals and reviews critical schema changes.1. Fundamentals & SQL
1. ACID Properties
1. ACID Properties
- Atomicity: All or Nothing. A transaction either fully commits or fully rolls back — there is no partial state. Under the hood, PostgreSQL implements this via the Write-Ahead Log (WAL): every change is written to the WAL before being applied to data pages. On crash, the WAL is replayed to complete or roll back in-flight transactions. MySQL’s InnoDB uses a similar redo/undo log system. The practical cost: every write hits the WAL first (sequential disk I/O), which is why
fsyncsettings directly impact write throughput. - Consistency: The database enforces all constraints (foreign keys, unique, CHECK, NOT NULL) before and after every transaction. Critical nuance: “consistency” in ACID is different from “consistency” in CAP theorem. ACID consistency means application-defined invariants are maintained. CAP consistency means all nodes see the same data simultaneously. Conflating these two is a common interview mistake.
- Isolation: Concurrent transactions do not see each other’s uncommitted changes — but the level of isolation is tunable. PostgreSQL defaults to Read Committed (you can see other transactions’ committed changes mid-query). MySQL InnoDB defaults to Repeatable Read (snapshot at transaction start). Serializable is the strongest level but costs 30-50% throughput due to lock contention and serialization failures requiring retries.
- Durability: Once committed, data survives crashes. The WAL must be flushed to disk (
fsync) before the commit is acknowledged. You can relax this withsynchronous_commit=offin PostgreSQL for 2-5x write throughput improvement — at the risk of losing the last ~600ms of commits on a crash. Analytics pipelines and logging systems often make this trade-off intentionally.
- “Walk me through what happens internally when PostgreSQL commits a transaction — from the COMMIT statement to the data being durable on disk.” (Tests whether they understand WAL flush, checkpoint process, and the role of
fsync.) - “You have a financial system that needs Serializable isolation for balance transfers but also handles 50K TPS in reads. How do you architect this?” (Tests CQRS thinking — separate read/write paths with different isolation guarantees.)
- “When would you intentionally turn off synchronous_commit, and what is the blast radius if the server crashes?” (Tests production judgment — acceptable for analytics writes, never for payment processing.)
SERIALIZABLE isolation for every balance-affecting transaction and couples that with synchronous_commit = on plus a synchronous replica — they accept 3-5 ms of added latency in exchange for never losing a committed charge. Their analytics warehouse, in contrast, runs on asynchronous replication with synchronous_commit = off because a few seconds of log loss is acceptable for dashboards.- Q: Can you have durability without Atomicity? A: Technically yes — a log-only store with no transactions. But most systems bundle them because a durable-but-partial write is worse than no write.
- Q: Why is ACID-C considered the “weakest” property by some database theorists? A: Because it’s the only one the database doesn’t fully own — it depends on the constraints the developer declares. Miss a foreign key, and the DB can’t enforce what you didn’t tell it about.
- postgresql.org/docs — “Write-Ahead Logging (WAL)” chapter
- martin.kleppmann.com — Designing Data-Intensive Applications Chapter 7, “Transactions”
2. Isolation Levels (Read Phenomena)
2. Isolation Levels (Read Phenomena)
| Level | Dirty Read | Non-Repeatable Read | Phantom Read | Default In | Performance |
|---|---|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible | Almost never used | Fastest |
| Read Committed | Prevented | Possible | Possible | PostgreSQL | Fast |
| Repeatable Read | Prevented | Prevented | Possible (MySQL) / Prevented (PG) | MySQL InnoDB | Moderate |
| Serializable | Prevented | Prevented | Prevented | Manual selection | Slowest (30-50% TPS drop) |
- Dirty Read: Transaction A updates a salary to 100K. Transaction A rolls back. B acted on data that never existed. This is why Read Uncommitted is almost never used in production.
- Non-Repeatable Read: Transaction A reads a row (
salary = $80K). Transaction B updates and commits it to$100K. Transaction A reads the same row again and gets$100K. The same query returned different results within one transaction. - Phantom Read: Transaction A runs
SELECT COUNT(*) FROM employees WHERE dept='eng'and gets 50. Transaction B inserts a new engineer and commits. Transaction A runs the same query and gets 51. New rows “appeared” in the result set.
SELECT ... FOR UPDATE (pessimistic locking) on the balance row, which serialized access to that specific row without upgrading the entire transaction to Serializable.Red flag answer: Listing the levels without explaining the anomalies, or not knowing which database defaults to which level. Also, saying “just use Serializable for everything” without acknowledging the throughput cost.Follow-up questions:- “How does MVCC actually work in PostgreSQL? What are xmin and xmax?” (Tests internals knowledge — every row has a creation transaction ID (xmin) and deletion transaction ID (xmax). Each transaction sees a snapshot based on these IDs.)
- “A developer reports that two concurrent API requests are creating duplicate records despite a unique constraint. What isolation level are you probably running, and how do you fix it?” (Tests debugging skills — likely Read Committed with a check-then-insert pattern. Fix: use INSERT … ON CONFLICT, or use advisory locks, or upgrade to Serializable for that specific transaction.)
- “What is the performance difference between Serializable and Read Committed in a high-throughput OLTP system?” (Tests production awareness — Serializable can reduce throughput by 30-50% due to serialization failures and retries, and requires application-level retry logic.)
READ COMMITTED to SERIALIZABLE on a specific stored procedure during flash sales, they saw a 35% drop in transactions-per-second on that code path — but double-sells dropped to zero. They localized the change to one function rather than the whole transaction, which is the pattern most senior engineers reach for.SELECT ... WHERE clause returns a different number of rows within one transaction because another transaction inserted (or deleted) matching rows in between. Not prevented by plain REPEATABLE READ in the SQL standard — only by SERIALIZABLE.REPEATABLE READ, pushing that level closer to SERIALIZABLE than the standard requires.- Q: Why doesn’t everyone just run at Serializable all the time? A: Because Serializable aborts transactions under contention and requires you to retry them. On a hot row, your throughput can collapse to the serial rate — often 10-30% of what Read Committed achieves.
- Q: If I use
SELECT ... FOR UPDATE, do I still need a higher isolation level? A: Often no — pessimistic row locks give you the correctness you want without paying the global cost of Serializable. Use the narrowest tool that solves the specific race.
- postgresql.org/docs — “Transaction Isolation” chapter
- “A Critique of ANSI SQL Isolation Levels” by Berenson et al. — the canonical paper explaining why the SQL standard’s levels are ambiguous
3. Indexing: Clustered vs Non-Clustered
3. Indexing: Clustered vs Non-Clustered
- Clustered Index (Primary Key in InnoDB): The data rows themselves ARE stored inside the index leaf nodes, sorted by the index key. This means the table data is physically ordered on disk by the clustered index. You can only have one per table because data can only be physically sorted one way. In MySQL InnoDB, the primary key IS the clustered index — if you do not define one, InnoDB creates a hidden 6-byte row ID as the clustered key. Range scans on the clustered key are extremely fast because sequential rows are stored on adjacent disk pages (spatial locality).
- Non-Clustered Index (Secondary Index): Leaf nodes contain the indexed columns plus a pointer back to the actual data row. In InnoDB, that “pointer” is the primary key value (not a physical row address), which means every non-clustered index lookup requires a second lookup through the clustered index to find the actual row — this is called a bookmark lookup or key lookup. In PostgreSQL (which uses heap tables, not clustered indexes), the pointer is a physical tuple ID (ctid).
- “Why does PostgreSQL not have clustered indexes in the same way MySQL does?” (PostgreSQL uses heap tables where rows are stored in insertion order. You can run CLUSTER to physically reorder once, but it is not maintained on subsequent writes.)
- “A query on a secondary index is slow despite the index existing. What could cause this?” (The index is not covering — it requires bookmark lookups for each row. If the query returns many rows, the random I/O from bookmark lookups is slower than a sequential scan. The optimizer may choose a seq scan instead.)
- “How would you design the primary key for a time-series table with billions of rows that needs both fast inserts and fast range queries?” (Tests practical judgment — auto-increment for fast sequential inserts with a secondary index on timestamp, or a composite key with timestamp prefix for clustered range scans, trading insert performance for query speed.)
ctid). Contrast with InnoDB where the clustered index is the table.- Q: Can I have a “clustered index” in Postgres?
A: Sort of.
CLUSTER table USING indexphysically reorders the heap once, but inserts and updates aren’t kept in order afterward. For true clustered-like behavior, useBRINon a naturally-ordered column (timestamp) or partition the table. - Q: Why does InnoDB make the PK the clustered index rather than letting you pick any column? A: Because clustering on a non-unique column would require all secondary indexes to carry a disambiguation key. Making the PK both unique and the cluster key keeps the engine simple.
- use-the-index-luke.com — “The Clustered Index” chapter by Markus Winand
- postgresql.org/docs — “Index Types” and “Storage” chapters
4. B-Tree vs B+Tree
4. B-Tree vs B+Tree
- B-Tree: Data pointers are stored in both internal nodes and leaf nodes. This means a search can terminate at an internal node if the key is found there — fewer hops in the best case. However, internal nodes are larger (they carry data payloads), so fewer keys fit per node, which increases tree height and disk I/O.
- B+Tree: Data pointers are stored ONLY in leaf nodes. Internal nodes contain only keys and child pointers, which means each internal node can hold far more keys — typically 100-500 keys per 8KB page. This keeps the tree extremely shallow (3-4 levels for billions of rows). Leaf nodes are linked into a doubly-linked list, which is the key innovation: once you find the start of a range, you walk the linked list sequentially instead of traversing back up the tree.
- Predictable performance: Every lookup traverses the same number of levels (root to leaf). B-Tree lookups vary depending on where in the tree the key sits. Predictability matters for query planning.
- Range scan efficiency:
SELECT * FROM orders WHERE created_at BETWEEN '2025-01-01' AND '2025-01-31'finds the first key via tree traversal (3-4 random I/Os), then walks the leaf linked list sequentially — sequential I/O is 100x faster than random I/O on spinning disk and still 5-10x faster on SSD. - Higher fanout: With data only at leaves, internal nodes are smaller, so more keys fit per page, reducing tree height. A 3-level B+Tree with 500 keys per internal node can index 125 million rows. The same depth B-Tree with 50 keys per node (because internal nodes carry data) only indexes 125,000 rows.
nbtree) uses 8KB pages. Each internal page holds roughly 300-500 index entries (depending on key size). For a table with 100 million rows, the index is typically 3 levels deep — meaning any point lookup requires exactly 3 page reads (often 1, since root and first-level pages are cached in shared_buffers).Red flag answer: “B+Tree stores data only in leaves” without explaining why that matters (fanout, range scans, predictable depth). Also, saying “B-Tree and B+Tree are basically the same” — the leaf linked list is a fundamental architectural difference.Follow-up questions:- “What happens to a B+Tree when you insert a key into a full leaf page?” (Page split: the leaf splits into two, and the middle key is promoted to the parent. If the parent is also full, it splits too — cascading up. This is why random inserts into a B+Tree index are slower than sequential inserts — random inserts cause more page splits.)
- “Why are UUIDs as primary keys problematic for B+Tree indexes?” (UUIDs are random, so every insert targets a random leaf page — causing constant page splits and poor cache utilization. Auto-increment IDs always insert at the rightmost leaf, which is hot in cache. This is why UUIDv7 (time-sorted) was created.)
- “How does a B+Tree index interact with the buffer pool / shared_buffers?” (Frequently accessed upper tree levels stay pinned in memory. For a well-cached index, a point lookup may only require 1 actual disk read — the leaf page — because root and internal pages are already in the buffer pool.)
UPDATE into 10+ physical writes. Postgres HOT (Heap-Only Tuple) updates avoid index updates when no indexed column changed and the new version fits on the same page — which is why fillfactor tuning matters.- Senior: Can explain B+Tree structure, page splits, and why UUIDv4 primary keys hurt.
- Staff: Thinks about write amplification at the storage layer — LSM-tree vs B+Tree tradeoffs (RocksDB, Cassandra use LSM for write-heavy workloads), fillfactor tuning for HOT updates, WAL compression settings, and how
checkpoint_timeout+full_page_writesinteract to create IO spikes. Also connects this to cost: high write amplification -> more IOPS -> more expensive storage tier.
4b. Write Amplification and LSM-Tree vs B+Tree
4b. Write Amplification and LSM-Tree vs B+Tree
- WAL writes (durability): every change written twice — once to WAL, once to data pages.
- Full-page writes: after a checkpoint, the first modification to a page writes the entire 8KB page to WAL (torn-page protection).
- Index updates: each secondary index adds its own writes.
- MVCC tuple versioning: updates create new versions; old versions stay until VACUUM.
- Page splits: inserting into a full page splits it in two, rewriting both.
- Writes go to an in-memory memtable + append-only WAL. O(1) write latency.
- Memtables flush to immutable SSTables on disk (Sorted String Tables).
- Background compaction merges SSTables, rewriting data multiple times as it moves through levels (L0, L1, … L6).
- Write amplification is high (10-30x) but writes are sequential, which flash and HDD love. Read amplification is also high because a read may hit multiple SSTables before finding the value — Bloom filters per SSTable mitigate this.
| Workload | Pick |
|---|---|
| OLTP with balanced read/write, rich queries | B+Tree (Postgres, MySQL) |
| Write-heavy time-series, logs, event sourcing | LSM (Cassandra, RocksDB, ClickHouse) |
| Read-heavy with large point lookups | B+Tree with good caching |
| Need to sustain >1M writes/sec | LSM |
- Lower
fillfactor(e.g., 80%) to leave room for HOT updates -> fewer index updates. - Batch
UPDATEs into transactions to amortize WAL overhead. - Use
UNLOGGEDtables for ephemeral data (no WAL, no crash safety). - Tune
checkpoint_timeoutup (15-30 min) to reduce full-page-write frequency, at the cost of longer recovery time.
- Senior: Measures write amplification via
pg_stat_wal(wal_bytes), tunes fillfactor, knows LSM exists. - Staff: Designs the storage tier for the company — B+Tree for OLTP, LSM-backed (ClickHouse/Cassandra) for analytics/time-series, and models the economic tradeoff: “this workload writes 50K ops/sec on Postgres and costs $8K/month in IOPS; moving to Cassandra cuts IOPS cost 70% but requires rewriting app queries.” Also makes the call on when to move stateful systems to local NVMe vs EBS vs object-storage-backed tiers.
- “Your write-heavy Postgres table has 10 indexes and write throughput plateaus at 5K TPS. How do you diagnose and fix?” — (a) Check
pg_stat_statementsfor write-heavy statements. (b) Countheap_blks_hitvsheap_blks_readon the table — cold cache hurts. (c) Drop indexes you do not use (pg_stat_user_indexes.idx_scan=0). (d) Consider partitioning by time so older partitions have fewer writes. (e) Last resort: evaluate moving to an LSM-backed system. - “Why does Cassandra handle time-series writes better than Postgres?” — LSM appends sequentially; Cassandra partitions by partition key distributing writes across the cluster; tombstones + compaction handle deletes without VACUUM; no global transaction coordinator means writes do not serialize. Postgres writes go through WAL + random B+Tree page updates, and a single hot partition serializes.
- “How does checkpoint tuning affect write amplification in Postgres?” — Longer
checkpoint_timeout-> fewer full-page writes -> lower write amplification, but longer recovery after crash.max_wal_sizecontrols how much WAL accumulates before a forced checkpoint. The tradeoff is steady-state IO vs RPO/RTO.
- Postgres with TimescaleDB: works up to ~200K/s on solid hardware; hypertables partition automatically; SQL is familiar. But 500K/s pushes the B+Tree engine.
- ClickHouse: LSM-like MergeTree, columnar storage, handles 500K/s per node easily. Trade-off: eventually-consistent, not ACID, SQL is MySQL-flavored.
- Cassandra/ScyllaDB: handles 500K/s across nodes; excellent for time-partitioned data; requires defining queries up front (partition key design).
- Recommendation: ClickHouse if analytics-style range scans dominate; Cassandra if you need sub-10ms point lookups by partition key.
5. Normalization (1NF, 2NF, 3NF)
5. Normalization (1NF, 2NF, 3NF)
- 1NF (First Normal Form): Every column holds atomic (indivisible) values — no arrays, no comma-separated lists, no nested structures in a single field. Example violation: storing
tags = "python,java,go"in a single column. Fix: create a separateuser_tagsjunction table. PostgreSQL’sARRAYandJSONBtypes technically violate 1NF but are pragmatically useful — this nuance shows production maturity. - 2NF (Second Normal Form): 1NF + no partial dependencies on a composite primary key. Only relevant for tables with composite keys. Example violation: table
(student_id, course_id, student_name, grade)—student_namedepends only onstudent_id, not the full composite key. Fix: separate intostudents(student_id, student_name)andenrollments(student_id, course_id, grade). - 3NF (Third Normal Form): 2NF + no transitive dependencies. No non-key column should depend on another non-key column. Example violation:
orders(order_id, customer_id, customer_city, customer_zip)—customer_citydepends oncustomer_zip, not onorder_id. Fix: move city/zip to acustomerstable. - Denormalization: Intentionally adding redundancy for read performance. Star schemas in data warehouses are the classic example: a fact table (
sales) with denormalized dimension attributes (customer_name,product_category) avoids expensive JOINs at query time. Amazon’s product pages denormalize heavily — the product detail is not assembled from 15 JOINed tables; it is pre-computed and stored as a single document.
- “You have a reporting dashboard that JOINs 8 tables and takes 12 seconds to load. How do you decide whether to denormalize vs. add indexes vs. use materialized views?” (Tests practical judgment: start with
EXPLAIN ANALYZEto find the bottleneck. If it is JOIN fan-out, a materialized view refreshed every 5 minutes is often the right middle ground. Full denormalization requires maintaining data consistency in application code.) - “What are update anomalies, and can you give a real example of one causing a production bug?” (If
customer_nameis stored in bothcustomersandorderstables, updating the customer’s name requires updating both. If theordersupdate fails or is forgotten, you have inconsistent data — old name on historical orders, new name on the customer record.) - “How does MongoDB’s document model relate to normalization? Is embedding the same as denormalization?” (Yes — embedding related data in a single document is denormalization. It avoids the equivalent of JOINs but means updates to shared data must be applied to every document containing it.)
sales) references dimension tables (customer, product, date). Analysts love it because most queries are simple fact-table aggregations with few joins.- Q: Does JSONB in Postgres violate 1NF? A: Technically yes — it stores composite/nested values in one column. Practically, it’s fine for attributes you never filter or aggregate on. Index the specific paths you query with GIN or expression indexes.
- Q: When is fully denormalized the right starting point? A: Analytical workloads and write-once-read-many systems (event logs, activity feeds, product catalogs served at edge). OLTP with lots of updates should almost always start normalized.
- martin.kleppmann.com — DDIA Chapter 3 on storage and retrieval
- “Star Schema: The Complete Reference” by Christopher Adamson (for warehouse-style denormalization)
6. SQL Joins
6. SQL Joins
- Inner Join: Returns only rows with matches in both tables. The most common join — used for lookups like “get all orders with their customer names.”
- Left (Outer) Join: All rows from the left table, matched rows from the right table, NULLs where no match exists. Use case: “find all customers, including those with zero orders.” The gotcha: filtering on a LEFT JOIN’s right table in the WHERE clause implicitly converts it to an INNER JOIN.
WHERE right_table.col = 'X'eliminates NULLs — useANDin theONclause instead. - Right (Outer) Join: Mirror of LEFT JOIN. Rarely used in practice — rewriting as a LEFT JOIN with swapped table order is more readable.
- Full Outer Join: All rows from both tables, NULLs on both sides where no match exists. Use case: data reconciliation — “show me all records from system A and system B, highlighting mismatches.”
- Cross Join: Cartesian product (N x M rows). Rarely intentional. Use case: generating combinations (e.g., all product-color combinations for a catalog). Accidentally omitting a JOIN condition produces a cross join — on two million-row tables, that is 10^12 rows and will kill your database.
- Nested Loop Join: For each row in the outer table, scan the inner table. O(N*M) worst case. The planner chooses this when one table is very small or an index exists on the inner table’s join column — then it becomes O(N * log M).
- Hash Join: Build an in-memory hash table from the smaller table, then probe it for each row in the larger table. O(N + M) but requires memory for the hash table. If the hash table exceeds
work_mem, it spills to disk and performance degrades dramatically. - Merge Join: Both inputs must be sorted on the join key. Walks through both sorted lists simultaneously — O(N log N + M log M) for the sorts, then O(N + M) for the merge. Efficient when both inputs are already sorted (e.g., from an index scan).
- “You have a query JOINing two tables with 50M rows each. EXPLAIN shows a nested loop join and the query takes 3 minutes. What happened and how do you fix it?” (The planner chose nested loop because of bad statistics or a missing index. With bad row estimates, it might think the inner table returns 1 row when it returns 50K. Fix: run ANALYZE, check for index on the join column, or increase
work_memto allow a hash join.) - “What is the difference between
ONandWHEREfor filtering in a LEFT JOIN?” (Conditions inONare applied during the join — rows that do not match still appear with NULLs. Conditions inWHEREare applied after the join — they filter out NULL rows, converting the LEFT JOIN to an INNER JOIN.) - “When would you prefer a subquery over a JOIN, and vice versa?” (Correlated subqueries can cause N+1-like behavior. JOINs allow the planner to choose the optimal strategy. However,
EXISTSsubqueries can be faster than JOINs when you only need to check existence, not retrieve columns — the planner can short-circuit on the first match.)
ANALYZE took 30 seconds and the next run of the query picked a hash join, dropping latency from 18 seconds to 40 ms. They now auto-run ANALYZE on critical tables after any DDL.work_mem — watch for Batches > 1 in EXPLAIN.- Q: Why do experienced engineers distrust
SELECT *in a JOIN? A: Because it forces the planner to materialize every column of every row, kills the chance of an index-only scan, and couples application code to the exact schema — schema changes silently break things. - Q: Is it ever faster to do two single-table queries and join in the application? A: Occasionally yes — when the join selectivity is terrible and one side is already cached. But you’re fighting the planner, and the result is fragile. Measure before reaching for it.
- use-the-index-luke.com — “Joining Tables” chapter on how indexes shape join strategies
- postgresql.org/docs — “Explicit JOIN Clauses” and “Performance Tips”
7. View vs Materialized View
7. View vs Materialized View
-
View: A saved SQL query — a virtual table that re-executes the underlying query every time you
SELECTfrom it. Zero storage overhead, always returns fresh data. Think of it as a named alias for a complex query. Use case: simplifying access control (expose a view that hides sensitive columns), standardizing complex JOINs that multiple services use, or providing a stable API over changing table structures. -
Materialized View (MatView): A physical copy of the query result stored on disk. Fast reads because it is just a table scan — no re-execution of the underlying query. Trade-off: data is stale until you run
REFRESH MATERIALIZED VIEW. In PostgreSQL, you can create indexes on materialized views, which makes them extremely powerful for dashboard-style queries.
REFRESH MATERIALIZED VIEW: Takes anACCESS EXCLUSIVElock — reads are blocked during refresh. On a 10M-row materialized view, refresh might take 30 seconds, meaning 30 seconds of downtime for any query hitting that view.REFRESH MATERIALIZED VIEW CONCURRENTLY: Allows reads during refresh (no exclusive lock), but requires a unique index on the materialized view and is 2-3x slower because it performs a diff between old and new data. This is what you use in production.- Scheduled refresh: Most teams use
pg_cronor an application-level scheduler to refresh every 5-15 minutes. The refresh interval is a business decision: “How stale can this data be?” - Trigger-based refresh: Refresh on every write to the source tables. This defeats the purpose for write-heavy workloads but works for slowly-changing data.
orders, customers, and products — a 15-second query on raw tables. A materialized view with a REFRESH CONCURRENTLY every 10 minutes reduces this to 200ms. The business accepts 10-minute staleness for a 75x speedup.Red flag answer: “Materialized views are just cached views” without mentioning the locking behavior of refresh, or not knowing about CONCURRENTLY. Also, suggesting materialized views for real-time data requirements.Follow-up questions:- “A materialized view refresh is taking 45 seconds and blocking all reads. How do you fix this without changing the refresh frequency?” (Switch to
REFRESH CONCURRENTLY— requires adding a unique index to the materialized view. If refresh is inherently slow, consider incremental refresh viapg_ivmextension or switching to an application-level caching strategy like Redis.) - “How do materialized views compare to application-level caching with Redis?” (MatViews live inside the database — they are queryable with SQL, support indexes, and are transactionally consistent at refresh time. Redis is faster for point lookups but cannot handle complex SQL queries. MatViews are better for analytical aggregations; Redis is better for hot key-value caching.)
- “Can you use a materialized view as a target for foreign keys or INSERT statements?” (No — materialized views are read-only. You cannot INSERT, UPDATE, or reference them in foreign key constraints. They are purely a query acceleration mechanism.)
pg_cron. Their engineers measured the alternative (re-running the underlying 8-table join each dashboard load) at 12+ seconds and decided 5-minute staleness was acceptable for internal-facing tools. External customer dashboards still hit the live data.REFRESH.pg_ivm provide this; native support is on the roadmap.- Q: Can I put an index on a materialized view?
A: Yes, and you usually should — MatViews behave like tables for indexing. A unique index is required if you want to use
REFRESH CONCURRENTLY. - Q: What happens to a materialized view if the underlying tables are dropped?
A: Postgres prevents the drop via dependency — you’d need
DROP TABLE ... CASCADE, which also drops the matview. A good protection against accidents.
- postgresql.org/docs — “Materialized Views” chapter
- use-the-index-luke.com — query performance techniques that complement materialized views
8. Stored Procedures vs Functions
8. Stored Procedures vs Functions
- Function: Returns a value (scalar, table, or set). Can be called inside
SELECT,WHERE, and other SQL expressions. In PostgreSQL, functions run within the caller’s transaction — they cannot independentlyCOMMITorROLLBACK. SupportsIMMUTABLE,STABLE,VOLATILEvolatility categories that affect query planner optimization and caching. - Stored Procedure (PostgreSQL 11+, MySQL, SQL Server): Can manage its own transaction control —
COMMITandROLLBACKinside the body. Cannot be called from within aSELECTstatement — invoked viaCALL. Designed for multi-step workflows that need intermediate commits (e.g., batch processing 1M rows in chunks with commits every 10K to avoid long-running transaction bloat).
- Good use cases: Data validation that must be enforced regardless of which application connects (constraint triggers), audit logging (triggers on INSERT/UPDATE/DELETE), complex data transformations during ETL that benefit from being close to the data (avoiding network round-trips for millions of rows), and row-level security policies.
- Bad use cases: Business logic that changes frequently (stored procedures are harder to version, test, and deploy than application code), anything that requires external API calls or complex branching logic, and any logic that your development team cannot effectively debug (if most engineers do not know PL/pgSQL, stored procedures become a black box).
- “You have a nightly batch job that needs to process and archive 50M rows. Would you write this as a stored procedure or application code? Why?” (Strong case for a stored procedure: processing 50M rows in application code means 50M network round-trips. A stored procedure processes data locally. Use intermediate COMMITs every 10K-50K rows to avoid transaction ID wraparound and long-running transaction problems.)
- “How do you unit test a stored procedure?” (This is the Achilles’ heel. pgTAP is the PostgreSQL testing framework for SQL/PL/pgSQL. You can also test via integration tests that call the procedure and assert on database state. But testing is harder than application code — this is the main argument against heavy stored procedure usage.)
- “What is the performance difference between a PL/pgSQL function and a SQL function in PostgreSQL?” (SQL functions can be “inlined” by the planner — the function body is substituted into the outer query and optimized as a whole. PL/pgSQL functions are opaque to the planner — each SQL statement inside is planned separately. For simple logic, SQL functions can be 5-10x faster.)
IMMUTABLE or STABLE — PL/pgSQL functions are always opaque to the planner.- Q: Can I version-control stored procedures like application code? A: Yes — use migration tooling (Flyway, Liquibase, Sqitch) and treat procedure bodies as migration files. Most teams that get burned by stored procs skipped this step.
- Q: Why do DBAs love them and DevOps teams hate them? A: DBAs see them as close-to-the-data performance wins and central policy enforcement. DevOps teams see them as out-of-band code that bypasses the normal CI/CD, testing, and observability pipeline.
- postgresql.org/docs — “Server Programming” chapters on functions and procedures
- “The Art of PostgreSQL” by Dimitri Fontaine for the case in favor of database-centric design
9. Constraint Types
9. Constraint Types
- Primary Key: Unique + NOT NULL. Implicitly creates a unique index. In InnoDB, the primary key IS the clustered index — choosing the right PK affects every query’s physical I/O pattern. In PostgreSQL, the PK creates a B-tree index on a heap table.
- Foreign Key: Enforces referential integrity between tables. On INSERT/UPDATE, the database checks the referenced row exists. On DELETE of the parent, the configured action fires:
CASCADE(delete children),SET NULL,SET DEFAULT, orRESTRICT(block). Performance cost: every INSERT into the child table requires an index lookup on the parent’s PK. On bulk loads, temporarily disabling FK checks (SET session_replication_role = 'replica';in PostgreSQL) can speed things up 5-10x — but you must re-validate after. - Unique: Prevents duplicate values. Creates a unique index. Partial unique constraints in PostgreSQL (
CREATE UNIQUE INDEX ... WHERE active = true) are extremely powerful — “only one active subscription per user.” - NOT NULL: Prevents NULL values. Seems simple but adding
NOT NULLto an existing column on a large table requires a full table scan to verify no NULLs exist (takes an ACCESS EXCLUSIVE lock). In PostgreSQL 12+, useALTER TABLE ... ADD CONSTRAINT ... NOT NULL NOT VALID; VALIDATE CONSTRAINT ...;to avoid long locks. - Check: Arbitrary boolean expressions (
CHECK (age > 0 AND age < 200),CHECK (status IN ('active', 'pending', 'closed'))). Cannot reference other tables. For cross-table validation, use triggers or application logic. - Exclusion Constraints (PostgreSQL-specific): Prevent overlapping values using GiST indexes. Classic use case:
EXCLUDE USING gist (room WITH =, during WITH &&)— no two bookings for the same room can have overlapping time ranges. This is incredibly powerful and underused.
- “You have a multi-tenant SaaS app. How do you ensure that a user in tenant A can never accidentally reference a record belonging to tenant B through a foreign key?” (Composite FKs that include
tenant_id, or row-level security policies.) - “A bulk import of 10M rows is taking 4 hours because of foreign key checks. How do you speed it up safely?” (Disable FK triggers, load data, re-enable and validate. Or use
COPYwith deferred constraints.) - “When would you intentionally NOT use a foreign key constraint?” (Cross-service boundaries in microservices, very high-write throughput tables where FK check overhead is unacceptable, or when referencing data in a different database/shard.)
workspace_id on every referential relationship. Early on they hit a bug where a shared integration could accidentally reference blocks from a different workspace — a composite FK (block_id, workspace_id) made that relationship impossible to express, turning a class of bugs into a compile-time DB error. The cost is ~10 bytes per FK, well worth it for a multi-tenant system.CREATE UNIQUE INDEX ... WHERE active = true). Uniqueness is enforced only on the matching subset of rows — perfect for “one active subscription per user” or soft-delete scenarios where deleted rows should not block re-use of an email.- Q: Why are FKs controversial in microservices?
A: They create implicit coupling across service boundaries. If
orders.customer_idreferencescustomers.idbut customers is owned by a different service, you’ve leaked schema ownership. Most microservice shops drop FKs at service boundaries and enforce integrity at the application layer. - Q: Does adding a CHECK constraint lock the table?
A: In Postgres, by default yes — it scans the full table to validate. Use
ADD CONSTRAINT ... NOT VALIDfirst (instant, no scan), thenVALIDATE CONSTRAINTseparately (weaker lock). On a billion-row table this is the difference between minutes of downtime and zero downtime.
- postgresql.org/docs — “Data Definition” chapter on constraints
- use-the-index-luke.com — how unique constraints and indexes interact
10. Cursor
10. Cursor
- Network roundtrips: Each
FETCHis a separate network call. Iterating 100K rows = 100K roundtrips. A singleSELECTreturning 100K rows is one roundtrip. - Lock holding: Server-side cursors hold resources (memory, locks) for the duration of iteration. On a busy OLTP system, this blocks other transactions.
- Set-based alternative: SQL is designed for set operations. Almost everything a cursor does can be done with
UPDATE ... FROM,INSERT ... SELECT, window functions, CTEs, orLATERALjoins — all of which the query planner can optimize. Cursors bypass the planner’s ability to optimize across operations.
- Batch processing with intermediate commits: Processing 50M rows in chunks — open a cursor, fetch 10K rows, process, COMMIT, fetch next 10K. This avoids holding a single transaction open for hours (which prevents VACUUM in PostgreSQL).
- Streaming large result sets: When the result is too large to fit in application memory. PostgreSQL’s
DECLARE CURSORwithFETCH 1000lets you stream without loading everything. - Stored procedures with complex control flow: PL/pgSQL
FOR record IN SELECT ...is syntactic sugar over a cursor and is the idiomatic way to iterate in server-side code.
- “A developer wrote a stored procedure that uses a cursor to update 2M rows one at a time. It takes 45 minutes. How do you rewrite it?” (Single
UPDATE ... SET ... FROMwith a JOIN, or batch updates withWHERE id BETWEEN X AND Y.) - “What is the difference between a server-side cursor and a client-side cursor?” (Server-side: result stays on the DB server, client fetches in chunks. Client-side: entire result sent to client, iteration happens in app memory. Server-side saves client memory but holds server resources.)
- “How do PostgreSQL’s scrollable cursors differ from forward-only cursors, and when would you use each?”
UPDATE files SET ... FROM stale_index WHERE files.id = stale_index.id with a batching WHERE id BETWEEN X AND Y wrapper. The rewrite took 90 seconds for the same 2 million rows — a 30x improvement. The lesson: cursors look like “proper iteration” to procedural programmers but are almost always the wrong tool in SQL.UPDATE ... WHERE). The opposite of procedural row-by-row iteration — lets the planner choose the optimal strategy and avoids per-row network overhead.- Q: When is a cursor genuinely necessary? A: Two cases: (1) batch processing where you need intermediate commits to avoid a multi-hour transaction blocking VACUUM, and (2) streaming a result set that doesn’t fit in app memory.
- Q: Can’t I just use
LIMIT/OFFSETinstead of a cursor? A: For small offsets, yes. ButOFFSET 1000000forces the DB to scan and discard the first million rows every time — O(N²). Keyset pagination (WHERE id > last_id LIMIT 1000) is the correct scalable pattern.
- use-the-index-luke.com — “Pagination” chapter on keyset vs offset
- postgresql.org/docs — “Cursors” and “Declare Cursor” documentation
2. Internals & Optimization
11. WAL (Write Ahead Log)
11. WAL (Write Ahead Log)
- Sequential vs Random I/O: Writing to the WAL is sequential I/O (appending to end of file), which is 100-1000x faster than the random I/O needed to update data pages scattered across disk. On a typical SSD, sequential writes can hit 500MB/s while random 4KB writes might only achieve 50MB/s.
- Crash Recovery: If the server crashes after writing to the WAL but before updating data pages, the database replays the WAL on startup to bring data pages up to date. If the crash happens before the WAL write completes, the transaction is considered uncommitted and is rolled back. This is how Atomicity and Durability work together.
- Application sends
INSERTstatement - PostgreSQL writes the change to WAL buffers in shared memory
- On
COMMIT, WAL buffers are flushed to disk (fsyncto WAL segment file) - The commit is acknowledged to the client — data is now durable
- Later, the background writer and checkpointer lazily flush dirty data pages to disk (this is NOT on the critical path)
synchronous_commit=off skips the fsync on step 3, buffering WAL writes in the OS page cache. This gives 2-5x write throughput but risks losing up to ~600ms of commits on crash. Analytics and logging workloads often accept this trade-off.Red flag answer: “WAL is just a log file for debugging” or not understanding why writes go to WAL before data pages. Also, not knowing that WAL is what enables streaming replication in PostgreSQL (replicas consume the WAL stream).Follow-up questions:- “How does WAL relate to streaming replication?” (Replicas receive and replay the WAL stream from the primary. This is why replication lag is measured in WAL bytes behind, not time.)
- “What happens to WAL files over time? Do they grow forever?” (No — after a checkpoint, old WAL segments are recycled or archived.
wal_keep_sizecontrols retention. WAL archiving to S3 enables Point-In-Time Recovery.) - “A PostgreSQL instance is experiencing high write latency. WAL disk utilization is at 100%. What is happening?” (The WAL disk is the bottleneck. WAL writes are synchronous on commit. Solutions: move WAL to faster disk (NVMe), increase
wal_buffers, usesynchronous_commit=offif data loss is acceptable, or check iffull_page_writesis generating excessive WAL volume.)
- Q: What’s the difference between
synchronous_commit=onandsynchronous_commit=remote_apply? A:onwaits for local WAL fsync.remote_applywaits for the replica to fsync AND apply the WAL. The latter gives you read-your-writes consistency on the replica but increases write latency by the replica’s round trip. - Q: Can WAL fill up the disk?
A: Yes — if an orphaned replication slot prevents cleanup, or if
archive_commandfails silently, WAL accumulates until the disk fills and Postgres shuts down. Monitorpg_replication_slots.restart_lsnandarchive_failurecount.
- postgresql.org/docs — “Write-Ahead Logging (WAL)” and “Reliability” chapters
- martin.kleppmann.com — DDIA Chapter 3 on storage engines
12. MVCC (Multi-Version Concurrency Control)
12. MVCC (Multi-Version Concurrency Control)
- Every row (tuple) has two hidden system columns:
xmin(the transaction ID that created this version) andxmax(the transaction ID that deleted/updated this version, or 0 if still live). - When a row is updated, PostgreSQL does not modify the existing tuple. Instead, it creates a new version of the tuple with a new
xmin, and sets thexmaxon the old version to the updating transaction’s ID. Both versions exist on disk simultaneously. - When a transaction reads, it checks its own snapshot (list of active transaction IDs at the time it started) against each tuple’s
xmin/xmaxto determine visibility. Ifxminwas committed before the snapshot andxmaxis either 0 or not yet committed, the tuple is visible. - This means a long-running
SELECTquery sees a frozen-in-time view of the database, even if other transactions are actively modifying the same rows.
- Because updates create new tuple versions, the old versions (dead tuples) accumulate. These consume disk space and bloat indexes.
- VACUUM reclaims this space by marking dead tuples as reusable. VACUUM FULL rewrites the entire table (but requires an exclusive lock — avoid in production).
- Autovacuum runs in the background and is critical to keep running. A common production disaster: a team disables autovacuum “because it’s using CPU” and weeks later the table is 10x its actual size, queries slow to a crawl, and worst case, the database hits transaction ID wraparound — a catastrophic event where PostgreSQL refuses ALL writes to prevent data corruption.
- “A PostgreSQL table has 1 million live rows but is taking up 10GB on disk. What happened and how do you fix it?” (Table bloat from dead tuples. Autovacuum may be misconfigured or a long-running transaction prevented it. Run
VACUUM FULLduring maintenance window, or usepg_repackfor online table rewrites without locking.) - “What is transaction ID wraparound and why is it PostgreSQL’s most dangerous failure mode?” (PostgreSQL uses 32-bit transaction IDs. After ~2 billion transactions, IDs wrap around and old transactions become “in the future,” making their data invisible. VACUUM must freeze old transaction IDs before this happens. If it cannot — because autovacuum is blocked or disabled — PostgreSQL enters a protective shutdown mode.)
- “How do you handle a situation where a long-running analytics query is blocking VACUUM on a production OLTP database?” (This is a common real-world problem. Solutions: set
statement_timeoutfor analytics queries, use a read replica for analytics, or configurehot_standby_feedbackto control snapshot retention on replicas.)
pg_terminate_backend() and an aggressive VACUUM ANALYZE — but they now alert on any transaction idle for more than 30 minutes.xmin is the transaction ID that created the row version; xmax is the transaction ID that deleted or updated it (or 0 if still live). MVCC visibility is decided by comparing these against your transaction’s snapshot.- Q: Why does
SELECT count(*)need to scan every row in Postgres but is instant in MySQL InnoDB? A: MVCC visibility is per-row in Postgres — you must check every row’s visibility against your snapshot. InnoDB caches an approximate row count that’s good enough for the default statement. Postgres trades that speed for snapshot isolation correctness. - Q: What does
VACUUM FULLdo differently fromVACUUM? A:VACUUMmarks dead tuples as reusable (space returned to the free space map).VACUUM FULLrewrites the entire table, returning space to the OS — but requires an ACCESS EXCLUSIVE lock. Usepg_repackfor the online alternative.
- postgresql.org/docs — “Concurrency Control” chapter on MVCC internals
- martin.kleppmann.com — DDIA Chapter 7 on snapshot isolation
13. Query Execution Plan
13. Query Execution Plan
EXPLAIN ANALYZE output like a diagnostic report, identify the bottleneck nodes, understand why the planner made its choices, and take corrective action. This is the single most important operational skill for any engineer working with relational databases.Answer:The query lifecycle: Parser (syntax check) -> Rewriter (view/rule expansion) -> Planner/Optimizer (generates candidate plans, estimates cost, picks cheapest) -> Executor (runs the plan, returns rows).Reading EXPLAIN ANALYZE — what matters:- Actual vs. estimated rows: If the planner estimates 100 rows but the actual is 500K, statistics are stale or misleading. This is the #1 cause of bad plans. Fix:
ANALYZE table_nameor increasedefault_statistics_target. - Node types to watch for:
Seq Scanon a large table with a selective WHERE clause — missing index or stale stats.Nested Loopwith a large inner table and no index — O(N*M) disaster. The planner chose this because it underestimated the outer row count.Hash JoinwithBatches: 8— the hash table spilled to disk because it exceededwork_mem. Increasework_mem(per-session, not globally).Sort Method: external merge— sort spilled to disk. Same fix: increasework_mem.Bitmap Heap ScanwithLossy: true— bitmap exceededwork_memand lost precision. Acceptable for moderate result sets but watch for performance cliffs.
- Buffers (with
BUFFERSoption):shared hit= reads from cache,shared read= reads from disk. Highreadcounts on a warm system indicate your working set does not fit inshared_buffers. - Time breakdown: The
actual timeshows two numbers: startup time and total time. A node with high startup time but low total time is doing expensive setup (e.g., building a hash table). Compare parent node time to sum of children — the difference is the node’s own processing overhead.
- “EXPLAIN ANALYZE shows accurate row estimates and an Index Scan, but the query is still slow. What else could be wrong?” (I/O wait — check
Buffers: shared readcount. Lock contention — checkpg_stat_activity. Network latency. Or the index scan returns many rows requiring heap fetches with poor spatial locality.) - “How do you read a plan with parallel query nodes? What does
Workers Planned: 4, Workers Launched: 2tell you?” (Only 2 of 4 planned workers were available — others are busy. TheGathernode’s time includes waiting for workers. Checkmax_parallel_workers_per_gather.) - “A query plan changes from fast (Index Scan) to slow (Seq Scan) after a weekend batch job. What happened?” (The batch job changed the table’s data distribution. Autovacuum/ANALYZE hasn’t run yet, so statistics are stale. Or the batch job increased the table size past the planner’s threshold for switching to Seq Scan.)
EXPLAIN ANALYZE showed Hash Join with Batches: 64 — meaning the hash table had spilled to disk 64 times. Raising work_mem from 4 MB to 64 MB (session-local, not globally) dropped the query to 180 ms consistently. Lesson: work_mem is session-scoped for a reason — tune it per-query, not globally.EXPLAIN ANALYZE: Runs the query and reports both the planner’s estimates and the actual execution metrics. The difference between estimated and actual rows is the first thing experienced DBAs look at — it reveals stale statistics, which is the #1 cause of bad plans.work_mem: The amount of memory a single query operation (sort, hash join, hash aggregate) can use before spilling to disk. Disk spill is 10-100x slower. Set session-local for expensive queries rather than raising the global default, which multiplies across all concurrent operations.- Q: What’s the difference between
EXPLAINandEXPLAIN ANALYZE? A:EXPLAINshows only the estimated plan.EXPLAIN ANALYZEactually executes the query and shows real timings and row counts. Use ANALYZE when you can afford the query to actually run; use plain EXPLAIN forUPDATE/DELETEyou don’t want to execute. - Q: Why does the planner sometimes ignore an obviously good index?
A: Usually stale statistics. The planner’s cost model compares index scan vs seq scan cost; if it thinks the query returns 80% of the table, seq scan wins. Run
ANALYZEand re-check.
- use-the-index-luke.com — “Execution Plans” chapter explaining every major plan node
- postgresql.org/docs — “Using EXPLAIN” chapter
14. Index Scan / Seek
14. Index Scan / Seek
- Index Seek (Point Lookup): Traverse the B+Tree from root to the specific leaf node matching your predicate. O(log N) — typically 3-4 page reads for millions of rows. Used for equality (
=) and range start-point lookups. This is the fast path. - Index Scan (Range Scan): Seek to the starting leaf node, then walk the leaf-level linked list sequentially. Used for range queries (
BETWEEN,>,<). Cost = O(log N) for the seek + O(K) for reading K matching leaf nodes. This is efficient because leaf pages are linked — sequential I/O. - Full Index Scan: Read ALL leaf nodes of the index, ignoring the tree structure. Used when the query can be satisfied entirely from the index (covering index) but has no selective predicate. Cheaper than a full table scan because the index is smaller than the table.
- Bitmap Index Scan (PostgreSQL): Scan the index to build a bitmap of matching heap pages, then read those pages in physical order. This converts random I/O (many individual index lookups) into sequential I/O (reading pages in order). Used when the index scan would return too many rows for individual lookups but too few for a Seq Scan. The planner’s sweet spot: roughly 5-20% of the table.
- “At what selectivity percentage does PostgreSQL typically switch from Index Scan to Seq Scan?” (Roughly 10-20% of the table, depending on
random_page_costandeffective_cache_size. On SSD, you can lowerrandom_page_costto 1.1 (from default 4.0) to encourage index usage at higher selectivity.) - “What is an Index-Only Scan and when does it fail to be ‘index only’?” (When all columns are in the index AND the visibility map confirms all heap pages are all-visible. If the visibility map is not up to date — because VACUUM hasn’t run — the scan must still check the heap for tuple visibility, degrading to a regular index scan.)
- “Why would two identical queries on the same table choose different scan strategies?” (Different parameter values — one value is highly selective (index scan), the other matches 80% of rows (seq scan). This is the plan caching problem that
plan_cache_mode = autoaddresses.)
gp2 (spinning-era defaults) to gp3 SSDs on AWS, they forgot to lower random_page_cost from the default 4.0 to ~1.1. The planner kept picking seq scans for queries that should have used indexes, because it thought random I/O was 4x more expensive than sequential. Updating one parameter dropped p99 on their hottest dashboard query from 2.3 seconds to 40 ms. This is one of the highest-leverage Postgres tweaks on modern SSDs.- Q: Why does
WHERE col > 'X'sometimes use the index and sometimes not? A: Selectivity. If “greater than X” matches 5% of rows, the index wins. If it matches 50%, the seq scan wins. Histograms inpg_statstell the planner which it is — stale histograms = wrong choice. - Q: How do I force an index in Postgres?
A: You can’t — Postgres doesn’t have hints by design. Diagnostic trick:
SET enable_seqscan = offto see what plan the planner would pick with the index, then fix the underlying cost estimates instead of forcing behavior.
- use-the-index-luke.com — “Anatomy of an Index” and “Where Clause” chapters
- postgresql.org/docs — “Query Planning” and “Index-Only Scans” sections
15. Covering Index
15. Covering Index
SELECT name, email FROM users WHERE age = 25:- Index on
(age)— the index finds matching rows, but then must do a heap fetch for each row to getnameandemail. If 10K rows match, that is 10K random I/O heap lookups. - Index on
(age, name, email)— the index contains everything. Zero heap fetches. PostgreSQL shows this asIndex Only Scanin the query plan.
CREATE INDEX idx_users_age ON users (age) INCLUDE (name, email);
The INCLUDE columns are stored in the leaf nodes but are NOT part of the B-tree sort order. This means: (1) the tree is sorted only by age (efficient for WHERE age = 25), (2) name and email are available without heap access, (3) the index is smaller than (age, name, email) because internal nodes only carry age.Trade-off: Covering indexes are wider (more columns = more storage, slower writes). Every INSERT and UPDATE must maintain the index. On a high-write table, adding columns to an index for read optimization can degrade write throughput by 10-30%. The rule of thumb: cover your top 3-5 most frequent queries, not every possible query.What weak candidates say: “Just add all columns to the index.” This creates massive indexes that slow writes.What strong candidates say: Use INCLUDE for payload columns, keep the sorted key columns minimal, and balance read speedup against write overhead. Mention the visibility map requirement for Index Only Scans in PostgreSQL.Follow-up chain:- “Your covering index is in place but EXPLAIN still shows heap fetches. Why?” (The visibility map is not up to date. Recently modified pages are marked as not-all-visible. VACUUM updates the visibility map. Run
VACUUMon the table.) - “How do covering indexes interact with HOT (Heap Only Tuple) updates in PostgreSQL?” (If the covering index includes a column that is updated, HOT updates are disabled for that row — every update must also update the index, even if the indexed column value did not change. This is why you should only INCLUDE columns that are rarely updated.)
- “Covering index vs. materialized view for a dashboard query that joins 3 tables — when do you use each?”
INCLUDE are stored in the leaf pages but are NOT part of the B-tree sort order — so they don’t bloat internal nodes. Perfect for payload-only columns.- Q: Why not just add every column to every index? A: Each additional column in an index adds write cost (every INSERT/UPDATE must maintain the index) and storage. Instagram reported that one team’s “add every column to indexes” spree caused a 25% drop in write throughput until they audited and removed them.
- Q: Can I have a covering index in MySQL InnoDB? A: Yes — any secondary index whose columns fully satisfy the query avoids the bookmark lookup into the clustered index. MySQL documentation calls this “using index” in EXPLAIN output.
- use-the-index-luke.com — chapter on covering indexes (“index-only scans”)
- postgresql.org/docs — “Index-Only Scans” and
CREATE INDEX ... INCLUDEsyntax
16. Hash Index
16. Hash Index
- Equality only: Hash indexes support only
=operator. No range queries (>,<,BETWEEN), noORDER BY, noLIKE 'prefix%'. B-tree handles all of these. - No covering index: Hash indexes cannot use INCLUDE columns (PostgreSQL). B-tree can.
- Historical reliability issues: Before PostgreSQL 10, hash indexes were not WAL-logged — they would corrupt on crash and had to be rebuilt. This history made the community wary. Since PG 10, they ARE crash-safe, but the stigma persists.
- Marginal speed advantage: For a B-tree with 100M rows, a point lookup is 3-4 page reads. For a hash index, it is 1-2 page reads. The real-world difference is microseconds — rarely the bottleneck. The flexibility loss is not worth the marginal gain.
- Very large tables where every microsecond matters AND queries are exclusively equality lookups (e.g., session token lookup, cache key lookup).
- The hash index is smaller than the equivalent B-tree for long text keys because it stores fixed-size hashes, not the full key value.
- “Compare B-tree, hash, GIN, and GiST indexes. When would you choose each?” (See the dedicated B-tree vs Hash vs GiST question below.)
- “How does a hash index handle collisions, and what happens to performance under high collision rates?” (Overflow pages. If the hash function produces many collisions for the data distribution, performance degrades from O(1) to O(N) for the collision chain.)
- “In-memory databases like Redis use hash tables as the primary data structure. Why does the calculus change for in-memory vs. disk-based systems?” (No I/O cost for memory access, so the constant-factor advantage of O(1) hash lookup over O(log N) tree traversal is significant. On disk, the I/O cost dwarfs the comparison cost.)
Object.hashCode()) can cluster badly.- Q: If hash is O(1), why isn’t it the default for primary keys? A: Because primary keys are used for more than point lookups — range scans, sort-order traversal, foreign-key joins. B-tree supports all of those in O(log N). Hash only wins for one narrow use case.
- Q: Do any databases use hash indexes by default? A: Yes — key-value stores (Redis, DynamoDB partition keys) use hash indexing because their access pattern is exclusively point lookups. The calculus changes when range queries aren’t in the API.
- postgresql.org/docs — “Index Types” chapter covering all 6 native index types
- use-the-index-luke.com — discussion of why B-tree dominates despite hash’s theoretical edge
17. Database Sharding Logic
17. Database Sharding Logic
- Hash-based (
shard = hash(tenant_id) % N): Even distribution, no hotspots if the key has good cardinality. Drawback: adding a shard requires rehashing and migrating data. Consistent hashing minimizes migration (only ~1/N of keys move). - Range-based (
shard 1: IDs 1-1M, shard 2: 1M-2M): Simple, supports range queries within a shard. Drawback: hotspots if recent data is accessed more (all traffic hits the last shard). Time-based ranges for logs have this exact problem. - Geo-based (EU data on EU shard, US data on US shard): Reduces latency and satisfies data residency requirements (GDPR). Drawback: users who travel or multi-region organizations complicate routing.
- Directory/lookup-based: A mapping table tracks which shard holds each entity. Most flexible but the lookup service becomes a single point of failure and a latency tax on every query.
- Cross-shard JOINs: Essentially impossible efficiently. You must denormalize or accept application-level joins. This is why the shard key choice is critical — data that is frequently joined must live on the same shard (co-location).
- Cross-shard transactions: 2PC is slow and fragile. Most teams use eventual consistency with Saga patterns.
- Rebalancing: Moving data between shards without downtime. Tools like Vitess (MySQL) or Citus (PostgreSQL) automate this, but it is never painless.
- Schema migrations: Must be applied to ALL shards atomically — or at least with backward compatibility. A failed migration on shard 3 of 16 is a nightmare.
- “Your shard key is
user_idbut you need to query ‘all orders for product X’ across all users. How do you handle this?” (Secondary index shard — scatter-gather query to all shards and merge results. Or maintain a denormalized reverse-lookup table. Or use a search index like Elasticsearch for cross-shard queries.) - “You have 4 shards and need to add 2 more. Walk me through the data migration without downtime.” (Consistent hashing to minimize movement. Dual-write during migration window. Shadow reads to verify. Cutover shard by shard. Vitess/Citus handle this operationally.)
- “When would you choose Vitess over Citus for a sharding solution?” (Vitess: MySQL ecosystem, battle-tested at YouTube/Slack scale, good for MySQL-native teams. Citus: PostgreSQL extension, transparent distribution, better for teams already on PostgreSQL who want to keep SQL semantics.)
team_id because 99% of their query patterns scoped to one team. The migration took 9 months and was documented in their “How Figma’s databases team lived to tell the scale” post. Key lesson they shared: they initially considered workspace_id but discovered ~2% of queries genuinely spanned workspaces — that 2% would have forced scatter-gather on the hot path. team_id had zero cross-shard queries in the hot path.- Q: How do you handle a migration (ADD COLUMN) across 64 shards? A: You don’t run them simultaneously — one shard at a time, with automated rollback. Deploy dual-compatible code first (works with old and new schema), then run the migration on each shard, then deploy code that uses the new schema. Stripe documented this “expand-migrate-contract” pattern.
- Q: What’s the alternative to sharding if my writes exceed one node? A: A distributed SQL database that shards transparently: CockroachDB, Spanner, Yugabyte. Or a write-optimized NoSQL engine like Cassandra/Scylla. Both skip the pain of application-level shard-key management — at the cost of learning a new operational model.
- highscalability.com — Instagram’s and Figma’s sharding case studies
- martin.kleppmann.com — DDIA Chapter 6 “Partitioning”
18. Partitioning (Table)
18. Partitioning (Table)
- Range partitioning: Most common. Partition by date ranges —
audit_logs_2025_01,audit_logs_2025_02. Perfect for time-series data where queries always include a date range. Partition pruning turns a 2-billion-row scan into a single-partition scan. - List partitioning: Partition by discrete values —
orders_us,orders_eu,orders_apac. Good for multi-tenant or geographic splits. - Hash partitioning: Partition by hash of a column. Distributes data evenly but does not support range queries efficiently. Use when you need even distribution without a natural range key.
- Partition creation: Must create new partitions BEFORE data arrives. Automate monthly partition creation with
pg_cronor a migration script. If a query inserts into a month with no partition and no DEFAULT partition, it fails with an error. - Partition dropping:
DROP TABLE audit_logs_2023_01is instant and reclaims space immediately. Far faster thanDELETE FROM audit_logs WHERE created_at < '2023-02-01'which generates dead tuples and requires VACUUM. - Index management: Each partition has its own indexes. A partitioned table with 48 partitions and 3 indexes = 144 individual indexes to maintain.
- Too many partitions: More than a few hundred partitions degrades planning time. The planner must evaluate each partition. For sub-second OLTP queries, keep partition count under 100-200.
- “Your partitioned table has 500 monthly partitions. Query planning time has gone from 2ms to 200ms. What do you do?” (Consolidate old partitions — merge years-old monthly partitions into yearly ones. Or use TimescaleDB which handles hypertable chunk management automatically.)
- “Compare PostgreSQL declarative partitioning with MySQL partitioning. What can PostgreSQL do that MySQL cannot?” (PostgreSQL: partition-level indexes, partition-level constraints, attaching/detaching partitions dynamically, partitioned indexes. MySQL: more limited — no partition-wise joins, no sub-partitioning in recent versions.)
- “When would you partition vs. shard vs. use a time-series database like TimescaleDB?” (Partition: single server, query optimization. Shard: capacity beyond single server. TimescaleDB: time-series specific with automated chunking, compression, continuous aggregates.)
DROP TABLE events_2024_02 per month — instant, reclaims space immediately, avoids bloating autovacuum with billions of dead tuples. The alternative (DELETE FROM events WHERE created_at < ...) would have generated a mountain of dead tuples and required aggressive vacuuming. Monthly partitioning turned retention from an ongoing operational headache into a one-line scheduled job.WHERE created_at BETWEEN '2025-01-01' AND '2025-02-01' into a scan of one partition (~60M rows) instead of all 2B.- Q: Why does partition pruning sometimes fail silently?
A: Data-type mismatches (query passes
timestamp, partition bound istimestamptz), function-wrapped columns (WHERE date_trunc('month', created_at) = ...), or parameterized queries with older Postgres versions that don’t support runtime pruning. - Q: Can I partition a table after it’s already huge?
A: Yes, but it requires rewriting the table. The pattern: create a new partitioned table, copy data in batches, swap names, drop the old. Tools like
pg_partmanhelp automate this.
- postgresql.org/docs — “Table Partitioning” chapter
- TimescaleDB docs — hypertables and continuous aggregates
19. Vacuum (Postgres)
19. Vacuum (Postgres)
- Marks dead tuples as reusable — the space is added to the Free Space Map (FSM). New INSERTs can reuse this space. Note: regular VACUUM does NOT return space to the OS — the file stays the same size.
- Updates the visibility map — marks pages where all tuples are visible to all transactions. This enables Index-Only Scans.
- Freezes old transaction IDs — prevents transaction ID wraparound (see below).
VACUUM(standard): Runs concurrently with reads/writes. Does not reclaim space to OS. Fast for routine maintenance.VACUUM FULL: Rewrites the entire table, reclaiming space to OS. Requires ACCESS EXCLUSIVE lock — blocks ALL operations. On a 100GB table, this can take hours. Avoid in production; usepg_repackinstead.VACUUM FREEZE: Aggressively freezes all tuple transaction IDs. Used when approaching wraparound danger.
autovacuum_vacuum_threshold(default 50) +autovacuum_vacuum_scale_factor(default 0.2) = autovacuum triggers when dead tuples exceed50 + 20% of table size. For a 100M-row table, that is 20M dead tuples before autovacuum kicks in. For hot tables, setscale_factorto 0.01 per-table.autovacuum_vacuum_cost_delay(default 2ms in PG 12+): How long autovacuum sleeps between I/O operations. Lower = more aggressive vacuum, higher I/O impact. Increase aggressiveness for critical tables.autovacuum_max_workers(default 3): Maximum concurrent autovacuum processes. On a system with hundreds of tables, 3 workers may not be enough.
pg_stat_user_tables.n_dead_tup, age(datfrozenxid)).Follow-up chain:- “Autovacuum is running every 5 minutes on a table but dead tuple count keeps rising. What is blocking it?” (A long-running transaction or idle-in-transaction session is holding the xmin horizon. VACUUM can run but cannot remove tuples newer than the oldest active transaction. Also check for orphaned replication slots.)
- “How do you monitor for transaction ID wraparound risk?” (
SELECT datname, age(datfrozenxid) FROM pg_database;— alert when age exceeds 500M. Also monitorpg_stat_user_tables.n_dead_tupandlast_autovacuumtimestamp.) - “Your 500GB table needs VACUUM FULL but you cannot take downtime. What are your options?” (
pg_repackrebuilds the table online using triggers to capture concurrent changes.pg_squeezeis a similar tool. Both require temporary disk space equal to the table size.)
20. Connection Pooling
20. Connection Pooling
SET leakage, pool vs DB max_connections mismatch)?Answer:
Opening a new database connection is expensive: TCP handshake, TLS negotiation, authentication, and memory allocation for the connection’s backend process (Postgres spawns a new OS process per connection, ~10MB RSS each). At scale, this overhead becomes a bottleneck — a Node.js app with 50 serverless instances each opening 10 connections can exhaust the default max_connections=100 instantly and leave DB memory starved for buffer cache.Solution: A connection pool (PgBouncer for Postgres, ProxySQL for MySQL, HikariCP for Java, pgx for Go) maintains a set of pre-established connections. Application code borrows a connection, uses it, and returns it to the pool. The pool manages lifecycle, health checking, and recycling.PgBouncer modes:- Session pooling: Connection held for the entire client session. Safest (all session features work) but no multiplexing benefit — pool size must equal concurrent clients.
- Transaction pooling: Connection returned after each transaction. Sweet spot for most apps — high multiplexing, but breaks prepared statements (server-side),
SETstatements,LISTEN/NOTIFY, and temp tables. - Statement pooling: Connection returned after each statement. Most restrictive; does not support multi-statement transactions.
- Start from the DB’s
max_connections. Leave headroom for superuser/replication (Postgres reserves ~3). - For Postgres, a common formula:
pool_size = num_cores * 2 + effective_spindle_count. For SSD + 8 cores: 16-20 connections per DB. More is usually counterproductive (context-switching kills throughput). - With a pooler in front: DB
max_connections=100, poolerpool_size=20,default_pool_sizeper database=20, client-side pool can be much larger (say 200) because the pooler multiplexes them down. - The #1 sizing trap: your ORM’s pool (Hibernate
maximumPoolSize: 50) x N replicas of your app = total DB connections. Multiply it out before deploying.
- Prepared statement explosion: In transaction pooling, each client’s prepared statements may land on different server connections, causing re-prepares. Disable server-side prepares (JDBC:
prepareThreshold=0) or use session pooling for that service. - Idle-in-transaction: An app takes a connection, runs a query, then goes idle holding the connection for 30s. With 100 app instances, you are wasting 100 slots. Use
idle_in_transaction_session_timeoutin Postgres to kill these. - Pool exhaustion deadlock: Service A takes a connection, calls Service B (same pool), which tries to take another connection — pool is empty, infinite wait. Fix: always bound pool waits with a timeout; never nest connection checkouts.
- Senior: Sets up PgBouncer in transaction mode, sizes pool from cores, adds idle timeouts, and debugs pool exhaustion with
pg_stat_activity. - Staff: Designs the connection topology for the whole platform — PgBouncer per read replica, separate pools for OLTP vs analytics, sidecar vs centralized pooler tradeoff, connection budget per service as an SLO, and a dashboard that alerts before
max_connectionsis hit. Also thinks about serverless: why connection pooling is a nightmare for Lambda and why you need RDS Proxy or a warm pool layer.
- “Your Postgres hits
FATAL: too many connections for role 'app'. Walk through your triage.” — (a)SELECT count(*), state FROM pg_stat_activity GROUP BY state;— are connections idle, active, or idle-in-transaction? (b) If many idle-in-transaction, an app is not closing transactions — check the app’s ORM for missing commits. (c) If many idle, the pool is over-sized. (d) Short-term:SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle' AND state_change < now() - interval '1 hour';(e) Long-term: add PgBouncer, right-size the pool. - “Why is connection pooling uniquely painful for AWS Lambda?” — Each Lambda invocation is a separate process. Cold starts create fresh connections; scale-out to 1000 concurrent Lambdas creates 1000 connections. Fixes: RDS Proxy (managed pooler), keep connections warm across invocations with container reuse, or switch to HTTP-based APIs over JDBC (Data API for Aurora).
- “What is the difference between PgBouncer in transaction mode and a serverless Postgres like Neon or Aurora Serverless v2?” — PgBouncer is a pooler; you still pay for the backend. Aurora Serverless v2 scales the DB compute itself (vCPU/RAM) based on load, but does not help with connection count — you still need a pooler in front for connection-heavy workloads. Neon has built-in pooling.
- “You see
client_idle_timeoutexpiring connections mid-transaction. What is happening?” — Timeout is shorter than the longest transaction. Raiseclient_idle_timeoutor lower transaction duration. Also check for network middleboxes (NAT gateways) with their own idle timeouts silently killing TCP sockets.
pg_stat_activity shows 95 active connections out of 100 max. Walk through your fix.- Immediate: add PgBouncer in transaction mode, set app pool to 40 and pooler
default_pool_sizeto 20. This multiplexes 200+ app connections down to 20 DB connections. - Measure: check if latency drops. If still high, the bottleneck is not connections but query time — move to
pg_stat_statementsto find slow queries. - Validate: run a load test at 2x peak traffic, confirm
max_connectionsnever goes above 80%. - Guardrails: alert on
pg_stat_database.numbackends > 80, alert onidle_in_transaction > 60s.
max_connections.” — Postgres memory use scales linearly with connections. Going from 100 to 1000 can 10x your RSS and push the DB into swap.What strong candidates say: “Connection pooling is about matching two very different population sizes — many short-lived app requests vs few long-lived DB backends. The pooler is a multiplexer. I size from DB capacity backwards: cores, memory per connection, then the pool, then the app pool on top. And I always add idle-in-transaction and statement timeouts — those two settings have saved more production databases than any tuning parameter.”3. NoSQL (Mongo/Cassandra/Redis)
21. Document vs Column-Family vs Key-Value
21. Document vs Column-Family vs Key-Value
- Document (MongoDB, CouchDB): Stores JSON/BSON documents. Schema-flexible — each document can have different fields. Ideal when: your data is naturally hierarchical (user profiles with nested addresses, preferences), your schema evolves frequently, or you need to read/write an entire entity in one operation. Trade-off: no JOINs (use
$lookupwhich is slow), denormalization means update anomalies. - Key-Value (Redis, DynamoDB): Simplest model — opaque value stored under a key. O(1) lookups. Ideal for: caching, session storage, rate limiting, leaderboards. Trade-off: no querying by value — you must know the key. DynamoDB adds secondary indexes but at heart it is still key-value optimized.
- Wide Column (Cassandra, HBase): Rows with dynamic columns, organized by partition key and clustering key. Ideal for: time-series data, IoT sensor data, activity feeds — any workload that writes heavily and reads by partition. A single partition can hold millions of columns. Trade-off: queries MUST include the partition key. Ad-hoc queries across partitions are expensive (full cluster scan).
- Graph (Neo4j, Amazon Neptune): Nodes and edges with properties. Traversal is O(1) per hop (index-free adjacency). Ideal for: social networks (friends-of-friends), fraud detection (circular money transfers), recommendation engines, knowledge graphs. Trade-off: poor at bulk aggregations, no standard query language (Cypher vs. Gremlin vs. SPARQL).
- “Your team wants to replace PostgreSQL with MongoDB ‘because it’s faster.’ How do you evaluate this decision?” (Faster for what? MongoDB has no JOINs, no transactions across documents (until v4.0 multi-doc transactions, which are slower than PG). If the workload is single-document reads/writes, MongoDB can be faster. If you need complex queries, aggregations, or cross-entity transactions, PostgreSQL wins.)
- “When would you choose DynamoDB over Redis for a key-value use case?” (DynamoDB: durable, scalable, serverless, supports secondary indexes. Redis: in-memory, sub-millisecond, rich data structures, but requires managing persistence and failover.)
- “How would you migrate from MongoDB to PostgreSQL — or vice versa — for a 50M-document collection?”
22. MongoDB Replication (Replica Set)
22. MongoDB Replication (Replica Set)
- The Oplog (operations log) is a capped collection on the primary that records every write operation. Secondaries tail the Oplog and replay operations in order.
- Election: If the primary becomes unreachable (heartbeat timeout, default 10s), secondaries hold an election using the Raft-like protocol. The secondary with the most up-to-date Oplog wins. Election takes 5-12 seconds — during which the replica set is unavailable for writes.
- Read preference: By default, all reads go to the primary (
primary). You can route reads to secondaries (secondaryPreferred) for read scaling, but you may read stale data (seconds behind primary).nearestroutes to the lowest-latency node regardless of role.
- Reading from secondaries with
readPreference: secondaryPreferredmeans you might not see your own writes. A user updates their profile, then immediately reads it from a secondary — they see the old data. Fix: usereadConcern: majoritywithreadPreference: primaryPreferredfor read-your-writes semantics. - Write concern
w: majorityensures the write is acknowledged by a majority of nodes before returning. Combined withreadConcern: majority, you get causal consistency.
- “A secondary is falling behind the primary. The Oplog on the primary has rolled over past the secondary’s last-read entry. What happens?” (The secondary enters an unrecoverable state and must perform a full resync — copying the entire dataset from the primary. This can take hours for large datasets. Prevention: size the Oplog appropriately for your write volume.)
- “How does MongoDB’s replication model differ from PostgreSQL’s streaming replication?” (MongoDB: logical replication via Oplog (operations). PostgreSQL: physical replication via WAL (byte-level changes). MongoDB can replicate across different storage engines; PostgreSQL replicas must have identical physical layout.)
- “You need zero-downtime MongoDB version upgrades. How does the replica set election mechanism help?” (Rolling upgrade: upgrade secondaries one by one, then step down the primary, upgrade it, let it rejoin. The election promotes an upgraded secondary.)
23. Cassandra Architecture
23. Cassandra Architecture
- Masterless ring: Every node is equal (no primary/secondary). Data is distributed using consistent hashing — each node owns a range of the hash ring (tokens). No single point of failure.
- VNodes (Virtual Nodes): Each physical node owns multiple token ranges (default 256 vnodes). This ensures even distribution even with heterogeneous hardware and simplifies rebalancing when adding/removing nodes.
- Replication factor (RF): Data is copied to RF nodes. With RF=3, each piece of data lives on 3 nodes. The replication strategy determines which nodes:
SimpleStrategyfor single-DC,NetworkTopologyStrategyfor multi-DC (specify RF per datacenter).
- Write consistency
CL:ONE(ack from 1 replica — fastest, least safe),QUORUM(ack from majority — balanced),ALL(ack from all replicas — slowest, strongest). Formula for strong consistency:R + W > RFwhere R = read CL, W = write CL. With RF=3,QUORUMreads +QUORUMwrites = 2+2 > 3 = strong consistency. - Hinted handoff: If a target replica is down during a write, the coordinator stores a “hint” and forwards it when the node recovers. Prevents write failures during temporary outages.
- Read repair: On read, if replicas disagree, the coordinator sends the most recent version to out-of-date replicas. Happens in the background (
read_repair_chance). - Anti-entropy repair (
nodetool repair): Full data reconciliation across replicas. Must be run periodically (at least everygc_grace_seconds, default 10 days) to prevent zombie data resurrection.
- “You need to query Cassandra for ‘all orders by user X in the last 30 days, sorted by date.’ Design the table.” (Partition key:
user_id, clustering key:order_date DESC. Primary key:((user_id), order_date). This ensures all orders for a user are in one partition, sorted by date. One read, one partition, no scatter-gather.) - “What are tombstones in Cassandra and why can they cause performance problems?” (Deletes create tombstone markers instead of removing data. During reads, Cassandra must scan through tombstones. Too many tombstones (from frequent deletes or TTL expirations) degrade read performance dramatically.
gc_grace_secondscontrols when tombstones are purged.) - “Compare Cassandra vs. DynamoDB for a high-write IoT workload.” (Cassandra: self-managed, more tuning knobs, multi-DC natively, no vendor lock-in. DynamoDB: fully managed, auto-scaling, pay-per-request pricing, but AWS-only and less flexible data modeling.)
24. Redis Persistence
24. Redis Persistence
- RDB (Redis Database Backup): Point-in-time snapshots at configurable intervals. Redis forks the process (copy-on-write), serializes the dataset to disk. Compact file, fast restart. Trade-off: you lose all data written since the last snapshot. With
save 60 1000(snapshot every 60s if 1000+ keys changed), worst case is 60 seconds of data loss. - AOF (Append Only File): Logs every write command to a file. Three
fsyncpolicies:always(every command — slowest, safest),everysec(batch fsync every second — good default, lose ~1s on crash),no(let OS decide — fastest, most risk). AOF files are larger and restart is slower (must replay all commands). Redis periodically rewrites the AOF to compact it (BGREWRITEAOF). - RDB + AOF (recommended for durability): Use both. On restart, Redis loads from AOF (more complete) if it exists, falls back to RDB. This gives you fast snapshots for backup plus near-zero data loss from AOF.
fork(). On a Redis instance with 25GB of data, the fork can take 100-500ms (during which Redis blocks all commands) and requires up to 25GB of additional memory if many pages are modified (copy-on-write). This is why production Redis instances should not exceed ~25GB per instance, and you should monitor latest_fork_usec.What weak candidates say: “Redis is just a cache, persistence doesn’t matter.” For many use cases (session storage, rate limiting, queues), data loss IS acceptable. But for Redis as a primary store (leaderboards, real-time analytics), persistence is critical.What strong candidates say: Explain both mechanisms with their specific data loss windows, discuss the fork overhead on large instances, and mention the combined RDB+AOF strategy for production. Understand that Redis Cluster + persistence still does not replace a proper database for durable transactional data.Follow-up chain:- “Your Redis instance with 30GB of data is experiencing 400ms latency spikes every 5 minutes. What is happening?” (RDB snapshots or AOF rewrites triggering
fork(). The fork blocks the main thread while copying page tables for 30GB of memory. Fix: reduce snapshot frequency, use a replica for persistence, or reduce instance size.) - “When would you choose Redis with persistence over PostgreSQL for a primary data store?” (Almost never for relational data. But for leaderboards, real-time counters, session storage where sub-millisecond latency matters and the data model fits Redis’s structures — sorted sets, hashes, streams.)
- “How does Redis Sentinel vs. Redis Cluster handle failover, and how does persistence interact with each?”
25. Bloom Filter in NoSQL
25. Bloom Filter in NoSQL
-N * ln(p) / (ln(2))^2 where p is the desired false positive rate.What weak candidates say: “It is a filter that checks if something exists.” Cannot explain the false positive trade-off or why it matters for storage engines.What strong candidates say: Explain the I/O savings in the context of LSM-tree reads, mention the false positive rate tuning, and connect it to the broader SSTable compaction strategy.Follow-up chain:- “Can you delete a key from a Bloom filter? What would you use instead if you need deletions?” (No — unsetting bits could remove other keys that share those bit positions. Use a Counting Bloom filter (each position is a counter, not a bit) for deletions, at the cost of 3-4x more memory.)
- “How does compaction in Cassandra interact with Bloom filters?” (When SSTables are compacted into a new SSTable, a new Bloom filter is generated. Fewer SSTables = fewer Bloom filter checks = faster reads. This is why compaction strategy — Size-Tiered vs. Leveled — affects read performance.)
- “Where else are Bloom filters used in distributed systems?” (CDN cache checking, spell checkers, network routers for packet filtering, PostgreSQL’s
bloomindex extension for multi-column equality searches.)
26. CAP Theorem in Practice
26. CAP Theorem in Practice
- MongoDB (CP): When the primary is unreachable, a new election takes 5-12 seconds. During that window, writes are rejected (sacrificing availability). Once a new primary is elected, consistency is maintained because there is always one authoritative writer.
- Cassandra (AP): During a partition, all nodes continue accepting writes. When the partition heals, conflicting writes are resolved by “last write wins” (timestamp-based). This means you might read stale data during and briefly after a partition, but the system never stops serving requests.
- PostgreSQL streaming replication (CP): With
synchronous_commit = remote_apply, writes block until the replica confirms. If the replica is unreachable, writes stall (sacrificing availability for consistency). With async replication, it leans AP — writes succeed but replicas may lag.
- “A microservice writes to both PostgreSQL and Redis. The Redis update fails. You now have inconsistent state. Is this a CAP problem?” (No — this is a distributed transaction problem. CAP applies to a single replicated data store. This requires Saga patterns, outbox pattern, or 2PC. Many people misapply CAP to application-level consistency.)
- “How does CockroachDB (NewSQL) claim to provide both C and A?” (It provides CP during partitions — unavailable nodes stop serving. Its claim to ‘availability’ is about high availability through consensus-based replication, not CAP-A during partitions. No system violates CAP.)
- “You are designing a shopping cart service. Would you choose CP or AP, and why?” (AP — a shopping cart should always be available, even if a user temporarily sees a slightly stale cart. Lost availability means lost revenue. Inconsistency can be resolved on checkout.)
27. Geo-Spatial Index
27. Geo-Spatial Index
- R-tree / GiST (PostgreSQL PostGIS): Hierarchical bounding-box tree. Supports complex operations: containment, overlap, nearest-neighbor, intersection of polygons. PostGIS
ST_DWithin,ST_Contains,ST_Intersectsuse GiST indexes. The gold standard for geospatial queries in relational databases. - Geohash: Encodes lat/lng into a string. Nearby points share a common prefix. A B-tree index on geohash supports proximity searches via
LIKE 'prefix%'. Simple but has edge cases at geohash cell boundaries — points that are physically close may have different prefixes. - QuadTree: Recursively divides 2D space into four quadrants. Used by MongoDB’s
2dindex and some in-memory spatial indices. Good for point data, less efficient for complex polygon queries. - S2 (Google) / H3 (Uber): Modern hierarchical spatial indexing systems. S2 uses a sphere-projected Hilbert curve. H3 uses hexagonal tiling. Both solve the geohash boundary problem. DynamoDB and BigQuery use S2 internally.
2dsphereindex: For spherical (Earth) geometry. Supports$near,$geoWithin,$geoIntersects.2dindex: For flat-plane geometry. Legacy, rarely used.
- “You need to find the 10 nearest drivers to a rider. Compare the geohash approach vs. PostGIS ST_DWithin.” (Geohash: fast for approximate proximity but misses edge cases at cell boundaries. PostGIS: accurate but requires GiST index and is more computationally expensive. For ride-sharing, geohash with expanding radius is often sufficient because you do not need meter-level precision for initial matching.)
- “How does Uber’s H3 indexing system improve on geohash?” (Hexagonal cells have uniform adjacency — every cell has 6 neighbors at equal distance, unlike geohash’s rectangular cells. This makes proximity queries more uniform and avoids the boundary problem.)
- “Your geo-spatial query works in testing but times out in production with 100M points. How do you optimize?”
28. Write Concern (MongoDB)
28. Write Concern (MongoDB)
w: 1(default): Acknowledged by the primary only. Fastest. Risk: if the primary crashes before replicating, the write is lost. Data loss window = replication lag (~100ms typically).w: "majority": Acknowledged by a majority of voting members (2 of 3 in a typical replica set). The write is on multiple nodes before acknowledgment. Survives primary failure. 2-5ms overhead per write for the cross-node acknowledgment.w: 0(fire-and-forget): No acknowledgment. The driver does not wait for any response. Maximum throughput, maximum risk. Use only for logging or metrics where data loss is acceptable.j: true(journaling): Write is flushed to the on-disk journal before acknowledgment. Withoutj: true, the acknowledged write may still be in the WiredTiger memory buffer and lost on crash.w: "majority"impliesj: truein MongoDB 5.0+.wtimeout: Maximum time to wait for write concern satisfaction. Without a timeout, a network partition can cause the application to hang indefinitely waiting for majority acknowledgment.
w: "majority", j: true, wtimeout: 5000 is the safe default for critical data (orders, payments, user accounts). w: 1 is acceptable for high-throughput, lower-criticality data (analytics events, activity logs). w: 0 is for fire-hose data where throughput matters more than any individual write.What weak candidates say: “Just use w: majority for everything.” Does not consider the latency impact or know about wtimeout.What strong candidates say: Match write concern to the business criticality of the data. Explain the interaction between write concern and read concern for causal consistency. Mention wtimeout to prevent indefinite hangs during partitions.Follow-up chain:- “You set
w: majoritybut the application is now 3x slower on writes. How do you investigate?” (Check replica lag — if a secondary is far behind, the majority acknowledgment waits for it. Check network latency between nodes. Consider that majority requires waiting for the slowest majority member.) - “What is the difference between
w: majorityandw: 2in a 3-node replica set?” (Both require 2 acknowledgments. Butw: majorityis topology-aware — if you add a 4th node, majority becomes 3.w: 2is fixed.w: majorityis preferred because it scales with topology changes.) - “How does write concern interact with retryable writes in MongoDB 4.2+?”
29. Redis Single Threaded?
29. Redis Single Threaded?
- No locks, no context switches: Every operation is atomic by default.
INCR,LPUSH,SETNX— all guaranteed atomic without needing mutexes. This simplifies the codebase and eliminates entire classes of concurrency bugs. - CPU is not the bottleneck: Redis operations are in-memory and take microseconds. The bottleneck is network I/O and memory bandwidth, not CPU computation. A single core can process 100K-200K operations per second.
- Simplicity: The entire data structure library is lock-free because only one thread mutates state. This is why Redis can offer complex atomic operations (MULTI/EXEC, Lua scripts) without the complexity of multi-threaded data structures.
- I/O threading: Redis 6 introduced multi-threaded I/O — separate threads handle reading from and writing to client sockets, while the main thread still processes commands. This eliminates the network I/O bottleneck and increases throughput by 2x on multi-core machines. The command execution is still single-threaded.
- Threaded lazy-free: Background threads handle memory deallocation for large keys (e.g., deleting a 10M-element sorted set). Without this,
DELon a large key blocks the main thread for seconds.
- Redis Cluster: Shard data across multiple Redis instances (each single-threaded), distributed across cores or machines. Each shard handles a subset of the keyspace (16384 hash slots).
- Multiple instances per server: Run 4-8 Redis instances on an 8-core machine, each pinned to a core. Simple but requires client-side routing.
- “You have a Redis instance processing 150K ops/sec and need to get to 500K. What are your options?” (Enable I/O threads (
io-threads 4). If still insufficient, shard with Redis Cluster. Also check if pipelining is being used — pipelining batches commands, reducing network roundtrips and increasing throughput 5-10x.) - “A
DELcommand on a key containing a 5M-element sorted set causes a 2-second latency spike. Why and how do you fix it?” (DELfrees memory synchronously on the main thread. UseUNLINKinstead — it schedules memory deallocation on a background thread and returns immediately.) - “How does Redis’s single-threaded model compare to Memcached’s multi-threaded model? When would you choose Memcached?” (Memcached is multi-threaded and scales better on a single instance for simple get/set operations. Choose Memcached for pure caching with simple key-value needs. Choose Redis for data structures, persistence, pub/sub, Lua scripting.)
30. Graph DB Use Cases
30. Graph DB Use Cases
- Fraud detection: Find circular money transfers (A -> B -> C -> A) in banking transactions. In SQL, this requires recursive CTEs with self-JOINs that grow exponentially. In Neo4j:
MATCH (a)-[:TRANSFER*3..5]->(a) RETURN a— follows the chain directly. - Social networks: “Friends of friends who also like X” — 2-3 hop traversals. SQL JOINs for this pattern are O(N^k) where k is the depth. Graph traversal is O(D^k) where D is the average degree (much smaller).
- Recommendation engines: “Users who bought X also bought Y” — collaborative filtering via graph patterns.
- Knowledge graphs: Entity-relationship modeling for AI/ML features, semantic search, ontology management.
- Network topology: Route planning, dependency analysis, impact analysis (if server X goes down, what services are affected?).
- Bulk aggregations (SELECT COUNT, SUM, GROUP BY) — graph DBs are terrible at this.
- Simple CRUD with no relationship-heavy queries — overhead without benefit.
- When your data fits neatly in tables and your queries are predictable JOINs.
- Write-heavy workloads — graph databases typically have lower write throughput than relational DBs.
- “Can you model a social network in PostgreSQL with recursive CTEs and achieve similar performance to Neo4j?” (For 1-2 hops, PostgreSQL with proper indexes is competitive. At 3+ hops, the JOIN explosion makes SQL impractical. Neo4j maintains constant per-hop cost regardless of depth.)
- “Compare Neo4j’s Cypher with Apache TinkerPop’s Gremlin. When would you choose each?” (Cypher: declarative, SQL-like, easier to learn, Neo4j-specific. Gremlin: imperative/traversal-based, runs on multiple backends (Neptune, JanusGraph, CosmosDB), better for complex programmatic traversals.)
- “How would you handle a graph that grows to billions of edges? Can Neo4j handle this or do you need a different approach?” (Neo4j Enterprise can handle billions of edges on a single machine with sufficient RAM. For truly massive graphs, consider JanusGraph on Cassandra/HBase for distributed storage, or Amazon Neptune for managed service.)
4. Advanced SQL Scenarios
31. Window Functions (`OVER`)
31. Window Functions (`OVER`)
OVER clause defines the window — which rows to include and how to order them.The three components of a window function:- PARTITION BY: Divides rows into groups. Each partition is processed independently. Like GROUP BY but without collapsing rows.
- ORDER BY: Defines the sort order within each partition. Required for ranking and running aggregate functions.
- Frame clause: Defines which rows relative to the current row are included.
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(default for running totals),ROWS BETWEEN 6 PRECEDING AND CURRENT ROW(7-day moving average).
ROW_NUMBER(): Sequential numbering within partitions. Use case: deduplication —WHERE row_num = 1keeps only the latest record per group.RANK()/DENSE_RANK(): Ranking with ties. RANK skips positions after ties (1, 2, 2, 4). DENSE_RANK does not (1, 2, 2, 3).LAG(col, N)/LEAD(col, N): Access previous/next row’s value. Use case: calculate day-over-day revenue change:revenue - LAG(revenue, 1) OVER (ORDER BY date).SUM() OVER (ORDER BY date): Running total. Accumulating revenue over time.NTILE(4): Divide rows into N equal groups (quartiles, percentiles).
work_mem is insufficient. If you frequently window by the same columns, an index on (partition_col, order_col) eliminates the sort step.What weak candidates say: “Window functions are like GROUP BY.” They are fundamentally different — GROUP BY collapses rows, window functions preserve them. Cannot write a window function from memory.What strong candidates say: Write examples fluently, explain the frame clause, mention performance implications of the sort, and know when a window function replaces a correlated subquery.Follow-up chain:- “Write a query that returns the top 3 highest-paid employees per department, handling ties.” (
WITH ranked AS (SELECT *, DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary DESC) as rnk FROM employees) SELECT * FROM ranked WHERE rnk <= 3;) - “What is the difference between
ROWS BETWEENandRANGE BETWEENin the frame clause?” (ROWS operates on physical row positions. RANGE operates on logical values — all rows with the same ORDER BY value are treated as peers. RANGE can produce unexpected results if ORDER BY values are not unique.) - “You have a query with 3 different window functions, each with a different PARTITION BY. What is the performance impact?” (Each distinct window specification requires a separate sort pass. Three different PARTITION BY clauses = three sorts. Define a common window with
WINDOW w AS (...)if partitions overlap, but the planner still needs separate sorts for different partition schemes.)
32. CTE (Common Table Expression)
32. CTE (Common Table Expression)
WITH clause) defines a named temporary result set that exists for the duration of a single query. Think of it as a named subquery that you can reference multiple times.Basic CTE:- PostgreSQL 12+: CTEs are inlined by default (not materialized) — the optimizer can push predicates into the CTE and optimize the whole query as one plan. You can force materialization with
AS MATERIALIZEDor prevent it withAS NOT MATERIALIZED. - PostgreSQL < 12: CTEs are ALWAYS materialized — they act as an optimization fence. This means a CTE that returns 10M rows materializes all of them even if the outer query only needs 10. This is a common performance trap.
- MySQL 8.0+: CTEs are supported and generally inlined by the optimizer.
WHERE depth < 20) or use CYCLE detection (PostgreSQL 14+) to prevent infinite recursion from circular references.What weak candidates say: “CTEs are just named subqueries for readability.” Missing the materialization trap and recursive capability.What strong candidates say: Discuss materialization behavior across PostgreSQL versions, use recursive CTEs fluently, and mention the optimization fence issue in older PostgreSQL versions.Follow-up chain:- “A CTE-based query runs in 200ms in PostgreSQL 12 but 15 seconds in PostgreSQL 11. Same data, same indexes. Why?” (In PG 11, the CTE is materialized as an optimization fence. The outer query’s WHERE clause cannot be pushed into the CTE, so it materializes the full result set then filters. In PG 12, the CTE is inlined and the planner optimizes the whole query.)
- “When would you intentionally use
AS MATERIALIZEDon a CTE?” (When the CTE is referenced multiple times and the result set is small. Without materialization, the subquery executes once per reference. With materialization, it executes once and the result is reused. Also useful to force a specific plan shape.) - “Can recursive CTEs replace a graph database for relationship traversals?” (For simple hierarchies — yes. For complex graph patterns, multi-hop variable-depth traversals, and pattern matching — no. Recursive CTEs have exponential cost growth at depth, while graph DBs maintain constant per-hop cost.)
33. Self Join
33. Self Join
- Employee-Manager hierarchy:
SELECT e.name AS employee, m.name AS manager FROM employees e LEFT JOIN employees m ON e.manager_id = m.id; - Find duplicates:
SELECT a.* FROM users a JOIN users b ON a.email = b.email AND a.id > b.id;— finds rows with the same email but different IDs. - Compare consecutive rows:
SELECT a.date, a.revenue, b.revenue AS prev_revenue FROM daily_sales a JOIN daily_sales b ON a.date = b.date + INTERVAL '1 day';(ThoughLAG()window function is cleaner for this.) - Find pairs:
SELECT a.name, b.name FROM employees a JOIN employees b ON a.department = b.department AND a.id < b.id;— all pairs of employees in the same department (without duplicates).
manager_id, email, department) is essential.What weak candidates say: “You cannot join a table to itself.” Or can only explain the employee-manager example.What strong candidates say: Give multiple use cases, use proper aliasing, and mention that window functions (LAG, LEAD) have replaced many traditional self-join patterns for sequential comparisons.Follow-up chain:- “Rewrite the employee-manager self-join to show the full management chain (CEO -> VP -> Director -> Manager -> Employee) using a recursive CTE.”
- “A self-join on a 10M-row table is running slowly despite an index on the join column. What could be wrong?” (The join is producing a Cartesian product within groups — if 1000 employees share a department, the self-join produces 1M pairs for that department alone. Use window functions or add more restrictive join conditions.)
- “When would you use a self-join vs. a window function for comparing rows within the same table?”
34. N+1 Problem in SQL
34. N+1 Problem in SQL
- Eager loading / JOIN:
SELECT orders.*, customers.name FROM orders JOIN customers ON ...— one query, all data. - Batch loading / IN clause:
SELECT * FROM customers WHERE id IN (1, 2, 3, ..., 100)— two queries total. This is what Facebook’s DataLoader pattern uses. - ORM-specific: Django
select_related()(JOIN) andprefetch_related()(batch IN). SQLAlchemyjoinedload()andsubqueryload(). Railsincludes(). - GraphQL DataLoader: Batches and deduplicates database requests within a single request cycle. Essential for GraphQL APIs where N+1 is structural due to field-level resolution.
pg_stat_statements shows query frequency. Django Debug Toolbar and Rails Bullet gem catch N+1 at development time.Red flag answer: “Just add an index” — indexes do not fix N+1, they make each of the N queries slightly faster but the fundamental problem (too many roundtrips) remains. Also, not being able to explain how an ORM causes this problem.Follow-up questions:- “You are using GraphQL and every query triggers N+1 on nested fields. How does DataLoader solve this?” (DataLoader collects all IDs requested within a single tick of the event loop, then issues one batched query. It also deduplicates — if the same customer ID is requested 5 times, it fetches it once.)
- “Can N+1 happen with NoSQL databases?” (Yes — fetching a list of documents from MongoDB, then fetching referenced documents one by one is the same pattern. MongoDB
$lookupis the aggregation pipeline equivalent of a JOIN.) - “Your team’s Django API has 200ms average response time. You enable query logging and see 47 queries per request. Walk me through your debugging process.” (Identify N+1 patterns in the logs, add
select_relatedandprefetch_relatedto the ORM queries, verify with Django Debug Toolbar, and set up query count assertions in tests to prevent regression.)
35. Optimistic vs Pessimistic Locking
35. Optimistic vs Pessimistic Locking
-
Pessimistic Locking: Assume conflicts are frequent. Lock the resource before modifying it.
SELECT ... FOR UPDATEacquires a row-level exclusive lock — no other transaction can read (with FOR UPDATE) or modify that row until the lock is released on COMMIT/ROLLBACK.- When to use: High contention scenarios like bank account balance updates, inventory decrements during flash sales, or seat reservation systems. Any time two concurrent requests frequently target the same row.
- Danger: Deadlocks. Transaction A locks row 1 then tries to lock row 2. Transaction B locks row 2 then tries to lock row 1. Both wait forever. PostgreSQL detects this and kills one transaction, but your application needs retry logic.
- Performance cost: Lock contention serializes access, reducing throughput on hot rows.
-
Optimistic Locking: Assume conflicts are rare. Read the row with its version number, do your work, then update with a WHERE clause that checks the version has not changed:
UPDATE accounts SET balance=X, version=version+1 WHERE id=1 AND version=5. If another transaction modified the row (version is now 6), the UPDATE affects 0 rows and your application retries.- When to use: Low contention scenarios like editing a blog post, updating user profile settings, or any CRUD operation where concurrent modification of the same record is uncommon.
- Danger: Under high contention, retry storms. If 100 concurrent requests all try to update the same row optimistically, 99 will fail on each round, retry, and 98 will fail again. This amplifies load instead of serializing it.
- Implementation: A
versioninteger column (increment on every update) or aupdated_attimestamp column (less precise, can have collisions within the same millisecond).
- “You have a flash sale where 10,000 users try to buy the last 100 items simultaneously. Which locking strategy do you use and why?” (Pessimistic with
SELECT ... FOR UPDATE SKIP LOCKED— this lets transactions grab the next available item without waiting for locked ones, dramatically reducing contention. This is how ticket reservation systems work.) - “How do ORMs implement optimistic locking?” (Django:
F()expressions for atomic updates, or django-concurrency package. SQLAlchemy:version_id_colmapper option. Rails: built-inlock_versioncolumn with ActiveRecord.) - “What is the difference between
SELECT ... FOR UPDATEandSELECT ... FOR SHARE?” (FOR UPDATE takes an exclusive lock — no other transaction can lock or update the row. FOR SHARE takes a shared lock — other transactions can also take FOR SHARE but cannot FOR UPDATE or modify. Use FOR SHARE when you need to ensure data is not modified while you read it but do not plan to modify it yourself.)
36. Indexing a JSON column
36. Indexing a JSON column
CREATE INDEX idx_data ON events USING gin (data);— indexes ALL keys and values in the JSONB document. Supports@>(containment),?(key existence),?|(any key exists),?&(all keys exist).- Example:
SELECT * FROM events WHERE data @> '{"type": "purchase", "amount": 100}';— the GIN index finds matching documents without a full table scan. - Trade-off: GIN indexes are large (they index every key-value pair) and slow to update on writes. On a table with 10M rows and complex JSONB documents, the GIN index can be 3-5x the size of a B-tree index on a scalar column.
CREATE INDEX idx_type ON events ((data->>'type'));— indexes a specific extracted value. Faster and smaller than GIN for known query patterns.- Supports
=,<,>,BETWEENon the extracted value. Cannot support arbitrary key lookups.
jsonb_path_ops:CREATE INDEX idx_data ON events USING gin (data jsonb_path_ops);— optimized for@>containment queries only. 2-3x smaller than default GIN operator class. Faster for containment but does not support key existence queries (?,?|).
jsonb_path_ops, or when to use expression indexes.What strong candidates say: Choose the index type based on query patterns. Use expression indexes for known access patterns, GIN for flexible querying, and acknowledge the write overhead of GIN indexes.Follow-up chain:- “Your JSONB column stores nested objects 5 levels deep. Can GIN index nested keys, and what are the performance implications?” (GIN indexes nested keys for containment queries. But deeply nested containment checks can be slow because the index stores flattened key paths.)
- “Compare PostgreSQL JSONB indexing with MongoDB’s document indexing. Which is more mature?” (MongoDB: native document model, more natural index types for documents, multi-key indexes on arrays. PostgreSQL: JSONB is a bolt-on to relational model, powerful but GIN has higher write overhead. MongoDB wins for document-primary workloads.)
- “When would you use JSONB vs. a separate document store like MongoDB alongside PostgreSQL?”
37. Full Text Search
37. Full Text Search
LIKE '%term%' can do.PostgreSQL FTS:tsvector: A sorted list of normalized lexemes (word stems) with position information.to_tsvector('english', 'The running foxes jumped')produces'fox':3 'jump':4 'run':2.tsquery: The search query.to_tsquery('english', 'running & fox')matches documents containing both “run” (stemmed) and “fox.”- GIN index:
CREATE INDEX idx_fts ON articles USING gin (to_tsvector('english', body));— pre-computes the tsvector and indexes all lexemes. Lookup is O(1) per term. - Ranking:
ts_rank(tsvector, tsquery)scores relevance based on term frequency and position. - Phrase search:
to_tsquery('english', 'machine <-> learning')matches “machine learning” as a phrase (adjacent words).
- Moderate corpus size (< 50M documents), simple search requirements (keyword matching, boolean operators), and you want to avoid managing a separate search cluster.
- Simpler ops — no Elasticsearch cluster to maintain, data stays in one place, transactionally consistent with writes.
- Complex relevance tuning (BM25, custom scoring, boosting), fuzzy matching, auto-complete/suggest, faceted search, distributed search across billions of documents, or real-time index updates with sub-second latency.
- Elasticsearch’s inverted index is more sophisticated — it supports n-gram tokenizers, synonym dictionaries, and custom analyzers that PostgreSQL’s FTS cannot match.
LIKE '%search%' with an index.” LIKE with a leading wildcard cannot use a B-tree index — it performs a full table scan.What strong candidates say: Explain the tsvector/tsquery mechanism, know the GIN index trade-off, and can articulate the decision boundary between PostgreSQL FTS and Elasticsearch. Mention pg_trgm extension for trigram-based fuzzy matching as a middle ground.Follow-up chain:- “Your search returns results but the ranking is poor — irrelevant documents rank higher than relevant ones. How do you improve it?” (Weighted tsvector:
setweight(to_tsvector(title), 'A') || setweight(to_tsvector(body), 'B')— title matches rank higher. Or addts_rank_cdfor cover density ranking.) - “You need auto-complete (search-as-you-type) on a 10M-row product table. Can PostgreSQL handle this?” (Use
pg_trgmwith a GIN index for trigram similarity matching:SELECT * FROM products WHERE name % 'lapt' ORDER BY similarity(name, 'lapt') DESC LIMIT 10;. For production auto-complete with sub-50ms latency, Elasticsearch with edge-ngram tokenizer is more appropriate.) - “How do you keep an Elasticsearch index in sync with PostgreSQL?” (CDC via Debezium reading the WAL and pushing to Elasticsearch via Kafka. Or application-level dual-write with eventual consistency. Or
pg_es_fdwfor read-through. Dual-write is simplest but risks inconsistency; CDC is robust but adds infrastructure.)
38. Schema Migration Strategies
38. Schema Migration Strategies
- Add column with default (PostgreSQL 11+):
ALTER TABLE t ADD COLUMN c INT DEFAULT 0;— metadata-only, instant. The default is stored in system catalogs, not written to every row. - Add nullable column: Always instant (no default needed, no rewrite).
- Create index concurrently:
CREATE INDEX CONCURRENTLY idx ON t(col);— does not block writes. Takes longer (2 passes) but safe for production. - Add CHECK constraint NOT VALID:
ALTER TABLE t ADD CONSTRAINT chk CHECK (col > 0) NOT VALID;— does not scan existing rows. ThenALTER TABLE t VALIDATE CONSTRAINT chk;separately with a weaker lock.
- Remove column: First deploy code that no longer references the column. Then
ALTER TABLE t DROP COLUMN c;— instant in PostgreSQL (marks column as dropped, does not rewrite). But if old code is still running during rolling deploy, it will break. The expand-contract pattern: (1) stop writing to column, (2) deploy code that ignores column, (3) drop column. - Rename column: Effectively a remove + add from the application’s perspective. Use the expand-contract pattern: add new column, dual-write, migrate reads, drop old column.
- Change column type: Often requires a full table rewrite (ACCESS EXCLUSIVE lock). For large tables, use the shadow table approach: create new table with correct type, sync via triggers, atomic swap.
- Add NOT NULL to existing column: Requires a full table scan to verify (ACCESS EXCLUSIVE lock). In PostgreSQL 12+, use
SET NOT NULL NOT VALID+ separateVALIDATEto reduce lock duration.
- Flyway / Liquibase: Version control for migrations. Execute SQL scripts in order, track which migrations have been applied. Good for small-medium teams.
- gh-ost (MySQL): Online schema change via binlog replication to a shadow table. Essential for MySQL because
ALTER TABLErewrites the table. - pgroll (PostgreSQL): Schema versioning with automatic expand-contract pattern. Multiple schema versions coexist during migration.
- Sqitch: Dependency-based migration system with revert capabilities.
lock_timeout before any DDL on production tables.Follow-up chain:- “You need to split a
full_namecolumn intofirst_nameandlast_nameon a 500M-row table with zero downtime. Walk me through every step.” (Add two nullable columns (instant). Backfill in batches withUPDATE ... WHERE id BETWEEN X AND Y. Add trigger to dual-write for new inserts. Deploy new code to read from new columns. Verify no reads/writes to old column. Drop old column. This takes days to execute safely.) - “What happens if a migration fails halfway through? How do you handle rollback?” (Depends on the tool. Flyway: wrap in a transaction (DDL is transactional in PostgreSQL). MySQL: DDL is not transactional — partial state requires manual cleanup. Always test migrations on a production-size staging database first.)
- “How do you handle migrations in a microservices architecture where multiple services share a database?” (Each service should own its tables. Shared schemas are a coupling anti-pattern. If unavoidable, use versioned views — each service reads from its own view, and migrations update the underlying tables + views together.)
39. SQL Injection
39. SQL Injection
SELECT * FROM users WHERE username = ' + input + '. If input is ' OR '1'='1' --, the query becomes SELECT * FROM users WHERE username = '' OR '1'='1' --' — returns all users, bypassing authentication.Defenses (defense in depth):- Parameterized queries / prepared statements: The #1 defense.
SELECT * FROM users WHERE username = $1with the input passed as a parameter. The database treats the parameter as data, never as SQL syntax. Every modern language/ORM supports this. - ORM usage: ORMs like SQLAlchemy, Django ORM, and ActiveRecord parameterize queries by default. But raw SQL within an ORM (
Model.objects.raw(),execute()) can still be vulnerable if you concatenate strings. - Input validation / allowlisting: Validate that inputs conform to expected patterns. An
order_byparameter should only accept known column names — do not pass user input directly intoORDER BY(which cannot be parameterized in most databases). - Least privilege: The database user for the application should have only SELECT/INSERT/UPDATE/DELETE on specific tables. No
DROP TABLE, noCREATE, no access topg_catalog. Even if injection succeeds, the damage is limited. - WAF (Web Application Firewall): A secondary defense layer that detects common injection patterns. Not a primary defense — too many bypass techniques exist.
admin'-- is later used in a report query that concatenates the stored value into SQL.What weak candidates say: “Use prepared statements.” Correct but incomplete — cannot explain second-order injection, ORDER BY injection, or the importance of least privilege.What strong candidates say: Layered defense starting with prepared statements, discuss ORM edge cases, mention second-order injection, and emphasize that input validation alone is insufficient — it must be combined with parameterization.Follow-up chain:- “An ORM-heavy application is ‘safe from injection.’ Do you agree?” (No. Raw SQL, dynamic ORDER BY, LIKE patterns with unescaped user input, and bulk operations often bypass the ORM’s protection. Also, NoSQL injection exists — MongoDB
$gtoperator injection is the document-store equivalent.) - “How do you handle SQL injection in dynamic query builders where column names and table names come from user input?” (Allowlist valid column/table names in application code. Never pass user input as identifiers. Use an enum or lookup table to map user-facing names to actual column names.)
- “Your security audit found a stored procedure that concatenates input into dynamic SQL. The team says ‘it is server-side so it is safe.’ Is it?” (Absolutely not. SQL injection inside a stored procedure is equally dangerous. Dynamic SQL in stored procedures must use parameterized execution:
EXECUTE ... USINGin PostgreSQL,sp_executesqlin SQL Server.)
40. Soft Delete
40. Soft Delete
deleted_at TIMESTAMP NULL) instead of physically removing it. The record remains in the database for recovery, auditing, or undo functionality.Pros:- Recovery: Accidentally deleted data can be restored instantly (
UPDATE SET deleted_at = NULL). - Audit trail: Historical data is preserved for compliance and debugging.
- Referential integrity: Foreign key references to soft-deleted records do not break. Hard deletes cascade or leave dangling references.
- Undo functionality: User-facing “recycle bin” or “trash” features.
- Every query needs
WHERE deleted_at IS NULL: If a developer forgets this filter, deleted records appear in results. Use a default scope in the ORM (Django’sSoftDeleteManager, Rails’default_scope) or a PostgreSQL view that filters automatically. - Index bloat: A table with 100M rows where 80M are soft-deleted means indexes contain 80M useless entries. Fix: partial indexes:
CREATE INDEX idx_active ON users (email) WHERE deleted_at IS NULL;— index only contains live records, 5x smaller. - Unique constraint conflicts: If a user soft-deletes their account (email
foo@bar.com), they cannot re-register with the same email. Fix: partial unique index:CREATE UNIQUE INDEX idx_email ON users (email) WHERE deleted_at IS NULL;— uniqueness only enforced among active records. - GDPR/data retention: “Right to be forgotten” requires actual deletion, not soft delete. You may need a separate hard-delete process that runs after the audit retention period.
- Foreign key complexity: Cascading soft deletes across related tables requires application logic — the database does not natively support cascading soft deletes.
deleted_at column and filter on it.” Cannot explain partial indexes, unique constraint issues, or GDPR implications.What strong candidates say: Discuss partial indexes for performance, the unique constraint problem with solutions, mention GDPR hard-delete requirements, and recommend a tiered approach (soft delete for user-facing recovery, hard delete after retention period).Follow-up chain:- “Your soft-deleted users table has 200M total rows but only 5M active. Queries are slow despite having an index on the lookup column. Why?” (The index includes all 200M rows. The planner may choose a seq scan because the index is large. Fix: partial index
WHERE deleted_at IS NULLthat only indexes the 5M active rows.) - “Compare soft delete vs. event sourcing for maintaining history. When would you choose each?” (Soft delete: simple, good for single-entity recovery. Event sourcing: full history of all state changes, enables replaying events and time-travel queries, but much more complex infrastructure. Choose event sourcing for financial transactions, audit-critical systems, or when you need to reconstruct state at any point in time.)
- “How do you implement cascading soft deletes across related tables?” (Application-level: when soft-deleting a parent, also soft-delete children. Or use a trigger. Or use a message queue — publish a “user deleted” event and let each service handle its own soft-delete. There is no native database support for cascading soft deletes.)
5. Operations & Scaling
41. Backup Types
41. Backup Types
- Full: Whole DB.
- Differential: Changes since last Full.
- Transaction Log: Stream of ops (Point in Time Recovery).
42. CDC (Change Data Capture)
42. CDC (Change Data Capture)
43. Read Replicas
43. Read Replicas
44. Deadlocks
44. Deadlocks
45. Vertical Partitioning (Column split)
45. Vertical Partitioning (Column split)
46. Database Federation
46. Database Federation
47. Time Series DB (Influx/Timescale)
47. Time Series DB (Influx/Timescale)
48. Ledger Database (QLDB)
48. Ledger Database (QLDB)
49. Row-Level Security (RLS)
49. Row-Level Security (RLS)
50. Two-Phase Commit (2PC)
50. Two-Phase Commit (2PC)
- Prepare: All nodes vote “Yes/No” (Lock resources).
- Commit: If all Yes, persist. Slow. Blocked if Coordinator dies.
Advanced Scenario-Based Questions
S1. Deadlock Investigation: 3 AM Page, Orders Stuck
S1. Deadlock Investigation: 3 AM Page, Orders Stuck
ERROR: deadlock detected from Postgres. The checkout flow is completely broken. Revenue is bleeding at ~$12K/minute. Walk me through your investigation and resolution.What weak candidates say:- “I’d restart the database.”
- “Deadlocks happen when two transactions lock each other. I’d just add retry logic.”
- Can’t explain how to actually find the participating queries or the lock graph.
- Immediate triage (first 2 minutes): Check
pg_stat_activityforwait_event_type = 'Lock'and cross-reference withpg_locksjoined topg_classto identify the exact tables and rows involved. RunSELECT * FROM pg_locks WHERE NOT granted;to see who’s waiting, then trace back to the blocking PID. - Root cause identification: Pull the actual queries from
log_lock_waits = on(should already be enabled in production). In my experience the classic pattern is: Transaction A updatesorderstheninventory, Transaction B updatesinventorythenorders. The fix isn’t retry logic alone — it’s ensuring consistent lock acquisition order across all code paths. - Immediate fix: Terminate the victim transaction with
pg_terminate_backend(pid)targeting the least-progressed transaction. Then deploy a hotfix that reorders the lock acquisition. In one incident I worked, we found a background reconciliation job had been deployed that week with reversed table access order — it only manifested under load because the probability of interleaving was low. - Long-term: Set
deadlock_timeoutappropriately (default 1s is fine for detection, but you wantlog_lock_waitsandlock_timeoutper-session for aggressive callers). Add a Grafana panel onpg_stat_database.deadlockscounter. Consider advisory locks for known hot-path contention.
- What’s the difference between a deadlock and simple lock contention, and how does Postgres’s deadlock detector actually work internally (hint: wait-for graph cycle detection)?
- You mentioned advisory locks — when would you prefer
pg_advisory_xact_lockoverSELECT ... FOR UPDATE SKIP LOCKED, and what are the trade-offs? - How would your approach differ on MySQL InnoDB where the deadlock detection algorithm and gap locking behavior are fundamentally different?
S2. Query Planner Ignoring Your Perfect Index
S2. Query Planner Ignoring Your Perfect Index
(tenant_id, created_at, status) on a 200M-row events table. Your query filters on all three columns. But EXPLAIN ANALYZE shows a sequential scan costing 4.2 seconds instead of the expected sub-100ms index lookup. The index definitely exists. What’s going on?What weak candidates say:- “Maybe the index wasn’t created properly. I’d drop and recreate it.”
- “Run
ANALYZEon the table.” (Correct but incomplete — they can’t explain why the planner chose seq scan.) - “Force the index with a hint.” (Postgres doesn’t have traditional index hints.)
- Statistics are stale or misleading: First check
pg_statistic/pg_statsfor the columns involved. IfANALYZEhasn’t run recently (orautovacuumis behind), the planner’s row estimates will be wildly off. RunEXPLAIN(without ANALYZE) and compare therows=estimate to reality. If planner thinks the query returns 80% of the table, it rationally chooses seq scan because random I/O for 160M rows is worse than a single sequential pass. - Column correlation issue: If
tenant_idhas extreme skew — say tenant #42 owns 70% of the rows — Postgres’s default single-column statistics won’t capture the joint selectivity of(tenant_id, created_at, status). Fix: create extended statistics withCREATE STATISTICSon the correlated columns, or increasedefault_statistics_targetfor those columns. - Data type mismatch / implicit cast: One I’ve hit in production: the query was passing
statusas an integer but the column wasvarchar. The implicit cast made the index unusable.EXPLAINshowed aFilternode instead ofIndex Cond. - Index column order vs. query pattern: If the query uses a range on
created_at(e.g.,created_at > now() - interval '7 days') but an equality check onstatus, the index(tenant_id, created_at, status)is suboptimal — the range on the second column means the third column can’t be used as a seek condition. Reorder to(tenant_id, status, created_at). - Actual check:
SET enable_seqscan = off;temporarily to see what plan the optimizer would use with the index, and compare costs. This isn’t a production fix — it’s a diagnostic.
- Walk me through exactly how Postgres decides between an index scan, bitmap index scan, and sequential scan. What cost parameters (
random_page_cost,seq_page_cost,effective_cache_size) influence this decision? - You have a query with
WHERE tenant_id = $1 AND created_at BETWEEN $2 AND $3 AND status IN ('active', 'pending', 'review'). Design the optimal index and explain why. - How do partial indexes and index-only scans change your analysis here?
S3. Replication Lag Spike Taking Down Read Replicas
S3. Replication Lag Spike Taking Down Read Replicas
- “Scale up the replica hardware.”
- “Switch to synchronous replication.” (Would tank write throughput and make the problem worse.)
- Cannot distinguish between WAL shipping lag and replay lag.
- Distinguish the two lag components: Check
pg_stat_replicationon the primary — comparesent_lsn,write_lsn,flush_lsn,replay_lsn. Ifsent_lsnis far ahead ofreplay_lsnbutwrite_lsnis close tosent_lsn, the bottleneck is WAL replay on the replica, not network transfer. This is the most common case. - Long-running queries on replica: In Postgres, if a query on the replica conflicts with WAL replay (e.g., replay needs to vacuum a page the query is reading), replay pauses. Check
max_standby_streaming_delay— if it’s set to 30s (default), a single long analytics query can cause 30s of replay lag. I’ve seen this exact pattern when a BI tool ran a 2-minute dashboard refresh query on the replica. Fix: sethot_standby_feedback = on(but beware table bloat on primary), or route analytics queries to a dedicated replica with highermax_standby_streaming_delay, or cancel long queries withstatement_timeouton the replica. - Heavy write burst: A bulk data load or large
DELETEon the primary generates a WAL spike. The replica’s single-threaded WAL replay can’t keep up. In Postgres 15+ you can tunerecovery_prefetchto help, but ultimately single-threaded replay is a known bottleneck. Consider logical replication for parallelism, or table-level partitioning so deletes become fastDROP PARTITIONinstead of row-by-row. - I/O saturation on replica: Check
iostat/pg_stat_io. If the replica’s disk is saturated (common when replica also handles heavy read queries), WAL replay competes with query I/O. Move to faster storage or separate WAL directory onto dedicated disk.
- How does Postgres streaming replication differ from logical replication, and in what scenarios would you choose each?
- What happens to in-flight read queries on the replica when
max_standby_streaming_delayis exceeded? - You need to guarantee that a user who just wrote data will read their own write from a replica. How do you architect this without sending all reads to the primary?
S4. Connection Pool Exhaustion at Peak Traffic
S4. Connection Pool Exhaustion at Peak Traffic
FATAL: too many connections errors. PgBouncer’s SHOW POOLS shows all 100 connections active, 2,400 clients waiting. Response times have gone from 15ms to 12 seconds. Median query time is 8ms. How is this possible and what do you do?What weak candidates say:- “Increase
max_connectionsto 1000.” (Postgres falls over with context switching at high connection counts.) - “Add more PgBouncer instances.” (Doesn’t help if the bottleneck is server-side connections.)
- Don’t understand the difference between PgBouncer’s
transactionvssessionpooling modes.
- The math doesn’t add up — and that’s the clue: 100 connections, 8ms median query = theoretical throughput of ~12,500 queries/sec. If we’re queuing 2,400 clients, either we’re exceeding that throughput or something is holding connections longer than 8ms. The culprit is usually session-mode pooling where connections are held for the entire session, not released between transactions. First check:
SHOW POOLS;— ifpool_mode = session, switch totransactionmode. This alone can 10x effective capacity. - Idle-in-transaction connections: Run
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;. If you see 60+ connections inidle in transaction, someone’s code is opening a transaction, doing app-level work (API calls, computation), then committing. Each of those is holding a server connection hostage. Fix: setidle_in_transaction_session_timeout = '5s'in Postgres to kill offenders, and fix the application code to keep transactions short. - Prepared statements with transaction pooling: If you switch to transaction mode but the app uses prepared statements, PgBouncer can’t route statements across connections. Use
DEALLOCATE ALLat transaction end, or switch to PgBouncer 1.21+ withprepared_statement_cache_size. - Short-term triage: During the incident, identify and
pg_terminate_backend()the idle-in-transaction sessions. Set PgBouncerserver_idle_timeoutandquery_wait_timeoutto fail fast rather than queue indefinitely. 12-second queue wait is worse than a fast error + client retry. - Architecture fix: For this scale (flash sale spikes), consider a connection pooler per service, with per-service connection limits. One runaway service shouldn’t be able to starve others. We implemented this after a similar incident — the marketing analytics service had a connection leak that ate 40 of our 100 connections.
- Explain the internal difference between PgBouncer’s session, transaction, and statement pooling modes. When is statement mode dangerous?
- Why does Postgres performance degrade severely above ~300 connections, even if the queries are fast? (Hint:
ProcArrayLock, snapshot management.) - Compare PgBouncer vs. Pgpool-II vs. application-level pooling (HikariCP). When would you choose each?
S5. Partition Pruning Not Working — Full Table Scans on Partitioned Table
S5. Partition Pruning Not Working — Full Table Scans on Partitioned Table
audit_logs table by month using Postgres declarative partitioning (PARTITION BY RANGE (created_at)). Queries like SELECT * FROM audit_logs WHERE created_at >= '2025-01-01' AND created_at < '2025-02-01' should hit only one partition, but EXPLAIN shows it scanning all 48 partitions. What went wrong?What weak candidates say:- “Partitioning doesn’t work well in Postgres.”
- “The data types probably don’t match.” (Partially right but can’t elaborate.)
- Don’t know what partition pruning is or how to verify it in
EXPLAINoutput.
- Check for runtime vs. plan-time pruning: In the
EXPLAINoutput, look forAppendwith a list of child scans. If all 48 partitions appear, pruning failed entirely. If using a parameterized query ($1instead of literal), Postgres can only do runtime pruning (available since PG 11), and you needenable_partition_pruning = on(default, but verify). - Data type mismatch is the #1 cause: If
created_atistimestamptzbut the query literal is adateortimestamp without time zone, the implicit cast can prevent the planner from matching the value to the partition boundary. I’ve seen this exact issue — partition bounds were defined astimestamptzbut the ORM was sendingtimestampvalues. Fix: explicit cast (WHERE created_at >= '2025-01-01'::timestamptz) or ensure the ORM type-maps correctly. - Expression mismatch: If the partition key is
created_atbut the query applies a function —WHERE date_trunc('month', created_at) = '2025-01-01'— the planner can’t prune because the function transforms the column. TheWHEREclause must directly reference the partition key column with simple comparison operators. - Constraint exclusion vs. partition pruning: In older PG (pre-11), partition pruning relied on
constraint_exclusion, which only worked at plan time and not with prepared statements. Verify PG version and that you’re using native declarative partitioning, not the old inheritance-based partitioning. - Default partition trap: If you have a
DEFAULTpartition, Postgres must scan it because any value could exist there. This is correct behavior but often surprises people. If the default partition has millions of rows, it adds significant overhead. - Verification: Run
EXPLAIN (COSTS OFF)and look forSubplans Removed: 47— that confirms 47 of 48 partitions were pruned. If you seeSubplans Removed: 0, pruning is failing.
- You need to add a new monthly partition for next month. How do you automate this, and what happens if queries arrive for a month with no partition and no default partition?
- What’s the performance impact of having 500+ partitions vs. 48? At what point does partition management overhead outweigh pruning benefits?
- Compare Postgres declarative partitioning with MySQL’s partitioning and with TimescaleDB hypertables for time-series data.
S6. MVCC Bloat: Table Grew 10x, Queries Slowing Down
S6. MVCC Bloat: Table Grew 10x, Queries Slowing Down
transactions table holds 50M active rows but pg_total_relation_size reports 380 GB — roughly 10x what you’d expect for 50M rows. Queries that were 20ms are now 200ms. You suspect bloat. Autovacuum is enabled and shows last_autovacuum was 2 hours ago. Explain what happened and how you fix it.What weak candidates say:- “Just run
VACUUM FULLon the table.” (Correct but takes an exclusive lock, potentially hours of downtime on 380 GB.) - “Delete old rows.” (The rows are already deleted — that’s the problem.)
- Can’t explain why autovacuum didn’t prevent this.
- Why autovacuum failed to prevent this: Autovacuum was running, but it wasn’t effective. The most common culprit is a long-running transaction or an abandoned replication slot holding back the
xminhorizon. Check:SELECT xmin, pid, state, query_start FROM pg_stat_activity ORDER BY xmin ASC NULLS LAST LIMIT 5;. If there’s a transaction from 6 hours ago inidle in transactionstate, no dead tuples created after that transaction started can be vacuumed — autovacuum runs but removes zero tuples. I once traced a 400 GB bloat incident to a single forgottenBEGINin a developer’s pgAdmin session that had been open for 3 days. - Replication slot check:
SELECT slot_name, active, restart_lsn, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes FROM pg_replication_slots;. An inactive replication slot (maybe from a decommissioned replica) prevents WAL and tuple cleanup. Drop it if it’s orphaned. - Quantify the bloat: Use
pgstattupleextension:SELECT * FROM pgstattuple('transactions');gives youdead_tuple_percent. Or use thepg_bloat_checkquery from the Postgres wiki. If dead tuple ratio is above 50%, you’re in bad shape. - Fix without downtime:
VACUUM FULLrequiresACCESS EXCLUSIVElock — not an option on a production table. Instead usepg_repack(orpg_squeeze): these rebuild the table online by creating a new copy, replaying changes via triggers, then swapping. I’ve usedpg_repackon a 500 GB table — it took 4 hours but had zero downtime. Setstatement_timeout = 0for the repack session. - Prevention: Tune autovacuum per-table for hot tables:
ALTER TABLE transactions SET (autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_cost_delay = 2);. Defaultscale_factorof 0.2 means autovacuum waits until 20% of the table is dead — on 50M rows that’s 10M dead tuples before it triggers. Monitorn_dead_tupinpg_stat_user_tablesand alert when it exceeds a threshold. Setidle_in_transaction_session_timeoutto prevent the root cause.
- Explain how MVCC’s
xmin/xmaxtuple headers work and why the visibility map matters for index-only scans on a bloated table. - What is transaction ID wraparound, why does Postgres need aggressive autovacuum to prevent it, and what happens if you hit the 2-billion transaction limit?
- How does MySQL InnoDB handle MVCC cleanup differently (purge thread, undo logs) and does it suffer the same bloat patterns?
S7. Backup Restoration Under Pressure: Production Data Corruption
S7. Backup Restoration Under Pressure: Production Data Corruption
UPDATE without a WHERE clause on the accounts table in production at 14:32 UTC. All 2.3M accounts now have balance = 0. The team is panicking. You have daily base backups (taken at 02:00 UTC) and continuous WAL archiving. You need to recover the accounts table to its state at 14:31 UTC without losing any other data committed between 02:00 and now. Walk me through the recovery.What weak candidates say:- “Restore the full backup from 02:00.” (This loses 12 hours of data across ALL tables.)
- “Use
pg_dumpfrom the backup.” (Pg_dump is logical backup — they likely have physical backups withpg_basebackup.) - Don’t know what PITR is or how WAL replay works.
- Do NOT restore over production. Spin up a separate recovery instance. Use
pg_basebackuprestore + WAL replay to a parallel server. This is critical — you never risk making things worse on the live database. - Point-in-Time Recovery (PITR): Restore the 02:00 base backup to the recovery instance. In
recovery.conf(orpostgresql.confin PG 12+), setrecovery_target_time = '2026-04-10 14:31:00 UTC'andrecovery_target_action = 'pause'. Start the recovery instance — it replays WAL from the archive up to 14:31 UTC and pauses. Verify theaccountstable looks correct. - Extract just the table: From the recovery instance, dump only the
accountstable:pg_dump -t accounts --data-only --column-inserts recovery_db > accounts_recovery.sql. Or for speed, useCOPY accounts TO '/tmp/accounts.csv'on the recovery instance. - Restore into production: On the live production database, within a transaction:
BEGIN; DELETE FROM accounts; COPY accounts FROM '/tmp/accounts.csv'; COMMIT;or usepg_restorewith--data-only --table=accounts. Verify row counts and spot-check balances. - Verify and communicate: Compare row count and checksum (
SELECT count(*), sum(balance) FROM accounts) between recovery instance and production after restore. Write an incident report. Then set uppg_auditor a pre-deploy SQL review gate so unqualifiedUPDATEstatements get caught. - Time budget: In practice, restoring a base backup and replaying 12 hours of WAL for a moderately busy database takes 30-90 minutes depending on disk speed and WAL volume. The
pg_dump/COPYof a single table is fast — maybe 2-3 minutes for 2.3M rows. Total time to recovery: ~1-2 hours. Communicate this timeline to stakeholders immediately so they stop asking “is it fixed yet” every 5 minutes. - Prevention: Enable
pg_stat_statementsand alert on DML without WHERE clauses in staging gate. UseSET statement_timeoutfor deployment scripts. Consider aBEFORE UPDATEtrigger that rejects mass updates above a row threshold (a “blast radius limiter”).
- What’s the difference between physical backups (
pg_basebackup) and logical backups (pg_dump), and why can’t you do PITR withpg_dumpalone? - The WAL archive has a gap — one segment file is missing or corrupted. What happens during replay, and how do you handle it?
- Your database is 4 TB and the incident just happened. PITR will take 3 hours. The business wants data back in 30 minutes. What are your options? (Hint: think about logical replication, flashback queries with
pg_dirtyread, or selective WAL replay.)
S8. Zero-Downtime Schema Migration on a Billion-Row Table
S8. Zero-Downtime Schema Migration on a Billion-Row Table
region_code VARCHAR(5) NOT NULL DEFAULT 'US' to a user_events table with 1.2 billion rows in Postgres 14. The table receives 5K inserts/second. Your SLA says zero downtime. How do you do it?What weak candidates say:- “Just run
ALTER TABLE ADD COLUMN.” (Not wrong in PG 11+ for defaults, but can’t explain why it’s fast now vs. before, and doesn’t consider theNOT NULLconstraint behavior.) - “Create a new table, migrate data, swap.” (Massively over-engineered for this specific case.)
- Don’t know about Postgres’s metadata-only ADD COLUMN optimization.
- Postgres 11+ fast-path ADD COLUMN: Since PG 11,
ALTER TABLE ADD COLUMN ... DEFAULT 'US'with a non-volatile default is a metadata-only operation — it does NOT rewrite the table. The default is stored inpg_attrdefand returned for rows that don’t physically have the column yet. This takes milliseconds regardless of table size. This is one of the most important Postgres optimizations to know. - The NOT NULL nuance: Adding
NOT NULLwith a default together in a singleALTER TABLEstatement works instantly in PG 11+ because every row logically has the default value. But if you try to add the column nullable first, backfill, then addNOT NULLconstraint separately — theSET NOT NULLrequires a full table scan to verify no nulls exist. On 1.2 billion rows, that scan takes anACCESS EXCLUSIVElock for potentially 30+ minutes. So the correct approach is: single statementALTER TABLE user_events ADD COLUMN region_code VARCHAR(5) NOT NULL DEFAULT 'US';— instant, zero downtime. - When you CAN’T use the fast path: If the default is volatile (e.g.,
DEFAULT now()orDEFAULT uuid_generate_v4()), Postgres must rewrite every row. In that case, you need the multi-step approach: (1) Add column as nullable with no default. (2) Backfill in batches of 10K-50K rows withUPDATE ... WHERE id BETWEEN X AND Yandpg_sleep(0.1)between batches to avoid overwhelming replication and I/O. (3) AddNOT NULLconstraint usingNOT VALIDfirst, thenVALIDATE CONSTRAINTseparately —VALIDATEonly takes aSHARE UPDATE EXCLUSIVElock, notACCESS EXCLUSIVE, so reads and writes continue. - Lock timeout safety: Always set
SET lock_timeout = '3s';before any DDL on a hot table. If theALTER TABLEcan’t acquire its lock within 3 seconds (because a long query is holding a conflicting lock), it’s better to fail and retry than to queue behind the lock and block all subsequent queries. I’ve seen a 2-secondALTER TABLEcause a 10-minute outage because it queued behind a lock, and 500 other queries queued behind it. - Tooling: For complex multi-step migrations, use
gh-ost(GitHub’s tool for MySQL) orpg-osc/pgrollfor Postgres. These create shadow tables, sync via triggers, and do an atomic swap. Flyway or Liquibase handle migration ordering but don’t solve the lock problem — you still need to write lock-safe SQL.
- You need to change the column type from
VARCHAR(5)to anENUMtype on the same billion-row table. This does require a rewrite. Walk me through the zero-downtime approach. - How does MySQL’s
ALTER TABLEbehavior differ? Why does MySQL often require tools likegh-ostorpt-online-schema-changefor operations that are instant in Postgres? - You ran
ALTER TABLE ... ADD COLUMNand it’s been hanging for 30 seconds on a table that should be instant. What’s happening? (Hint: lock queue,pg_locks,pg_stat_activity, conflictingACCESS SHAREfrom long queries.)