Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Database Interview Questions (50+ Detailed Q&A)

Senior vs Staff expectations for database interviews:Senior Engineer: Designs schemas and writes queries for production workloads. Can reason about index selection, isolation levels, and query plans. Debugs slow queries with EXPLAIN ANALYZE. Understands MVCC, locking strategies, and replication lag. Chooses between SQL and NoSQL for a given use case with clear reasoning.Staff Engineer: Owns the database strategy across the organization. Writes migration runbooks for billion-row tables. Designs the data platform architecture — connection pooling topology, read replica routing, sharding strategy, backup/PITR policy. Makes build-vs-buy decisions on NewSQL, evaluates vendor lock-in, and sets org-wide standards for schema evolution, data lifecycle, and compliance (GDPR deletion, audit trails). Mentors teams on database internals and reviews critical schema changes.

1. Fundamentals & SQL

What interviewers are really testing: Whether you understand ACID beyond the acronym — specifically, how each property is implemented under the hood, the performance cost of each guarantee, and when you would intentionally relax them. Senior candidates connect ACID to WAL mechanics, isolation levels, and real production trade-offs.Answer:
  • Atomicity: All or Nothing. A transaction either fully commits or fully rolls back — there is no partial state. Under the hood, PostgreSQL implements this via the Write-Ahead Log (WAL): every change is written to the WAL before being applied to data pages. On crash, the WAL is replayed to complete or roll back in-flight transactions. MySQL’s InnoDB uses a similar redo/undo log system. The practical cost: every write hits the WAL first (sequential disk I/O), which is why fsync settings directly impact write throughput.
  • Consistency: The database enforces all constraints (foreign keys, unique, CHECK, NOT NULL) before and after every transaction. Critical nuance: “consistency” in ACID is different from “consistency” in CAP theorem. ACID consistency means application-defined invariants are maintained. CAP consistency means all nodes see the same data simultaneously. Conflating these two is a common interview mistake.
  • Isolation: Concurrent transactions do not see each other’s uncommitted changes — but the level of isolation is tunable. PostgreSQL defaults to Read Committed (you can see other transactions’ committed changes mid-query). MySQL InnoDB defaults to Repeatable Read (snapshot at transaction start). Serializable is the strongest level but costs 30-50% throughput due to lock contention and serialization failures requiring retries.
  • Durability: Once committed, data survives crashes. The WAL must be flushed to disk (fsync) before the commit is acknowledged. You can relax this with synchronous_commit=off in PostgreSQL for 2-5x write throughput improvement — at the risk of losing the last ~600ms of commits on a crash. Analytics pipelines and logging systems often make this trade-off intentionally.
Red flag answer: Reciting the four words without mentioning isolation levels, WAL mechanics, or any scenario where you would relax a guarantee. Also, confusing ACID consistency with CAP consistency — this is a surprisingly common mistake even among experienced candidates.Follow-up questions:
  • “Walk me through what happens internally when PostgreSQL commits a transaction — from the COMMIT statement to the data being durable on disk.” (Tests whether they understand WAL flush, checkpoint process, and the role of fsync.)
  • “You have a financial system that needs Serializable isolation for balance transfers but also handles 50K TPS in reads. How do you architect this?” (Tests CQRS thinking — separate read/write paths with different isolation guarantees.)
  • “When would you intentionally turn off synchronous_commit, and what is the blast radius if the server crashes?” (Tests production judgment — acceptable for analytics writes, never for payment processing.)
Structured Answer Template — ACID questions:
  1. Name each property and pair it with its implementation mechanism (A = WAL, C = constraints, I = MVCC/locks, D = fsync).
  2. Call out that ACID-C and CAP-C are different — interviewers love this nuance.
  3. Explain one production knob per property (isolation level, synchronous_commit).
  4. Describe one scenario where you’d relax a guarantee and why.
  5. Close with: “Every guarantee has a measurable throughput cost — engineers pick levels, not defaults.”
Real-World Example: Stripe’s ledger runs Postgres at SERIALIZABLE isolation for every balance-affecting transaction and couples that with synchronous_commit = on plus a synchronous replica — they accept 3-5 ms of added latency in exchange for never losing a committed charge. Their analytics warehouse, in contrast, runs on asynchronous replication with synchronous_commit = off because a few seconds of log loss is acceptable for dashboards.
Big Word Alert — Write-Ahead Log (WAL): An append-only log where changes are written before the data pages are modified. On crash, the DB replays the WAL to rebuild the committed state. Enables both Atomicity and Durability cheaply.
Big Word Alert — MVCC (Multi-Version Concurrency Control): Readers see a consistent snapshot of the database as of their transaction’s start time — writers create new row versions instead of overwriting, so readers never block writers and vice versa.
Follow-up Q&A Chain:
  • Q: Can you have durability without Atomicity? A: Technically yes — a log-only store with no transactions. But most systems bundle them because a durable-but-partial write is worse than no write.
  • Q: Why is ACID-C considered the “weakest” property by some database theorists? A: Because it’s the only one the database doesn’t fully own — it depends on the constraints the developer declares. Miss a foreign key, and the DB can’t enforce what you didn’t tell it about.
Further Reading:
  • postgresql.org/docs — “Write-Ahead Logging (WAL)” chapter
  • martin.kleppmann.com — Designing Data-Intensive Applications Chapter 7, “Transactions”
What interviewers are really testing: Whether you understand the concrete anomalies each level prevents, can explain WHY the defaults differ between PostgreSQL and MySQL, and have dealt with isolation-related bugs in production. This is where theory meets real debugging.Answer:
LevelDirty ReadNon-Repeatable ReadPhantom ReadDefault InPerformance
Read UncommittedPossiblePossiblePossibleAlmost never usedFastest
Read CommittedPreventedPossiblePossiblePostgreSQLFast
Repeatable ReadPreventedPreventedPossible (MySQL) / Prevented (PG)MySQL InnoDBModerate
SerializablePreventedPreventedPreventedManual selectionSlowest (30-50% TPS drop)
The anomalies explained with concrete examples:
  • Dirty Read: Transaction A updates a salary to 100Kbuthasnotcommitted.TransactionBreads100K but has not committed. Transaction B reads 100K. Transaction A rolls back. B acted on data that never existed. This is why Read Uncommitted is almost never used in production.
  • Non-Repeatable Read: Transaction A reads a row (salary = $80K). Transaction B updates and commits it to $100K. Transaction A reads the same row again and gets $100K. The same query returned different results within one transaction.
  • Phantom Read: Transaction A runs SELECT COUNT(*) FROM employees WHERE dept='eng' and gets 50. Transaction B inserts a new engineer and commits. Transaction A runs the same query and gets 51. New rows “appeared” in the result set.
Why PostgreSQL and MySQL chose different defaults: PostgreSQL defaults to Read Committed because its MVCC implementation is mature enough that most applications work correctly at this level, and it avoids the overhead of snapshot maintenance. MySQL InnoDB defaults to Repeatable Read partly for historical reasons (binary log-based replication required it) and because InnoDB’s gap locking mechanism can prevent phantoms even at this level — making MySQL’s Repeatable Read closer to Serializable than the SQL standard requires.Production war story: At a fintech company, a race condition in the payment service caused double-charges. Two concurrent transactions both read a balance of 100,bothapproveda100, both approved a 80 charge, and both committed — overdrawing the account. The root cause was Read Committed isolation on the balance check. The fix was SELECT ... FOR UPDATE (pessimistic locking) on the balance row, which serialized access to that specific row without upgrading the entire transaction to Serializable.Red flag answer: Listing the levels without explaining the anomalies, or not knowing which database defaults to which level. Also, saying “just use Serializable for everything” without acknowledging the throughput cost.Follow-up questions:
  • “How does MVCC actually work in PostgreSQL? What are xmin and xmax?” (Tests internals knowledge — every row has a creation transaction ID (xmin) and deletion transaction ID (xmax). Each transaction sees a snapshot based on these IDs.)
  • “A developer reports that two concurrent API requests are creating duplicate records despite a unique constraint. What isolation level are you probably running, and how do you fix it?” (Tests debugging skills — likely Read Committed with a check-then-insert pattern. Fix: use INSERT … ON CONFLICT, or use advisory locks, or upgrade to Serializable for that specific transaction.)
  • “What is the performance difference between Serializable and Read Committed in a high-throughput OLTP system?” (Tests production awareness — Serializable can reduce throughput by 30-50% due to serialization failures and retries, and requires application-level retry logic.)
Structured Answer Template — Isolation level questions:
  1. Name the three phenomena (dirty, non-repeatable, phantom) and which level prevents each.
  2. State the defaults: Postgres = Read Committed, MySQL InnoDB = Repeatable Read — and why.
  3. Give one real race condition you’ve seen at Read Committed (e.g., check-then-insert double-submit).
  4. Propose the targeted fix: SELECT ... FOR UPDATE, advisory lock, or upgrade isolation only for that transaction.
  5. Quantify: “Serializable costs 30-50% throughput; don’t apply it globally.”
Real-World Example: When Shopify migrated their inventory counter from READ COMMITTED to SERIALIZABLE on a specific stored procedure during flash sales, they saw a 35% drop in transactions-per-second on that code path — but double-sells dropped to zero. They localized the change to one function rather than the whole transaction, which is the pattern most senior engineers reach for.
Big Word Alert — Phantom read: When the same SELECT ... WHERE clause returns a different number of rows within one transaction because another transaction inserted (or deleted) matching rows in between. Not prevented by plain REPEATABLE READ in the SQL standard — only by SERIALIZABLE.
Big Word Alert — Gap locking: MySQL InnoDB’s trick of locking the “gap” between index values so that inserts into that gap are blocked. It’s how InnoDB prevents phantoms at REPEATABLE READ, pushing that level closer to SERIALIZABLE than the standard requires.
Follow-up Q&A Chain:
  • Q: Why doesn’t everyone just run at Serializable all the time? A: Because Serializable aborts transactions under contention and requires you to retry them. On a hot row, your throughput can collapse to the serial rate — often 10-30% of what Read Committed achieves.
  • Q: If I use SELECT ... FOR UPDATE, do I still need a higher isolation level? A: Often no — pessimistic row locks give you the correctness you want without paying the global cost of Serializable. Use the narrowest tool that solves the specific race.
Further Reading:
  • postgresql.org/docs — “Transaction Isolation” chapter
  • “A Critique of ANSI SQL Isolation Levels” by Berenson et al. — the canonical paper explaining why the SQL standard’s levels are ambiguous
What interviewers are really testing: Whether you understand the physical storage implications of indexing — not just “indexes make queries faster” but HOW they change data layout on disk, why you can only have one clustered index, and the performance implications of each type for different query patterns.Answer:
  • Clustered Index (Primary Key in InnoDB): The data rows themselves ARE stored inside the index leaf nodes, sorted by the index key. This means the table data is physically ordered on disk by the clustered index. You can only have one per table because data can only be physically sorted one way. In MySQL InnoDB, the primary key IS the clustered index — if you do not define one, InnoDB creates a hidden 6-byte row ID as the clustered key. Range scans on the clustered key are extremely fast because sequential rows are stored on adjacent disk pages (spatial locality).
  • Non-Clustered Index (Secondary Index): Leaf nodes contain the indexed columns plus a pointer back to the actual data row. In InnoDB, that “pointer” is the primary key value (not a physical row address), which means every non-clustered index lookup requires a second lookup through the clustered index to find the actual row — this is called a bookmark lookup or key lookup. In PostgreSQL (which uses heap tables, not clustered indexes), the pointer is a physical tuple ID (ctid).
The performance implication most candidates miss: In InnoDB, a wide primary key (e.g., a UUID) bloats every secondary index because every secondary index entry stores a copy of the primary key. A table with 5 secondary indexes and a 36-byte UUID primary key wastes 180 bytes per row across the secondary indexes compared to using a 4-byte integer. This is why auto-increment integers are still preferred as primary keys in high-volume MySQL systems, even when UUIDs are used as business identifiers.Red flag answer: “Clustered means the index is on the primary key” — this conflates the concept with the MySQL-specific implementation. In SQL Server, you can create a clustered index on any column. Also, not knowing that secondary index lookups in InnoDB require a double lookup.Follow-up questions:
  • “Why does PostgreSQL not have clustered indexes in the same way MySQL does?” (PostgreSQL uses heap tables where rows are stored in insertion order. You can run CLUSTER to physically reorder once, but it is not maintained on subsequent writes.)
  • “A query on a secondary index is slow despite the index existing. What could cause this?” (The index is not covering — it requires bookmark lookups for each row. If the query returns many rows, the random I/O from bookmark lookups is slower than a sequential scan. The optimizer may choose a seq scan instead.)
  • “How would you design the primary key for a time-series table with billions of rows that needs both fast inserts and fast range queries?” (Tests practical judgment — auto-increment for fast sequential inserts with a secondary index on timestamp, or a composite key with timestamp prefix for clustered range scans, trading insert performance for query speed.)
Structured Answer Template — Clustered vs Non-Clustered:
  1. State the physical-layout implication first (“clustered = the table IS the index leaves”).
  2. Note the per-engine reality: InnoDB has exactly one clustered index (the PK); Postgres has none.
  3. Explain the secondary-index cost: InnoDB bookmark lookup costs a second B+Tree traversal.
  4. Connect to PK choice: wide PK (UUID) inflates every secondary index.
  5. End with a design rule: “For high-write tables, keep the PK narrow and time-sortable.”
Real-World Example: Notion’s block table uses a 16-byte UUIDv7 as the primary key. They deliberately chose UUIDv7 over UUIDv4 because the time-ordered prefix means new rows always insert at the rightmost leaf of the clustered index — avoiding the page-split storm that pure-random UUIDv4 would cause on a 50-billion-row table. The 16-byte PK still propagates into every secondary index, but they measured the extra bytes costing less than the write amplification of random inserts.
Big Word Alert — Bookmark lookup (key lookup): When a secondary-index scan finds the matching index entry and then has to do a second lookup — via the clustered index in InnoDB or the heap tuple ID in Postgres — to fetch the row’s actual columns. Each bookmark lookup is another random I/O.
Big Word Alert — Heap table: Postgres’s default table storage where rows sit in insertion order (the “heap”), and all indexes — including the primary key — are separate structures pointing into the heap via tuple IDs (ctid). Contrast with InnoDB where the clustered index is the table.
Follow-up Q&A Chain:
  • Q: Can I have a “clustered index” in Postgres? A: Sort of. CLUSTER table USING index physically reorders the heap once, but inserts and updates aren’t kept in order afterward. For true clustered-like behavior, use BRIN on a naturally-ordered column (timestamp) or partition the table.
  • Q: Why does InnoDB make the PK the clustered index rather than letting you pick any column? A: Because clustering on a non-unique column would require all secondary indexes to carry a disambiguation key. Making the PK both unique and the cluster key keeps the engine simple.
Further Reading:
  • use-the-index-luke.com — “The Clustered Index” chapter by Markus Winand
  • postgresql.org/docs — “Index Types” and “Storage” chapters
What interviewers are really testing: Whether you understand the data structure that underpins virtually every relational index, why B+Tree won over B-Tree for database storage, and how the leaf-level linked list enables the range scans that power most real queries. Candidates who only say “B+Tree is better” without explaining the I/O implications have only surface knowledge.Answer:
  • B-Tree: Data pointers are stored in both internal nodes and leaf nodes. This means a search can terminate at an internal node if the key is found there — fewer hops in the best case. However, internal nodes are larger (they carry data payloads), so fewer keys fit per node, which increases tree height and disk I/O.
  • B+Tree: Data pointers are stored ONLY in leaf nodes. Internal nodes contain only keys and child pointers, which means each internal node can hold far more keys — typically 100-500 keys per 8KB page. This keeps the tree extremely shallow (3-4 levels for billions of rows). Leaf nodes are linked into a doubly-linked list, which is the key innovation: once you find the start of a range, you walk the linked list sequentially instead of traversing back up the tree.
Why databases chose B+Tree:
  • Predictable performance: Every lookup traverses the same number of levels (root to leaf). B-Tree lookups vary depending on where in the tree the key sits. Predictability matters for query planning.
  • Range scan efficiency: SELECT * FROM orders WHERE created_at BETWEEN '2025-01-01' AND '2025-01-31' finds the first key via tree traversal (3-4 random I/Os), then walks the leaf linked list sequentially — sequential I/O is 100x faster than random I/O on spinning disk and still 5-10x faster on SSD.
  • Higher fanout: With data only at leaves, internal nodes are smaller, so more keys fit per page, reducing tree height. A 3-level B+Tree with 500 keys per internal node can index 125 million rows. The same depth B-Tree with 50 keys per node (because internal nodes carry data) only indexes 125,000 rows.
Concrete example: PostgreSQL’s B-Tree index implementation (nbtree) uses 8KB pages. Each internal page holds roughly 300-500 index entries (depending on key size). For a table with 100 million rows, the index is typically 3 levels deep — meaning any point lookup requires exactly 3 page reads (often 1, since root and first-level pages are cached in shared_buffers).Red flag answer: “B+Tree stores data only in leaves” without explaining why that matters (fanout, range scans, predictable depth). Also, saying “B-Tree and B+Tree are basically the same” — the leaf linked list is a fundamental architectural difference.Follow-up questions:
  • “What happens to a B+Tree when you insert a key into a full leaf page?” (Page split: the leaf splits into two, and the middle key is promoted to the parent. If the parent is also full, it splits too — cascading up. This is why random inserts into a B+Tree index are slower than sequential inserts — random inserts cause more page splits.)
  • “Why are UUIDs as primary keys problematic for B+Tree indexes?” (UUIDs are random, so every insert targets a random leaf page — causing constant page splits and poor cache utilization. Auto-increment IDs always insert at the rightmost leaf, which is hot in cache. This is why UUIDv7 (time-sorted) was created.)
  • “How does a B+Tree index interact with the buffer pool / shared_buffers?” (Frequently accessed upper tree levels stay pinned in memory. For a well-cached index, a point lookup may only require 1 actual disk read — the leaf page — because root and internal pages are already in the buffer pool.)
Write amplification deep-dive: A single logical row update can trigger many physical writes: (1) the heap tuple gets a new version (MVCC), (2) every secondary index pointing to that row gets updated, (3) WAL records each change, (4) full-page images may be written to WAL on the first modification after a checkpoint. A row with 5 indexes can turn one UPDATE into 10+ physical writes. Postgres HOT (Heap-Only Tuple) updates avoid index updates when no indexed column changed and the new version fits on the same page — which is why fillfactor tuning matters.
Senior vs Staff perspective
  • Senior: Can explain B+Tree structure, page splits, and why UUIDv4 primary keys hurt.
  • Staff: Thinks about write amplification at the storage layer — LSM-tree vs B+Tree tradeoffs (RocksDB, Cassandra use LSM for write-heavy workloads), fillfactor tuning for HOT updates, WAL compression settings, and how checkpoint_timeout + full_page_writes interact to create IO spikes. Also connects this to cost: high write amplification -> more IOPS -> more expensive storage tier.
What interviewers are really testing: Can you reason about the storage engine beneath the query language, and pick the right engine for a workload’s read:write ratio?Answer: Write amplification = (bytes actually written to storage) / (bytes of user data changed). A ratio of 20 means writing 1KB of user data caused 20KB of physical I/O.Sources of write amplification in B+Tree engines (Postgres, InnoDB):
  • WAL writes (durability): every change written twice — once to WAL, once to data pages.
  • Full-page writes: after a checkpoint, the first modification to a page writes the entire 8KB page to WAL (torn-page protection).
  • Index updates: each secondary index adds its own writes.
  • MVCC tuple versioning: updates create new versions; old versions stay until VACUUM.
  • Page splits: inserting into a full page splits it in two, rewriting both.
LSM-Tree engines (RocksDB, Cassandra, ScyllaDB, HBase):
  • Writes go to an in-memory memtable + append-only WAL. O(1) write latency.
  • Memtables flush to immutable SSTables on disk (Sorted String Tables).
  • Background compaction merges SSTables, rewriting data multiple times as it moves through levels (L0, L1, … L6).
  • Write amplification is high (10-30x) but writes are sequential, which flash and HDD love. Read amplification is also high because a read may hit multiple SSTables before finding the value — Bloom filters per SSTable mitigate this.
When to pick each:
WorkloadPick
OLTP with balanced read/write, rich queriesB+Tree (Postgres, MySQL)
Write-heavy time-series, logs, event sourcingLSM (Cassandra, RocksDB, ClickHouse)
Read-heavy with large point lookupsB+Tree with good caching
Need to sustain >1M writes/secLSM
Tuning B+Tree write amplification in Postgres:
  • Lower fillfactor (e.g., 80%) to leave room for HOT updates -> fewer index updates.
  • Batch UPDATEs into transactions to amortize WAL overhead.
  • Use UNLOGGED tables for ephemeral data (no WAL, no crash safety).
  • Tune checkpoint_timeout up (15-30 min) to reduce full-page-write frequency, at the cost of longer recovery time.
Senior vs Staff perspective
  • Senior: Measures write amplification via pg_stat_wal (wal_bytes), tunes fillfactor, knows LSM exists.
  • Staff: Designs the storage tier for the company — B+Tree for OLTP, LSM-backed (ClickHouse/Cassandra) for analytics/time-series, and models the economic tradeoff: “this workload writes 50K ops/sec on Postgres and costs $8K/month in IOPS; moving to Cassandra cuts IOPS cost 70% but requires rewriting app queries.” Also makes the call on when to move stateful systems to local NVMe vs EBS vs object-storage-backed tiers.
Follow-up chain:
  1. “Your write-heavy Postgres table has 10 indexes and write throughput plateaus at 5K TPS. How do you diagnose and fix?” — (a) Check pg_stat_statements for write-heavy statements. (b) Count heap_blks_hit vs heap_blks_read on the table — cold cache hurts. (c) Drop indexes you do not use (pg_stat_user_indexes.idx_scan=0). (d) Consider partitioning by time so older partitions have fewer writes. (e) Last resort: evaluate moving to an LSM-backed system.
  2. “Why does Cassandra handle time-series writes better than Postgres?” — LSM appends sequentially; Cassandra partitions by partition key distributing writes across the cluster; tombstones + compaction handle deletes without VACUUM; no global transaction coordinator means writes do not serialize. Postgres writes go through WAL + random B+Tree page updates, and a single hot partition serializes.
  3. “How does checkpoint tuning affect write amplification in Postgres?” — Longer checkpoint_timeout -> fewer full-page writes -> lower write amplification, but longer recovery after crash. max_wal_size controls how much WAL accumulates before a forced checkpoint. The tradeoff is steady-state IO vs RPO/RTO.
Work-sample scenario: You are picking a database for a new service that ingests 500K events/second and needs to query them by time range. Walk through your decision.
  • Postgres with TimescaleDB: works up to ~200K/s on solid hardware; hypertables partition automatically; SQL is familiar. But 500K/s pushes the B+Tree engine.
  • ClickHouse: LSM-like MergeTree, columnar storage, handles 500K/s per node easily. Trade-off: eventually-consistent, not ACID, SQL is MySQL-flavored.
  • Cassandra/ScyllaDB: handles 500K/s across nodes; excellent for time-partitioned data; requires defining queries up front (partition key design).
  • Recommendation: ClickHouse if analytics-style range scans dominate; Cassandra if you need sub-10ms point lookups by partition key.
What weak candidates say: “All databases do roughly the same thing at this level.” — Storage engine choice shapes everything above it.What strong candidates say: “Write amplification is the hidden cost of durability. B+Tree engines give me rich queries and ACID but cost 5-10x write amplification. LSM engines give me 20-30x amplification but with sequential-only IO, which is why they scale better on commodity hardware. I pick based on the read:write ratio and query shape: OLTP = B+Tree, append-heavy = LSM.”
What interviewers are really testing: Whether you can explain normalization with concrete examples (not abstract definitions), understand the real-world trade-offs between normalized and denormalized schemas, and know when to intentionally break normalization rules. Textbook recitation without practical judgment is a red flag.Answer:
  • 1NF (First Normal Form): Every column holds atomic (indivisible) values — no arrays, no comma-separated lists, no nested structures in a single field. Example violation: storing tags = "python,java,go" in a single column. Fix: create a separate user_tags junction table. PostgreSQL’s ARRAY and JSONB types technically violate 1NF but are pragmatically useful — this nuance shows production maturity.
  • 2NF (Second Normal Form): 1NF + no partial dependencies on a composite primary key. Only relevant for tables with composite keys. Example violation: table (student_id, course_id, student_name, grade)student_name depends only on student_id, not the full composite key. Fix: separate into students(student_id, student_name) and enrollments(student_id, course_id, grade).
  • 3NF (Third Normal Form): 2NF + no transitive dependencies. No non-key column should depend on another non-key column. Example violation: orders(order_id, customer_id, customer_city, customer_zip)customer_city depends on customer_zip, not on order_id. Fix: move city/zip to a customers table.
  • Denormalization: Intentionally adding redundancy for read performance. Star schemas in data warehouses are the classic example: a fact table (sales) with denormalized dimension attributes (customer_name, product_category) avoids expensive JOINs at query time. Amazon’s product pages denormalize heavily — the product detail is not assembled from 15 JOINed tables; it is pre-computed and stored as a single document.
The production trade-off: Normalization optimizes for write correctness (single source of truth, no update anomalies). Denormalization optimizes for read speed (fewer JOINs, simpler queries). Most production OLTP systems target 3NF for core entities and selectively denormalize hot read paths. OLAP/data warehouse systems are typically denormalized (star/snowflake schema) from the start.Red flag answer: Reciting “1NF is atomic, 2NF is no partial dependency, 3NF is no transitive dependency” without concrete examples or without mentioning when denormalization is appropriate. Also, dogmatically insisting “always normalize” or “always denormalize” — it depends on the read/write ratio and access patterns.Follow-up questions:
  • “You have a reporting dashboard that JOINs 8 tables and takes 12 seconds to load. How do you decide whether to denormalize vs. add indexes vs. use materialized views?” (Tests practical judgment: start with EXPLAIN ANALYZE to find the bottleneck. If it is JOIN fan-out, a materialized view refreshed every 5 minutes is often the right middle ground. Full denormalization requires maintaining data consistency in application code.)
  • “What are update anomalies, and can you give a real example of one causing a production bug?” (If customer_name is stored in both customers and orders tables, updating the customer’s name requires updating both. If the orders update fails or is forgotten, you have inconsistent data — old name on historical orders, new name on the customer record.)
  • “How does MongoDB’s document model relate to normalization? Is embedding the same as denormalization?” (Yes — embedding related data in a single document is denormalization. It avoids the equivalent of JOINs but means updates to shared data must be applied to every document containing it.)
Structured Answer Template — Normalization questions:
  1. Define 1NF/2NF/3NF in one sentence each, with a concrete violation example.
  2. State the trade-off: normalization optimizes writes, denormalization optimizes reads.
  3. Identify the workload: OLTP → 3NF; OLAP/warehouse → star schema (denormalized).
  4. Name the modern middle ground: materialized views or JSONB for controlled denormalization.
  5. Acknowledge: “3NF is the starting point; selectively denormalize hot read paths once measured.”
Real-World Example: Airbnb’s listing pages used to assemble data from 15+ normalized tables and took ~800 ms to render. They introduced a denormalized “listing_view” materialized table refreshed every 15 minutes with pre-joined data for the listing page. Page load dropped to ~80 ms — a 10x improvement. They kept the normalized tables as the source of truth and accepted that the read side could be up to 15 minutes stale, which matches the freshness needs of the listing page.
Big Word Alert — Star schema: A denormalized warehouse pattern where a central “fact” table (like sales) references dimension tables (customer, product, date). Analysts love it because most queries are simple fact-table aggregations with few joins.
Big Word Alert — Update anomaly: When the same logical fact is stored in multiple places and an update to one copy fails or is forgotten, leaving the database inconsistent. Normalization prevents these; denormalization requires application-level discipline to avoid them.
Follow-up Q&A Chain:
  • Q: Does JSONB in Postgres violate 1NF? A: Technically yes — it stores composite/nested values in one column. Practically, it’s fine for attributes you never filter or aggregate on. Index the specific paths you query with GIN or expression indexes.
  • Q: When is fully denormalized the right starting point? A: Analytical workloads and write-once-read-many systems (event logs, activity feeds, product catalogs served at edge). OLTP with lots of updates should almost always start normalized.
Further Reading:
  • martin.kleppmann.com — DDIA Chapter 3 on storage and retrieval
  • “Star Schema: The Complete Reference” by Christopher Adamson (for warehouse-style denormalization)
What interviewers are really testing: Not whether you can recite JOIN types (everyone can), but whether you understand what the query planner actually does under the hood — hash joins vs. nested loops vs. merge joins, their performance characteristics, and when a JOIN choice causes a production query to go from 50ms to 50 seconds.Answer:
  • Inner Join: Returns only rows with matches in both tables. The most common join — used for lookups like “get all orders with their customer names.”
  • Left (Outer) Join: All rows from the left table, matched rows from the right table, NULLs where no match exists. Use case: “find all customers, including those with zero orders.” The gotcha: filtering on a LEFT JOIN’s right table in the WHERE clause implicitly converts it to an INNER JOIN. WHERE right_table.col = 'X' eliminates NULLs — use AND in the ON clause instead.
  • Right (Outer) Join: Mirror of LEFT JOIN. Rarely used in practice — rewriting as a LEFT JOIN with swapped table order is more readable.
  • Full Outer Join: All rows from both tables, NULLs on both sides where no match exists. Use case: data reconciliation — “show me all records from system A and system B, highlighting mismatches.”
  • Cross Join: Cartesian product (N x M rows). Rarely intentional. Use case: generating combinations (e.g., all product-color combinations for a catalog). Accidentally omitting a JOIN condition produces a cross join — on two million-row tables, that is 10^12 rows and will kill your database.
How the planner executes JOINs (the part most candidates miss):
  • Nested Loop Join: For each row in the outer table, scan the inner table. O(N*M) worst case. The planner chooses this when one table is very small or an index exists on the inner table’s join column — then it becomes O(N * log M).
  • Hash Join: Build an in-memory hash table from the smaller table, then probe it for each row in the larger table. O(N + M) but requires memory for the hash table. If the hash table exceeds work_mem, it spills to disk and performance degrades dramatically.
  • Merge Join: Both inputs must be sorted on the join key. Walks through both sorted lists simultaneously — O(N log N + M log M) for the sorts, then O(N + M) for the merge. Efficient when both inputs are already sorted (e.g., from an index scan).
Red flag answer: Only listing JOIN types without discussing planner strategies, or not knowing the LEFT JOIN WHERE clause gotcha. Also, saying “I avoid JOINs for performance” — this indicates NoSQL cargo-culting rather than understanding when JOINs are efficient.Follow-up questions:
  • “You have a query JOINing two tables with 50M rows each. EXPLAIN shows a nested loop join and the query takes 3 minutes. What happened and how do you fix it?” (The planner chose nested loop because of bad statistics or a missing index. With bad row estimates, it might think the inner table returns 1 row when it returns 50K. Fix: run ANALYZE, check for index on the join column, or increase work_mem to allow a hash join.)
  • “What is the difference between ON and WHERE for filtering in a LEFT JOIN?” (Conditions in ON are applied during the join — rows that do not match still appear with NULLs. Conditions in WHERE are applied after the join — they filter out NULL rows, converting the LEFT JOIN to an INNER JOIN.)
  • “When would you prefer a subquery over a JOIN, and vice versa?” (Correlated subqueries can cause N+1-like behavior. JOINs allow the planner to choose the optimal strategy. However, EXISTS subqueries can be faster than JOINs when you only need to check existence, not retrieve columns — the planner can short-circuit on the first match.)
Structured Answer Template — JOIN questions:
  1. Name the JOIN type and what the planner likely picks (nested loop / hash / merge).
  2. Identify the driving factor: table sizes, selectivity, index availability, work_mem.
  3. Call out one classic trap: LEFT JOIN filter-in-WHERE converting to inner join.
  4. Propose a diagnostic: EXPLAIN (ANALYZE, BUFFERS) to see actual vs estimated rows.
  5. Close with a fix: index the join column, raise work_mem for hash joins, or rewrite to EXISTS.
Real-World Example: At Meta, engineers debugging a slow News Feed query found the planner had chosen a nested loop join because the statistics hadn’t been refreshed after a schema migration. Actual rows: 2 million. Estimated rows: 1. Running ANALYZE took 30 seconds and the next run of the query picked a hash join, dropping latency from 18 seconds to 40 ms. They now auto-run ANALYZE on critical tables after any DDL.
Big Word Alert — Hash join: The planner builds an in-memory hash table from the smaller (build) input, then streams the larger (probe) input through it. O(N + M). Fails over to disk if the hash doesn’t fit in work_mem — watch for Batches > 1 in EXPLAIN.
Big Word Alert — Merge join: Both inputs must arrive sorted on the join key. The planner walks them together. Efficient if both sides are already sorted (e.g., by index), otherwise it pays a sort cost first.
Follow-up Q&A Chain:
  • Q: Why do experienced engineers distrust SELECT * in a JOIN? A: Because it forces the planner to materialize every column of every row, kills the chance of an index-only scan, and couples application code to the exact schema — schema changes silently break things.
  • Q: Is it ever faster to do two single-table queries and join in the application? A: Occasionally yes — when the join selectivity is terrible and one side is already cached. But you’re fighting the planner, and the result is fragile. Measure before reaching for it.
Further Reading:
  • use-the-index-luke.com — “Joining Tables” chapter on how indexes shape join strategies
  • postgresql.org/docs — “Explicit JOIN Clauses” and “Performance Tips”
What interviewers are really testing: Whether you understand the caching vs. freshness trade-off at the database level, can articulate when to use each, and know the operational complexity of materialized views in production (refresh strategies, storage cost, index support).Answer:
  • View: A saved SQL query — a virtual table that re-executes the underlying query every time you SELECT from it. Zero storage overhead, always returns fresh data. Think of it as a named alias for a complex query. Use case: simplifying access control (expose a view that hides sensitive columns), standardizing complex JOINs that multiple services use, or providing a stable API over changing table structures.
  • Materialized View (MatView): A physical copy of the query result stored on disk. Fast reads because it is just a table scan — no re-execution of the underlying query. Trade-off: data is stale until you run REFRESH MATERIALIZED VIEW. In PostgreSQL, you can create indexes on materialized views, which makes them extremely powerful for dashboard-style queries.
Refresh strategies and their trade-offs:
  • REFRESH MATERIALIZED VIEW: Takes an ACCESS EXCLUSIVE lock — reads are blocked during refresh. On a 10M-row materialized view, refresh might take 30 seconds, meaning 30 seconds of downtime for any query hitting that view.
  • REFRESH MATERIALIZED VIEW CONCURRENTLY: Allows reads during refresh (no exclusive lock), but requires a unique index on the materialized view and is 2-3x slower because it performs a diff between old and new data. This is what you use in production.
  • Scheduled refresh: Most teams use pg_cron or an application-level scheduler to refresh every 5-15 minutes. The refresh interval is a business decision: “How stale can this data be?”
  • Trigger-based refresh: Refresh on every write to the source tables. This defeats the purpose for write-heavy workloads but works for slowly-changing data.
Real-world example: A reporting dashboard showing “revenue by region for the last 30 days” JOINs orders, customers, and products — a 15-second query on raw tables. A materialized view with a REFRESH CONCURRENTLY every 10 minutes reduces this to 200ms. The business accepts 10-minute staleness for a 75x speedup.Red flag answer: “Materialized views are just cached views” without mentioning the locking behavior of refresh, or not knowing about CONCURRENTLY. Also, suggesting materialized views for real-time data requirements.Follow-up questions:
  • “A materialized view refresh is taking 45 seconds and blocking all reads. How do you fix this without changing the refresh frequency?” (Switch to REFRESH CONCURRENTLY — requires adding a unique index to the materialized view. If refresh is inherently slow, consider incremental refresh via pg_ivm extension or switching to an application-level caching strategy like Redis.)
  • “How do materialized views compare to application-level caching with Redis?” (MatViews live inside the database — they are queryable with SQL, support indexes, and are transactionally consistent at refresh time. Redis is faster for point lookups but cannot handle complex SQL queries. MatViews are better for analytical aggregations; Redis is better for hot key-value caching.)
  • “Can you use a materialized view as a target for foreign keys or INSERT statements?” (No — materialized views are read-only. You cannot INSERT, UPDATE, or reference them in foreign key constraints. They are purely a query acceleration mechanism.)
Structured Answer Template — View vs MatView:
  1. State the key difference: view = re-executes query; matview = cached result on disk.
  2. Name the freshness trade-off: MatView is only as fresh as its last refresh.
  3. Require REFRESH CONCURRENTLY in production — naive refresh blocks reads.
  4. Position it in the caching spectrum: faster than a raw query, slower than Redis, SQL-queryable.
  5. Mention pg_ivm or incremental refresh for advanced use cases.
Real-World Example: Stripe uses Postgres materialized views for their internal analytics dashboards — the “merchant monthly revenue” view refreshes every 5 minutes via pg_cron. Their engineers measured the alternative (re-running the underlying 8-table join each dashboard load) at 12+ seconds and decided 5-minute staleness was acceptable for internal-facing tools. External customer dashboards still hit the live data.
Big Word Alert — Materialized view: A saved query whose result is physically stored as a table. Reads are as fast as any other table scan; the cost is that the data is only as current as the last REFRESH.
Big Word Alert — Incremental view maintenance (IVM): Updating a materialized view by applying only the delta of changes since the last refresh, instead of re-running the whole query. Postgres extensions like pg_ivm provide this; native support is on the roadmap.
Follow-up Q&A Chain:
  • Q: Can I put an index on a materialized view? A: Yes, and you usually should — MatViews behave like tables for indexing. A unique index is required if you want to use REFRESH CONCURRENTLY.
  • Q: What happens to a materialized view if the underlying tables are dropped? A: Postgres prevents the drop via dependency — you’d need DROP TABLE ... CASCADE, which also drops the matview. A good protection against accidents.
Further Reading:
  • postgresql.org/docs — “Materialized Views” chapter
  • use-the-index-luke.com — query performance techniques that complement materialized views
What interviewers are really testing: Whether you understand the database-side vs. application-side logic debate, the practical differences beyond syntax, and when putting logic in the database is a good idea vs. a maintenance nightmare. This also tests your opinion on where business logic should live — a strong signal of architectural maturity.Answer:
  • Function: Returns a value (scalar, table, or set). Can be called inside SELECT, WHERE, and other SQL expressions. In PostgreSQL, functions run within the caller’s transaction — they cannot independently COMMIT or ROLLBACK. Supports IMMUTABLE, STABLE, VOLATILE volatility categories that affect query planner optimization and caching.
  • Stored Procedure (PostgreSQL 11+, MySQL, SQL Server): Can manage its own transaction control — COMMIT and ROLLBACK inside the body. Cannot be called from within a SELECT statement — invoked via CALL. Designed for multi-step workflows that need intermediate commits (e.g., batch processing 1M rows in chunks with commits every 10K to avoid long-running transaction bloat).
When to use database-side logic (and when not to):
  • Good use cases: Data validation that must be enforced regardless of which application connects (constraint triggers), audit logging (triggers on INSERT/UPDATE/DELETE), complex data transformations during ETL that benefit from being close to the data (avoiding network round-trips for millions of rows), and row-level security policies.
  • Bad use cases: Business logic that changes frequently (stored procedures are harder to version, test, and deploy than application code), anything that requires external API calls or complex branching logic, and any logic that your development team cannot effectively debug (if most engineers do not know PL/pgSQL, stored procedures become a black box).
The cultural divide: Companies with a DBA tradition (banking, telecom) tend to put more logic in stored procedures. Companies with a DevOps/microservices culture (startups, SaaS) tend to keep the database “dumb” and put logic in application code. Neither is universally correct — the right answer depends on your team’s skills and your deployment pipeline.Red flag answer: “Never use stored procedures, keep the database dumb” without acknowledging legitimate use cases, or conversely “put all logic in stored procedures for performance” without considering maintainability. Both extremes show lack of nuance.Follow-up questions:
  • “You have a nightly batch job that needs to process and archive 50M rows. Would you write this as a stored procedure or application code? Why?” (Strong case for a stored procedure: processing 50M rows in application code means 50M network round-trips. A stored procedure processes data locally. Use intermediate COMMITs every 10K-50K rows to avoid transaction ID wraparound and long-running transaction problems.)
  • “How do you unit test a stored procedure?” (This is the Achilles’ heel. pgTAP is the PostgreSQL testing framework for SQL/PL/pgSQL. You can also test via integration tests that call the procedure and assert on database state. But testing is harder than application code — this is the main argument against heavy stored procedure usage.)
  • “What is the performance difference between a PL/pgSQL function and a SQL function in PostgreSQL?” (SQL functions can be “inlined” by the planner — the function body is substituted into the outer query and optimized as a whole. PL/pgSQL functions are opaque to the planner — each SQL statement inside is planned separately. For simple logic, SQL functions can be 5-10x faster.)
Structured Answer Template — Stored Procedure vs Function:
  1. State the behavioral difference: functions return values and run inside the caller’s transaction; procedures own their own transactions.
  2. Identify the use case: functions for expressions; procedures for multi-step batch work with intermediate commits.
  3. Surface the debate: where should business logic live?
  4. List the anti-pattern: rapidly-changing business logic in stored procs is hard to version.
  5. Offer a pragmatic rule: “Data-local validation/transformation = DB; rapidly-evolving business rules = application code.”
Real-World Example: Dropbox keeps their product business logic in application code (Go/Python services) but uses Postgres stored procedures for one specific task: the nightly storage-quota recomputation across billions of rows. Running it in-database saved ~2 hours per run because it avoided the round-trip latency of streaming 10 billion rows to the application layer. Everything else — permission checks, file sharing logic — lives in versioned Go services.
Big Word Alert — PL/pgSQL: Postgres’s procedural SQL language — adds loops, conditionals, local variables, and exception handling on top of SQL. Similar to Oracle’s PL/SQL. Runs inside the database process.
Big Word Alert — SQL function inlining: When the planner substitutes a simple SQL function’s body into the calling query and optimizes them together. Only works for pure SQL functions marked IMMUTABLE or STABLE — PL/pgSQL functions are always opaque to the planner.
Follow-up Q&A Chain:
  • Q: Can I version-control stored procedures like application code? A: Yes — use migration tooling (Flyway, Liquibase, Sqitch) and treat procedure bodies as migration files. Most teams that get burned by stored procs skipped this step.
  • Q: Why do DBAs love them and DevOps teams hate them? A: DBAs see them as close-to-the-data performance wins and central policy enforcement. DevOps teams see them as out-of-band code that bypasses the normal CI/CD, testing, and observability pipeline.
Further Reading:
  • postgresql.org/docs — “Server Programming” chapters on functions and procedures
  • “The Art of PostgreSQL” by Dimitri Fontaine for the case in favor of database-centric design
What interviewers are really testing: Whether you understand constraints as the database’s last line of defense for data integrity, can explain the performance implications of each, and know when to rely on database constraints vs. application-level validation.Answer:
  • Primary Key: Unique + NOT NULL. Implicitly creates a unique index. In InnoDB, the primary key IS the clustered index — choosing the right PK affects every query’s physical I/O pattern. In PostgreSQL, the PK creates a B-tree index on a heap table.
  • Foreign Key: Enforces referential integrity between tables. On INSERT/UPDATE, the database checks the referenced row exists. On DELETE of the parent, the configured action fires: CASCADE (delete children), SET NULL, SET DEFAULT, or RESTRICT (block). Performance cost: every INSERT into the child table requires an index lookup on the parent’s PK. On bulk loads, temporarily disabling FK checks (SET session_replication_role = 'replica'; in PostgreSQL) can speed things up 5-10x — but you must re-validate after.
  • Unique: Prevents duplicate values. Creates a unique index. Partial unique constraints in PostgreSQL (CREATE UNIQUE INDEX ... WHERE active = true) are extremely powerful — “only one active subscription per user.”
  • NOT NULL: Prevents NULL values. Seems simple but adding NOT NULL to an existing column on a large table requires a full table scan to verify no NULLs exist (takes an ACCESS EXCLUSIVE lock). In PostgreSQL 12+, use ALTER TABLE ... ADD CONSTRAINT ... NOT NULL NOT VALID; VALIDATE CONSTRAINT ...; to avoid long locks.
  • Check: Arbitrary boolean expressions (CHECK (age > 0 AND age < 200), CHECK (status IN ('active', 'pending', 'closed'))). Cannot reference other tables. For cross-table validation, use triggers or application logic.
  • Exclusion Constraints (PostgreSQL-specific): Prevent overlapping values using GiST indexes. Classic use case: EXCLUDE USING gist (room WITH =, during WITH &&) — no two bookings for the same room can have overlapping time ranges. This is incredibly powerful and underused.
What weak candidates say: List the five basic types without mentioning performance implications or real use cases. Cannot explain CASCADE behavior or when to use partial unique constraints.What strong candidates say: Discuss constraints as a data integrity strategy, explain the locking implications of adding constraints to existing tables, mention exclusion constraints for range overlaps, and know when to disable FKs for bulk operations.Follow-up chain:
  1. “You have a multi-tenant SaaS app. How do you ensure that a user in tenant A can never accidentally reference a record belonging to tenant B through a foreign key?” (Composite FKs that include tenant_id, or row-level security policies.)
  2. “A bulk import of 10M rows is taking 4 hours because of foreign key checks. How do you speed it up safely?” (Disable FK triggers, load data, re-enable and validate. Or use COPY with deferred constraints.)
  3. “When would you intentionally NOT use a foreign key constraint?” (Cross-service boundaries in microservices, very high-write throughput tables where FK check overhead is unacceptable, or when referencing data in a different database/shard.)
Structured Answer Template — Constraint design:
  1. Map each constraint type to what it prevents (PK = uniqueness + not-null; FK = referential integrity; UNIQUE = dup-prevention; CHECK = domain rules).
  2. Note the performance cost — especially FK index lookups on writes.
  3. Call out the partial-unique trick for multi-tenant and soft-delete scenarios.
  4. Highlight the locking cost of ADD CONSTRAINT on a live table and the NOT VALID → VALIDATE pattern.
  5. Close with: “Constraints are your cheapest bug-prevention layer — use them, but know their cost.”
Real-World Example: Notion uses composite foreign keys that include workspace_id on every referential relationship. Early on they hit a bug where a shared integration could accidentally reference blocks from a different workspace — a composite FK (block_id, workspace_id) made that relationship impossible to express, turning a class of bugs into a compile-time DB error. The cost is ~10 bytes per FK, well worth it for a multi-tenant system.
Big Word Alert — Partial unique index: A unique index with a WHERE clause (CREATE UNIQUE INDEX ... WHERE active = true). Uniqueness is enforced only on the matching subset of rows — perfect for “one active subscription per user” or soft-delete scenarios where deleted rows should not block re-use of an email.
Big Word Alert — Exclusion constraint: A Postgres-specific constraint that uses a GiST index to prevent overlapping values. The canonical use: “no two bookings for the same room can have overlapping time ranges” — implemented in a single declarative statement.
Follow-up Q&A Chain:
  • Q: Why are FKs controversial in microservices? A: They create implicit coupling across service boundaries. If orders.customer_id references customers.id but customers is owned by a different service, you’ve leaked schema ownership. Most microservice shops drop FKs at service boundaries and enforce integrity at the application layer.
  • Q: Does adding a CHECK constraint lock the table? A: In Postgres, by default yes — it scans the full table to validate. Use ADD CONSTRAINT ... NOT VALID first (instant, no scan), then VALIDATE CONSTRAINT separately (weaker lock). On a billion-row table this is the difference between minutes of downtime and zero downtime.
Further Reading:
  • postgresql.org/docs — “Data Definition” chapter on constraints
  • use-the-index-luke.com — how unique constraints and indexes interact
What interviewers are really testing: Whether you understand why cursors exist, why they are almost always the wrong tool, and the rare cases where they are genuinely necessary. This question separates set-based thinkers from procedural thinkers.Answer:A cursor is a database object that allows row-by-row iteration over a result set. Think of it as a pointer (or iterator) that moves through query results one row at a time.Why cursors are almost always wrong:
  • Network roundtrips: Each FETCH is a separate network call. Iterating 100K rows = 100K roundtrips. A single SELECT returning 100K rows is one roundtrip.
  • Lock holding: Server-side cursors hold resources (memory, locks) for the duration of iteration. On a busy OLTP system, this blocks other transactions.
  • Set-based alternative: SQL is designed for set operations. Almost everything a cursor does can be done with UPDATE ... FROM, INSERT ... SELECT, window functions, CTEs, or LATERAL joins — all of which the query planner can optimize. Cursors bypass the planner’s ability to optimize across operations.
When cursors ARE appropriate:
  • Batch processing with intermediate commits: Processing 50M rows in chunks — open a cursor, fetch 10K rows, process, COMMIT, fetch next 10K. This avoids holding a single transaction open for hours (which prevents VACUUM in PostgreSQL).
  • Streaming large result sets: When the result is too large to fit in application memory. PostgreSQL’s DECLARE CURSOR with FETCH 1000 lets you stream without loading everything.
  • Stored procedures with complex control flow: PL/pgSQL FOR record IN SELECT ... is syntactic sugar over a cursor and is the idiomatic way to iterate in server-side code.
What weak candidates say: “Cursors let you loop through rows” without mentioning why this is usually bad, or suggesting cursors for problems that have set-based solutions.What strong candidates say: Immediately explain the performance anti-pattern, then acknowledge the legitimate use cases (batch processing, memory-constrained streaming), and suggest set-based alternatives for the common cases.Follow-up chain:
  1. “A developer wrote a stored procedure that uses a cursor to update 2M rows one at a time. It takes 45 minutes. How do you rewrite it?” (Single UPDATE ... SET ... FROM with a JOIN, or batch updates with WHERE id BETWEEN X AND Y.)
  2. “What is the difference between a server-side cursor and a client-side cursor?” (Server-side: result stays on the DB server, client fetches in chunks. Client-side: entire result sent to client, iteration happens in app memory. Server-side saves client memory but holds server resources.)
  3. “How do PostgreSQL’s scrollable cursors differ from forward-only cursors, and when would you use each?”
Structured Answer Template — Cursor questions:
  1. Define it: row-by-row iteration over a result set.
  2. Signal the anti-pattern first: most cursor code should be a set-based UPDATE/INSERT.
  3. Identify the legit cases: batch processing with intermediate commits, streaming results too large for memory.
  4. Contrast server-side (resources on DB) vs client-side (resources on app) cursors.
  5. Close with a rewrite pattern: UPDATE ... WHERE id BETWEEN X AND Y in a loop beats a cursor every time.
Real-World Example: A Dropbox engineer rewrote a metadata-cleanup job from a PL/pgSQL cursor that processed one file at a time (45 minutes for 2 million rows) into a single UPDATE files SET ... FROM stale_index WHERE files.id = stale_index.id with a batching WHERE id BETWEEN X AND Y wrapper. The rewrite took 90 seconds for the same 2 million rows — a 30x improvement. The lesson: cursors look like “proper iteration” to procedural programmers but are almost always the wrong tool in SQL.
Big Word Alert — Server-side cursor: The query result lives on the DB server and the client fetches pages on demand. Saves client memory but holds database resources (locks, snapshot visibility) open for as long as the cursor is active.
Big Word Alert — Set-based operation: A SQL operation that processes an entire set of rows as a single unit (e.g., UPDATE ... WHERE). The opposite of procedural row-by-row iteration — lets the planner choose the optimal strategy and avoids per-row network overhead.
Follow-up Q&A Chain:
  • Q: When is a cursor genuinely necessary? A: Two cases: (1) batch processing where you need intermediate commits to avoid a multi-hour transaction blocking VACUUM, and (2) streaming a result set that doesn’t fit in app memory.
  • Q: Can’t I just use LIMIT/OFFSET instead of a cursor? A: For small offsets, yes. But OFFSET 1000000 forces the DB to scan and discard the first million rows every time — O(N²). Keyset pagination (WHERE id > last_id LIMIT 1000) is the correct scalable pattern.
Further Reading:
  • use-the-index-luke.com — “Pagination” chapter on keyset vs offset
  • postgresql.org/docs — “Cursors” and “Declare Cursor” documentation

2. Internals & Optimization

What interviewers are really testing: Whether you understand the fundamental I/O optimization that makes databases durable without being slow, and can trace the lifecycle of a write from application to disk. This concept underpins Atomicity and Durability in every production database.Answer:The WAL is the single most important concept in database durability. Changes are written to an append-only log file before they are applied to the actual data pages. This seems redundant but is critical for two reasons:
  1. Sequential vs Random I/O: Writing to the WAL is sequential I/O (appending to end of file), which is 100-1000x faster than the random I/O needed to update data pages scattered across disk. On a typical SSD, sequential writes can hit 500MB/s while random 4KB writes might only achieve 50MB/s.
  2. Crash Recovery: If the server crashes after writing to the WAL but before updating data pages, the database replays the WAL on startup to bring data pages up to date. If the crash happens before the WAL write completes, the transaction is considered uncommitted and is rolled back. This is how Atomicity and Durability work together.
The write lifecycle in PostgreSQL:
  1. Application sends INSERT statement
  2. PostgreSQL writes the change to WAL buffers in shared memory
  3. On COMMIT, WAL buffers are flushed to disk (fsync to WAL segment file)
  4. The commit is acknowledged to the client — data is now durable
  5. Later, the background writer and checkpointer lazily flush dirty data pages to disk (this is NOT on the critical path)
Key insight: The data files can be “behind” the WAL. PostgreSQL’s checkpointer periodically writes all dirty pages to disk and records the WAL position. On crash recovery, replay starts from the last checkpoint, not from the beginning of the WAL.Tuning trade-off: synchronous_commit=off skips the fsync on step 3, buffering WAL writes in the OS page cache. This gives 2-5x write throughput but risks losing up to ~600ms of commits on crash. Analytics and logging workloads often accept this trade-off.Red flag answer: “WAL is just a log file for debugging” or not understanding why writes go to WAL before data pages. Also, not knowing that WAL is what enables streaming replication in PostgreSQL (replicas consume the WAL stream).Follow-up questions:
  • “How does WAL relate to streaming replication?” (Replicas receive and replay the WAL stream from the primary. This is why replication lag is measured in WAL bytes behind, not time.)
  • “What happens to WAL files over time? Do they grow forever?” (No — after a checkpoint, old WAL segments are recycled or archived. wal_keep_size controls retention. WAL archiving to S3 enables Point-In-Time Recovery.)
  • “A PostgreSQL instance is experiencing high write latency. WAL disk utilization is at 100%. What is happening?” (The WAL disk is the bottleneck. WAL writes are synchronous on commit. Solutions: move WAL to faster disk (NVMe), increase wal_buffers, use synchronous_commit=off if data loss is acceptable, or check if full_page_writes is generating excessive WAL volume.)
Structured Answer Template — WAL questions:
  1. State why it exists: sequential log writes are 100x faster than random data-page writes.
  2. Walk the commit lifecycle: SQL → WAL buffer → WAL fsync → ACK → later checkpointed to data pages.
  3. Connect to Atomicity and Durability: WAL replay rebuilds state on crash.
  4. Name one tuning knob: synchronous_commit, wal_level, or max_wal_size.
  5. Mention replication: “Replicas consume the WAL stream — lag is measured in WAL bytes.”
Real-World Example: Stripe put Postgres’s WAL directory on dedicated NVMe disks separate from the data directory. During a traffic spike, they discovered the data disk was saturated but the WAL disk (the commit-path bottleneck) was fine — meaning commits still completed in <5 ms even while analytical queries on the data disk crawled. Separating WAL from data is a standard production pattern for write-heavy Postgres.
Big Word Alert — Checkpoint: A Postgres operation that flushes all dirty data pages from memory to disk and records the WAL position. Crash recovery only needs to replay WAL from the last checkpoint — not from the beginning of time.
Big Word Alert — Point-in-Time Recovery (PITR): Restoring the database to any specific moment by applying a base backup plus the WAL stream up to the target timestamp. Requires WAL archiving (typically to S3) and is the gold standard for “oops we ran UPDATE without a WHERE” recovery.
Follow-up Q&A Chain:
  • Q: What’s the difference between synchronous_commit=on and synchronous_commit=remote_apply? A: on waits for local WAL fsync. remote_apply waits for the replica to fsync AND apply the WAL. The latter gives you read-your-writes consistency on the replica but increases write latency by the replica’s round trip.
  • Q: Can WAL fill up the disk? A: Yes — if an orphaned replication slot prevents cleanup, or if archive_command fails silently, WAL accumulates until the disk fills and Postgres shuts down. Monitor pg_replication_slots.restart_lsn and archive_failure count.
Further Reading:
  • postgresql.org/docs — “Write-Ahead Logging (WAL)” and “Reliability” chapters
  • martin.kleppmann.com — DDIA Chapter 3 on storage engines
What interviewers are really testing: Whether you understand the concurrency model that makes modern databases fast, can explain the mechanism (not just the effect), and know the operational cost (VACUUM). This separates engineers who use databases from engineers who understand them.Answer:MVCC is the mechanism that allows databases to handle concurrent reads and writes without locking. The core principle: readers never block writers, and writers never block readers. Each transaction sees a consistent snapshot of the database as it existed at a specific point in time.How PostgreSQL implements MVCC (tuple versioning):
  • Every row (tuple) has two hidden system columns: xmin (the transaction ID that created this version) and xmax (the transaction ID that deleted/updated this version, or 0 if still live).
  • When a row is updated, PostgreSQL does not modify the existing tuple. Instead, it creates a new version of the tuple with a new xmin, and sets the xmax on the old version to the updating transaction’s ID. Both versions exist on disk simultaneously.
  • When a transaction reads, it checks its own snapshot (list of active transaction IDs at the time it started) against each tuple’s xmin/xmax to determine visibility. If xmin was committed before the snapshot and xmax is either 0 or not yet committed, the tuple is visible.
  • This means a long-running SELECT query sees a frozen-in-time view of the database, even if other transactions are actively modifying the same rows.
The cost — dead tuples and VACUUM:
  • Because updates create new tuple versions, the old versions (dead tuples) accumulate. These consume disk space and bloat indexes.
  • VACUUM reclaims this space by marking dead tuples as reusable. VACUUM FULL rewrites the entire table (but requires an exclusive lock — avoid in production).
  • Autovacuum runs in the background and is critical to keep running. A common production disaster: a team disables autovacuum “because it’s using CPU” and weeks later the table is 10x its actual size, queries slow to a crawl, and worst case, the database hits transaction ID wraparound — a catastrophic event where PostgreSQL refuses ALL writes to prevent data corruption.
How MySQL InnoDB differs: InnoDB uses a rollback segment (undo log) instead of storing multiple tuple versions inline. The current version lives in the data page; old versions are stored in the undo tablespace. This means InnoDB does not need VACUUM, but long-running transactions prevent undo log purging, causing the undo tablespace to grow.Red flag answer: “MVCC means the database uses locks” — it is literally the opposite. Also, not mentioning VACUUM or the consequences of dead tuple buildup shows you have never operated PostgreSQL in production.Follow-up questions:
  • “A PostgreSQL table has 1 million live rows but is taking up 10GB on disk. What happened and how do you fix it?” (Table bloat from dead tuples. Autovacuum may be misconfigured or a long-running transaction prevented it. Run VACUUM FULL during maintenance window, or use pg_repack for online table rewrites without locking.)
  • “What is transaction ID wraparound and why is it PostgreSQL’s most dangerous failure mode?” (PostgreSQL uses 32-bit transaction IDs. After ~2 billion transactions, IDs wrap around and old transactions become “in the future,” making their data invisible. VACUUM must freeze old transaction IDs before this happens. If it cannot — because autovacuum is blocked or disabled — PostgreSQL enters a protective shutdown mode.)
  • “How do you handle a situation where a long-running analytics query is blocking VACUUM on a production OLTP database?” (This is a common real-world problem. Solutions: set statement_timeout for analytics queries, use a read replica for analytics, or configure hot_standby_feedback to control snapshot retention on replicas.)
Structured Answer Template — MVCC questions:
  1. Lead with the goal: readers never block writers; writers never block readers.
  2. Explain the mechanism: each row has xmin/xmax; transactions see a snapshot filtered by visibility rules.
  3. Name the cost: dead tuples accumulate — VACUUM must run.
  4. Compare implementations: Postgres (inline versions, needs VACUUM) vs MySQL (undo log, no VACUUM but undo tablespace grows).
  5. Close with the danger: “Disable autovacuum and you’re on a countdown to transaction ID wraparound.”
Real-World Example: Discord once discovered a “forgotten” psql session from an on-call engineer had been idle-in-transaction for 72 hours. It held back VACUUM cleanup on their message-metadata table, which ballooned to 3x its normal size. Their write latency had crept up over that window. The fix was one pg_terminate_backend() and an aggressive VACUUM ANALYZE — but they now alert on any transaction idle for more than 30 minutes.
Big Word Alert — xmin / xmax: Hidden system columns on every Postgres row. xmin is the transaction ID that created the row version; xmax is the transaction ID that deleted or updated it (or 0 if still live). MVCC visibility is decided by comparing these against your transaction’s snapshot.
Big Word Alert — Transaction ID wraparound: Postgres uses 32-bit transaction IDs; after ~2 billion commits they’d wrap around and old data would appear to be in the future. VACUUM prevents this by freezing old rows. If it fails to keep up, Postgres stops accepting writes to protect your data — a nightmare scenario requiring single-user-mode recovery.
Follow-up Q&A Chain:
  • Q: Why does SELECT count(*) need to scan every row in Postgres but is instant in MySQL InnoDB? A: MVCC visibility is per-row in Postgres — you must check every row’s visibility against your snapshot. InnoDB caches an approximate row count that’s good enough for the default statement. Postgres trades that speed for snapshot isolation correctness.
  • Q: What does VACUUM FULL do differently from VACUUM? A: VACUUM marks dead tuples as reusable (space returned to the free space map). VACUUM FULL rewrites the entire table, returning space to the OS — but requires an ACCESS EXCLUSIVE lock. Use pg_repack for the online alternative.
Further Reading:
  • postgresql.org/docs — “Concurrency Control” chapter on MVCC internals
  • martin.kleppmann.com — DDIA Chapter 7 on snapshot isolation
What interviewers are really testing: Whether you can read an EXPLAIN ANALYZE output like a diagnostic report, identify the bottleneck nodes, understand why the planner made its choices, and take corrective action. This is the single most important operational skill for any engineer working with relational databases.Answer:The query lifecycle: Parser (syntax check) -> Rewriter (view/rule expansion) -> Planner/Optimizer (generates candidate plans, estimates cost, picks cheapest) -> Executor (runs the plan, returns rows).Reading EXPLAIN ANALYZE — what matters:
  • Actual vs. estimated rows: If the planner estimates 100 rows but the actual is 500K, statistics are stale or misleading. This is the #1 cause of bad plans. Fix: ANALYZE table_name or increase default_statistics_target.
  • Node types to watch for:
    • Seq Scan on a large table with a selective WHERE clause — missing index or stale stats.
    • Nested Loop with a large inner table and no index — O(N*M) disaster. The planner chose this because it underestimated the outer row count.
    • Hash Join with Batches: 8 — the hash table spilled to disk because it exceeded work_mem. Increase work_mem (per-session, not globally).
    • Sort Method: external merge — sort spilled to disk. Same fix: increase work_mem.
    • Bitmap Heap Scan with Lossy: true — bitmap exceeded work_mem and lost precision. Acceptable for moderate result sets but watch for performance cliffs.
  • Buffers (with BUFFERS option): shared hit = reads from cache, shared read = reads from disk. High read counts on a warm system indicate your working set does not fit in shared_buffers.
  • Time breakdown: The actual time shows two numbers: startup time and total time. A node with high startup time but low total time is doing expensive setup (e.g., building a hash table). Compare parent node time to sum of children — the difference is the node’s own processing overhead.
What weak candidates say: “Look for Seq Scan and fix it with an index.” This is cargo-cult optimization. Sometimes a Seq Scan IS the optimal plan (small table, or query returns 80% of rows).What strong candidates say: Walk through the plan tree bottom-up, compare estimated vs. actual rows at each node, identify the cost-dominant node, and reason about whether the planner’s choice was rational given its (possibly wrong) estimates.Follow-up chain:
  1. “EXPLAIN ANALYZE shows accurate row estimates and an Index Scan, but the query is still slow. What else could be wrong?” (I/O wait — check Buffers: shared read count. Lock contention — check pg_stat_activity. Network latency. Or the index scan returns many rows requiring heap fetches with poor spatial locality.)
  2. “How do you read a plan with parallel query nodes? What does Workers Planned: 4, Workers Launched: 2 tell you?” (Only 2 of 4 planned workers were available — others are busy. The Gather node’s time includes waiting for workers. Check max_parallel_workers_per_gather.)
  3. “A query plan changes from fast (Index Scan) to slow (Seq Scan) after a weekend batch job. What happened?” (The batch job changed the table’s data distribution. Autovacuum/ANALYZE hasn’t run yet, so statistics are stale. Or the batch job increased the table size past the planner’s threshold for switching to Seq Scan.)
Structured Answer Template — Reading a query plan:
  1. Read bottom-up — child nodes execute first and feed the parent.
  2. Compare estimated vs actual rows at each node; a big gap means stale stats.
  3. Identify the cost-dominant node (highest total time).
  4. Check Buffers: shared hit vs shared read — lots of reads means cold cache.
  5. Recommend one concrete fix: ANALYZE, add index, raise work_mem, or rewrite.
Real-World Example: A Meta engineer shared a story where a dashboard query that normally took 200 ms occasionally spiked to 12 seconds. EXPLAIN ANALYZE showed Hash Join with Batches: 64 — meaning the hash table had spilled to disk 64 times. Raising work_mem from 4 MB to 64 MB (session-local, not globally) dropped the query to 180 ms consistently. Lesson: work_mem is session-scoped for a reason — tune it per-query, not globally.
Big Word Alert — EXPLAIN ANALYZE: Runs the query and reports both the planner’s estimates and the actual execution metrics. The difference between estimated and actual rows is the first thing experienced DBAs look at — it reveals stale statistics, which is the #1 cause of bad plans.
Big Word Alert — work_mem: The amount of memory a single query operation (sort, hash join, hash aggregate) can use before spilling to disk. Disk spill is 10-100x slower. Set session-local for expensive queries rather than raising the global default, which multiplies across all concurrent operations.
Follow-up Q&A Chain:
  • Q: What’s the difference between EXPLAIN and EXPLAIN ANALYZE? A: EXPLAIN shows only the estimated plan. EXPLAIN ANALYZE actually executes the query and shows real timings and row counts. Use ANALYZE when you can afford the query to actually run; use plain EXPLAIN for UPDATE/DELETE you don’t want to execute.
  • Q: Why does the planner sometimes ignore an obviously good index? A: Usually stale statistics. The planner’s cost model compares index scan vs seq scan cost; if it thinks the query returns 80% of the table, seq scan wins. Run ANALYZE and re-check.
Further Reading:
  • use-the-index-luke.com — “Execution Plans” chapter explaining every major plan node
  • postgresql.org/docs — “Using EXPLAIN” chapter
What interviewers are really testing: Whether you understand the different ways a database engine traverses an index and can explain when each strategy is chosen. This is fundamental to reading query plans.Answer:
  • Index Seek (Point Lookup): Traverse the B+Tree from root to the specific leaf node matching your predicate. O(log N) — typically 3-4 page reads for millions of rows. Used for equality (=) and range start-point lookups. This is the fast path.
  • Index Scan (Range Scan): Seek to the starting leaf node, then walk the leaf-level linked list sequentially. Used for range queries (BETWEEN, >, &lt;). Cost = O(log N) for the seek + O(K) for reading K matching leaf nodes. This is efficient because leaf pages are linked — sequential I/O.
  • Full Index Scan: Read ALL leaf nodes of the index, ignoring the tree structure. Used when the query can be satisfied entirely from the index (covering index) but has no selective predicate. Cheaper than a full table scan because the index is smaller than the table.
  • Bitmap Index Scan (PostgreSQL): Scan the index to build a bitmap of matching heap pages, then read those pages in physical order. This converts random I/O (many individual index lookups) into sequential I/O (reading pages in order). Used when the index scan would return too many rows for individual lookups but too few for a Seq Scan. The planner’s sweet spot: roughly 5-20% of the table.
What weak candidates say: “Index Scan means it uses the index, which is fast.” Cannot distinguish between the different traversal strategies.What strong candidates say: Explain the seek-then-scan pattern, mention bitmap scans as PostgreSQL’s way of handling the “too many rows for index scan, too few for seq scan” gap, and know the approximate selectivity thresholds where the planner switches strategies.Follow-up chain:
  1. “At what selectivity percentage does PostgreSQL typically switch from Index Scan to Seq Scan?” (Roughly 10-20% of the table, depending on random_page_cost and effective_cache_size. On SSD, you can lower random_page_cost to 1.1 (from default 4.0) to encourage index usage at higher selectivity.)
  2. “What is an Index-Only Scan and when does it fail to be ‘index only’?” (When all columns are in the index AND the visibility map confirms all heap pages are all-visible. If the visibility map is not up to date — because VACUUM hasn’t run — the scan must still check the heap for tuple visibility, degrading to a regular index scan.)
  3. “Why would two identical queries on the same table choose different scan strategies?” (Different parameter values — one value is highly selective (index scan), the other matches 80% of rows (seq scan). This is the plan caching problem that plan_cache_mode = auto addresses.)
Structured Answer Template — Scan strategy questions:
  1. Name the four strategies: Index Seek, Index Scan (range), Full Index Scan, Bitmap Index Scan.
  2. Tie each to a selectivity range: <1% = seek, 1-10% = range, 10-20% = bitmap, >20% = seq.
  3. Explain the bitmap trick: aggregates random I/O into sequential heap reads.
  4. Mention random_page_cost as the tuning knob — SSDs want 1.1, not the default 4.
  5. Close with: “The planner picks based on cost estimates; fix the estimates and it picks right.”
Real-World Example: When Notion migrated from gp2 (spinning-era defaults) to gp3 SSDs on AWS, they forgot to lower random_page_cost from the default 4.0 to ~1.1. The planner kept picking seq scans for queries that should have used indexes, because it thought random I/O was 4x more expensive than sequential. Updating one parameter dropped p99 on their hottest dashboard query from 2.3 seconds to 40 ms. This is one of the highest-leverage Postgres tweaks on modern SSDs.
Big Word Alert — Bitmap Index Scan: A two-phase scan. Phase 1 walks the index and builds a bitmap of matching heap pages. Phase 2 reads those heap pages in physical order, turning many random I/Os into a single sequential sweep. Postgres chooses this when selectivity is too high for a pure index scan but too low for a seq scan.
Big Word Alert — Index-Only Scan: When every column the query needs is in the index and the visibility map confirms the data is visible to all transactions — so the heap is never touched. Requires VACUUM to keep the visibility map fresh.
Follow-up Q&A Chain:
  • Q: Why does WHERE col &gt; 'X' sometimes use the index and sometimes not? A: Selectivity. If “greater than X” matches 5% of rows, the index wins. If it matches 50%, the seq scan wins. Histograms in pg_stats tell the planner which it is — stale histograms = wrong choice.
  • Q: How do I force an index in Postgres? A: You can’t — Postgres doesn’t have hints by design. Diagnostic trick: SET enable_seqscan = off to see what plan the planner would pick with the index, then fix the underlying cost estimates instead of forcing behavior.
Further Reading:
  • use-the-index-luke.com — “Anatomy of an Index” and “Where Clause” chapters
  • postgresql.org/docs — “Query Planning” and “Index-Only Scans” sections
What interviewers are really testing: Whether you understand the performance difference between an index lookup that satisfies the query entirely vs. one that requires a heap fetch, and can design indexes strategically for your query patterns.Answer:A covering index contains ALL columns needed by a query — both the WHERE clause columns and the SELECT columns. The database never needs to visit the heap (actual table data), making it dramatically faster.Example: For SELECT name, email FROM users WHERE age = 25:
  • Index on (age) — the index finds matching rows, but then must do a heap fetch for each row to get name and email. If 10K rows match, that is 10K random I/O heap lookups.
  • Index on (age, name, email) — the index contains everything. Zero heap fetches. PostgreSQL shows this as Index Only Scan in the query plan.
PostgreSQL’s INCLUDE syntax (PG 11+): CREATE INDEX idx_users_age ON users (age) INCLUDE (name, email); The INCLUDE columns are stored in the leaf nodes but are NOT part of the B-tree sort order. This means: (1) the tree is sorted only by age (efficient for WHERE age = 25), (2) name and email are available without heap access, (3) the index is smaller than (age, name, email) because internal nodes only carry age.Trade-off: Covering indexes are wider (more columns = more storage, slower writes). Every INSERT and UPDATE must maintain the index. On a high-write table, adding columns to an index for read optimization can degrade write throughput by 10-30%. The rule of thumb: cover your top 3-5 most frequent queries, not every possible query.What weak candidates say: “Just add all columns to the index.” This creates massive indexes that slow writes.What strong candidates say: Use INCLUDE for payload columns, keep the sorted key columns minimal, and balance read speedup against write overhead. Mention the visibility map requirement for Index Only Scans in PostgreSQL.Follow-up chain:
  1. “Your covering index is in place but EXPLAIN still shows heap fetches. Why?” (The visibility map is not up to date. Recently modified pages are marked as not-all-visible. VACUUM updates the visibility map. Run VACUUM on the table.)
  2. “How do covering indexes interact with HOT (Heap Only Tuple) updates in PostgreSQL?” (If the covering index includes a column that is updated, HOT updates are disabled for that row — every update must also update the index, even if the indexed column value did not change. This is why you should only INCLUDE columns that are rarely updated.)
  3. “Covering index vs. materialized view for a dashboard query that joins 3 tables — when do you use each?”
Structured Answer Template — Covering index design:
  1. State the goal: eliminate heap fetches by putting all needed columns in the index.
  2. Distinguish key columns (for sorting/filtering) vs INCLUDE columns (for payload only).
  3. Warn about the trade-off: wider index → slower writes → higher storage.
  4. Call out the HOT-update caveat in Postgres (indexed column updates kill HOT).
  5. Rule of thumb: “Cover your top 3-5 hottest queries, not every possible column combination.”
Real-World Example: Dropbox’s file-metadata lookup (~1M QPS at peak) uses a covering index with INCLUDE columns for the 4 columns the hot read path needs. Before adding INCLUDE, each lookup was: index seek + random heap fetch. After: index-only scan with zero heap fetches. They measured a 40% reduction in IOPS on the underlying disk — which directly reduced their provisioned storage tier and saved ~$200K/year across the fleet.
Big Word Alert — INCLUDE columns: Postgres 11+ syntax for a covering index. Columns after INCLUDE are stored in the leaf pages but are NOT part of the B-tree sort order — so they don’t bloat internal nodes. Perfect for payload-only columns.
Big Word Alert — HOT update (Heap-Only Tuple): A Postgres optimization where an UPDATE that doesn’t change any indexed column avoids updating any index — the new row version lives on the same page and is linked from the old version. Fillfactor tuning (e.g., 80%) leaves room on pages for HOT updates to happen.
Follow-up Q&A Chain:
  • Q: Why not just add every column to every index? A: Each additional column in an index adds write cost (every INSERT/UPDATE must maintain the index) and storage. Instagram reported that one team’s “add every column to indexes” spree caused a 25% drop in write throughput until they audited and removed them.
  • Q: Can I have a covering index in MySQL InnoDB? A: Yes — any secondary index whose columns fully satisfy the query avoids the bookmark lookup into the clustered index. MySQL documentation calls this “using index” in EXPLAIN output.
Further Reading:
  • use-the-index-luke.com — chapter on covering indexes (“index-only scans”)
  • postgresql.org/docs — “Index-Only Scans” and CREATE INDEX ... INCLUDE syntax
What interviewers are really testing: Whether you understand the trade-offs between different index types and can articulate when a hash index beats a B-tree — and why, despite the theoretical advantage, hash indexes are rarely used in practice.Answer:A hash index stores a hash of the indexed column and maps it to the heap tuple location. Lookup is O(1) for exact equality matches — theoretically faster than B-tree’s O(log N).Why they are rarely used over B-tree:
  • Equality only: Hash indexes support only = operator. No range queries (>, &lt;, BETWEEN), no ORDER BY, no LIKE 'prefix%'. B-tree handles all of these.
  • No covering index: Hash indexes cannot use INCLUDE columns (PostgreSQL). B-tree can.
  • Historical reliability issues: Before PostgreSQL 10, hash indexes were not WAL-logged — they would corrupt on crash and had to be rebuilt. This history made the community wary. Since PG 10, they ARE crash-safe, but the stigma persists.
  • Marginal speed advantage: For a B-tree with 100M rows, a point lookup is 3-4 page reads. For a hash index, it is 1-2 page reads. The real-world difference is microseconds — rarely the bottleneck. The flexibility loss is not worth the marginal gain.
When hash indexes make sense:
  • Very large tables where every microsecond matters AND queries are exclusively equality lookups (e.g., session token lookup, cache key lookup).
  • The hash index is smaller than the equivalent B-tree for long text keys because it stores fixed-size hashes, not the full key value.
What weak candidates say: “Hash is O(1) so it is always faster than B-tree.” Ignores the practical limitations and the fact that B-tree’s O(log N) with N=100M is only ~27 comparisons.What strong candidates say: Acknowledge the theoretical advantage but explain why B-tree’s versatility (range scans, ordering, covering) makes it the default choice. Mention the historical crash-safety issue and the niche use cases where hash indexes win.Follow-up chain:
  1. “Compare B-tree, hash, GIN, and GiST indexes. When would you choose each?” (See the dedicated B-tree vs Hash vs GiST question below.)
  2. “How does a hash index handle collisions, and what happens to performance under high collision rates?” (Overflow pages. If the hash function produces many collisions for the data distribution, performance degrades from O(1) to O(N) for the collision chain.)
  3. “In-memory databases like Redis use hash tables as the primary data structure. Why does the calculus change for in-memory vs. disk-based systems?” (No I/O cost for memory access, so the constant-factor advantage of O(1) hash lookup over O(log N) tree traversal is significant. On disk, the I/O cost dwarfs the comparison cost.)
Structured Answer Template — Hash index questions:
  1. State the theoretical win: O(1) vs O(log N) for equality lookups.
  2. Immediately flag the practical loss: no range scans, no ordering, no LIKE 'prefix%'.
  3. Note Postgres history: not WAL-logged before PG 10 — that memory still shapes adoption.
  4. Identify the niche use case: long text keys with equality-only lookups (session tokens, cache keys).
  5. Close with: “The 27 comparisons of a B-tree on 100M rows is rarely the bottleneck.”
Real-World Example: A Meta engineer evaluated switching a session-token lookup table from a B-tree to a hash index. The sessions table had 500M rows, token-lookup was the only query pattern, and tokens averaged 64 bytes. The hash index was 2x smaller on disk (stores fixed-size hashes, not full keys) and ~30% faster per lookup. But because they already had sub-millisecond p99, the win wasn’t user-visible. They kept the B-tree for operational simplicity (range scans during incident investigation). Lesson: hash indexes are valid but rarely worth the inflexibility.
Big Word Alert — Hash collision: When two different keys hash to the same bucket. Hash indexes handle this via overflow pages, but heavy collisions degrade lookup from O(1) to O(chain-length). Crypto-grade hashes (MD5, SHA-1) have ~0 collision risk; weak hashes (Object.hashCode()) can cluster badly.
Big Word Alert — GIN index: Generalized Inverted Index — used for “contains” queries on arrays, JSONB, and full-text search. Stores each element/token with a list of rows that contain it. Think of it as a search-engine-style inverted index inside the database.
Follow-up Q&A Chain:
  • Q: If hash is O(1), why isn’t it the default for primary keys? A: Because primary keys are used for more than point lookups — range scans, sort-order traversal, foreign-key joins. B-tree supports all of those in O(log N). Hash only wins for one narrow use case.
  • Q: Do any databases use hash indexes by default? A: Yes — key-value stores (Redis, DynamoDB partition keys) use hash indexing because their access pattern is exclusively point lookups. The calculus changes when range queries aren’t in the API.
Further Reading:
  • postgresql.org/docs — “Index Types” chapter covering all 6 native index types
  • use-the-index-luke.com — discussion of why B-tree dominates despite hash’s theoretical edge
What interviewers are really testing: Whether you understand sharding as a last-resort scaling strategy (not a first choice), can design a shard key that avoids hotspots, and know the operational nightmare of cross-shard operations. Staff-level candidates articulate when sharding is necessary vs. when vertical scaling or read replicas suffice.Answer:Sharding distributes rows across multiple independent database servers, each holding a subset of the data. Unlike partitioning (single server), sharding scales horizontally across machines.Sharding strategies:
  • Hash-based (shard = hash(tenant_id) % N): Even distribution, no hotspots if the key has good cardinality. Drawback: adding a shard requires rehashing and migrating data. Consistent hashing minimizes migration (only ~1/N of keys move).
  • Range-based (shard 1: IDs 1-1M, shard 2: 1M-2M): Simple, supports range queries within a shard. Drawback: hotspots if recent data is accessed more (all traffic hits the last shard). Time-based ranges for logs have this exact problem.
  • Geo-based (EU data on EU shard, US data on US shard): Reduces latency and satisfies data residency requirements (GDPR). Drawback: users who travel or multi-region organizations complicate routing.
  • Directory/lookup-based: A mapping table tracks which shard holds each entity. Most flexible but the lookup service becomes a single point of failure and a latency tax on every query.
The hard problems:
  • Cross-shard JOINs: Essentially impossible efficiently. You must denormalize or accept application-level joins. This is why the shard key choice is critical — data that is frequently joined must live on the same shard (co-location).
  • Cross-shard transactions: 2PC is slow and fragile. Most teams use eventual consistency with Saga patterns.
  • Rebalancing: Moving data between shards without downtime. Tools like Vitess (MySQL) or Citus (PostgreSQL) automate this, but it is never painless.
  • Schema migrations: Must be applied to ALL shards atomically — or at least with backward compatibility. A failed migration on shard 3 of 16 is a nightmare.
What weak candidates say: “Shard by user ID using modulo.” Cannot explain what happens when you add a shard or how cross-shard queries work.What strong candidates say: Start with “do you actually need sharding?” — explore read replicas, caching, and vertical scaling first. Then discuss shard key selection based on query patterns, co-location strategy, and operational complexity. Mention specific tools (Vitess, Citus, ProxySQL).Follow-up chain:
  1. “Your shard key is user_id but you need to query ‘all orders for product X’ across all users. How do you handle this?” (Secondary index shard — scatter-gather query to all shards and merge results. Or maintain a denormalized reverse-lookup table. Or use a search index like Elasticsearch for cross-shard queries.)
  2. “You have 4 shards and need to add 2 more. Walk me through the data migration without downtime.” (Consistent hashing to minimize movement. Dual-write during migration window. Shadow reads to verify. Cutover shard by shard. Vitess/Citus handle this operationally.)
  3. “When would you choose Vitess over Citus for a sharding solution?” (Vitess: MySQL ecosystem, battle-tested at YouTube/Slack scale, good for MySQL-native teams. Citus: PostgreSQL extension, transparent distribution, better for teams already on PostgreSQL who want to keep SQL semantics.)
Structured Answer Template — Sharding design:
  1. Justify the need: one-node write throughput or storage is genuinely exhausted.
  2. Pick a shard key aligned with the dominant query pattern (tenant_id, user_id, etc).
  3. Show how each top query routes: single-shard (fast), scatter-gather (slow), or co-located join (careful).
  4. Plan for rebalance: consistent hashing minimizes the keys that move.
  5. Close with: “The shard key is a one-way door — get it wrong and migration is measured in months.”
Real-World Example: Figma’s Postgres cluster was exhausted at ~20 TB and ~40K QPS. They chose to shard by team_id because 99% of their query patterns scoped to one team. The migration took 9 months and was documented in their “How Figma’s databases team lived to tell the scale” post. Key lesson they shared: they initially considered workspace_id but discovered ~2% of queries genuinely spanned workspaces — that 2% would have forced scatter-gather on the hot path. team_id had zero cross-shard queries in the hot path.
Big Word Alert — Shard key: The column that decides which physical shard a row lives on. The dominant query’s WHERE clause should reference it, or every query becomes a scatter-gather across all shards.
Big Word Alert — Scatter-gather query: A query that must hit every shard, collect partial results, and merge them in the coordinator or application. Fine for occasional analytics; catastrophic on the hot path because the query’s latency = slowest shard’s latency.
Follow-up Q&A Chain:
  • Q: How do you handle a migration (ADD COLUMN) across 64 shards? A: You don’t run them simultaneously — one shard at a time, with automated rollback. Deploy dual-compatible code first (works with old and new schema), then run the migration on each shard, then deploy code that uses the new schema. Stripe documented this “expand-migrate-contract” pattern.
  • Q: What’s the alternative to sharding if my writes exceed one node? A: A distributed SQL database that shards transparently: CockroachDB, Spanner, Yugabyte. Or a write-optimized NoSQL engine like Cassandra/Scylla. Both skip the pain of application-level shard-key management — at the cost of learning a new operational model.
Further Reading:
  • highscalability.com — Instagram’s and Figma’s sharding case studies
  • martin.kleppmann.com — DDIA Chapter 6 “Partitioning”
What interviewers are really testing: Whether you understand the difference between partitioning (single server, query optimization) and sharding (multiple servers, capacity scaling), can choose the right partition strategy, and know the operational gotchas (partition pruning failures, partition management).Answer:Partitioning splits a single logical table into multiple physical sub-tables on the SAME server. The query planner eliminates irrelevant partitions at query time (partition pruning), dramatically reducing I/O.Partition strategies:
  • Range partitioning: Most common. Partition by date ranges — audit_logs_2025_01, audit_logs_2025_02. Perfect for time-series data where queries always include a date range. Partition pruning turns a 2-billion-row scan into a single-partition scan.
  • List partitioning: Partition by discrete values — orders_us, orders_eu, orders_apac. Good for multi-tenant or geographic splits.
  • Hash partitioning: Partition by hash of a column. Distributes data evenly but does not support range queries efficiently. Use when you need even distribution without a natural range key.
Operational considerations:
  • Partition creation: Must create new partitions BEFORE data arrives. Automate monthly partition creation with pg_cron or a migration script. If a query inserts into a month with no partition and no DEFAULT partition, it fails with an error.
  • Partition dropping: DROP TABLE audit_logs_2023_01 is instant and reclaims space immediately. Far faster than DELETE FROM audit_logs WHERE created_at &lt; '2023-02-01' which generates dead tuples and requires VACUUM.
  • Index management: Each partition has its own indexes. A partitioned table with 48 partitions and 3 indexes = 144 individual indexes to maintain.
  • Too many partitions: More than a few hundred partitions degrades planning time. The planner must evaluate each partition. For sub-second OLTP queries, keep partition count under 100-200.
What weak candidates say: “Partitioning is like sharding but on one server.” Cannot explain partition pruning or when partitioning hurts.What strong candidates say: Explain the pruning mechanism, discuss the partition count sweet spot, mention the DROP vs. DELETE advantage for data retention, and know the common pruning failures (type mismatches, function-wrapped columns).Follow-up chain:
  1. “Your partitioned table has 500 monthly partitions. Query planning time has gone from 2ms to 200ms. What do you do?” (Consolidate old partitions — merge years-old monthly partitions into yearly ones. Or use TimescaleDB which handles hypertable chunk management automatically.)
  2. “Compare PostgreSQL declarative partitioning with MySQL partitioning. What can PostgreSQL do that MySQL cannot?” (PostgreSQL: partition-level indexes, partition-level constraints, attaching/detaching partitions dynamically, partitioned indexes. MySQL: more limited — no partition-wise joins, no sub-partitioning in recent versions.)
  3. “When would you partition vs. shard vs. use a time-series database like TimescaleDB?” (Partition: single server, query optimization. Shard: capacity beyond single server. TimescaleDB: time-series specific with automated chunking, compression, continuous aggregates.)
Structured Answer Template — Partitioning questions:
  1. Distinguish partitioning (one server) from sharding (multiple servers).
  2. Pick a strategy: range for time-series, hash for even distribution, list for geo/tenant.
  3. Lead with the benefit: partition pruning eliminates I/O on irrelevant partitions.
  4. Call out the DROP vs DELETE advantage for retention.
  5. Warn about partition count — >200 degrades planning time.
Real-World Example: Stripe partitions their events table by month. Their 30-day retention policy translates to one DROP TABLE events_2024_02 per month — instant, reclaims space immediately, avoids bloating autovacuum with billions of dead tuples. The alternative (DELETE FROM events WHERE created_at &lt; ...) would have generated a mountain of dead tuples and required aggressive vacuuming. Monthly partitioning turned retention from an ongoing operational headache into a one-line scheduled job.
Big Word Alert — Partition pruning: The planner eliminates partitions that can’t match the query’s WHERE clause. A 2-billion-row table partitioned by month turns WHERE created_at BETWEEN '2025-01-01' AND '2025-02-01' into a scan of one partition (~60M rows) instead of all 2B.
Big Word Alert — Hypertable (TimescaleDB): An automated partitioning layer on top of Postgres that creates time-based partitions (“chunks”) transparently as data arrives. Eliminates the manual partition-creation cron job and handles retention/compression in one config.
Follow-up Q&A Chain:
  • Q: Why does partition pruning sometimes fail silently? A: Data-type mismatches (query passes timestamp, partition bound is timestamptz), function-wrapped columns (WHERE date_trunc('month', created_at) = ...), or parameterized queries with older Postgres versions that don’t support runtime pruning.
  • Q: Can I partition a table after it’s already huge? A: Yes, but it requires rewriting the table. The pattern: create a new partitioned table, copy data in batches, swap names, drop the old. Tools like pg_partman help automate this.
Further Reading:
  • postgresql.org/docs — “Table Partitioning” chapter
  • TimescaleDB docs — hypertables and continuous aggregates
What interviewers are really testing: Whether you understand PostgreSQL’s MVCC cleanup mechanism, can diagnose vacuum failures, and know the catastrophic consequence of vacuum not running — transaction ID wraparound.Answer:VACUUM reclaims disk space from dead tuples (old row versions left behind by UPDATE and DELETE operations due to MVCC). Without VACUUM, tables grow indefinitely even if the logical row count stays constant.What VACUUM does:
  1. Marks dead tuples as reusable — the space is added to the Free Space Map (FSM). New INSERTs can reuse this space. Note: regular VACUUM does NOT return space to the OS — the file stays the same size.
  2. Updates the visibility map — marks pages where all tuples are visible to all transactions. This enables Index-Only Scans.
  3. Freezes old transaction IDs — prevents transaction ID wraparound (see below).
VACUUM variants:
  • VACUUM (standard): Runs concurrently with reads/writes. Does not reclaim space to OS. Fast for routine maintenance.
  • VACUUM FULL: Rewrites the entire table, reclaiming space to OS. Requires ACCESS EXCLUSIVE lock — blocks ALL operations. On a 100GB table, this can take hours. Avoid in production; use pg_repack instead.
  • VACUUM FREEZE: Aggressively freezes all tuple transaction IDs. Used when approaching wraparound danger.
Autovacuum configuration that matters:
  • autovacuum_vacuum_threshold (default 50) + autovacuum_vacuum_scale_factor (default 0.2) = autovacuum triggers when dead tuples exceed 50 + 20% of table size. For a 100M-row table, that is 20M dead tuples before autovacuum kicks in. For hot tables, set scale_factor to 0.01 per-table.
  • autovacuum_vacuum_cost_delay (default 2ms in PG 12+): How long autovacuum sleeps between I/O operations. Lower = more aggressive vacuum, higher I/O impact. Increase aggressiveness for critical tables.
  • autovacuum_max_workers (default 3): Maximum concurrent autovacuum processes. On a system with hundreds of tables, 3 workers may not be enough.
Transaction ID Wraparound — the database time bomb: PostgreSQL uses 32-bit transaction IDs (2^31 ~ 2.1 billion usable). VACUUM freezes old transaction IDs so they are no longer “live.” If vacuum cannot keep up — because it is disabled, blocked by long-running transactions, or orphaned replication slots — the database eventually hits the limit and SHUTS DOWN ALL WRITES to prevent data corruption. Recovery requires running vacuum in single-user mode. This is PostgreSQL’s most dangerous failure mode.What weak candidates say: “VACUUM cleans up deleted rows.” Cannot explain autovacuum tuning, wraparound, or when to use VACUUM FULL vs. pg_repack.What strong candidates say: Explain the MVCC relationship, discuss autovacuum tuning for high-write tables, describe the wraparound scenario from first-hand experience or deep understanding, and mention monitoring (pg_stat_user_tables.n_dead_tup, age(datfrozenxid)).Follow-up chain:
  1. “Autovacuum is running every 5 minutes on a table but dead tuple count keeps rising. What is blocking it?” (A long-running transaction or idle-in-transaction session is holding the xmin horizon. VACUUM can run but cannot remove tuples newer than the oldest active transaction. Also check for orphaned replication slots.)
  2. “How do you monitor for transaction ID wraparound risk?” (SELECT datname, age(datfrozenxid) FROM pg_database; — alert when age exceeds 500M. Also monitor pg_stat_user_tables.n_dead_tup and last_autovacuum timestamp.)
  3. “Your 500GB table needs VACUUM FULL but you cannot take downtime. What are your options?” (pg_repack rebuilds the table online using triggers to capture concurrent changes. pg_squeeze is a similar tool. Both require temporary disk space equal to the table size.)
What interviewers are really testing: Do you understand why connection pools exist at the kernel/OS level, can you pick the right pool size math, and do you know the production traps (prepared statements, SET leakage, pool vs DB max_connections mismatch)?Answer: Opening a new database connection is expensive: TCP handshake, TLS negotiation, authentication, and memory allocation for the connection’s backend process (Postgres spawns a new OS process per connection, ~10MB RSS each). At scale, this overhead becomes a bottleneck — a Node.js app with 50 serverless instances each opening 10 connections can exhaust the default max_connections=100 instantly and leave DB memory starved for buffer cache.Solution: A connection pool (PgBouncer for Postgres, ProxySQL for MySQL, HikariCP for Java, pgx for Go) maintains a set of pre-established connections. Application code borrows a connection, uses it, and returns it to the pool. The pool manages lifecycle, health checking, and recycling.PgBouncer modes:
  • Session pooling: Connection held for the entire client session. Safest (all session features work) but no multiplexing benefit — pool size must equal concurrent clients.
  • Transaction pooling: Connection returned after each transaction. Sweet spot for most apps — high multiplexing, but breaks prepared statements (server-side), SET statements, LISTEN/NOTIFY, and temp tables.
  • Statement pooling: Connection returned after each statement. Most restrictive; does not support multi-statement transactions.
Sizing math (the part most people get wrong):
  • Start from the DB’s max_connections. Leave headroom for superuser/replication (Postgres reserves ~3).
  • For Postgres, a common formula: pool_size = num_cores * 2 + effective_spindle_count. For SSD + 8 cores: 16-20 connections per DB. More is usually counterproductive (context-switching kills throughput).
  • With a pooler in front: DB max_connections=100, pooler pool_size=20, default_pool_size per database=20, client-side pool can be much larger (say 200) because the pooler multiplexes them down.
  • The #1 sizing trap: your ORM’s pool (Hibernate maximumPoolSize: 50) x N replicas of your app = total DB connections. Multiply it out before deploying.
Production gotchas:
  • Prepared statement explosion: In transaction pooling, each client’s prepared statements may land on different server connections, causing re-prepares. Disable server-side prepares (JDBC: prepareThreshold=0) or use session pooling for that service.
  • Idle-in-transaction: An app takes a connection, runs a query, then goes idle holding the connection for 30s. With 100 app instances, you are wasting 100 slots. Use idle_in_transaction_session_timeout in Postgres to kill these.
  • Pool exhaustion deadlock: Service A takes a connection, calls Service B (same pool), which tries to take another connection — pool is empty, infinite wait. Fix: always bound pool waits with a timeout; never nest connection checkouts.
Senior vs Staff perspective
  • Senior: Sets up PgBouncer in transaction mode, sizes pool from cores, adds idle timeouts, and debugs pool exhaustion with pg_stat_activity.
  • Staff: Designs the connection topology for the whole platform — PgBouncer per read replica, separate pools for OLTP vs analytics, sidecar vs centralized pooler tradeoff, connection budget per service as an SLO, and a dashboard that alerts before max_connections is hit. Also thinks about serverless: why connection pooling is a nightmare for Lambda and why you need RDS Proxy or a warm pool layer.
Follow-up chain:
  1. “Your Postgres hits FATAL: too many connections for role 'app'. Walk through your triage.” — (a) SELECT count(*), state FROM pg_stat_activity GROUP BY state; — are connections idle, active, or idle-in-transaction? (b) If many idle-in-transaction, an app is not closing transactions — check the app’s ORM for missing commits. (c) If many idle, the pool is over-sized. (d) Short-term: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle' AND state_change < now() - interval '1 hour'; (e) Long-term: add PgBouncer, right-size the pool.
  2. “Why is connection pooling uniquely painful for AWS Lambda?” — Each Lambda invocation is a separate process. Cold starts create fresh connections; scale-out to 1000 concurrent Lambdas creates 1000 connections. Fixes: RDS Proxy (managed pooler), keep connections warm across invocations with container reuse, or switch to HTTP-based APIs over JDBC (Data API for Aurora).
  3. “What is the difference between PgBouncer in transaction mode and a serverless Postgres like Neon or Aurora Serverless v2?” — PgBouncer is a pooler; you still pay for the backend. Aurora Serverless v2 scales the DB compute itself (vCPU/RAM) based on load, but does not help with connection count — you still need a pooler in front for connection-heavy workloads. Neon has built-in pooling.
  4. “You see client_idle_timeout expiring connections mid-transaction. What is happening?” — Timeout is shorter than the longest transaction. Raise client_idle_timeout or lower transaction duration. Also check for network middleboxes (NAT gateways) with their own idle timeouts silently killing TCP sockets.
Work-sample scenario: Your API service’s p99 latency doubles during traffic spikes. pg_stat_activity shows 95 active connections out of 100 max. Walk through your fix.
  • Immediate: add PgBouncer in transaction mode, set app pool to 40 and pooler default_pool_size to 20. This multiplexes 200+ app connections down to 20 DB connections.
  • Measure: check if latency drops. If still high, the bottleneck is not connections but query time — move to pg_stat_statements to find slow queries.
  • Validate: run a load test at 2x peak traffic, confirm max_connections never goes above 80%.
  • Guardrails: alert on pg_stat_database.numbackends > 80, alert on idle_in_transaction > 60s.
What weak candidates say: “Just increase max_connections.” — Postgres memory use scales linearly with connections. Going from 100 to 1000 can 10x your RSS and push the DB into swap.What strong candidates say: “Connection pooling is about matching two very different population sizes — many short-lived app requests vs few long-lived DB backends. The pooler is a multiplexer. I size from DB capacity backwards: cores, memory per connection, then the pool, then the app pool on top. And I always add idle-in-transaction and statement timeouts — those two settings have saved more production databases than any tuning parameter.”

3. NoSQL (Mongo/Cassandra/Redis)

What interviewers are really testing: Whether you can match data model to access pattern — not just list NoSQL categories, but explain WHY each model excels at its use case and what it sacrifices.Answer:
  • Document (MongoDB, CouchDB): Stores JSON/BSON documents. Schema-flexible — each document can have different fields. Ideal when: your data is naturally hierarchical (user profiles with nested addresses, preferences), your schema evolves frequently, or you need to read/write an entire entity in one operation. Trade-off: no JOINs (use $lookup which is slow), denormalization means update anomalies.
  • Key-Value (Redis, DynamoDB): Simplest model — opaque value stored under a key. O(1) lookups. Ideal for: caching, session storage, rate limiting, leaderboards. Trade-off: no querying by value — you must know the key. DynamoDB adds secondary indexes but at heart it is still key-value optimized.
  • Wide Column (Cassandra, HBase): Rows with dynamic columns, organized by partition key and clustering key. Ideal for: time-series data, IoT sensor data, activity feeds — any workload that writes heavily and reads by partition. A single partition can hold millions of columns. Trade-off: queries MUST include the partition key. Ad-hoc queries across partitions are expensive (full cluster scan).
  • Graph (Neo4j, Amazon Neptune): Nodes and edges with properties. Traversal is O(1) per hop (index-free adjacency). Ideal for: social networks (friends-of-friends), fraud detection (circular money transfers), recommendation engines, knowledge graphs. Trade-off: poor at bulk aggregations, no standard query language (Cypher vs. Gremlin vs. SPARQL).
Decision framework: Start with your access patterns, not your data structure. If you need flexible queries and transactions, use SQL. If you need fast reads/writes on a known key, use key-value. If you need hierarchical documents with embedded data, use document. If you need high-throughput append and partition-key reads, use wide-column. If you need relationship traversal, use graph.What weak candidates say: “MongoDB is for unstructured data.” MongoDB has schemas (via validation rules) and indexes — it is for document-shaped data, not “unstructured.”What strong candidates say: Frame the choice in terms of access patterns and trade-offs, not just data shape. Mention that many teams use polyglot persistence — PostgreSQL for core OLTP, Redis for caching, Elasticsearch for search, Cassandra for time-series.Follow-up chain:
  1. “Your team wants to replace PostgreSQL with MongoDB ‘because it’s faster.’ How do you evaluate this decision?” (Faster for what? MongoDB has no JOINs, no transactions across documents (until v4.0 multi-doc transactions, which are slower than PG). If the workload is single-document reads/writes, MongoDB can be faster. If you need complex queries, aggregations, or cross-entity transactions, PostgreSQL wins.)
  2. “When would you choose DynamoDB over Redis for a key-value use case?” (DynamoDB: durable, scalable, serverless, supports secondary indexes. Redis: in-memory, sub-millisecond, rich data structures, but requires managing persistence and failover.)
  3. “How would you migrate from MongoDB to PostgreSQL — or vice versa — for a 50M-document collection?”
What interviewers are really testing: Whether you understand MongoDB’s high-availability mechanism, the Oplog-based replication model, the election process, and the consistency implications of reading from secondaries.Answer:A MongoDB replica set consists of a Primary (handles all writes) and one or more Secondaries (replicate from the primary asynchronously via the Oplog).How it works:
  • The Oplog (operations log) is a capped collection on the primary that records every write operation. Secondaries tail the Oplog and replay operations in order.
  • Election: If the primary becomes unreachable (heartbeat timeout, default 10s), secondaries hold an election using the Raft-like protocol. The secondary with the most up-to-date Oplog wins. Election takes 5-12 seconds — during which the replica set is unavailable for writes.
  • Read preference: By default, all reads go to the primary (primary). You can route reads to secondaries (secondaryPreferred) for read scaling, but you may read stale data (seconds behind primary). nearest routes to the lowest-latency node regardless of role.
Consistency implications:
  • Reading from secondaries with readPreference: secondaryPreferred means you might not see your own writes. A user updates their profile, then immediately reads it from a secondary — they see the old data. Fix: use readConcern: majority with readPreference: primaryPreferred for read-your-writes semantics.
  • Write concern w: majority ensures the write is acknowledged by a majority of nodes before returning. Combined with readConcern: majority, you get causal consistency.
What weak candidates say: “Primary handles writes, secondaries handle reads, and failover is automatic.” Cannot explain Oplog mechanics, election timing, or consistency trade-offs of reading from secondaries.What strong candidates say: Discuss Oplog tailing, election timing (the 5-12 second write unavailability window), the risk of stale reads from secondaries, and how write concern + read concern together determine the consistency level.Follow-up chain:
  1. “A secondary is falling behind the primary. The Oplog on the primary has rolled over past the secondary’s last-read entry. What happens?” (The secondary enters an unrecoverable state and must perform a full resync — copying the entire dataset from the primary. This can take hours for large datasets. Prevention: size the Oplog appropriately for your write volume.)
  2. “How does MongoDB’s replication model differ from PostgreSQL’s streaming replication?” (MongoDB: logical replication via Oplog (operations). PostgreSQL: physical replication via WAL (byte-level changes). MongoDB can replicate across different storage engines; PostgreSQL replicas must have identical physical layout.)
  3. “You need zero-downtime MongoDB version upgrades. How does the replica set election mechanism help?” (Rolling upgrade: upgrade secondaries one by one, then step down the primary, upgrade it, let it rejoin. The election promotes an upgraded secondary.)
What interviewers are really testing: Whether you understand Cassandra’s masterless architecture, tunable consistency model, and why it excels at write-heavy workloads. Also whether you can explain the data model and why query patterns must be designed before the schema.Answer:Architecture:
  • Masterless ring: Every node is equal (no primary/secondary). Data is distributed using consistent hashing — each node owns a range of the hash ring (tokens). No single point of failure.
  • VNodes (Virtual Nodes): Each physical node owns multiple token ranges (default 256 vnodes). This ensures even distribution even with heterogeneous hardware and simplifies rebalancing when adding/removing nodes.
  • Replication factor (RF): Data is copied to RF nodes. With RF=3, each piece of data lives on 3 nodes. The replication strategy determines which nodes: SimpleStrategy for single-DC, NetworkTopologyStrategy for multi-DC (specify RF per datacenter).
Tunable consistency:
  • Write consistency CL: ONE (ack from 1 replica — fastest, least safe), QUORUM (ack from majority — balanced), ALL (ack from all replicas — slowest, strongest). Formula for strong consistency: R + W > RF where R = read CL, W = write CL. With RF=3, QUORUM reads + QUORUM writes = 2+2 > 3 = strong consistency.
  • Hinted handoff: If a target replica is down during a write, the coordinator stores a “hint” and forwards it when the node recovers. Prevents write failures during temporary outages.
  • Read repair: On read, if replicas disagree, the coordinator sends the most recent version to out-of-date replicas. Happens in the background (read_repair_chance).
  • Anti-entropy repair (nodetool repair): Full data reconciliation across replicas. Must be run periodically (at least every gc_grace_seconds, default 10 days) to prevent zombie data resurrection.
Write path (why writes are fast): Write -> Commit Log (sequential, for durability) -> Memtable (in-memory sorted structure) -> Flush to SSTable on disk (immutable). No random I/O on writes. This is why Cassandra achieves 10-100x write throughput compared to PostgreSQL for append-heavy workloads.What weak candidates say: “Cassandra is a NoSQL database that scales well.” Cannot explain the consistency model, write path, or why the data model must be query-driven.What strong candidates say: Explain the write path (commit log -> memtable -> SSTable), tunable consistency with the R+W>RF formula, and emphasize that Cassandra requires you to design your tables around your queries, not your entity relationships.Follow-up chain:
  1. “You need to query Cassandra for ‘all orders by user X in the last 30 days, sorted by date.’ Design the table.” (Partition key: user_id, clustering key: order_date DESC. Primary key: ((user_id), order_date). This ensures all orders for a user are in one partition, sorted by date. One read, one partition, no scatter-gather.)
  2. “What are tombstones in Cassandra and why can they cause performance problems?” (Deletes create tombstone markers instead of removing data. During reads, Cassandra must scan through tombstones. Too many tombstones (from frequent deletes or TTL expirations) degrade read performance dramatically. gc_grace_seconds controls when tombstones are purged.)
  3. “Compare Cassandra vs. DynamoDB for a high-write IoT workload.” (Cassandra: self-managed, more tuning knobs, multi-DC natively, no vendor lock-in. DynamoDB: fully managed, auto-scaling, pay-per-request pricing, but AWS-only and less flexible data modeling.)
What interviewers are really testing: Whether you understand Redis’s durability options, the trade-offs between speed and data safety, and when Redis should NOT be your primary data store.Answer:Redis is primarily an in-memory store. Persistence is optional and comes in two flavors:
  • RDB (Redis Database Backup): Point-in-time snapshots at configurable intervals. Redis forks the process (copy-on-write), serializes the dataset to disk. Compact file, fast restart. Trade-off: you lose all data written since the last snapshot. With save 60 1000 (snapshot every 60s if 1000+ keys changed), worst case is 60 seconds of data loss.
  • AOF (Append Only File): Logs every write command to a file. Three fsync policies: always (every command — slowest, safest), everysec (batch fsync every second — good default, lose ~1s on crash), no (let OS decide — fastest, most risk). AOF files are larger and restart is slower (must replay all commands). Redis periodically rewrites the AOF to compact it (BGREWRITEAOF).
  • RDB + AOF (recommended for durability): Use both. On restart, Redis loads from AOF (more complete) if it exists, falls back to RDB. This gives you fast snapshots for backup plus near-zero data loss from AOF.
The fork problem on large datasets: RDB snapshot and AOF rewrite both use fork(). On a Redis instance with 25GB of data, the fork can take 100-500ms (during which Redis blocks all commands) and requires up to 25GB of additional memory if many pages are modified (copy-on-write). This is why production Redis instances should not exceed ~25GB per instance, and you should monitor latest_fork_usec.What weak candidates say: “Redis is just a cache, persistence doesn’t matter.” For many use cases (session storage, rate limiting, queues), data loss IS acceptable. But for Redis as a primary store (leaderboards, real-time analytics), persistence is critical.What strong candidates say: Explain both mechanisms with their specific data loss windows, discuss the fork overhead on large instances, and mention the combined RDB+AOF strategy for production. Understand that Redis Cluster + persistence still does not replace a proper database for durable transactional data.Follow-up chain:
  1. “Your Redis instance with 30GB of data is experiencing 400ms latency spikes every 5 minutes. What is happening?” (RDB snapshots or AOF rewrites triggering fork(). The fork blocks the main thread while copying page tables for 30GB of memory. Fix: reduce snapshot frequency, use a replica for persistence, or reduce instance size.)
  2. “When would you choose Redis with persistence over PostgreSQL for a primary data store?” (Almost never for relational data. But for leaderboards, real-time counters, session storage where sub-millisecond latency matters and the data model fits Redis’s structures — sorted sets, hashes, streams.)
  3. “How does Redis Sentinel vs. Redis Cluster handle failover, and how does persistence interact with each?”
What interviewers are really testing: Whether you understand probabilistic data structures and their role in database I/O optimization. This is a systems-level question that reveals whether you think about storage engine internals.Answer:A Bloom filter is a space-efficient probabilistic data structure that tells you either “definitely not in the set” or “possibly in the set.” It has zero false negatives but a tunable false positive rate (typically 1%).How it is used in Cassandra/HBase/LevelDB/RocksDB: The LSM-tree storage engine writes immutable SSTables to disk. When reading a key, the engine must check multiple SSTables to find the most recent version. Without Bloom filters, every read would require checking every SSTable on disk — potentially dozens of disk reads.With Bloom filters: each SSTable has an associated Bloom filter loaded in memory. Before reading the SSTable from disk, the engine checks the Bloom filter. If the filter says “no,” the SSTable is skipped entirely (zero I/O). If it says “maybe,” the SSTable is read. With a 1% false positive rate and 10 SSTables, the average number of unnecessary disk reads drops from 9 (all SSTables that do not contain the key) to 0.09.How it works internally: The filter is a bit array of size M. K different hash functions each map a key to a position in the array. To add a key: set all K positions to 1. To query: check all K positions. If any is 0, the key is definitely absent. If all are 1, the key is probably present (but hash collisions can cause false positives).Tuning: Increasing the bit array size reduces false positives but uses more memory. The formula: for N keys, optimal array size is -N * ln(p) / (ln(2))^2 where p is the desired false positive rate.What weak candidates say: “It is a filter that checks if something exists.” Cannot explain the false positive trade-off or why it matters for storage engines.What strong candidates say: Explain the I/O savings in the context of LSM-tree reads, mention the false positive rate tuning, and connect it to the broader SSTable compaction strategy.Follow-up chain:
  1. “Can you delete a key from a Bloom filter? What would you use instead if you need deletions?” (No — unsetting bits could remove other keys that share those bit positions. Use a Counting Bloom filter (each position is a counter, not a bit) for deletions, at the cost of 3-4x more memory.)
  2. “How does compaction in Cassandra interact with Bloom filters?” (When SSTables are compacted into a new SSTable, a new Bloom filter is generated. Fewer SSTables = fewer Bloom filter checks = faster reads. This is why compaction strategy — Size-Tiered vs. Leveled — affects read performance.)
  3. “Where else are Bloom filters used in distributed systems?” (CDN cache checking, spell checkers, network routers for packet filtering, PostgreSQL’s bloom index extension for multi-column equality searches.)
What interviewers are really testing: Whether you understand CAP beyond the oversimplified “pick two” framing, can explain what it actually means for real systems, and know that the choice is not global but per-operation.Answer:The CAP theorem states that during a network partition (P), a distributed system must choose between Consistency (C) and Availability (A). You always need partition tolerance in a distributed system (networks fail), so the real choice is: during a partition, do you reject requests (choose C) or serve potentially stale data (choose A)?The nuance most candidates miss: CAP is about behavior DURING a partition, not during normal operation. When the network is healthy, you can have both consistency and availability. The question is: what happens when nodes cannot communicate?Real systems:
  • MongoDB (CP): When the primary is unreachable, a new election takes 5-12 seconds. During that window, writes are rejected (sacrificing availability). Once a new primary is elected, consistency is maintained because there is always one authoritative writer.
  • Cassandra (AP): During a partition, all nodes continue accepting writes. When the partition heals, conflicting writes are resolved by “last write wins” (timestamp-based). This means you might read stale data during and briefly after a partition, but the system never stops serving requests.
  • PostgreSQL streaming replication (CP): With synchronous_commit = remote_apply, writes block until the replica confirms. If the replica is unreachable, writes stall (sacrificing availability for consistency). With async replication, it leans AP — writes succeed but replicas may lag.
Beyond CAP — PACELC: Eric Brewer (CAP’s author) later clarified that CAP is incomplete. PACELC adds: even when there is no Partition, do you prioritize Latency or Consistency? Cassandra: PA/EL (available during partition, low latency normally). PostgreSQL synchronous: PC/EC (consistent during partition, consistent normally but higher latency).What weak candidates say: “CAP means you pick two out of three.” This is the most common oversimplification. You always need P, so it is really C vs. A during partitions.What strong candidates say: Explain that CAP applies during partition events, discuss PACELC for normal-operation trade-offs, and give concrete examples of how real databases behave during network failures.Follow-up chain:
  1. “A microservice writes to both PostgreSQL and Redis. The Redis update fails. You now have inconsistent state. Is this a CAP problem?” (No — this is a distributed transaction problem. CAP applies to a single replicated data store. This requires Saga patterns, outbox pattern, or 2PC. Many people misapply CAP to application-level consistency.)
  2. “How does CockroachDB (NewSQL) claim to provide both C and A?” (It provides CP during partitions — unavailable nodes stop serving. Its claim to ‘availability’ is about high availability through consensus-based replication, not CAP-A during partitions. No system violates CAP.)
  3. “You are designing a shopping cart service. Would you choose CP or AP, and why?” (AP — a shopping cart should always be available, even if a user temporarily sees a slightly stale cart. Lost availability means lost revenue. Inconsistency can be resolved on checkout.)
What interviewers are really testing: Whether you understand how location-based queries work under the hood and can choose the right spatial indexing strategy for a given workload (proximity search vs. containment queries vs. route planning).Answer:Geo-spatial indexes enable efficient queries on geographic data: “find all restaurants within 2km,” “does this point fall inside this polygon,” “what is the nearest warehouse to this delivery address.”Index types:
  • R-tree / GiST (PostgreSQL PostGIS): Hierarchical bounding-box tree. Supports complex operations: containment, overlap, nearest-neighbor, intersection of polygons. PostGIS ST_DWithin, ST_Contains, ST_Intersects use GiST indexes. The gold standard for geospatial queries in relational databases.
  • Geohash: Encodes lat/lng into a string. Nearby points share a common prefix. A B-tree index on geohash supports proximity searches via LIKE 'prefix%'. Simple but has edge cases at geohash cell boundaries — points that are physically close may have different prefixes.
  • QuadTree: Recursively divides 2D space into four quadrants. Used by MongoDB’s 2d index and some in-memory spatial indices. Good for point data, less efficient for complex polygon queries.
  • S2 (Google) / H3 (Uber): Modern hierarchical spatial indexing systems. S2 uses a sphere-projected Hilbert curve. H3 uses hexagonal tiling. Both solve the geohash boundary problem. DynamoDB and BigQuery use S2 internally.
MongoDB geo-spatial:
  • 2dsphere index: For spherical (Earth) geometry. Supports $near, $geoWithin, $geoIntersects.
  • 2d index: For flat-plane geometry. Legacy, rarely used.
What weak candidates say: “Use latitude and longitude columns with a WHERE clause.” This performs a full table scan — O(N) for every query.What strong candidates say: Explain the difference between point-based proximity queries (geohash/S2 sufficient) and polygon operations (need R-tree/GiST). Mention PostGIS for complex geospatial workloads and the geohash boundary problem.Follow-up chain:
  1. “You need to find the 10 nearest drivers to a rider. Compare the geohash approach vs. PostGIS ST_DWithin.” (Geohash: fast for approximate proximity but misses edge cases at cell boundaries. PostGIS: accurate but requires GiST index and is more computationally expensive. For ride-sharing, geohash with expanding radius is often sufficient because you do not need meter-level precision for initial matching.)
  2. “How does Uber’s H3 indexing system improve on geohash?” (Hexagonal cells have uniform adjacency — every cell has 6 neighbors at equal distance, unlike geohash’s rectangular cells. This makes proximity queries more uniform and avoids the boundary problem.)
  3. “Your geo-spatial query works in testing but times out in production with 100M points. How do you optimize?”
What interviewers are really testing: Whether you understand MongoDB’s durability and consistency guarantees, can configure them for different risk tolerances, and know the performance impact of each level.Answer:Write concern controls how many replica set members must acknowledge a write before the operation returns to the client. Combined with read concern, it determines your effective consistency and durability guarantees.Levels:
  • w: 1 (default): Acknowledged by the primary only. Fastest. Risk: if the primary crashes before replicating, the write is lost. Data loss window = replication lag (~100ms typically).
  • w: "majority": Acknowledged by a majority of voting members (2 of 3 in a typical replica set). The write is on multiple nodes before acknowledgment. Survives primary failure. 2-5ms overhead per write for the cross-node acknowledgment.
  • w: 0 (fire-and-forget): No acknowledgment. The driver does not wait for any response. Maximum throughput, maximum risk. Use only for logging or metrics where data loss is acceptable.
  • j: true (journaling): Write is flushed to the on-disk journal before acknowledgment. Without j: true, the acknowledged write may still be in the WiredTiger memory buffer and lost on crash. w: "majority" implies j: true in MongoDB 5.0+.
  • wtimeout: Maximum time to wait for write concern satisfaction. Without a timeout, a network partition can cause the application to hang indefinitely waiting for majority acknowledgment.
The production decision: w: "majority", j: true, wtimeout: 5000 is the safe default for critical data (orders, payments, user accounts). w: 1 is acceptable for high-throughput, lower-criticality data (analytics events, activity logs). w: 0 is for fire-hose data where throughput matters more than any individual write.What weak candidates say: “Just use w: majority for everything.” Does not consider the latency impact or know about wtimeout.What strong candidates say: Match write concern to the business criticality of the data. Explain the interaction between write concern and read concern for causal consistency. Mention wtimeout to prevent indefinite hangs during partitions.Follow-up chain:
  1. “You set w: majority but the application is now 3x slower on writes. How do you investigate?” (Check replica lag — if a secondary is far behind, the majority acknowledgment waits for it. Check network latency between nodes. Consider that majority requires waiting for the slowest majority member.)
  2. “What is the difference between w: majority and w: 2 in a 3-node replica set?” (Both require 2 acknowledgments. But w: majority is topology-aware — if you add a 4th node, majority becomes 3. w: 2 is fixed. w: majority is preferred because it scales with topology changes.)
  3. “How does write concern interact with retryable writes in MongoDB 4.2+?”
What interviewers are really testing: Whether you understand Redis’s threading model, why single-threaded design works for an in-memory store, and the limitations this creates at scale.Answer:Redis uses a single thread for command execution (the event loop that processes client commands). This is a deliberate design choice:Why single-threaded works:
  • No locks, no context switches: Every operation is atomic by default. INCR, LPUSH, SETNX — all guaranteed atomic without needing mutexes. This simplifies the codebase and eliminates entire classes of concurrency bugs.
  • CPU is not the bottleneck: Redis operations are in-memory and take microseconds. The bottleneck is network I/O and memory bandwidth, not CPU computation. A single core can process 100K-200K operations per second.
  • Simplicity: The entire data structure library is lock-free because only one thread mutates state. This is why Redis can offer complex atomic operations (MULTI/EXEC, Lua scripts) without the complexity of multi-threaded data structures.
What changed in Redis 6+:
  • I/O threading: Redis 6 introduced multi-threaded I/O — separate threads handle reading from and writing to client sockets, while the main thread still processes commands. This eliminates the network I/O bottleneck and increases throughput by 2x on multi-core machines. The command execution is still single-threaded.
  • Threaded lazy-free: Background threads handle memory deallocation for large keys (e.g., deleting a 10M-element sorted set). Without this, DEL on a large key blocks the main thread for seconds.
Scaling beyond one core:
  • Redis Cluster: Shard data across multiple Redis instances (each single-threaded), distributed across cores or machines. Each shard handles a subset of the keyspace (16384 hash slots).
  • Multiple instances per server: Run 4-8 Redis instances on an 8-core machine, each pinned to a core. Simple but requires client-side routing.
What weak candidates say: “Redis is single-threaded so it is slow for concurrent users.” Confuses single-threaded with serial processing — the event loop handles thousands of concurrent connections efficiently.What strong candidates say: Explain why single-threading is a feature (atomicity, simplicity), discuss the I/O threading improvement in v6+, and know the scaling patterns (Cluster, multi-instance) for CPU-bound workloads.Follow-up chain:
  1. “You have a Redis instance processing 150K ops/sec and need to get to 500K. What are your options?” (Enable I/O threads (io-threads 4). If still insufficient, shard with Redis Cluster. Also check if pipelining is being used — pipelining batches commands, reducing network roundtrips and increasing throughput 5-10x.)
  2. “A DEL command on a key containing a 5M-element sorted set causes a 2-second latency spike. Why and how do you fix it?” (DEL frees memory synchronously on the main thread. Use UNLINK instead — it schedules memory deallocation on a background thread and returns immediately.)
  3. “How does Redis’s single-threaded model compare to Memcached’s multi-threaded model? When would you choose Memcached?” (Memcached is multi-threaded and scales better on a single instance for simple get/set operations. Choose Memcached for pure caching with simple key-value needs. Choose Redis for data structures, persistence, pub/sub, Lua scripting.)
What interviewers are really testing: Whether you understand when a graph database genuinely outperforms a relational database and can articulate the performance difference — not just “graphs are for relationships.”Answer:Graph databases store data as nodes (entities) and edges (relationships), both with properties. The key innovation is index-free adjacency — each node physically stores pointers to its neighbors. Traversal from one node to the next is O(1), regardless of graph size.Use cases where graph databases genuinely shine:
  • Fraud detection: Find circular money transfers (A -> B -> C -> A) in banking transactions. In SQL, this requires recursive CTEs with self-JOINs that grow exponentially. In Neo4j: MATCH (a)-[:TRANSFER*3..5]->(a) RETURN a — follows the chain directly.
  • Social networks: “Friends of friends who also like X” — 2-3 hop traversals. SQL JOINs for this pattern are O(N^k) where k is the depth. Graph traversal is O(D^k) where D is the average degree (much smaller).
  • Recommendation engines: “Users who bought X also bought Y” — collaborative filtering via graph patterns.
  • Knowledge graphs: Entity-relationship modeling for AI/ML features, semantic search, ontology management.
  • Network topology: Route planning, dependency analysis, impact analysis (if server X goes down, what services are affected?).
When NOT to use a graph database:
  • Bulk aggregations (SELECT COUNT, SUM, GROUP BY) — graph DBs are terrible at this.
  • Simple CRUD with no relationship-heavy queries — overhead without benefit.
  • When your data fits neatly in tables and your queries are predictable JOINs.
  • Write-heavy workloads — graph databases typically have lower write throughput than relational DBs.
What weak candidates say: “Graph databases are for any data with relationships.” ALL data has relationships — that is what JOINs are for. The question is whether the relationship traversal pattern makes graph superior.What strong candidates say: Articulate the specific access pattern (multi-hop traversals, variable-depth paths, pattern matching) that makes graphs win. Provide performance comparisons — “a 4-hop friend-of-friend query on 100M users takes 2ms in Neo4j but 30 seconds in PostgreSQL.”Follow-up chain:
  1. “Can you model a social network in PostgreSQL with recursive CTEs and achieve similar performance to Neo4j?” (For 1-2 hops, PostgreSQL with proper indexes is competitive. At 3+ hops, the JOIN explosion makes SQL impractical. Neo4j maintains constant per-hop cost regardless of depth.)
  2. “Compare Neo4j’s Cypher with Apache TinkerPop’s Gremlin. When would you choose each?” (Cypher: declarative, SQL-like, easier to learn, Neo4j-specific. Gremlin: imperative/traversal-based, runs on multiple backends (Neptune, JanusGraph, CosmosDB), better for complex programmatic traversals.)
  3. “How would you handle a graph that grows to billions of edges? Can Neo4j handle this or do you need a different approach?” (Neo4j Enterprise can handle billions of edges on a single machine with sufficient RAM. For truly massive graphs, consider JanusGraph on Cassandra/HBase for distributed storage, or Amazon Neptune for managed service.)

4. Advanced SQL Scenarios

What interviewers are really testing: Whether you can solve analytical problems without resorting to subqueries or application code, understand partitioning vs. ordering in window frames, and know the performance implications.Answer:Window functions perform calculations across a set of rows related to the current row WITHOUT collapsing the result set (unlike GROUP BY). The OVER clause defines the window — which rows to include and how to order them.The three components of a window function:
  1. PARTITION BY: Divides rows into groups. Each partition is processed independently. Like GROUP BY but without collapsing rows.
  2. ORDER BY: Defines the sort order within each partition. Required for ranking and running aggregate functions.
  3. Frame clause: Defines which rows relative to the current row are included. ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (default for running totals), ROWS BETWEEN 6 PRECEDING AND CURRENT ROW (7-day moving average).
Common functions and real-world use:
  • ROW_NUMBER(): Sequential numbering within partitions. Use case: deduplication — WHERE row_num = 1 keeps only the latest record per group.
  • RANK() / DENSE_RANK(): Ranking with ties. RANK skips positions after ties (1, 2, 2, 4). DENSE_RANK does not (1, 2, 2, 3).
  • LAG(col, N) / LEAD(col, N): Access previous/next row’s value. Use case: calculate day-over-day revenue change: revenue - LAG(revenue, 1) OVER (ORDER BY date).
  • SUM() OVER (ORDER BY date): Running total. Accumulating revenue over time.
  • NTILE(4): Divide rows into N equal groups (quartiles, percentiles).
Performance consideration: Window functions require sorting the data by the PARTITION BY + ORDER BY columns. On large tables, this sort can spill to disk if work_mem is insufficient. If you frequently window by the same columns, an index on (partition_col, order_col) eliminates the sort step.What weak candidates say: “Window functions are like GROUP BY.” They are fundamentally different — GROUP BY collapses rows, window functions preserve them. Cannot write a window function from memory.What strong candidates say: Write examples fluently, explain the frame clause, mention performance implications of the sort, and know when a window function replaces a correlated subquery.Follow-up chain:
  1. “Write a query that returns the top 3 highest-paid employees per department, handling ties.” (WITH ranked AS (SELECT *, DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary DESC) as rnk FROM employees) SELECT * FROM ranked WHERE rnk &lt;= 3;)
  2. “What is the difference between ROWS BETWEEN and RANGE BETWEEN in the frame clause?” (ROWS operates on physical row positions. RANGE operates on logical values — all rows with the same ORDER BY value are treated as peers. RANGE can produce unexpected results if ORDER BY values are not unique.)
  3. “You have a query with 3 different window functions, each with a different PARTITION BY. What is the performance impact?” (Each distinct window specification requires a separate sort pass. Three different PARTITION BY clauses = three sorts. Define a common window with WINDOW w AS (...) if partitions overlap, but the planner still needs separate sorts for different partition schemes.)
What interviewers are really testing: Whether you understand CTEs as a readability tool vs. a performance tool, know the materialization behavior, and can use recursive CTEs for hierarchical data.Answer:A CTE (WITH clause) defines a named temporary result set that exists for the duration of a single query. Think of it as a named subquery that you can reference multiple times.Basic CTE:
WITH active_users AS (
  SELECT * FROM users WHERE status = 'active'
)
SELECT department, COUNT(*) FROM active_users GROUP BY department;
Materialization behavior (critical to understand):
  • PostgreSQL 12+: CTEs are inlined by default (not materialized) — the optimizer can push predicates into the CTE and optimize the whole query as one plan. You can force materialization with AS MATERIALIZED or prevent it with AS NOT MATERIALIZED.
  • PostgreSQL < 12: CTEs are ALWAYS materialized — they act as an optimization fence. This means a CTE that returns 10M rows materializes all of them even if the outer query only needs 10. This is a common performance trap.
  • MySQL 8.0+: CTEs are supported and generally inlined by the optimizer.
Recursive CTE (the powerful use case): Used for hierarchical/tree data: org charts, category trees, bill of materials.
WITH RECURSIVE org_tree AS (
  SELECT id, name, manager_id, 1 as depth
  FROM employees WHERE manager_id IS NULL  -- root (CEO)
  UNION ALL
  SELECT e.id, e.name, e.manager_id, ot.depth + 1
  FROM employees e JOIN org_tree ot ON e.manager_id = ot.id
)
SELECT * FROM org_tree;
Safety: Always add a depth limit (WHERE depth &lt; 20) or use CYCLE detection (PostgreSQL 14+) to prevent infinite recursion from circular references.What weak candidates say: “CTEs are just named subqueries for readability.” Missing the materialization trap and recursive capability.What strong candidates say: Discuss materialization behavior across PostgreSQL versions, use recursive CTEs fluently, and mention the optimization fence issue in older PostgreSQL versions.Follow-up chain:
  1. “A CTE-based query runs in 200ms in PostgreSQL 12 but 15 seconds in PostgreSQL 11. Same data, same indexes. Why?” (In PG 11, the CTE is materialized as an optimization fence. The outer query’s WHERE clause cannot be pushed into the CTE, so it materializes the full result set then filters. In PG 12, the CTE is inlined and the planner optimizes the whole query.)
  2. “When would you intentionally use AS MATERIALIZED on a CTE?” (When the CTE is referenced multiple times and the result set is small. Without materialization, the subquery executes once per reference. With materialization, it executes once and the result is reused. Also useful to force a specific plan shape.)
  3. “Can recursive CTEs replace a graph database for relationship traversals?” (For simple hierarchies — yes. For complex graph patterns, multi-hop variable-depth traversals, and pattern matching — no. Recursive CTEs have exponential cost growth at depth, while graph DBs maintain constant per-hop cost.)
What interviewers are really testing: Whether you understand that a table can be joined to itself and can apply this pattern to hierarchical data, comparison queries, and deduplication.Answer:A self-join joins a table to itself using different aliases. It treats the same table as if it were two separate tables. This is the foundation for hierarchical queries in relational databases.Classic examples:
  • Employee-Manager hierarchy: SELECT e.name AS employee, m.name AS manager FROM employees e LEFT JOIN employees m ON e.manager_id = m.id;
  • Find duplicates: SELECT a.* FROM users a JOIN users b ON a.email = b.email AND a.id > b.id; — finds rows with the same email but different IDs.
  • Compare consecutive rows: SELECT a.date, a.revenue, b.revenue AS prev_revenue FROM daily_sales a JOIN daily_sales b ON a.date = b.date + INTERVAL '1 day'; (Though LAG() window function is cleaner for this.)
  • Find pairs: SELECT a.name, b.name FROM employees a JOIN employees b ON a.department = b.department AND a.id &lt; b.id; — all pairs of employees in the same department (without duplicates).
Performance consideration: Self-joins can be expensive if the table is large and the join condition is not indexed. An index on the join column (manager_id, email, department) is essential.What weak candidates say: “You cannot join a table to itself.” Or can only explain the employee-manager example.What strong candidates say: Give multiple use cases, use proper aliasing, and mention that window functions (LAG, LEAD) have replaced many traditional self-join patterns for sequential comparisons.Follow-up chain:
  1. “Rewrite the employee-manager self-join to show the full management chain (CEO -> VP -> Director -> Manager -> Employee) using a recursive CTE.”
  2. “A self-join on a 10M-row table is running slowly despite an index on the join column. What could be wrong?” (The join is producing a Cartesian product within groups — if 1000 employees share a department, the self-join produces 1M pairs for that department alone. Use window functions or add more restrictive join conditions.)
  3. “When would you use a self-join vs. a window function for comparing rows within the same table?”
What interviewers are really testing: Whether you have debugged real performance issues in ORMs and understand the difference between lazy loading traps and efficient data fetching. This is one of the most common production performance problems and a strong signal of hands-on experience.Answer:The N+1 problem occurs when code executes 1 query to fetch a list of N parent records, then executes N additional queries to fetch related data for each parent — resulting in N+1 total queries instead of 1-2.Concrete example (the ORM trap):
-- Query 1: Fetch all orders (returns 100 orders)
SELECT * FROM orders WHERE status = 'pending';

-- Query 2-101: For EACH order, fetch the customer (N=100 extra queries)
SELECT * FROM customers WHERE id = 1;
SELECT * FROM customers WHERE id = 2;
-- ... 98 more queries
Why it kills performance: Each query has network roundtrip overhead (~1-5ms on localhost, 10-50ms across availability zones). 100 extra queries add 100ms-5s of pure latency, even if each individual query is fast. At 10,000 rows, this becomes catastrophic.Fixes (from simplest to most sophisticated):
  1. Eager loading / JOIN: SELECT orders.*, customers.name FROM orders JOIN customers ON ... — one query, all data.
  2. Batch loading / IN clause: SELECT * FROM customers WHERE id IN (1, 2, 3, ..., 100) — two queries total. This is what Facebook’s DataLoader pattern uses.
  3. ORM-specific: Django select_related() (JOIN) and prefetch_related() (batch IN). SQLAlchemy joinedload() and subqueryload(). Rails includes().
  4. GraphQL DataLoader: Batches and deduplicates database requests within a single request cycle. Essential for GraphQL APIs where N+1 is structural due to field-level resolution.
Detection: Enable query logging and look for repeated identical query patterns. PostgreSQL pg_stat_statements shows query frequency. Django Debug Toolbar and Rails Bullet gem catch N+1 at development time.Red flag answer: “Just add an index” — indexes do not fix N+1, they make each of the N queries slightly faster but the fundamental problem (too many roundtrips) remains. Also, not being able to explain how an ORM causes this problem.Follow-up questions:
  • “You are using GraphQL and every query triggers N+1 on nested fields. How does DataLoader solve this?” (DataLoader collects all IDs requested within a single tick of the event loop, then issues one batched query. It also deduplicates — if the same customer ID is requested 5 times, it fetches it once.)
  • “Can N+1 happen with NoSQL databases?” (Yes — fetching a list of documents from MongoDB, then fetching referenced documents one by one is the same pattern. MongoDB $lookup is the aggregation pipeline equivalent of a JOIN.)
  • “Your team’s Django API has 200ms average response time. You enable query logging and see 47 queries per request. Walk me through your debugging process.” (Identify N+1 patterns in the logs, add select_related and prefetch_related to the ORM queries, verify with Django Debug Toolbar, and set up query count assertions in tests to prevent regression.)
What interviewers are really testing: Whether you can choose the right concurrency strategy for a given workload, understand the failure modes of each, and have dealt with race conditions in production. This is a senior-level question that separates theoretical knowledge from real system design experience.Answer:
  • Pessimistic Locking: Assume conflicts are frequent. Lock the resource before modifying it. SELECT ... FOR UPDATE acquires a row-level exclusive lock — no other transaction can read (with FOR UPDATE) or modify that row until the lock is released on COMMIT/ROLLBACK.
    • When to use: High contention scenarios like bank account balance updates, inventory decrements during flash sales, or seat reservation systems. Any time two concurrent requests frequently target the same row.
    • Danger: Deadlocks. Transaction A locks row 1 then tries to lock row 2. Transaction B locks row 2 then tries to lock row 1. Both wait forever. PostgreSQL detects this and kills one transaction, but your application needs retry logic.
    • Performance cost: Lock contention serializes access, reducing throughput on hot rows.
  • Optimistic Locking: Assume conflicts are rare. Read the row with its version number, do your work, then update with a WHERE clause that checks the version has not changed: UPDATE accounts SET balance=X, version=version+1 WHERE id=1 AND version=5. If another transaction modified the row (version is now 6), the UPDATE affects 0 rows and your application retries.
    • When to use: Low contention scenarios like editing a blog post, updating user profile settings, or any CRUD operation where concurrent modification of the same record is uncommon.
    • Danger: Under high contention, retry storms. If 100 concurrent requests all try to update the same row optimistically, 99 will fail on each round, retry, and 98 will fail again. This amplifies load instead of serializing it.
    • Implementation: A version integer column (increment on every update) or a updated_at timestamp column (less precise, can have collisions within the same millisecond).
The production decision framework: Use pessimistic locking when the cost of a retry is high (payment processing) or contention is guaranteed (hot counter). Use optimistic locking when contention is rare and retries are cheap (content editing, settings updates). Some systems combine both: optimistic for reads, pessimistic for the final commit step.Red flag answer: “Always use pessimistic locking because it is safer” — this shows no understanding of the throughput implications. Also, implementing optimistic locking with a timestamp instead of a version counter and not mentioning the precision problem.Follow-up questions:
  • “You have a flash sale where 10,000 users try to buy the last 100 items simultaneously. Which locking strategy do you use and why?” (Pessimistic with SELECT ... FOR UPDATE SKIP LOCKED — this lets transactions grab the next available item without waiting for locked ones, dramatically reducing contention. This is how ticket reservation systems work.)
  • “How do ORMs implement optimistic locking?” (Django: F() expressions for atomic updates, or django-concurrency package. SQLAlchemy: version_id_col mapper option. Rails: built-in lock_version column with ActiveRecord.)
  • “What is the difference between SELECT ... FOR UPDATE and SELECT ... FOR SHARE?” (FOR UPDATE takes an exclusive lock — no other transaction can lock or update the row. FOR SHARE takes a shared lock — other transactions can also take FOR SHARE but cannot FOR UPDATE or modify. Use FOR SHARE when you need to ensure data is not modified while you read it but do not plan to modify it yourself.)
What interviewers are really testing: Whether you understand the trade-offs of semi-structured data in a relational database, can choose the right index type for JSON queries, and know when JSONB is appropriate vs. normalized columns.Answer:PostgreSQL’s JSONB stores binary-parsed JSON that is indexable and queryable. Two index strategies:GIN (Generalized Inverted Index):
  • CREATE INDEX idx_data ON events USING gin (data); — indexes ALL keys and values in the JSONB document. Supports @> (containment), ? (key existence), ?| (any key exists), ?& (all keys exist).
  • Example: SELECT * FROM events WHERE data @> '{"type": "purchase", "amount": 100}'; — the GIN index finds matching documents without a full table scan.
  • Trade-off: GIN indexes are large (they index every key-value pair) and slow to update on writes. On a table with 10M rows and complex JSONB documents, the GIN index can be 3-5x the size of a B-tree index on a scalar column.
B-tree on expression:
  • CREATE INDEX idx_type ON events ((data->>'type')); — indexes a specific extracted value. Faster and smaller than GIN for known query patterns.
  • Supports =, &lt;, >, BETWEEN on the extracted value. Cannot support arbitrary key lookups.
GIN with jsonb_path_ops:
  • CREATE INDEX idx_data ON events USING gin (data jsonb_path_ops); — optimized for @> containment queries only. 2-3x smaller than default GIN operator class. Faster for containment but does not support key existence queries (?, ?|).
When JSONB vs. normalized columns: Use JSONB for: user preferences, feature flags, flexible metadata where the schema varies per record, and rapid prototyping. Use normalized columns for: data you filter, sort, or aggregate frequently — B-tree on a regular column is always faster and smaller than GIN on JSONB.What weak candidates say: “Put a GIN index on it.” Cannot explain the difference between default GIN and jsonb_path_ops, or when to use expression indexes.What strong candidates say: Choose the index type based on query patterns. Use expression indexes for known access patterns, GIN for flexible querying, and acknowledge the write overhead of GIN indexes.Follow-up chain:
  1. “Your JSONB column stores nested objects 5 levels deep. Can GIN index nested keys, and what are the performance implications?” (GIN indexes nested keys for containment queries. But deeply nested containment checks can be slow because the index stores flattened key paths.)
  2. “Compare PostgreSQL JSONB indexing with MongoDB’s document indexing. Which is more mature?” (MongoDB: native document model, more natural index types for documents, multi-key indexes on arrays. PostgreSQL: JSONB is a bolt-on to relational model, powerful but GIN has higher write overhead. MongoDB wins for document-primary workloads.)
  3. “When would you use JSONB vs. a separate document store like MongoDB alongside PostgreSQL?”
What interviewers are really testing: Whether you can evolve a database schema on a live production system without downtime, understand the locking implications of DDL operations, and have a strategy for backward compatibility during rolling deployments.Answer:Schema migrations in production require understanding two constraints: (1) DDL locking behavior — some operations block reads/writes, and (2) backward compatibility — during a rolling deploy, old and new application code run simultaneously against the same database.Safe operations (fast, minimal locking):
  • Add column with default (PostgreSQL 11+): ALTER TABLE t ADD COLUMN c INT DEFAULT 0; — metadata-only, instant. The default is stored in system catalogs, not written to every row.
  • Add nullable column: Always instant (no default needed, no rewrite).
  • Create index concurrently: CREATE INDEX CONCURRENTLY idx ON t(col); — does not block writes. Takes longer (2 passes) but safe for production.
  • Add CHECK constraint NOT VALID: ALTER TABLE t ADD CONSTRAINT chk CHECK (col > 0) NOT VALID; — does not scan existing rows. Then ALTER TABLE t VALIDATE CONSTRAINT chk; separately with a weaker lock.
Dangerous operations (require strategy):
  • Remove column: First deploy code that no longer references the column. Then ALTER TABLE t DROP COLUMN c; — instant in PostgreSQL (marks column as dropped, does not rewrite). But if old code is still running during rolling deploy, it will break. The expand-contract pattern: (1) stop writing to column, (2) deploy code that ignores column, (3) drop column.
  • Rename column: Effectively a remove + add from the application’s perspective. Use the expand-contract pattern: add new column, dual-write, migrate reads, drop old column.
  • Change column type: Often requires a full table rewrite (ACCESS EXCLUSIVE lock). For large tables, use the shadow table approach: create new table with correct type, sync via triggers, atomic swap.
  • Add NOT NULL to existing column: Requires a full table scan to verify (ACCESS EXCLUSIVE lock). In PostgreSQL 12+, use SET NOT NULL NOT VALID + separate VALIDATE to reduce lock duration.
Tooling:
  • Flyway / Liquibase: Version control for migrations. Execute SQL scripts in order, track which migrations have been applied. Good for small-medium teams.
  • gh-ost (MySQL): Online schema change via binlog replication to a shadow table. Essential for MySQL because ALTER TABLE rewrites the table.
  • pgroll (PostgreSQL): Schema versioning with automatic expand-contract pattern. Multiple schema versions coexist during migration.
  • Sqitch: Dependency-based migration system with revert capabilities.
What weak candidates say: “Just run ALTER TABLE.” Cannot explain locking implications or backward compatibility during rolling deployments.What strong candidates say: Walk through the expand-contract pattern, explain which DDL operations are safe vs. dangerous in their database of choice, and set lock_timeout before any DDL on production tables.Follow-up chain:
  1. “You need to split a full_name column into first_name and last_name on a 500M-row table with zero downtime. Walk me through every step.” (Add two nullable columns (instant). Backfill in batches with UPDATE ... WHERE id BETWEEN X AND Y. Add trigger to dual-write for new inserts. Deploy new code to read from new columns. Verify no reads/writes to old column. Drop old column. This takes days to execute safely.)
  2. “What happens if a migration fails halfway through? How do you handle rollback?” (Depends on the tool. Flyway: wrap in a transaction (DDL is transactional in PostgreSQL). MySQL: DDL is not transactional — partial state requires manual cleanup. Always test migrations on a production-size staging database first.)
  3. “How do you handle migrations in a microservices architecture where multiple services share a database?” (Each service should own its tables. Shared schemas are a coupling anti-pattern. If unavoidable, use versioned views — each service reads from its own view, and migrations update the underlying tables + views together.)
What interviewers are really testing: Whether you understand SQL injection beyond the textbook example, know the defense-in-depth approach, and can explain second-order injection and ORM-level risks.Answer:SQL injection occurs when untrusted input is concatenated into a SQL query string, allowing an attacker to modify the query’s logic.Classic example: SELECT * FROM users WHERE username = ' + input + '. If input is ' OR '1'='1' --, the query becomes SELECT * FROM users WHERE username = '' OR '1'='1' --' — returns all users, bypassing authentication.Defenses (defense in depth):
  1. Parameterized queries / prepared statements: The #1 defense. SELECT * FROM users WHERE username = $1 with the input passed as a parameter. The database treats the parameter as data, never as SQL syntax. Every modern language/ORM supports this.
  2. ORM usage: ORMs like SQLAlchemy, Django ORM, and ActiveRecord parameterize queries by default. But raw SQL within an ORM (Model.objects.raw(), execute()) can still be vulnerable if you concatenate strings.
  3. Input validation / allowlisting: Validate that inputs conform to expected patterns. An order_by parameter should only accept known column names — do not pass user input directly into ORDER BY (which cannot be parameterized in most databases).
  4. Least privilege: The database user for the application should have only SELECT/INSERT/UPDATE/DELETE on specific tables. No DROP TABLE, no CREATE, no access to pg_catalog. Even if injection succeeds, the damage is limited.
  5. WAF (Web Application Firewall): A secondary defense layer that detects common injection patterns. Not a primary defense — too many bypass techniques exist.
Second-order injection (the subtle attack): Malicious input is stored in the database (properly escaped during insertion), then later used in a query without parameterization. Example: a username stored as admin'-- is later used in a report query that concatenates the stored value into SQL.What weak candidates say: “Use prepared statements.” Correct but incomplete — cannot explain second-order injection, ORDER BY injection, or the importance of least privilege.What strong candidates say: Layered defense starting with prepared statements, discuss ORM edge cases, mention second-order injection, and emphasize that input validation alone is insufficient — it must be combined with parameterization.Follow-up chain:
  1. “An ORM-heavy application is ‘safe from injection.’ Do you agree?” (No. Raw SQL, dynamic ORDER BY, LIKE patterns with unescaped user input, and bulk operations often bypass the ORM’s protection. Also, NoSQL injection exists — MongoDB $gt operator injection is the document-store equivalent.)
  2. “How do you handle SQL injection in dynamic query builders where column names and table names come from user input?” (Allowlist valid column/table names in application code. Never pass user input as identifiers. Use an enum or lookup table to map user-facing names to actual column names.)
  3. “Your security audit found a stored procedure that concatenates input into dynamic SQL. The team says ‘it is server-side so it is safe.’ Is it?” (Absolutely not. SQL injection inside a stored procedure is equally dangerous. Dynamic SQL in stored procedures must use parameterized execution: EXECUTE ... USING in PostgreSQL, sp_executesql in SQL Server.)
What interviewers are really testing: Whether you understand the full operational impact of soft deletes — not just the concept, but the index bloat, query performance degradation, and GDPR compliance implications.Answer:Soft delete marks a record as deleted (typically deleted_at TIMESTAMP NULL) instead of physically removing it. The record remains in the database for recovery, auditing, or undo functionality.Pros:
  • Recovery: Accidentally deleted data can be restored instantly (UPDATE SET deleted_at = NULL).
  • Audit trail: Historical data is preserved for compliance and debugging.
  • Referential integrity: Foreign key references to soft-deleted records do not break. Hard deletes cascade or leave dangling references.
  • Undo functionality: User-facing “recycle bin” or “trash” features.
Cons (the part most candidates underestimate):
  • Every query needs WHERE deleted_at IS NULL: If a developer forgets this filter, deleted records appear in results. Use a default scope in the ORM (Django’s SoftDeleteManager, Rails’ default_scope) or a PostgreSQL view that filters automatically.
  • Index bloat: A table with 100M rows where 80M are soft-deleted means indexes contain 80M useless entries. Fix: partial indexes: CREATE INDEX idx_active ON users (email) WHERE deleted_at IS NULL; — index only contains live records, 5x smaller.
  • Unique constraint conflicts: If a user soft-deletes their account (email foo@bar.com), they cannot re-register with the same email. Fix: partial unique index: CREATE UNIQUE INDEX idx_email ON users (email) WHERE deleted_at IS NULL; — uniqueness only enforced among active records.
  • GDPR/data retention: “Right to be forgotten” requires actual deletion, not soft delete. You may need a separate hard-delete process that runs after the audit retention period.
  • Foreign key complexity: Cascading soft deletes across related tables requires application logic — the database does not natively support cascading soft deletes.
What weak candidates say: “Add a deleted_at column and filter on it.” Cannot explain partial indexes, unique constraint issues, or GDPR implications.What strong candidates say: Discuss partial indexes for performance, the unique constraint problem with solutions, mention GDPR hard-delete requirements, and recommend a tiered approach (soft delete for user-facing recovery, hard delete after retention period).Follow-up chain:
  1. “Your soft-deleted users table has 200M total rows but only 5M active. Queries are slow despite having an index on the lookup column. Why?” (The index includes all 200M rows. The planner may choose a seq scan because the index is large. Fix: partial index WHERE deleted_at IS NULL that only indexes the 5M active rows.)
  2. “Compare soft delete vs. event sourcing for maintaining history. When would you choose each?” (Soft delete: simple, good for single-entity recovery. Event sourcing: full history of all state changes, enables replaying events and time-travel queries, but much more complex infrastructure. Choose event sourcing for financial transactions, audit-critical systems, or when you need to reconstruct state at any point in time.)
  3. “How do you implement cascading soft deletes across related tables?” (Application-level: when soft-deleting a parent, also soft-delete children. Or use a trigger. Or use a message queue — publish a “user deleted” event and let each service handle its own soft-delete. There is no native database support for cascading soft deletes.)

5. Operations & Scaling

Answer:
  • Full: Whole DB.
  • Differential: Changes since last Full.
  • Transaction Log: Stream of ops (Point in Time Recovery).
Answer: Capture writes from Transaction Log (Debezium reading Postgres WAL) -> Stream to Kafka -> DW. Zero impact on Query performance.
Answer: Offload Read traffic. Async replication. Reporting/Analytics queries.
Answer: Cycle of locks. DB detects and kills one transaction. App must Retry. Prevention: Acquire locks in same order.
Answer: Moving BLOB columns (Image, Text) to separate table. Keeps main table rows small (more fit in RAM page).
Answer: Treating multiple DBs as one. Postgres Foreign Data Wrapper (FDW). Query Mongo from SQL.
Answer: Optimized for append-only timestamped data. Compression (Delta-Delta), Downsampling (Rollups), Retention policies.
Answer: Immutable, cryptographically verifiable log changes. Financial/Supply Chain audit.
Answer: DB enforces “User can only see their own rows”. Policy defined in SQL.
Answer: Distributed Transaction.
  1. Prepare: All nodes vote “Yes/No” (Lock resources).
  2. Commit: If all Yes, persist. Slow. Blocked if Coordinator dies.

Advanced Scenario-Based Questions

Scenario: You’re on-call and get paged at 3 AM. The alerting dashboard shows order processing has stalled. Application logs are flooded with ERROR: deadlock detected from Postgres. The checkout flow is completely broken. Revenue is bleeding at ~$12K/minute. Walk me through your investigation and resolution.What weak candidates say:
  • “I’d restart the database.”
  • “Deadlocks happen when two transactions lock each other. I’d just add retry logic.”
  • Can’t explain how to actually find the participating queries or the lock graph.
What strong candidates say:
  • Immediate triage (first 2 minutes): Check pg_stat_activity for wait_event_type = 'Lock' and cross-reference with pg_locks joined to pg_class to identify the exact tables and rows involved. Run SELECT * FROM pg_locks WHERE NOT granted; to see who’s waiting, then trace back to the blocking PID.
  • Root cause identification: Pull the actual queries from log_lock_waits = on (should already be enabled in production). In my experience the classic pattern is: Transaction A updates orders then inventory, Transaction B updates inventory then orders. The fix isn’t retry logic alone — it’s ensuring consistent lock acquisition order across all code paths.
  • Immediate fix: Terminate the victim transaction with pg_terminate_backend(pid) targeting the least-progressed transaction. Then deploy a hotfix that reorders the lock acquisition. In one incident I worked, we found a background reconciliation job had been deployed that week with reversed table access order — it only manifested under load because the probability of interleaving was low.
  • Long-term: Set deadlock_timeout appropriately (default 1s is fine for detection, but you want log_lock_waits and lock_timeout per-session for aggressive callers). Add a Grafana panel on pg_stat_database.deadlocks counter. Consider advisory locks for known hot-path contention.
Follow-up:
  1. What’s the difference between a deadlock and simple lock contention, and how does Postgres’s deadlock detector actually work internally (hint: wait-for graph cycle detection)?
  2. You mentioned advisory locks — when would you prefer pg_advisory_xact_lock over SELECT ... FOR UPDATE SKIP LOCKED, and what are the trade-offs?
  3. How would your approach differ on MySQL InnoDB where the deadlock detection algorithm and gap locking behavior are fundamentally different?
Scenario: You created a composite index (tenant_id, created_at, status) on a 200M-row events table. Your query filters on all three columns. But EXPLAIN ANALYZE shows a sequential scan costing 4.2 seconds instead of the expected sub-100ms index lookup. The index definitely exists. What’s going on?What weak candidates say:
  • “Maybe the index wasn’t created properly. I’d drop and recreate it.”
  • “Run ANALYZE on the table.” (Correct but incomplete — they can’t explain why the planner chose seq scan.)
  • “Force the index with a hint.” (Postgres doesn’t have traditional index hints.)
What strong candidates say:
  • Statistics are stale or misleading: First check pg_statistic / pg_stats for the columns involved. If ANALYZE hasn’t run recently (or autovacuum is behind), the planner’s row estimates will be wildly off. Run EXPLAIN (without ANALYZE) and compare the rows= estimate to reality. If planner thinks the query returns 80% of the table, it rationally chooses seq scan because random I/O for 160M rows is worse than a single sequential pass.
  • Column correlation issue: If tenant_id has extreme skew — say tenant #42 owns 70% of the rows — Postgres’s default single-column statistics won’t capture the joint selectivity of (tenant_id, created_at, status). Fix: create extended statistics with CREATE STATISTICS on the correlated columns, or increase default_statistics_target for those columns.
  • Data type mismatch / implicit cast: One I’ve hit in production: the query was passing status as an integer but the column was varchar. The implicit cast made the index unusable. EXPLAIN showed a Filter node instead of Index Cond.
  • Index column order vs. query pattern: If the query uses a range on created_at (e.g., created_at > now() - interval '7 days') but an equality check on status, the index (tenant_id, created_at, status) is suboptimal — the range on the second column means the third column can’t be used as a seek condition. Reorder to (tenant_id, status, created_at).
  • Actual check: SET enable_seqscan = off; temporarily to see what plan the optimizer would use with the index, and compare costs. This isn’t a production fix — it’s a diagnostic.
Follow-up:
  1. Walk me through exactly how Postgres decides between an index scan, bitmap index scan, and sequential scan. What cost parameters (random_page_cost, seq_page_cost, effective_cache_size) influence this decision?
  2. You have a query with WHERE tenant_id = $1 AND created_at BETWEEN $2 AND $3 AND status IN ('active', 'pending', 'review'). Design the optimal index and explain why.
  3. How do partial indexes and index-only scans change your analysis here?
Scenario: Your Postgres primary handles ~8K writes/second. You have 3 read replicas serving the application’s read traffic. Monitoring shows replication lag spiking from the normal 50ms to 45 seconds, and users are seeing stale data — some are seeing order statuses from minutes ago. The replica isn’t crashing, just falling behind. Diagnose and fix.What weak candidates say:
  • “Scale up the replica hardware.”
  • “Switch to synchronous replication.” (Would tank write throughput and make the problem worse.)
  • Cannot distinguish between WAL shipping lag and replay lag.
What strong candidates say:
  • Distinguish the two lag components: Check pg_stat_replication on the primary — compare sent_lsn, write_lsn, flush_lsn, replay_lsn. If sent_lsn is far ahead of replay_lsn but write_lsn is close to sent_lsn, the bottleneck is WAL replay on the replica, not network transfer. This is the most common case.
  • Long-running queries on replica: In Postgres, if a query on the replica conflicts with WAL replay (e.g., replay needs to vacuum a page the query is reading), replay pauses. Check max_standby_streaming_delay — if it’s set to 30s (default), a single long analytics query can cause 30s of replay lag. I’ve seen this exact pattern when a BI tool ran a 2-minute dashboard refresh query on the replica. Fix: set hot_standby_feedback = on (but beware table bloat on primary), or route analytics queries to a dedicated replica with higher max_standby_streaming_delay, or cancel long queries with statement_timeout on the replica.
  • Heavy write burst: A bulk data load or large DELETE on the primary generates a WAL spike. The replica’s single-threaded WAL replay can’t keep up. In Postgres 15+ you can tune recovery_prefetch to help, but ultimately single-threaded replay is a known bottleneck. Consider logical replication for parallelism, or table-level partitioning so deletes become fast DROP PARTITION instead of row-by-row.
  • I/O saturation on replica: Check iostat / pg_stat_io. If the replica’s disk is saturated (common when replica also handles heavy read queries), WAL replay competes with query I/O. Move to faster storage or separate WAL directory onto dedicated disk.
Follow-up:
  1. How does Postgres streaming replication differ from logical replication, and in what scenarios would you choose each?
  2. What happens to in-flight read queries on the replica when max_standby_streaming_delay is exceeded?
  3. You need to guarantee that a user who just wrote data will read their own write from a replica. How do you architect this without sending all reads to the primary?
Scenario: Your application uses PgBouncer in front of Postgres (max 100 server connections). During a flash sale, you see the app throwing FATAL: too many connections errors. PgBouncer’s SHOW POOLS shows all 100 connections active, 2,400 clients waiting. Response times have gone from 15ms to 12 seconds. Median query time is 8ms. How is this possible and what do you do?What weak candidates say:
  • “Increase max_connections to 1000.” (Postgres falls over with context switching at high connection counts.)
  • “Add more PgBouncer instances.” (Doesn’t help if the bottleneck is server-side connections.)
  • Don’t understand the difference between PgBouncer’s transaction vs session pooling modes.
What strong candidates say:
  • The math doesn’t add up — and that’s the clue: 100 connections, 8ms median query = theoretical throughput of ~12,500 queries/sec. If we’re queuing 2,400 clients, either we’re exceeding that throughput or something is holding connections longer than 8ms. The culprit is usually session-mode pooling where connections are held for the entire session, not released between transactions. First check: SHOW POOLS; — if pool_mode = session, switch to transaction mode. This alone can 10x effective capacity.
  • Idle-in-transaction connections: Run SELECT state, count(*) FROM pg_stat_activity GROUP BY state;. If you see 60+ connections in idle in transaction, someone’s code is opening a transaction, doing app-level work (API calls, computation), then committing. Each of those is holding a server connection hostage. Fix: set idle_in_transaction_session_timeout = '5s' in Postgres to kill offenders, and fix the application code to keep transactions short.
  • Prepared statements with transaction pooling: If you switch to transaction mode but the app uses prepared statements, PgBouncer can’t route statements across connections. Use DEALLOCATE ALL at transaction end, or switch to PgBouncer 1.21+ with prepared_statement_cache_size.
  • Short-term triage: During the incident, identify and pg_terminate_backend() the idle-in-transaction sessions. Set PgBouncer server_idle_timeout and query_wait_timeout to fail fast rather than queue indefinitely. 12-second queue wait is worse than a fast error + client retry.
  • Architecture fix: For this scale (flash sale spikes), consider a connection pooler per service, with per-service connection limits. One runaway service shouldn’t be able to starve others. We implemented this after a similar incident — the marketing analytics service had a connection leak that ate 40 of our 100 connections.
Follow-up:
  1. Explain the internal difference between PgBouncer’s session, transaction, and statement pooling modes. When is statement mode dangerous?
  2. Why does Postgres performance degrade severely above ~300 connections, even if the queries are fast? (Hint: ProcArrayLock, snapshot management.)
  3. Compare PgBouncer vs. Pgpool-II vs. application-level pooling (HikariCP). When would you choose each?
Scenario: You partitioned a 2-billion-row audit_logs table by month using Postgres declarative partitioning (PARTITION BY RANGE (created_at)). Queries like SELECT * FROM audit_logs WHERE created_at >= '2025-01-01' AND created_at < '2025-02-01' should hit only one partition, but EXPLAIN shows it scanning all 48 partitions. What went wrong?What weak candidates say:
  • “Partitioning doesn’t work well in Postgres.”
  • “The data types probably don’t match.” (Partially right but can’t elaborate.)
  • Don’t know what partition pruning is or how to verify it in EXPLAIN output.
What strong candidates say:
  • Check for runtime vs. plan-time pruning: In the EXPLAIN output, look for Append with a list of child scans. If all 48 partitions appear, pruning failed entirely. If using a parameterized query ($1 instead of literal), Postgres can only do runtime pruning (available since PG 11), and you need enable_partition_pruning = on (default, but verify).
  • Data type mismatch is the #1 cause: If created_at is timestamptz but the query literal is a date or timestamp without time zone, the implicit cast can prevent the planner from matching the value to the partition boundary. I’ve seen this exact issue — partition bounds were defined as timestamptz but the ORM was sending timestamp values. Fix: explicit cast (WHERE created_at >= '2025-01-01'::timestamptz) or ensure the ORM type-maps correctly.
  • Expression mismatch: If the partition key is created_at but the query applies a function — WHERE date_trunc('month', created_at) = '2025-01-01' — the planner can’t prune because the function transforms the column. The WHERE clause must directly reference the partition key column with simple comparison operators.
  • Constraint exclusion vs. partition pruning: In older PG (pre-11), partition pruning relied on constraint_exclusion, which only worked at plan time and not with prepared statements. Verify PG version and that you’re using native declarative partitioning, not the old inheritance-based partitioning.
  • Default partition trap: If you have a DEFAULT partition, Postgres must scan it because any value could exist there. This is correct behavior but often surprises people. If the default partition has millions of rows, it adds significant overhead.
  • Verification: Run EXPLAIN (COSTS OFF) and look for Subplans Removed: 47 — that confirms 47 of 48 partitions were pruned. If you see Subplans Removed: 0, pruning is failing.
Follow-up:
  1. You need to add a new monthly partition for next month. How do you automate this, and what happens if queries arrive for a month with no partition and no default partition?
  2. What’s the performance impact of having 500+ partitions vs. 48? At what point does partition management overhead outweigh pruning benefits?
  3. Compare Postgres declarative partitioning with MySQL’s partitioning and with TimescaleDB hypertables for time-series data.
Scenario: Your transactions table holds 50M active rows but pg_total_relation_size reports 380 GB — roughly 10x what you’d expect for 50M rows. Queries that were 20ms are now 200ms. You suspect bloat. Autovacuum is enabled and shows last_autovacuum was 2 hours ago. Explain what happened and how you fix it.What weak candidates say:
  • “Just run VACUUM FULL on the table.” (Correct but takes an exclusive lock, potentially hours of downtime on 380 GB.)
  • “Delete old rows.” (The rows are already deleted — that’s the problem.)
  • Can’t explain why autovacuum didn’t prevent this.
What strong candidates say:
  • Why autovacuum failed to prevent this: Autovacuum was running, but it wasn’t effective. The most common culprit is a long-running transaction or an abandoned replication slot holding back the xmin horizon. Check: SELECT xmin, pid, state, query_start FROM pg_stat_activity ORDER BY xmin ASC NULLS LAST LIMIT 5;. If there’s a transaction from 6 hours ago in idle in transaction state, no dead tuples created after that transaction started can be vacuumed — autovacuum runs but removes zero tuples. I once traced a 400 GB bloat incident to a single forgotten BEGIN in a developer’s pgAdmin session that had been open for 3 days.
  • Replication slot check: SELECT slot_name, active, restart_lsn, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes FROM pg_replication_slots;. An inactive replication slot (maybe from a decommissioned replica) prevents WAL and tuple cleanup. Drop it if it’s orphaned.
  • Quantify the bloat: Use pgstattuple extension: SELECT * FROM pgstattuple('transactions'); gives you dead_tuple_percent. Or use the pg_bloat_check query from the Postgres wiki. If dead tuple ratio is above 50%, you’re in bad shape.
  • Fix without downtime: VACUUM FULL requires ACCESS EXCLUSIVE lock — not an option on a production table. Instead use pg_repack (or pg_squeeze): these rebuild the table online by creating a new copy, replaying changes via triggers, then swapping. I’ve used pg_repack on a 500 GB table — it took 4 hours but had zero downtime. Set statement_timeout = 0 for the repack session.
  • Prevention: Tune autovacuum per-table for hot tables: ALTER TABLE transactions SET (autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_cost_delay = 2);. Default scale_factor of 0.2 means autovacuum waits until 20% of the table is dead — on 50M rows that’s 10M dead tuples before it triggers. Monitor n_dead_tup in pg_stat_user_tables and alert when it exceeds a threshold. Set idle_in_transaction_session_timeout to prevent the root cause.
Follow-up:
  1. Explain how MVCC’s xmin/xmax tuple headers work and why the visibility map matters for index-only scans on a bloated table.
  2. What is transaction ID wraparound, why does Postgres need aggressive autovacuum to prevent it, and what happens if you hit the 2-billion transaction limit?
  3. How does MySQL InnoDB handle MVCC cleanup differently (purge thread, undo logs) and does it suffer the same bloat patterns?
Scenario: A bad deployment ran an UPDATE without a WHERE clause on the accounts table in production at 14:32 UTC. All 2.3M accounts now have balance = 0. The team is panicking. You have daily base backups (taken at 02:00 UTC) and continuous WAL archiving. You need to recover the accounts table to its state at 14:31 UTC without losing any other data committed between 02:00 and now. Walk me through the recovery.What weak candidates say:
  • “Restore the full backup from 02:00.” (This loses 12 hours of data across ALL tables.)
  • “Use pg_dump from the backup.” (Pg_dump is logical backup — they likely have physical backups with pg_basebackup.)
  • Don’t know what PITR is or how WAL replay works.
What strong candidates say:
  • Do NOT restore over production. Spin up a separate recovery instance. Use pg_basebackup restore + WAL replay to a parallel server. This is critical — you never risk making things worse on the live database.
  • Point-in-Time Recovery (PITR): Restore the 02:00 base backup to the recovery instance. In recovery.conf (or postgresql.conf in PG 12+), set recovery_target_time = '2026-04-10 14:31:00 UTC' and recovery_target_action = 'pause'. Start the recovery instance — it replays WAL from the archive up to 14:31 UTC and pauses. Verify the accounts table looks correct.
  • Extract just the table: From the recovery instance, dump only the accounts table: pg_dump -t accounts --data-only --column-inserts recovery_db > accounts_recovery.sql. Or for speed, use COPY accounts TO '/tmp/accounts.csv' on the recovery instance.
  • Restore into production: On the live production database, within a transaction: BEGIN; DELETE FROM accounts; COPY accounts FROM '/tmp/accounts.csv'; COMMIT; or use pg_restore with --data-only --table=accounts. Verify row counts and spot-check balances.
  • Verify and communicate: Compare row count and checksum (SELECT count(*), sum(balance) FROM accounts) between recovery instance and production after restore. Write an incident report. Then set up pg_audit or a pre-deploy SQL review gate so unqualified UPDATE statements get caught.
  • Time budget: In practice, restoring a base backup and replaying 12 hours of WAL for a moderately busy database takes 30-90 minutes depending on disk speed and WAL volume. The pg_dump/COPY of a single table is fast — maybe 2-3 minutes for 2.3M rows. Total time to recovery: ~1-2 hours. Communicate this timeline to stakeholders immediately so they stop asking “is it fixed yet” every 5 minutes.
  • Prevention: Enable pg_stat_statements and alert on DML without WHERE clauses in staging gate. Use SET statement_timeout for deployment scripts. Consider a BEFORE UPDATE trigger that rejects mass updates above a row threshold (a “blast radius limiter”).
Follow-up:
  1. What’s the difference between physical backups (pg_basebackup) and logical backups (pg_dump), and why can’t you do PITR with pg_dump alone?
  2. The WAL archive has a gap — one segment file is missing or corrupted. What happens during replay, and how do you handle it?
  3. Your database is 4 TB and the incident just happened. PITR will take 3 hours. The business wants data back in 30 minutes. What are your options? (Hint: think about logical replication, flashback queries with pg_dirtyread, or selective WAL replay.)
Scenario: You need to add a non-nullable column region_code VARCHAR(5) NOT NULL DEFAULT 'US' to a user_events table with 1.2 billion rows in Postgres 14. The table receives 5K inserts/second. Your SLA says zero downtime. How do you do it?What weak candidates say:
  • “Just run ALTER TABLE ADD COLUMN.” (Not wrong in PG 11+ for defaults, but can’t explain why it’s fast now vs. before, and doesn’t consider the NOT NULL constraint behavior.)
  • “Create a new table, migrate data, swap.” (Massively over-engineered for this specific case.)
  • Don’t know about Postgres’s metadata-only ADD COLUMN optimization.
What strong candidates say:
  • Postgres 11+ fast-path ADD COLUMN: Since PG 11, ALTER TABLE ADD COLUMN ... DEFAULT 'US' with a non-volatile default is a metadata-only operation — it does NOT rewrite the table. The default is stored in pg_attrdef and returned for rows that don’t physically have the column yet. This takes milliseconds regardless of table size. This is one of the most important Postgres optimizations to know.
  • The NOT NULL nuance: Adding NOT NULL with a default together in a single ALTER TABLE statement works instantly in PG 11+ because every row logically has the default value. But if you try to add the column nullable first, backfill, then add NOT NULL constraint separately — the SET NOT NULL requires a full table scan to verify no nulls exist. On 1.2 billion rows, that scan takes an ACCESS EXCLUSIVE lock for potentially 30+ minutes. So the correct approach is: single statement ALTER TABLE user_events ADD COLUMN region_code VARCHAR(5) NOT NULL DEFAULT 'US'; — instant, zero downtime.
  • When you CAN’T use the fast path: If the default is volatile (e.g., DEFAULT now() or DEFAULT uuid_generate_v4()), Postgres must rewrite every row. In that case, you need the multi-step approach: (1) Add column as nullable with no default. (2) Backfill in batches of 10K-50K rows with UPDATE ... WHERE id BETWEEN X AND Y and pg_sleep(0.1) between batches to avoid overwhelming replication and I/O. (3) Add NOT NULL constraint using NOT VALID first, then VALIDATE CONSTRAINT separately — VALIDATE only takes a SHARE UPDATE EXCLUSIVE lock, not ACCESS EXCLUSIVE, so reads and writes continue.
  • Lock timeout safety: Always set SET lock_timeout = '3s'; before any DDL on a hot table. If the ALTER TABLE can’t acquire its lock within 3 seconds (because a long query is holding a conflicting lock), it’s better to fail and retry than to queue behind the lock and block all subsequent queries. I’ve seen a 2-second ALTER TABLE cause a 10-minute outage because it queued behind a lock, and 500 other queries queued behind it.
  • Tooling: For complex multi-step migrations, use gh-ost (GitHub’s tool for MySQL) or pg-osc / pgroll for Postgres. These create shadow tables, sync via triggers, and do an atomic swap. Flyway or Liquibase handle migration ordering but don’t solve the lock problem — you still need to write lock-safe SQL.
Follow-up:
  1. You need to change the column type from VARCHAR(5) to an ENUM type on the same billion-row table. This does require a rewrite. Walk me through the zero-downtime approach.
  2. How does MySQL’s ALTER TABLE behavior differ? Why does MySQL often require tools like gh-ost or pt-online-schema-change for operations that are instant in Postgres?
  3. You ran ALTER TABLE ... ADD COLUMN and it’s been hanging for 30 seconds on a table that should be instant. What’s happening? (Hint: lock queue, pg_locks, pg_stat_activity, conflicting ACCESS SHARE from long queries.)