Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Module 15: Senior Interview Mastery

This module prepares you for database-focused questions in senior, staff, and principal engineering interviews. You’ll learn to explain complex concepts, discuss trade-offs, and demonstrate deep expertise.
Estimated Time: 12-14 hours
Interview Level: Senior/Staff/Principal
Format: Deep dives + System design + Behavioral
Goal: Demonstrate mastery, not just knowledge

15.1 Interview Types for Databases

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DATABASE INTERVIEW CATEGORIES                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   1. TECHNICAL DEEP DIVE (30-45 min)                                        │
│      "Explain how MVCC works"                                               │
│      "Walk me through query execution"                                       │
│      "Describe B-Tree internals"                                            │
│                                                                              │
│   2. SYSTEM DESIGN - DATA LAYER (45-60 min)                                │
│      "Design a database for Twitter"                                        │
│      "How would you scale a multi-tenant SaaS?"                             │
│      "Design an event sourcing system"                                      │
│                                                                              │
│   3. SQL LIVE CODING (30-45 min)                                            │
│      Write complex queries under pressure                                   │
│      Optimize given slow queries                                            │
│      Design schema for requirements                                         │
│                                                                              │
│   4. TROUBLESHOOTING (30-45 min)                                            │
│      "Production is slow, walk me through diagnosis"                        │
│      "We're seeing lock contention, what do you check?"                     │
│      "Replication lag is increasing, how do you fix it?"                    │
│                                                                              │
│   5. BEHAVIORAL + TECHNICAL (30-45 min)                                     │
│      "Tell me about a database issue you solved"                            │
│      "How did you handle a production incident?"                            │
│      "Describe a decision between SQL and NoSQL"                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

15.2 Deep Dive Questions

Deep dive questions test whether you truly understand the machinery behind the abstractions. A senior engineer does not just say “PostgreSQL uses MVCC” — they can trace how a specific UPDATE creates a new tuple version, explain why the old version persists, describe the visibility rules that determine which transaction sees which version, and articulate the VACUUM implications. The framework for answering any deep dive: What is it, Why does it exist (what problem does it solve), How does it work mechanically, and What are the trade-offs.

MVCC Explanation

Strong Answer Structure:
  1. What it is: Multi-Version Concurrency Control - readers don’t block writers and vice versa by keeping multiple versions of data.
  2. How it works in PostgreSQL:
    • Each row has hidden columns: xmin (creating transaction), xmax (deleting transaction)
    • New version created on UPDATE (not in-place modification)
    • Visibility rules based on transaction’s snapshot
    • Dead tuples cleaned up by VACUUM
  3. Trade-offs:
    • ✅ Better concurrency than locking
    • ✅ Consistent snapshots for long queries
    • ❌ Storage overhead (dead tuples)
    • ❌ Needs VACUUM maintenance
  4. Comparison: Oracle uses undo logs (rollback segments), PostgreSQL uses in-place versioning.
Bonus Points: Mention HOT updates, visibility map, wraparound vacuum.
Strong Answer (follow the pipeline):
  1. Parse: SQL text → tokens → parse tree
    • Validates syntax
    • Identifies keywords, identifiers, literals
  2. Analyze: Parse tree → Query tree
    • Resolves table/column names to OIDs
    • Type checking and coercion
    • Function resolution
  3. Rewrite: Query tree → Rewritten tree(s)
    • View expansion
    • Rule processing
  4. Plan: Query tree → Plan tree
    • Generate paths (seq scan, index scans, joins)
    • Estimate costs using statistics
    • Choose cheapest path
  5. Execute: Plan tree → Results
    • Initialize plan nodes (iterator model)
    • Fetch tuples through pull-based execution
    • Apply filters, projections
Senior-level additions: Mention prepared statement caching, JIT compilation, parallel query.
Strong Answer:
  1. Structure:
    • Balanced tree: all leaf nodes at same depth
    • Internal nodes contain keys and child pointers
    • Leaf nodes contain keys and TIDs (row pointers)
    • Nodes are 8KB pages
  2. Lookup process:
    Root → Compare key, descend to child
    Internal nodes → Binary search, follow pointer
    Leaf node → Binary search for key(s)
    Heap → Fetch tuple using TID
    
  3. Range scans: Leaf nodes linked for efficient range traversal
  4. PostgreSQL specifics:
    • Lehman-Yao algorithm (concurrent-safe)
    • High key in each page for sibling navigation
    • Right-link for concurrent splits
Bonus: Discuss index-only scans (visibility map), partial indexes, covering indexes.
Strong Answer:
  1. Isolation Levels:
    • Read Committed: Each statement sees committed data as of statement start
    • Repeatable Read: Entire transaction sees snapshot from start
    • Serializable: Full isolation, SSI (Serializable Snapshot Isolation)
  2. Locking hierarchy:
    • Row-level locks (shared, exclusive)
    • Table-level locks (8 modes from ACCESS SHARE to ACCESS EXCLUSIVE)
    • Predicate locks for serializable
  3. Deadlock handling:
    • Detection via wait-for graph
    • One transaction aborted as victim
    • Application should retry
  4. Concurrency patterns:
    • Optimistic: Read, compute, try update, retry on conflict
    • Pessimistic: Lock early, update, release
    • SELECT FOR UPDATE for row locking
Demonstrate depth: Discuss SSI vs 2PL, write skew anomaly, SKIP LOCKED.

15.3 System Design Questions

Design a Multi-Tenant SaaS Database

┌─────────────────────────────────────────────────────────────────────────────┐
│          MULTI-TENANT SAAS DATABASE DESIGN                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Approach 1: Shared Schema with tenant_id                                  │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                        DATABASE                                     │   │
│   │  users (id, tenant_id, name, ...)                                  │   │
│   │  orders (id, tenant_id, ...)                                       │   │
│   │                                                                     │   │
│   │  + Row Level Security:                                             │   │
│   │  CREATE POLICY tenant_isolation ON users                           │   │
│   │    USING (tenant_id = current_setting('app.tenant_id'));           │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│   ✅ Simple, efficient                                                      │
│   ✅ Easy cross-tenant analytics                                            │
│   ❌ Noisy neighbor risk                                                    │
│   ❌ Accidental data leaks if RLS misconfigured                            │
│                                                                              │
│   Approach 2: Schema per Tenant                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                        DATABASE                                     │   │
│   │  tenant_001.users   tenant_002.users   tenant_003.users            │   │
│   │  tenant_001.orders  tenant_002.orders  tenant_003.orders           │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│   ✅ Good isolation                                                         │
│   ✅ Per-tenant backup/restore                                              │
│   ❌ Schema migrations more complex                                         │
│   ❌ Connection management (SET search_path)                               │
│                                                                              │
│   Approach 3: Database per Tenant                                           │
│   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                        │
│   │ tenant_001  │  │ tenant_002  │  │ tenant_003  │                        │
│   │    users    │  │    users    │  │    users    │                        │
│   │   orders    │  │   orders    │  │   orders    │                        │
│   └─────────────┘  └─────────────┘  └─────────────┘                        │
│   ✅ Complete isolation                                                     │
│   ✅ Easy compliance (GDPR deletion)                                       │
│   ❌ Operational overhead                                                   │
│   ❌ Connection pooling complexity                                         │
│                                                                              │
│   RECOMMENDATION:                                                           │
│   • < 100 tenants: Schema per tenant                                       │
│   • 100-10K tenants: Shared schema with RLS                                │
│   • Enterprise needs: Database per tenant                                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Design Twitter’s Timeline Database

Requirements Clarification:
  • Users follow other users
  • Home timeline shows posts from followed users
  • Profile shows user’s own posts
  • Scale: 500M users, 200M daily active, 400M tweets/day
Approach 1: Pull Model (Fan-in on Read)
-- Schema
CREATE TABLE users (id BIGINT PRIMARY KEY, ...);
CREATE TABLE follows (follower_id BIGINT, following_id BIGINT);
CREATE TABLE tweets (id BIGINT, author_id BIGINT, content TEXT, created_at TIMESTAMPTZ);

-- Home timeline query (SLOW!)
SELECT t.* FROM tweets t
JOIN follows f ON t.author_id = f.following_id
WHERE f.follower_id = $current_user
ORDER BY t.created_at DESC
LIMIT 20;
❌ O(n) per request where n = # followed ❌ Slow for users following many accountsApproach 2: Push Model (Fan-out on Write)
-- Materialized timeline per user
CREATE TABLE timelines (
    user_id BIGINT,
    tweet_id BIGINT,
    created_at TIMESTAMPTZ,
    PRIMARY KEY (user_id, created_at DESC, tweet_id)
);

-- On new tweet: fan out to all followers
INSERT INTO timelines (user_id, tweet_id, created_at)
SELECT follower_id, $tweet_id, NOW()
FROM follows WHERE following_id = $author_id;

-- Home timeline query (FAST!)
SELECT t.* FROM tweets t
JOIN timelines tl ON t.id = tl.tweet_id
WHERE tl.user_id = $current_user
ORDER BY tl.created_at DESC
LIMIT 20;
✅ O(1) read ❌ Celebrity problem: millions of followers = slow writesApproach 3: Hybrid (Real Production Architecture)
Normal users → Fan-out on write to followers' timelines
Celebrities (>10K followers) → Fan-in on read, merged at read time

Timeline = merge(pre-materialized + real-time celebrity fetch)
Database Choices:
  • User/follows: PostgreSQL (relational, ACID)
  • Tweets: PostgreSQL with partitioning by time
  • Timeline cache: Redis sorted sets
  • Full-text search: Elasticsearch
Scaling Strategy:
  • Shard by user_id (follows, timelines)
  • Shard by date (tweets - time-series)
  • Read replicas for timeline reads
  • Connection pooling: PgBouncer

15.4 SQL Live Coding

Common Patterns to Master

-- 1. Running totals / Cumulative sums
-- Why window functions matter: they compute aggregates without collapsing rows.
-- The ORDER BY inside OVER() defines the "window frame" -- rows are accumulated
-- in date order. Without a ROWS/RANGE clause, the default frame is RANGE BETWEEN
-- UNBOUNDED PRECEDING AND CURRENT ROW.
SELECT 
    date,
    amount,
    SUM(amount) OVER (ORDER BY date) AS running_total
FROM transactions;

-- 2. Top N per group
-- Why ROW_NUMBER over RANK: ROW_NUMBER guarantees exactly N results per group
-- even with ties. RANK would return more than N if there are ties at the boundary.
-- Performance pitfall: this pattern requires a sort per partition. If the table has
-- millions of rows and hundreds of categories, ensure there is an index on
-- (category, sales DESC) to avoid a full sort.
SELECT * FROM (
    SELECT 
        category,
        product,
        sales,
        ROW_NUMBER() OVER (PARTITION BY category ORDER BY sales DESC) AS rn
    FROM products
) t WHERE rn <= 3;

-- 3. Gap and island problems
-- Why this works: for consecutive dates, (date - row_number) yields the same value.
-- When there is a gap, the row_number keeps incrementing but the date jumps,
-- creating a new group value. This is a classic technique interviewers love because
-- it tests both window function fluency and mathematical reasoning.
WITH numbered AS (
    SELECT 
        id, 
        date,
        date - (ROW_NUMBER() OVER (ORDER BY date))::int AS grp
    FROM events
)
SELECT MIN(date) AS start_date, MAX(date) AS end_date, COUNT(*) AS length
FROM numbered
GROUP BY grp;

-- 4. Hierarchical queries (recursive CTE)
-- Why the UNION ALL matters: UNION would deduplicate, which both wastes CPU and
-- can mask infinite loops in cyclic graphs. Always add a depth limit or cycle
-- detection (PG 14+ has CYCLE clause) when the hierarchy might have loops.
-- Performance pitfall: recursive CTEs materialize intermediate results. For deep
-- hierarchies (>100 levels) or wide graphs, consider ltree extension instead.
WITH RECURSIVE org_tree AS (
    SELECT id, name, manager_id, 1 AS level
    FROM employees WHERE manager_id IS NULL
    UNION ALL
    SELECT e.id, e.name, e.manager_id, t.level + 1
    FROM employees e
    JOIN org_tree t ON e.manager_id = t.id
)
SELECT * FROM org_tree;

-- 5. Pivot / Cross-tab
-- Why CASE inside SUM: PostgreSQL has no native PIVOT syntax (unlike SQL Server).
-- The CASE expression returns NULL for non-matching months, and SUM ignores NULLs.
-- For dynamic pivots (unknown number of columns), use the tablefunc extension's
-- crosstab() function or generate the SQL dynamically in application code.
SELECT 
    product,
    SUM(CASE WHEN month = 1 THEN sales END) AS jan,
    SUM(CASE WHEN month = 2 THEN sales END) AS feb,
    SUM(CASE WHEN month = 3 THEN sales END) AS mar
FROM sales_data
GROUP BY product;

-- 6. Deduplication (keep latest)
-- Why DELETE USING instead of a subquery: the USING clause lets PostgreSQL do a
-- single hash join instead of a correlated subquery (which would be an N*M scan).
-- Performance pitfall: on large tables, this can generate massive WAL and hold
-- locks for a long time. For production, batch the deletes:
--   DELETE FROM events a USING events b
--   WHERE a.id < b.id AND ... LIMIT 10000;
-- and loop until no rows are affected.
DELETE FROM events a
USING events b
WHERE a.id < b.id 
  AND a.user_id = b.user_id 
  AND a.event_type = b.event_type;

15.5 Troubleshooting Scenarios

Troubleshooting questions are where interviewers separate “I read the docs” from “I have been paged at 3 AM.” The framework that impresses: Observe (what are the symptoms?), Hypothesize (what could cause this?), Measure (gather data to confirm or eliminate), Act (fix the immediate problem), Prevent (ensure it does not recur). Always start by asking clarifying questions: “Is this sudden or gradual? Did anything change recently? What is the error rate vs. latency?”

Scenario: Production Database is Slow

-- 1. Check active queries -- sort by query_start to find long-running offenders.
-- A query running for hours is often the root cause of lock contention AND
-- VACUUM starvation simultaneously.
SELECT pid, state, query_start, query 
FROM pg_stat_activity 
WHERE state != 'idle'
ORDER BY query_start;

-- 2. Check for blocking -- this query finds which backend is blocking which.
-- In production, the "blocking_query" is often an idle-in-transaction session
-- holding a lock it forgot to release.
SELECT 
    blocked.pid AS blocked_pid,
    blocked.query AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_locks bl ON bl.pid = blocked.pid
JOIN pg_locks l ON l.locktype = bl.locktype 
    AND l.relation = bl.relation AND l.pid != bl.pid
JOIN pg_stat_activity blocking ON l.pid = blocking.pid
WHERE NOT bl.granted;

-- 3. Check wait events -- the wait_event_type tells you WHERE the bottleneck is:
-- 'LWLock' = internal lock contention (often buffer mapping or WAL insert)
-- 'Lock' = heavyweight lock (usually row or table locks from other transactions)
-- 'IO' = waiting for disk (check your storage throughput)
-- 'Client' = waiting for the client to consume results (network or app issue)
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL;
-- Buffer cache hit ratio (should be > 99%)
-- Why 99%? If shared_buffers is properly sized (25% of RAM), the OS file cache
-- handles the rest. A ratio below 95% usually means shared_buffers is too small
-- OR you have a working set larger than RAM (consider adding more memory).
SELECT 
    sum(blks_hit) / nullif(sum(blks_hit + blks_read), 0) AS ratio
FROM pg_stat_database;

-- Slowest queries (need pg_stat_statements)
-- Why total_exec_time, not mean_exec_time? A query that averages 10ms but runs
-- 1 million times consumes 10,000 seconds of CPU -- far more impactful than a
-- 5-second query that runs once. Optimize by total time first.
SELECT 
    substring(query, 1, 80) AS query,
    calls,
    mean_exec_time,
    total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Table bloat -- dead_ratio > 0.2 means more than 20% of the table is dead tuples.
-- Performance pitfall: high dead_ratio does not just waste disk space -- it also
-- slows sequential scans because PostgreSQL must read and skip dead tuples.
-- Index bloat is even worse: dead index entries inflate the B-Tree depth.
SELECT 
    schemaname, tablename,
    pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS size,
    n_dead_tup, n_live_tup,
    round(n_dead_tup::numeric / nullif(n_live_tup, 0), 2) AS dead_ratio
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
-- Kill long-running query -- pg_terminate_backend sends SIGTERM (graceful exit).
-- Use pg_cancel_backend first (cancels query but keeps connection alive).
-- Only escalate to terminate if cancel does not work.
SELECT pg_terminate_backend(pid);

-- Force vacuum on bloated table -- VERBOSE shows progress and reclaimed pages.
-- ANALYZE updates statistics so the planner makes better choices afterward.
VACUUM (VERBOSE, ANALYZE) big_table;

-- Tune autovacuum for this specific table -- the defaults (20% scale factor)
-- are too lazy for large tables. A 100M-row table would accumulate 20M dead
-- tuples before autovacuum fires. Set to 1% for high-write tables.
ALTER TABLE big_table SET (
    autovacuum_vacuum_scale_factor = 0.01,
    autovacuum_vacuum_threshold = 1000
);

-- Add missing index -- CONCURRENTLY is critical in production because it does
-- not hold an exclusive lock on the table. Without CONCURRENTLY, CREATE INDEX
-- blocks ALL writes until the index is fully built.
-- Performance pitfall: CONCURRENTLY requires two full table scans and cannot
-- run inside a transaction block. If it fails mid-way, you get an INVALID
-- index that must be dropped and recreated.
CREATE INDEX CONCURRENTLY idx_orders_customer 
ON orders(customer_id);

-- Tune work_mem for complex query -- use SET LOCAL so it only affects the
-- current transaction, not the entire session. Global work_mem changes affect
-- every sort/hash in every query on every connection.
SET work_mem = '256MB';

15.6 Behavioral Questions

Framework: STAR-T (Situation, Task, Action, Result, Technical)

Situation: At [Company], our e-commerce checkout was timing out during Black Friday, with 90th percentile response times exceeding 10 seconds.Task: As the senior backend engineer, I needed to identify and fix the bottleneck before peak traffic.Action:
  1. Enabled pg_stat_statements and identified a slow query in the inventory check
  2. Analyzed with EXPLAIN ANALYZE - seq scan on 50M row products table
  3. Found the issue: implicit cast from VARCHAR to INT in WHERE clause, bypassing index
  4. Created proper index, fixed application code to use correct types
  5. Added query performance alerts to prevent regression
Result:
  • Query time dropped from 3s to 5ms
  • Checkout success rate increased from 85% to 99.5%
  • No issues during peak traffic (10x normal load)
Technical Depth (add if they want more):
  • “The implicit cast was from the ORM using string interpolation…”
  • “We also added connection pooling with PgBouncer to handle the connection surge…“

15.7 Questions to Ask Interviewers

About the Stack

  • “What databases do you use and why those choices?”
  • “How do you handle schema migrations?”
  • “What’s your read/write ratio?”

About Scale

  • “What’s your data volume and growth rate?”
  • “How do you handle peak load?”
  • “Have you encountered scaling challenges?”

About Operations

  • “How do you handle database incidents?”
  • “What’s your backup and recovery strategy?”
  • “How is database work prioritized?”

About Team

  • “Is there a dedicated DBA team or embedded?”
  • “How do engineers learn about database topics?”
  • “What’s the biggest database project coming up?“

15.8 Mock Interview Practice

Practice Schedule

WeekFocusTime
1Deep dive explanations (record yourself)2 hrs
2SQL live coding (LeetCode, HackerRank)3 hrs
3System design with database focus3 hrs
4Troubleshooting scenarios2 hrs
5Mock interviews with peers3 hrs
6Full mock (all types)4 hrs

Self-Evaluation Rubric

CriterionJuniorSeniorStaff
CorrectnessGets basics rightHandles edge casesAnticipates issues
DepthSurface levelExplains whyDiscusses trade-offs
BreadthSingle approachMultiple optionsCross-system thinking
CommunicationNeeds promptingClear structureTeaches effectively
ExperienceTheoreticalReal examplesWar stories

15.9 Cheat Sheet for Interview Day

┌─────────────────────────────────────────────────────────────────────────────┐
│                    INTERVIEW DAY QUICK REFERENCE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   BEFORE ANSWERING:                                                          │
│   1. Clarify requirements (don't assume)                                    │
│   2. State assumptions explicitly                                           │
│   3. Outline approach before diving in                                      │
│                                                                              │
│   FOR DEEP DIVES:                                                            │
│   • What → Why → How → Trade-offs                                           │
│   • "The reason PostgreSQL does X is..."                                    │
│   • "The alternative would be Y, but that has..."                           │
│                                                                              │
│   FOR SYSTEM DESIGN:                                                         │
│   • Requirements → Estimation → Schema → Queries → Scaling                  │
│   • Always discuss indexing strategy                                        │
│   • Mention consistency vs availability trade-offs                          │
│                                                                              │
│   FOR TROUBLESHOOTING:                                                       │
│   • Start with symptoms → gather data → hypothesize → verify                │
│   • Mention tools: EXPLAIN, pg_stat_*, logs                                 │
│   • Prioritize: quick wins first, root cause after stabilizing             │
│                                                                              │
│   POWER PHRASES:                                                             │
│   • "In my experience at [Company]..."                                      │
│   • "The trade-off here is..."                                              │
│   • "One approach is X, but we could also Y depending on..."                │
│   • "I'd want to measure before optimizing..."                              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Next Module

Module 16: System Design Case Studies

Real-world database design problems