Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Module 16: System Design Case Studies

Apply everything you’ve learned to real-world database design challenges. Each case study covers requirements, schema design, queries, scaling, and trade-offs. Think of these as the difference between knowing how to play individual chess pieces and actually playing a full game — this is where isolated skills become integrated engineering judgment.
Estimated Time: 10-12 hours
Format: 6 comprehensive case studies
Interview Use: Practice for 45-60 min design rounds
Skill Level: Senior/Staff
How to get the most from these case studies: Before reading each solution, spend 20-30 minutes designing your own schema and queries from the requirements alone. Compare your decisions against the provided solution. The gaps you find are exactly what you need to study. Senior engineers are not people who memorize schemas — they are people who can reason about trade-offs under constraints.

Case Study 1: E-Commerce Platform

Requirements

  • Product catalog with categories and attributes
  • User accounts with addresses
  • Shopping cart and checkout
  • Order management with status tracking
  • Inventory management with stock levels
  • Review and rating system

Schema Design

Why BIGSERIAL and not SERIAL? An e-commerce platform will grow beyond 2.1 billion rows (SERIAL’s INT limit) faster than you think. Orders, order items, and cart events accumulate fast. Changing a primary key type on a table with billions of rows and foreign key dependencies is one of the most painful migrations in production database engineering. Start with BIGSERIAL and never worry about it again — the storage overhead (8 bytes vs 4 bytes per row) is negligible compared to the operational cost of a retroactive migration.
-- Core entities
-- Users table is deliberately minimal: authentication concerns
-- belong in a separate auth service in most production systems
CREATE TABLE users (
    id BIGSERIAL PRIMARY KEY,
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW()  -- TIMESTAMPTZ, not TIMESTAMP: always store with timezone
);

CREATE TABLE addresses (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT REFERENCES users(id),
    type VARCHAR(20), -- 'shipping', 'billing'
    street TEXT NOT NULL,
    city VARCHAR(100),
    country VARCHAR(2),  -- ISO 3166-1 alpha-2: enforce at app level or CHECK constraint
    postal_code VARCHAR(20),
    is_default BOOLEAN DEFAULT false
    -- Pitfall: is_default has no DB-level uniqueness per user+type.
    -- Enforce "only one default per user per type" in application logic
    -- or use a partial unique index:
    -- CREATE UNIQUE INDEX ON addresses(user_id, type) WHERE is_default = true;
);

-- Product catalog
-- Categories use ltree for efficient ancestor/descendant queries.
-- Think of ltree like a file path: 'Electronics.Phones.Smartphones'
-- This avoids expensive recursive CTEs for "find all products in
-- Electronics and subcategories" queries.
CREATE TABLE categories (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    parent_id INT REFERENCES categories(id),
    path ltree  -- requires: CREATE EXTENSION ltree;
    -- Example path: 'electronics.phones.smartphones'
    -- Query all children: WHERE path <@ 'electronics'
);

CREATE TABLE products (
    id BIGSERIAL PRIMARY KEY,
    sku VARCHAR(50) UNIQUE NOT NULL,  -- SKU = Stock Keeping Unit, the merchant's identifier
    name VARCHAR(255) NOT NULL,
    description TEXT,
    category_id INT REFERENCES categories(id),
    base_price DECIMAL(10,2) NOT NULL,  -- DECIMAL, not FLOAT: never use floating point for money
    attributes JSONB DEFAULT '{}',  -- Flexible attributes (color options, material, etc.)
    -- Why JSONB here? Product attributes vary wildly across categories.
    -- A T-shirt has "size" and "color"; a laptop has "RAM" and "screen size".
    -- A fully normalized EAV (Entity-Attribute-Value) model is a performance nightmare.
    -- JSONB gives you flexible schema with GIN index support for fast lookups.
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE product_variants (
    id BIGSERIAL PRIMARY KEY,
    product_id BIGINT REFERENCES products(id),
    sku VARCHAR(50) UNIQUE NOT NULL,
    name VARCHAR(100),  -- "Red, Large"
    price_modifier DECIMAL(10,2) DEFAULT 0,
    attributes JSONB NOT NULL  -- {"color": "red", "size": "L"}
);

-- Inventory
-- This is a separate table from product_variants on purpose.
-- Inventory changes MUCH more frequently than product data.
-- Separating them means inventory updates don't cause lock
-- contention on the heavily-read product_variants table.
-- This is the "hot table / cold table" separation pattern.
CREATE TABLE inventory (
    variant_id BIGINT PRIMARY KEY REFERENCES product_variants(id),
    quantity INT NOT NULL DEFAULT 0,    -- Total physical stock
    reserved INT NOT NULL DEFAULT 0,    -- Reserved by in-progress checkouts
    -- Available = quantity - reserved
    CHECK (quantity >= 0),
    CHECK (reserved >= 0),
    CHECK (reserved <= quantity)  -- Cannot reserve more than exists
);

-- Orders
CREATE TABLE orders (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT REFERENCES users(id),
    status VARCHAR(20) DEFAULT 'pending',
    total DECIMAL(12,2) NOT NULL,
    shipping_address_id BIGINT REFERENCES addresses(id),
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE order_items (
    id BIGSERIAL PRIMARY KEY,
    order_id BIGINT REFERENCES orders(id),
    variant_id BIGINT REFERENCES product_variants(id),
    quantity INT NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    CHECK (quantity > 0)
);

-- Reviews
CREATE TABLE reviews (
    id BIGSERIAL PRIMARY KEY,
    product_id BIGINT REFERENCES products(id),
    user_id BIGINT REFERENCES users(id),
    rating INT NOT NULL CHECK (rating BETWEEN 1 AND 5),
    title VARCHAR(200),
    body TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(product_id, user_id)
);

Key Queries

-- Product listing with category tree
WITH RECURSIVE cat_tree AS (
    SELECT id, name, parent_id, ARRAY[name] AS path
    FROM categories WHERE id = $category_id
    UNION ALL
    SELECT c.id, c.name, c.parent_id, ct.path || c.name
    FROM categories c
    JOIN cat_tree ct ON c.parent_id = ct.id
)
SELECT p.*, array_agg(pv.id) AS variant_ids
FROM products p
JOIN cat_tree ct ON p.category_id = ct.id
LEFT JOIN product_variants pv ON pv.product_id = p.id
GROUP BY p.id;

-- Checkout with inventory reservation
-- This is the critical section: two customers must not both "win"
-- the last item. FOR UPDATE acquires a row-level lock, which is like
-- putting a "hold" sign on the shelf while you walk to the register.
BEGIN;
SELECT * FROM inventory 
WHERE variant_id = $variant_id 
FOR UPDATE;  -- Row-level lock: blocks other transactions on THIS row only

UPDATE inventory 
SET reserved = reserved + $quantity
WHERE variant_id = $variant_id
  AND (quantity - reserved) >= $quantity;

-- Check affected rows, if 0 then out of stock
COMMIT;

-- Product with aggregated reviews
SELECT 
    p.*,
    COUNT(r.id) AS review_count,
    AVG(r.rating) AS avg_rating,
    jsonb_build_object(
        '5', COUNT(*) FILTER (WHERE r.rating = 5),
        '4', COUNT(*) FILTER (WHERE r.rating = 4),
        '3', COUNT(*) FILTER (WHERE r.rating = 3),
        '2', COUNT(*) FILTER (WHERE r.rating = 2),
        '1', COUNT(*) FILTER (WHERE r.rating = 1)
    ) AS rating_distribution
FROM products p
LEFT JOIN reviews r ON r.product_id = p.id
WHERE p.id = $product_id
GROUP BY p.id;

Scaling Strategy

┌─────────────────────────────────────────────────────────────────────────────┐
│                    E-COMMERCE SCALING                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Read Path (90% of traffic):                                               │
│   ┌─────────────┐     ┌─────────────┐     ┌─────────────┐                  │
│   │   CDN       │────▶│   Redis     │────▶│   Replica   │                  │
│   │ (static)    │     │ (products)  │     │  (queries)  │                  │
│   └─────────────┘     └─────────────┘     └─────────────┘                  │
│                                                                              │
│   Write Path (10% of traffic):                                              │
│   ┌─────────────┐     ┌─────────────┐                                       │
│   │    App      │────▶│   Primary   │                                       │
│   │  (orders)   │     │  (writes)   │                                       │
│   └─────────────┘     └─────────────┘                                       │
│                                                                              │
│   Indexing:                                                                  │
│   • products(category_id, created_at DESC) - listings                       │
│   • products USING GIN(attributes) - attribute search                       │
│   • inventory(variant_id) - stock checks                                    │
│   • orders(user_id, created_at DESC) - order history                        │
│                                                                              │
│   Partitioning:                                                              │
│   • orders by created_at (monthly) - historical data                        │
│   • reviews by product_id hash - even distribution                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Case Study 2: Real-Time Analytics

Requirements

  • Ingest millions of events per day
  • Sub-second dashboard queries
  • Time-series aggregations (hourly, daily, weekly)
  • Retention: raw data 30 days, aggregates forever
The fundamental tension in analytics: You want to ingest data fast (append-only, minimal indexing) AND query it fast (heavy indexing, pre-aggregation). These goals directly conflict. The solution is a two-tier architecture: raw event storage optimized for writes, and pre-aggregated summary tables optimized for reads. Think of it like a library: you file new books quickly into the “incoming” pile, but periodically organize them onto shelves (aggregation) so readers can find what they need instantly.

Schema Design

-- Time-series events with partitioning
-- Partitioning is non-negotiable for time-series at scale.
-- Without it, deleting 30-day-old data requires a massive DELETE
-- that generates WAL, bloats the table, and tanks performance.
-- With partitioning, you just DROP the old partition: instant, zero I/O.
CREATE TABLE events (
    id BIGSERIAL,
    event_time TIMESTAMPTZ NOT NULL,
    event_type VARCHAR(50) NOT NULL,
    user_id BIGINT,
    properties JSONB,
    -- Note: event_time is FIRST in the primary key. This is required
    -- because partition keys must be part of the PK, and putting it
    -- first gives us efficient range queries on time.
    PRIMARY KEY (event_time, id)
) PARTITION BY RANGE (event_time);

-- Create monthly partitions
CREATE TABLE events_2024_01 PARTITION OF events
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
-- ... more partitions

-- Pre-aggregated summaries
CREATE TABLE event_stats_hourly (
    hour TIMESTAMPTZ NOT NULL,
    event_type VARCHAR(50) NOT NULL,
    count BIGINT DEFAULT 0,
    unique_users BIGINT DEFAULT 0,
    PRIMARY KEY (hour, event_type)
);

CREATE TABLE event_stats_daily (
    date DATE NOT NULL,
    event_type VARCHAR(50) NOT NULL,
    count BIGINT DEFAULT 0,
    unique_users BIGINT DEFAULT 0,
    properties_agg JSONB,  -- aggregated metrics
    PRIMARY KEY (date, event_type)
);

-- HyperLogLog for unique counts (requires extension)
-- HLL is a probabilistic data structure that counts distinct values
-- using ~1.2KB of memory regardless of cardinality. The trick:
-- you CAN'T merge COUNT(DISTINCT) across time buckets (counting
-- 100 unique users per hour doesn't tell you how many unique users
-- per day, because users overlap). HLL sketches CAN be merged.
-- Error rate: ~2%. For dashboards, this is perfectly acceptable.
CREATE EXTENSION IF NOT EXISTS hll;

ALTER TABLE event_stats_hourly 
ADD COLUMN user_hll hll DEFAULT hll_empty();

Ingestion Pattern

-- Batch insert with conflict handling
INSERT INTO events (event_time, event_type, user_id, properties)
VALUES 
    ($1, $2, $3, $4),
    ($5, $6, $7, $8),
    -- ... batch of 1000
ON CONFLICT DO NOTHING;

-- Continuous aggregation (run every hour)
INSERT INTO event_stats_hourly (hour, event_type, count, unique_users, user_hll)
SELECT 
    date_trunc('hour', event_time) AS hour,
    event_type,
    COUNT(*) AS count,
    COUNT(DISTINCT user_id) AS unique_users,
    hll_add_agg(hll_hash_bigint(user_id)) AS user_hll
FROM events
WHERE event_time >= $last_processed_hour
  AND event_time < $current_hour
GROUP BY 1, 2
ON CONFLICT (hour, event_type) DO UPDATE SET
    count = event_stats_hourly.count + EXCLUDED.count,
    unique_users = EXCLUDED.unique_users,
    user_hll = hll_union(event_stats_hourly.user_hll, EXCLUDED.user_hll);

Query Patterns

-- Dashboard query (uses pre-aggregates)
SELECT 
    date_trunc('day', hour) AS date,
    event_type,
    SUM(count) AS total_events,
    hll_cardinality(hll_union_agg(user_hll)) AS unique_users
FROM event_stats_hourly
WHERE hour >= NOW() - INTERVAL '7 days'
GROUP BY 1, 2
ORDER BY 1, 2;

-- Drill-down to raw events
SELECT 
    event_time,
    event_type,
    properties
FROM events
WHERE event_time >= $start AND event_time < $end
  AND event_type = $type
ORDER BY event_time DESC
LIMIT 100;

Data Lifecycle

-- Drop old partitions (automated)
-- This is the magic of partitioning for time-series: deleting
-- 30 days of data is O(1) -- just drop the table. No vacuum needed.
-- Compare this to DELETE FROM events WHERE event_time < '2023-02-01'
-- which would scan millions of rows, generate massive WAL, and leave
-- behind dead tuples that autovacuum would struggle to clean.
DROP TABLE events_2023_01;  -- Instant, no locking

-- Compress old data with TimescaleDB (optional)
-- Or move to columnar storage (Citus columnar)
-- Columnar storage can achieve 10-20x compression for analytics data
-- because columns within a block tend to have similar values.
SELECT alter_table_set_access_method('events_2023_12', 'columnar');
Partition management pitfall: If you forget to create future partitions, inserts will fail with a routing error. Always automate partition creation using pg_cron or pg_partman. A common pattern is to create partitions 3 months ahead and drop partitions older than your retention policy in the same cron job.

Case Study 3: Social Media Feed

Requirements

  • Users follow other users
  • Home feed shows posts from followed users
  • Support for 100M users, 10M DAU
  • Feed must be under 100ms
The Fan-Out Problem: This is the single most interesting database design challenge in social media. When a user with 10 million followers posts, do you write that post into 10 million timelines immediately (fan-out-on-write), or do you assemble each user’s feed at read time (fan-out-on-read)? The answer, as Twitter/X famously discovered, is “both” — a hybrid approach. Regular users get fan-out-on-write for instant feed delivery. Celebrity posts get fan-out-on-read to avoid writing 10 million rows for a single post. This is the kind of trade-off that separates senior from staff-level thinking.

Schema Design

CREATE TABLE users (
    id BIGSERIAL PRIMARY KEY,
    username VARCHAR(50) UNIQUE NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE follows (
    follower_id BIGINT NOT NULL REFERENCES users(id),
    following_id BIGINT NOT NULL REFERENCES users(id),
    created_at TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (follower_id, following_id)
);

CREATE TABLE posts (
    id BIGSERIAL PRIMARY KEY,
    author_id BIGINT NOT NULL REFERENCES users(id),
    content TEXT NOT NULL,
    media_urls TEXT[],
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Fan-out timeline (for users with < 10K followers)
-- This is the "inbox" model: when someone you follow posts,
-- a copy is placed in YOUR timeline. Like a mail carrier delivering
-- a letter to every subscriber's mailbox.
-- Trade-off: writes are expensive (one row per follower), but reads
-- are a single index scan on (user_id, created_at) -- extremely fast.
CREATE TABLE timelines (
    user_id BIGINT NOT NULL,
    post_id BIGINT NOT NULL REFERENCES posts(id),
    author_id BIGINT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL,
    -- Composite PK: user_id first for "my feed" queries,
    -- created_at DESC for chronological ordering,
    -- post_id for uniqueness tie-breaking
    PRIMARY KEY (user_id, created_at DESC, post_id)
);

-- Celebrity posts (for users with > 10K followers)
-- NOT fanned out -- merged at read time instead.
-- If a celebrity with 10M followers posts, we do NOT write 10M timeline rows.
-- Instead, at feed-read time, we fetch their latest posts and merge them
-- with the pre-materialized timeline. This is the "pull" model for celebrities.
CREATE INDEX idx_posts_author_time ON posts(author_id, created_at DESC);

Fan-Out Strategy

-- On new post from regular user (< 10K followers)
INSERT INTO timelines (user_id, post_id, author_id, created_at)
SELECT 
    follower_id,
    $post_id,
    $author_id,
    $created_at
FROM follows
WHERE following_id = $author_id;

-- Async worker for large fan-outs
-- Use message queue (RabbitMQ, SQS) for batching

Feed Query (Hybrid)

-- Get user's celebrity followings
WITH celebrities AS (
    SELECT following_id 
    FROM follows f
    JOIN user_stats us ON us.user_id = f.following_id
    WHERE f.follower_id = $user_id 
      AND us.follower_count > 10000
),
-- Get pre-materialized timeline
materialized AS (
    SELECT post_id, author_id, created_at
    FROM timelines
    WHERE user_id = $user_id
      AND created_at > NOW() - INTERVAL '7 days'
    ORDER BY created_at DESC
    LIMIT 100
),
-- Get celebrity posts
celebrity_posts AS (
    SELECT id AS post_id, author_id, created_at
    FROM posts
    WHERE author_id IN (SELECT following_id FROM celebrities)
      AND created_at > NOW() - INTERVAL '7 days'
    ORDER BY created_at DESC
    LIMIT 100
)
-- Merge and return
SELECT p.*
FROM posts p
JOIN (
    SELECT * FROM materialized
    UNION ALL
    SELECT * FROM celebrity_posts
) feed ON feed.post_id = p.id
ORDER BY feed.created_at DESC
LIMIT 20;

Case Study 4: Multi-Tenant SaaS

Requirements

  • Thousands of tenants (companies)
  • Complete data isolation
  • Some tenants are very large (1M+ rows)
  • Compliance requirements (GDPR)
The three multi-tenancy models: (1) Separate databases per tenant — maximum isolation, operational nightmare at scale. (2) Separate schemas per tenant — good isolation, schema migrations become N migrations. (3) Shared tables with row-level security — minimal operational overhead, relies on PostgreSQL RLS for isolation. We use model 3 here because it scales to thousands of tenants without multiplying operational complexity. The trade-off: you trust PostgreSQL’s RLS to never leak data. This is a reasonable trust — RLS is enforced at the query planner level, not the application level — but it requires careful testing and policy design.

Schema Design (Shared with RLS)

-- Enable RLS
CREATE TABLE tenants (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name VARCHAR(255) NOT NULL,
    plan VARCHAR(20) DEFAULT 'free',
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE users (
    id BIGSERIAL PRIMARY KEY,
    tenant_id UUID NOT NULL REFERENCES tenants(id),
    email VARCHAR(255) NOT NULL,
    role VARCHAR(20) DEFAULT 'member',
    UNIQUE(tenant_id, email)
);

CREATE TABLE projects (
    id BIGSERIAL PRIMARY KEY,
    tenant_id UUID NOT NULL REFERENCES tenants(id),
    name VARCHAR(255) NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE tasks (
    id BIGSERIAL PRIMARY KEY,
    tenant_id UUID NOT NULL REFERENCES tenants(id),
    project_id BIGINT REFERENCES projects(id),
    title VARCHAR(255) NOT NULL,
    status VARCHAR(20) DEFAULT 'open',
    assignee_id BIGINT REFERENCES users(id)
);

-- Row Level Security
ALTER TABLE projects ENABLE ROW LEVEL SECURITY;
ALTER TABLE tasks ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation_projects ON projects
    USING (tenant_id = current_setting('app.tenant_id')::UUID);

CREATE POLICY tenant_isolation_tasks ON tasks
    USING (tenant_id = current_setting('app.tenant_id')::UUID);

-- Set tenant context (in application)
-- SET app.tenant_id = 'uuid-here';

Connection Management

# Application middleware
# CRITICAL: This is the security boundary. If you forget to set the
# tenant context, RLS policies default to blocking all rows (safe).
# But if you set the WRONG tenant ID, that user sees another tenant's data.
# Always derive tenant_id from the authenticated session, never from
# user-supplied input like query parameters or request headers.
def set_tenant_context(tenant_id):
    """Set tenant context for RLS"""
    with get_db_connection() as conn:
        # Use parameterized query to prevent SQL injection in the SET command
        conn.execute("SET app.tenant_id = %s", (str(tenant_id),))
        yield conn
        conn.execute("RESET app.tenant_id")  # Clean up: never leak context to next request

# PgBouncer with tenant routing (advanced)
# Route large tenants to dedicated pools
# Pitfall: PgBouncer in transaction mode reuses connections across tenants.
# You MUST set tenant context at the start of every transaction, not just
# at connection acquisition, because the connection may have been used
# by a different tenant in the previous transaction.

Tenant Isolation Testing

-- Test: Ensure tenant A can't see tenant B's data
SET app.tenant_id = 'tenant-a-uuid';

SELECT * FROM projects;  -- Should only see tenant A

-- Try to access tenant B (should fail silently - return 0 rows)
SELECT * FROM projects WHERE tenant_id = 'tenant-b-uuid';
-- Returns 0 rows due to RLS

-- Bypass for admin operations
SET ROLE admin_role;  -- Must have BYPASSRLS
SELECT * FROM projects;  -- Sees all tenants

Case Study 5: Event Sourcing System

Requirements

  • All state changes stored as events (audit trail)
  • Ability to rebuild state at any point in time
  • Support for projections (read models)
  • Event replay for debugging
Why event sourcing? Traditional CRUD overwrites state — when you update an order status from “pending” to “shipped,” the “pending” state is gone forever. Event sourcing instead stores every state transition as an immutable fact: “OrderCreated,” “ItemAdded,” “OrderShipped.” The current state is derived by replaying these events. Think of it like accounting: you never erase a ledger entry. You add correcting entries. This gives you a complete audit trail, the ability to debug by replaying history, and the power to build new views of your data retroactively. The cost: read queries require materializing projections, and the event store grows indefinitely. This trade-off is worth it for domains where auditability and temporal queries matter — finance, healthcare, logistics, and compliance-heavy applications.

Schema Design

-- Event store (append-only -- NEVER update or delete events)
CREATE TABLE events (
    id BIGSERIAL PRIMARY KEY,  -- Global ordering: critical for subscription processing
    stream_id UUID NOT NULL,   -- Groups related events (e.g., all events for order-123)
    stream_type VARCHAR(100) NOT NULL,  -- 'Order', 'User', etc.
    event_type VARCHAR(100) NOT NULL,   -- 'OrderCreated', 'ItemAdded'
    event_data JSONB NOT NULL,          -- The event payload
    metadata JSONB DEFAULT '{}',        -- Correlation IDs, causation IDs, user who triggered it
    version INT NOT NULL,               -- Per-stream version number
    created_at TIMESTAMPTZ DEFAULT NOW(),
    -- This UNIQUE constraint is your optimistic concurrency control.
    -- Two concurrent transactions trying to append version 4 to the
    -- same stream: one wins, one gets a unique violation. The loser
    -- must reload state, re-apply business logic, and retry.
    -- This is how you get consistency WITHOUT pessimistic locking.
    UNIQUE(stream_id, version)
);

-- Example events
INSERT INTO events (stream_id, stream_type, event_type, event_data, version)
VALUES 
    ('order-123', 'Order', 'OrderCreated', 
     '{"customer_id": 456, "items": []}', 1),
    ('order-123', 'Order', 'ItemAdded', 
     '{"product_id": 789, "quantity": 2}', 2),
    ('order-123', 'Order', 'OrderSubmitted', 
     '{"submitted_at": "2024-01-15T10:30:00Z"}', 3);

-- Projection: Materialized current state
CREATE TABLE order_projections (
    order_id UUID PRIMARY KEY,
    customer_id BIGINT NOT NULL,
    items JSONB NOT NULL DEFAULT '[]',
    status VARCHAR(20) NOT NULL,
    total DECIMAL(12,2),
    last_event_id BIGINT NOT NULL,  -- For replay
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Subscription checkpoints
CREATE TABLE subscription_checkpoints (
    subscription_id VARCHAR(100) PRIMARY KEY,
    last_processed_id BIGINT NOT NULL,
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

Event Replay

-- Rebuild projection from events
CREATE OR REPLACE FUNCTION rebuild_order_projection(p_order_id UUID)
RETURNS void AS $$
DECLARE
    evt RECORD;
    current_state JSONB := '{}';
BEGIN
    -- Delete existing projection
    DELETE FROM order_projections WHERE order_id = p_order_id;
    
    -- Replay all events
    FOR evt IN 
        SELECT * FROM events 
        WHERE stream_id = p_order_id 
        ORDER BY version
    LOOP
        current_state := apply_event(current_state, evt.event_type, evt.event_data);
    END LOOP;
    
    -- Insert rebuilt projection
    INSERT INTO order_projections (order_id, customer_id, items, status, last_event_id)
    VALUES (
        p_order_id,
        (current_state->>'customer_id')::BIGINT,
        current_state->'items',
        current_state->>'status',
        evt.id
    );
END;
$$ LANGUAGE plpgsql;

Subscription Processing

# Python async event processor
async def process_events():
    last_id = get_checkpoint('order_projection')
    
    while True:
        events = await db.fetch("""
            SELECT * FROM events 
            WHERE id > $1 
            ORDER BY id 
            LIMIT 100
        """, last_id)
        
        for event in events:
            await update_projection(event)
            last_id = event['id']
        
        await save_checkpoint('order_projection', last_id)
        await asyncio.sleep(0.1)

Case Study 6: Global Application

Requirements

  • Users in US, EU, Asia
  • Data residency (EU data stays in EU)
  • Low latency reads globally
  • Strong consistency for writes
Data residency is a legal constraint, not a technical preference. Under GDPR, EU user data must be stored and processed within the EU (or in countries with an adequacy decision). Violating this is not a performance issue — it is a legal liability that can result in fines up to 4% of global annual revenue. This means your database architecture must enforce geographic data boundaries at the infrastructure level, not rely on application-level routing alone. A misconfigured replica that syncs EU user data to a US datacenter is a compliance violation, even if no one ever queries it there.

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    GLOBAL DATABASE ARCHITECTURE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   US-EAST (Primary for US users)                                            │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  ┌─────────────┐     ┌─────────────┐                               │   │
│   │  │   Primary   │────▶│   Replica   │ (local reads)                 │   │
│   │  │  (writes)   │     │             │                               │   │
│   │  └──────┬──────┘     └─────────────┘                               │   │
│   └─────────┼───────────────────────────────────────────────────────────┘   │
│             │ async replication (reference data only)                       │
│   ┌─────────┼───────────────────────────────────────────────────────────┐   │
│   │         ▼           EU-WEST (Primary for EU users)                 │   │
│   │  ┌─────────────┐     ┌─────────────┐                               │   │
│   │  │   Primary   │────▶│   Replica   │                               │   │
│   │  │  (writes)   │     │             │                               │   │
│   │  └─────────────┘     └─────────────┘                               │   │
│   │                                                                     │   │
│   │  EU user data ONLY stored here (GDPR)                              │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   Data Routing:                                                             │
│   • User data: Route to home region based on user.region                   │
│   • Reference data: Replicated globally (products, configs)                │
│   • Cross-region queries: Federation layer or application-level           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation

-- Users partitioned by region
CREATE TABLE users (
    id BIGSERIAL,
    region VARCHAR(10) NOT NULL,  -- 'us', 'eu', 'asia'
    email VARCHAR(255) NOT NULL,
    data JSONB,
    PRIMARY KEY (region, id)
) PARTITION BY LIST (region);

CREATE TABLE users_us PARTITION OF users FOR VALUES IN ('us');
CREATE TABLE users_eu PARTITION OF users FOR VALUES IN ('eu');
CREATE TABLE users_asia PARTITION OF users FOR VALUES IN ('asia');

-- Reference data (replicated everywhere)
CREATE TABLE products (
    id BIGSERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    -- No region-specific data
);

-- Foreign data wrapper for cross-region queries
CREATE EXTENSION postgres_fdw;

CREATE SERVER eu_server
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'eu-db.example.com', port '5432', dbname 'app');

CREATE USER MAPPING FOR app_user
SERVER eu_server
OPTIONS (user 'app_user', password 'secret');

CREATE FOREIGN TABLE users_eu_remote (
    id BIGINT,
    region VARCHAR(10),
    email VARCHAR(255)
) SERVER eu_server OPTIONS (table_name 'users_eu');

How to Use These Case Studies in Interviews

System design interviews for database engineering roles typically follow a 45-60 minute format. Here’s how to leverage these case studies for maximum interview success:

Interview Structure and Timing

Phase 1: Requirements Clarification (5-10 minutes) This is where most candidates fail before they even start. Jumping into schema design without understanding constraints is like writing code without reading the requirements. The interviewer is evaluating your engineering judgment, not your typing speed.
  • Ask about scale: How many users? Queries per second? Data volume?
  • Identify read vs write ratio (often 90:10 or higher)
  • Clarify consistency requirements: Strong vs eventual consistency?
  • Understand geographic distribution and latency requirements
  • Confirm data retention and compliance needs
Phase 2: High-Level Design (10-15 minutes)
  • Start with a simple single-server design
  • Identify the main entities and their relationships
  • Draw a basic ER diagram or schema outline
  • Explain your normalization choices
  • Discuss primary access patterns and query shapes
Phase 3: Deep Dive on Specific Components (15-20 minutes)
  • Schema Design: Walk through 2-3 core tables in detail
    • Show primary keys, foreign keys, indexes
    • Explain constraint choices
    • Justify data types and column sizes
  • Query Patterns: Write 2-3 critical queries
    • Show how indexes support these queries
    • Discuss query plan expectations
    • Mention potential optimization techniques
  • Scaling Bottlenecks: Identify the first scaling constraint
    • Connection limits? I/O? CPU?
    • Propose specific solutions (not just “add caching”)
Phase 4: Scaling Discussion (10-15 minutes)
  • Progression from single server to scaled architecture
  • Read scaling: Replicas, caching layers, CDN
  • Write scaling: Partitioning, sharding strategies
  • Handle trade-offs explicitly:
    • “If we denormalize here, we gain X but lose Y”
    • “Caching improves reads but introduces consistency challenges”
Phase 5: Trade-offs and Extensions (5-10 minutes)
  • Discuss monitoring and observability needs
  • Mention backup and disaster recovery
  • Consider operational complexity
  • Address interviewer’s specific concerns

What Interviewers Look For

✅ Strong Signals:
  • Clear, structured thinking with explicit phases
  • Asking clarifying questions before jumping to solutions
  • Justifying design decisions with concrete trade-offs
  • Concrete SQL examples, not just hand-waving
  • Understanding when to denormalize and why
  • Awareness of operational concerns (monitoring, backups)
  • Scaling incrementally (not over-engineering initially)
❌ Red Flags:
  • Immediately jumping to complex distributed systems
  • Generic answers (“we’ll use microservices” without specifics)
  • Not considering query patterns when designing schema
  • Ignoring constraints and data integrity
  • Proposing technologies without understanding their trade-offs
  • Unable to write actual SQL queries
  • No discussion of indexes or query optimization

Practice Strategies

1. Time Yourself: Set a 45-minute timer and work through a full case study 2. Write Actual SQL: Don’t just describe—write CREATE TABLE statements 3. Draw Diagrams: Practice drawing ER diagrams and architecture diagrams quickly 4. Explain Out Loud: Verbalize your thinking process as if explaining to an interviewer 5. Compare Your Solution: After completing, review the provided solution and note differences 6. Focus on Trade-offs: For every design choice, practice stating the pros and cons

Common Interview Questions to Prepare

  • “How would you handle hot partition problems in sharding?”
  • “What’s your strategy for zero-downtime migrations?”
  • “How do you ensure data consistency across microservices?”
  • “Explain your approach to managing database schema changes in production.”
  • “How would you optimize this query?” (given EXPLAIN output)
  • “What monitoring metrics would you track for this system?”

Case Study to Interview Mapping

  • E-Commerce: Tests inventory management, transactions, and consistency
  • Analytics Platform: Tests time-series optimization and aggregation strategies
  • Social Feed: Tests fan-out approaches and caching patterns
  • Multi-Tenant SaaS: Tests isolation strategies and connection management
  • Event Sourcing: Tests immutable data patterns and projections
  • Global Application: Tests geo-distribution and data residency
Use these case studies to build a mental library of patterns you can adapt to any interview scenario.

Design Template: Build Your Own Case Study

Use this template to practice system design for any application domain:

1. Requirements Analysis

Functional Requirements:
  • List 5-7 core features the system must support
  • Identify the main entities and their relationships
  • Define critical user workflows (e.g., checkout flow, post creation)
Non-Functional Requirements:
  • Scale: How many users? Requests/second? Data volume?
  • Read:Write Ratio: Typically 90:10, 95:5, or 50:50?
  • Latency: P95/P99 targets (e.g., less than 100ms for reads, less than 500ms for writes)
  • Consistency: Strong, eventual, or per-operation?
  • Availability: Target uptime (e.g., 99.9%, 99.99%)?

2. Data Model Design

Core Entities Template:
-- Entity 1: [Name]
CREATE TABLE [entity_name] (
    id BIGSERIAL PRIMARY KEY,
    -- Add relevant columns with appropriate types
    -- Include timestamps
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Entity 2: [Name]
-- ...

-- Relationships (junction tables if needed)
-- ...
Schema Design Checklist:
  • All tables have explicit primary keys
  • Foreign keys defined with appropriate ON DELETE behavior
  • Unique constraints on natural keys (email, username, etc.)
  • Check constraints for data validation
  • Appropriate indexes for query patterns (list them explicitly)
  • Consider partitioning strategy if time-series or very large tables
  • JSONB columns only where schema flexibility is truly needed

3. Access Patterns and Queries

Identify Top 5 Queries by Frequency:
  1. [Query description] - Expected QPS: ___
    -- Write the SQL query
    -- Note: This query uses index on [column]
    
  2. [Query description] - Expected QPS: ___
  3. [Query description] - Expected QPS: ___
  4. [Query description] - Expected QPS: ___
  5. [Query description] - Expected QPS: ___
Write Patterns:
  • Transactional writes (e.g., orders, payments)
  • Bulk imports (e.g., data ingestion)
  • Background jobs (e.g., aggregation, cleanup)

4. Indexing Strategy

-- List all indexes with justification
CREATE INDEX idx_[table]_[columns] ON [table]([columns]) 
WHERE [condition if partial];
-- Justification: Supports [specific query pattern]

-- Example:
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC);
-- Justification: Supports "recent orders by user" query (ORDER BY created_at DESC)

CREATE INDEX idx_products_category_active ON products(category_id) 
WHERE status = 'active';
-- Justification: Partial index for active product listings by category

5. Scaling Plan

Phase 1: Single Server (0-10K users)
  • Single PostgreSQL instance
  • Application handles connection pooling
  • Basic monitoring and backups
Phase 2: Read Scaling (10K-100K users)
  • Add 2-3 read replicas
  • Implement read/write splitting in application
  • Add Redis for session and hot data caching
  • CDN for static assets
Phase 3: Write Scaling (100K-1M users)
  • Vertical scaling of primary (larger instance)
  • Implement connection pooling (PgBouncer)
  • Consider partitioning large tables by date/hash
  • Asynchronous processing for non-critical writes
Phase 4: Distributed Architecture (1M+ users)
  • Sharding strategy: [by user_id, by geographic region, etc.]
  • Cross-shard query handling approach
  • Multi-region deployment with data residency
  • Event-driven architecture for eventual consistency

6. Trade-offs and Considerations

Consistency vs Performance:
  • Where can you use eventual consistency?
  • Which operations require strong consistency?
  • How do you handle consistency across shards?
Normalization vs Denormalization:
  • Which tables/columns will you denormalize? Why?
  • What’s your strategy for keeping denormalized data in sync?
Complexity vs Maintainability:
  • Is sharding premature optimization for your scale?
  • Can simpler solutions (caching, indexing) solve the problem?

7. Operational Considerations

Monitoring:
  • Key metrics to track (latency, throughput, error rate, saturation)
  • Query performance tracking (pg_stat_statements)
  • Replication lag monitoring
  • Connection pool utilization
Disaster Recovery:
  • Backup strategy and RPO (Recovery Point Objective)
  • Restore and RTO (Recovery Time Objective)
  • WAL archiving and PITR (Point-in-Time Recovery)
Schema Evolution:
  • How will you handle schema migrations with zero downtime?
  • Versioning strategy for breaking changes

Self-Evaluation Rubric

Use this rubric to assess your case study solutions and identify areas for improvement:

Requirements Understanding (Score: /10)

Excellent (9-10):
  • Asked clarifying questions about scale, consistency, and latency requirements
  • Identified all implicit requirements (e.g., multi-tenancy, geographic distribution)
  • Made reasonable assumptions where information was missing
  • Documented assumptions explicitly
Good (7-8):
  • Asked some clarifying questions
  • Covered most critical requirements
  • Made some assumptions but didn’t document all
Needs Improvement (score below 7):
  • Jumped directly to solution without clarifying requirements
  • Missed critical non-functional requirements
  • Made unrealistic assumptions

Schema Design (Score: /20)

Excellent (17-20):
  • All entities properly normalized (or intentionally denormalized with justification)
  • Appropriate data types and constraints
  • Foreign keys with correct ON DELETE behavior
  • Indexes match query patterns precisely
  • Considered partitioning where appropriate
  • JSONB used judiciously
Good (13-16):
  • Mostly normalized schema
  • Some indexes missing or unnecessary indexes included
  • Minor data type issues
  • Constraints mostly appropriate
Needs Improvement (under 13):
  • Normalization violations without justification
  • Missing foreign keys or constraints
  • No indexes or all indexes wrong
  • Inappropriate use of JSONB for structured data
  • Missing primary keys

Query Design (Score: /15)

Excellent (13-15):
  • All queries are efficient and use indexes
  • Proper use of JOINs, CTEs, and window functions
  • Transactions used appropriately for consistency
  • Handles edge cases (NULL values, empty results)
Good (10-12):
  • Queries are functional but not optimally efficient
  • Some missed opportunities for index usage
  • Minor correctness issues
Needs Improvement (under 10):
  • Queries don’t use available indexes
  • N+1 query patterns
  • Missing WHERE clauses or overly broad queries
  • Incorrect JOIN types or conditions

Scaling Strategy (Score: /20)

Excellent (17-20):
  • Clear progression from simple to complex
  • Specific solutions for read and write scaling
  • Justified when to introduce each component
  • Considered operational complexity
  • Addressed specific bottlenecks (connection limits, I/O, CPU)
Good (13-16):
  • Reasonable scaling approach
  • Some components mentioned without detail
  • Missing specific thresholds for scaling triggers
Needs Improvement (under 13):
  • Generic “use microservices” without specifics
  • Over-engineered from the start
  • No consideration of bottlenecks
  • Proposed solutions don’t match requirements

Trade-off Analysis (Score: /15)

Excellent (13-15):
  • Every major decision includes explicit trade-offs
  • Quantified trade-offs where possible (e.g., “10% write overhead for 50% read improvement”)
  • Considered alternative approaches
  • Addressed operational implications
Good (10-12):
  • Some trade-offs discussed
  • Mostly qualitative analysis
  • Missing some alternative approaches
Needs Improvement (under 10):
  • No trade-off discussion
  • Claims solutions are “best” without justification
  • Doesn’t consider downsides

Operational Awareness (Score: /10)

Excellent (9-10):
  • Detailed monitoring plan
  • Backup and disaster recovery strategy
  • Schema migration approach
  • Considered operational complexity and team skills
Good (7-8):
  • Mentioned monitoring and backups
  • Missing some operational details
Needs Improvement (score under 7):
  • No discussion of operations
  • Proposes complex system without considering maintenance

Communication Clarity (Score: /10)

Excellent (9-10):
  • Clear, structured presentation
  • Used diagrams effectively
  • Explained rationale for decisions
  • Easy to follow progression
Good (7-8):
  • Mostly clear
  • Some organization issues
  • Could use more diagrams
Needs Improvement (score under 7):
  • Difficult to follow
  • Jumps around randomly
  • No visual aids

Total Score: ___/100

90-100: Excellent - Ready for staff+ interviews 75-89: Good - Solid senior-level understanding 60-74: Fair - Has foundational knowledge but needs more practice Below 60: Needs Improvement - Review fundamentals and practice more

Action Items Based on Your Score

If you scored below 70:
  1. Review the fundamentals in earlier modules (especially Database Design and Performance)
  2. Practice writing actual SQL, not just describing solutions
  3. Focus on one case study at a time and compare your solution to the provided one
If you scored 70-85:
  1. Deepdive into areas where you scored lowest
  2. Practice explaining trade-offs out loud
  3. Time yourself on case studies to improve pacing
  4. Write more detailed schemas and queries
If you scored above 85:
  1. Practice more complex variations
  2. Focus on communication and presentation
  3. Review edge cases and failure scenarios
  4. Consider mentoring others to solidify your knowledge

Capstone Project: Build a Complete Database-Backed System

Put everything together by building a production-ready system from scratch:

Project Overview

Design, implement, and document a complete database-backed application that demonstrates mastery of all course concepts. Duration: 2-4 weeks
Outcome: Portfolio piece suitable for interviews
Format: Working application + comprehensive documentation

Project Requirements

Choose ONE of these domains:
  1. Multi-Tenant SaaS Analytics Platform: Track events, generate reports, support multiple customers
  2. Real-Time Collaborative Tool: Like Google Docs or Figma, with conflict resolution
  3. E-Learning Platform: Courses, enrollments, progress tracking, assessments
  4. Inventory Management System: Multi-warehouse, real-time stock, order fulfillment
  5. Social Media Feed Engine: Posts, follows, timeline generation, notifications
  6. OR propose your own (must be approved based on complexity)

Implementation Checklist

Phase 1: Design and Planning (Week 1)

  • Write a requirements document (functional and non-functional)
  • Create detailed ER diagram with cardinality
  • Design complete schema with all constraints
  • Identify top 10 query patterns
  • Plan indexing strategy
  • Document normalization decisions and intentional denormalizations

Phase 2: Core Implementation (Week 1-2)

  • Set up PostgreSQL database (local or cloud)
  • Implement complete schema with migrations
  • Seed database with realistic test data (use faker libraries)
  • Implement all core queries with EXPLAIN ANALYZE results
  • Add appropriate indexes and verify performance
  • Write SQL-level tests for critical queries

Phase 3: Application Layer (Week 2-3)

  • Build simple API or web app that uses the database
  • Implement connection pooling (PgBouncer or application-level)
  • Add basic authentication and authorization
  • Implement at least one complex transaction (e.g., checkout, enrollment)
  • Handle concurrent access scenarios
  • Add error handling and validation

Phase 4: Performance and Scaling (Week 3)

  • Set up pg_stat_statements for query monitoring
  • Profile top queries and optimize
  • Implement caching layer (Redis or similar) for hot data
  • Add read replica (if using cloud provider)
  • Load test with realistic traffic patterns
  • Document performance baselines

Phase 5: Production Readiness (Week 4)

  • Set up automated backups
  • Implement monitoring dashboards (Grafana or similar)
  • Add alerts for critical metrics
  • Create runbook for common operational tasks
  • Implement graceful degradation for failures
  • Document disaster recovery procedures

Deliverables

1. Code Repository (GitHub/GitLab)
  • Schema migrations (versioned)
  • Seed data scripts
  • Application code
  • Test suite
  • README with setup instructions
2. Design Document (Markdown/PDF)
  • Requirements and assumptions
  • ER diagrams and schema design
  • Query patterns and access patterns
  • Indexing strategy with justifications
  • Scaling plan (current + future)
  • Trade-off analysis
3. Performance Report
  • Query performance benchmarks (EXPLAIN ANALYZE outputs)
  • Load testing results (throughput, latency percentiles)
  • Bottleneck analysis
  • Optimization decisions and their impact
4. Operational Playbook
  • Monitoring setup and key metrics
  • Backup and restore procedures
  • Common troubleshooting scenarios
  • Schema migration workflow
  • Disaster recovery plan

Evaluation Criteria

Technical Depth (40%)
  • Schema design quality and normalization
  • Query optimization and index usage
  • Transaction handling and concurrency
  • Performance under load
Production Readiness (30%)
  • Monitoring and observability
  • Error handling and resilience
  • Backup and recovery capability
  • Security considerations
Documentation (20%)
  • Clear architecture and design decisions
  • Trade-off analysis
  • Operational procedures
  • Code comments and README
Completeness (10%)
  • All phases completed
  • Realistic data and scenarios
  • Working end-to-end

Success Examples

Excellent Projects Include:
  • Thoughtful denormalization with documented trade-offs
  • Creative use of PostgreSQL features (JSONB, CTEs, window functions)
  • Real performance optimization with before/after metrics
  • Production-quality error handling and logging
  • Comprehensive monitoring setup
Red Flags to Avoid:
  • Using MongoDB or MySQL when PostgreSQL was required
  • No indexes or all wrong indexes
  • Hand-waving instead of actual implementation
  • No performance testing
  • Generic solutions without domain-specific optimization

Bonus Challenges (Optional)

For extra depth, consider adding:
  • Multi-region deployment with data residency
  • Sharding implementation for a specific table
  • Custom PostgreSQL extension
  • Advanced full-text search with ranking
  • Event sourcing pattern for audit trail
  • GraphQL API with optimized data loader
  • Real-time subscriptions using PostgreSQL LISTEN/NOTIFY

Getting Help

  • Post questions in course Discord
  • Schedule 1:1 code reviews (if available)
  • Share your design doc for feedback before full implementation
  • Present your project to peers for practice
This capstone project is your opportunity to demonstrate comprehensive mastery of database engineering principles in a realistic, portfolio-worthy implementation.

Interview Deep-Dive

Strong Answer:Start with requirements clarification: we need riders, drivers, ride requests, active rides, ride history, and payments. The core tension is between the real-time matching workload (low-latency writes for driver location updates, sub-second ride matching) and the analytics workload (trip history, revenue reports, driver earnings).The schema separates hot and cold paths. The driver_locations table stores current GPS coordinates and is updated every 5-10 seconds per active driver — this is a high-write, high-read table that benefits from being kept small (only active drivers, with completed rides archived). I would use a PostGIS geography column for location data, enabling spatial index queries like “find all available drivers within 2km of this pickup point” using ST_DWithin.The rides table tracks the lifecycle: requested, matched, in_progress, completed, cancelled. A key design decision is whether to store the full route as a JSONB array of coordinates or in a separate ride_waypoints table. For most queries (fare calculation, ETA), the start and end points plus the polyline are sufficient — store the full waypoint trace in a separate table or object storage for replay/dispute resolution.For the top 3 queries: (1) “Find available drivers near pickup” — spatial query on driver_locations with a GiST index, filtered by status = 'available'. (2) “Get ride history for user” — index on (rider_id, created_at DESC) with pagination using keyset/cursor, not OFFSET. (3) “Calculate driver earnings for pay period” — pre-aggregated in a driver_earnings_daily materialized view, summing completed ride fares by driver and day.For scaling from 1 to 50 cities: start with a single database, partitioned by city or region. As you grow, shard by geographic region — rides in Tokyo never join with rides in London, so cross-shard queries are extremely rare. Driver locations can be handled by a Redis-backed geospatial index for real-time matching, with PostgreSQL as the system of record for completed rides. The key insight is that ride-sharing has strong geographic locality, making it one of the cleanest sharding candidates in system design.Follow-up: How do you handle the scenario where a driver is near a city boundary and could serve riders in either city?Geographic boundaries should use overlapping zones, not hard borders. When matching drivers near a boundary, query both city shards (or partition zones). Since this affects a small percentage of rides (drivers near boundaries), the cross-shard cost is acceptable. In practice, companies like Uber use hexagonal grid systems (H3) that can overlap administrative boundaries. The shard key is the hex cell, not the city name, and boundary cells are replicated to both regions. This adds write amplification for a small number of cells but eliminates the boundary problem entirely.
Strong Answer:This is the “noisy neighbor” problem, and it is one of the most common real-world multi-tenant challenges. The naive approach — shared tables with a tenant_id column — breaks down when one tenant dominates because their queries scan enormous partitions while smaller tenants pay the latency cost of shared infrastructure.First, schema design: I would use a hybrid isolation model. The large tenant gets their own dedicated schema (or even a dedicated database instance on a larger machine), while the remaining tenants share tables with Row-Level Security. This is the “pooled small, dedicated large” pattern. The threshold for promotion from shared to dedicated should be based on measurable impact: when a tenant’s queries consistently exceed P95 latency targets or their data volume causes autovacuum to run for hours on shared tables.For partitioning, the shared tenant tables should be partitioned by tenant_id using hash partitioning, with the large tenant explicitly routed to their own partition. This gives you partition pruning (queries for tenant X only scan tenant X’s partition) and independent vacuum per partition. The large tenant’s dedicated partition can be further sub-partitioned by time if they have time-series data.Connection management is critical. Without connection pooling, the large tenant could exhaust all PostgreSQL connections during peak analytics queries. Use PgBouncer with per-tenant pool limits: the large tenant gets a dedicated pool of, say, 50 connections, while the shared pool for all other tenants gets 100 connections. This prevents the large tenant from starving everyone else. Additionally, set per-tenant statement_timeout and work_mem limits — the large tenant may need higher work_mem for their complex aggregations, but you do not want their queries consuming all available memory.The key trade-off: operational complexity increases with hybrid isolation. You now have two deployment models, two monitoring configurations, and two migration paths. This is worth it when the alternative is all tenants suffering degraded performance because of one whale. In my experience, the breakpoint is around 10x data disparity — if one tenant has more than 10x the data of the median tenant, hybrid isolation pays for itself.Follow-up: The large tenant now wants real-time dashboards while other tenants are fine with 15-minute-old data. How does this change the architecture?This is a read-path bifurcation problem. For tenants accepting stale data, serve their dashboards from read replicas with a controlled replication lag — even add a materialized view refresh cycle of 15 minutes. For the large tenant demanding real-time, their queries must hit the primary or a synchronous replica. But you cannot route all their dashboard traffic to the primary without impacting write throughput. The solution: use logical replication to stream only the large tenant’s tables to a dedicated analytics replica. This replica gets real-time data for that tenant without carrying the full database load. You are trading replication infrastructure complexity for workload isolation — a reasonable trade at this scale.
Strong Answer:Schema evolution is the hardest long-term challenge in event sourcing, and most teams underestimate it. Events are immutable — you cannot retroactively change an OrderCreated event from v1 (which had price as an integer in cents) to v2 (which uses a money object with currency). Old events with the old shape will exist forever.The standard approach is event upcasting: store a schema_version field on every event, and when reading events, apply an upcasting pipeline that transforms older versions to the latest shape. For example, an upcaster for OrderCreated v1-to-v2 would transform {"price": 1999} into {"money": {"amount": 1999, "currency": "USD"}}, with USD as a default because v1 was single-currency. These upcasters are pure functions that run at read time, meaning you never modify stored events. The projection builder applies the upcaster chain before processing. This is similar to how database migrations work, except the source data (events) is never altered.For the event store performance: the fundamental scaling challenge is that the event store is append-only and grows forever. At financial services scale (millions of events per day), the table will reach billions of rows within a year. Partition by time (monthly or weekly), and index on (stream_id, version) for stream replay and (id) for subscription processing (global ordering). The id column should be a BIGSERIAL, not a UUID, because subscription processors need sequential ordering, and UUID ordering is meaningless.Projection rebuilds are the hidden performance risk. If a projection bug requires replaying all events for a stream with 500,000 events, that is a long-running read query. Introduce snapshot checkpoints: every N events (e.g., every 1,000), serialize the current aggregate state as a snapshot. Rebuilding then starts from the latest snapshot and only replays events since the snapshot. This turns an O(N) rebuild into O(1000) in the worst case.For write throughput, the append-only nature of the event store is actually an advantage — no updates means no row-level locking contention. The bottleneck is usually the unique constraint check on (stream_id, version) for optimistic concurrency. At very high throughput, this B-tree index insertion can become a hotspot. The mitigation is to shard the event store by stream_id, so concurrent writes to different streams hit different index pages. In PostgreSQL, hash partitioning on stream_id achieves this naturally.Follow-up: A regulator requires you to delete all data for a specific customer (GDPR right to erasure). But events are immutable. How do you handle this?This is the most philosophically challenging intersection of event sourcing and privacy law. You have three options, each with trade-offs. First, crypto-shredding: encrypt each customer’s event data with a customer-specific key. To “delete” a customer, destroy the key. The events remain but are unreadable. This is the most elegant solution and is widely accepted by GDPR auditors because the data is effectively destroyed. Second, tombstone events: append a CustomerDataErased event and modify projections to ignore that customer’s prior events. The raw events still exist, which some regulators may not accept. Third, event rewriting: create a new version of the event store with the customer’s events redacted, then swap. This is the nuclear option — expensive, risky, but unambiguous. I recommend crypto-shredding as the default strategy, decided at system design time, not after the first deletion request arrives.
Strong Answer:This decision depends on four factors: query complexity, latency requirements, data freshness requirements, and operational budget.PostgreSQL full-text search (tsvector/tsquery with GIN indexes) is the right choice when search requirements are straightforward (keyword matching, basic ranking), the product catalog is under 10 million items, and you want to minimize infrastructure complexity. The advantages are transactional consistency (search results reflect the latest data immediately), no synchronization pipeline to maintain, and one less service to monitor at 3 AM. The disadvantages: PostgreSQL’s ranking algorithm is basic compared to Elasticsearch’s BM25 with custom scoring, faceted search (filter by brand, price range, color simultaneously) requires manual implementation with CTEs and conditional aggregation, and at scale the GIN index updates become a write amplification concern. A GIN index on a tsvector column can be 2-3x the size of the indexed data.Elasticsearch (or OpenSearch) is the right choice when search is a core differentiator (autocomplete, typo tolerance, relevance tuning, faceted navigation), the catalog exceeds 10 million items, and you can tolerate sub-second staleness. Elasticsearch excels at multi-faceted filtering (show me red shirts in size L under $50 with 4+ star ratings, sorted by relevance), fuzzy matching (“iphone” matches “iPhone”), and custom relevance scoring. The trade-off is operational: you now maintain a separate cluster, a synchronization pipeline (CDC or dual-write), and you must handle the eventual consistency window where a product is in PostgreSQL but not yet in the search index. Dual-writes are a consistency trap — use CDC (Change Data Capture) via Debezium or logical replication to feed Elasticsearch.Denormalized search tables are the right choice when you need specific, well-defined search patterns and want to avoid both full-text search complexity and a separate service. For example, a product_search materialized view that joins products, categories, brands, and pre-computed average ratings into a flat table with indexes matching your exact filter patterns. This gives you sub-millisecond exact-match queries without maintaining a search engine. The trade-off: no fuzzy matching, no relevance scoring, and the materialized view refresh adds latency. This is the “good enough” solution that many successful e-commerce sites use for their first few years.My recommendation: start with PostgreSQL full-text search (lowest operational cost), monitor search quality and latency metrics, and migrate to Elasticsearch only when users complain about search quality or performance degrades past your SLO. Premature introduction of Elasticsearch is one of the most common over-engineering decisions in e-commerce startups.Follow-up: You chose PostgreSQL full-text search initially. Six months later, product managers want autocomplete suggestions as users type. Does this change your recommendation?Autocomplete (prefix matching as the user types) is specifically where PostgreSQL full-text search struggles. tsquery operates on complete tokens, not prefixes. You can use LIKE 'iph%' with a text_pattern_ops index, or pg_trgm with a GIN trigram index for fuzzy prefix matching, but the performance and relevance quality degrade as the catalog grows. Autocomplete also requires extremely low latency (under 50ms) because it fires on every keystroke. At this point, I would introduce a lightweight in-memory solution like Redis with sorted sets containing product name prefixes, rather than jumping to Elasticsearch. The autocomplete index is small (product names only, not full documents), refreshes infrequently, and Redis delivers sub-millisecond responses. Reserve Elasticsearch for when you also need faceted search, synonyms, or complex relevance tuning — do not adopt it solely for autocomplete.
Strong Answer:Zero-downtime migration of a 5TB database is one of the hardest operational challenges in database engineering. The key principle: never do a big-bang cutover. Every step must be incremental and reversible.Phase 1 — Dual-write preparation (2-4 weeks): Introduce a sharding-aware data access layer in the application that can route reads and writes to either the old monolith or the new sharded databases. Initially, all traffic goes to the monolith. The shard key selection is the most consequential decision: choose the key that minimizes cross-shard queries. For most applications, this is tenant_id or user_id. Validate your shard key choice by analyzing the top 50 queries in pg_stat_statements — if more than 5% would require cross-shard joins, reconsider your shard key.Phase 2 — Data copy (1-2 weeks): Use pg_dump with --jobs for the initial bulk copy to each shard, filtering by shard key. For a 5TB database, this can take 12-24 hours even with parallel jobs. During this time, the monolith continues serving all traffic. After the initial copy, set up logical replication from the monolith to each shard, filtering by the shard key range. This keeps the shards up-to-date as new writes arrive on the monolith.Phase 3 — Shadow reads (1-2 weeks): Route a percentage of read traffic to the shards while continuing to serve from the monolith. Compare results for consistency. Log any discrepancies. This is your validation phase — if you find mismatches, fix the replication pipeline before proceeding. Shadow reads should be async (fire-and-forget comparisons) so they do not add latency to user requests.Phase 4 — Cutover writes (the critical step): Enable dual-writes: every write goes to both the monolith and the appropriate shard. The monolith remains the source of truth. Reads can now shift to the shards. Once you are confident the shards are consistent (zero discrepancies for 48-72 hours), flip the source of truth: writes go to shards first, and are replicated back to the monolith. The monolith is now the follower.Phase 5 — Decommission monolith: Once shards have been the source of truth for 2-4 weeks with no issues, stop replicating to the monolith. Keep it around (read-only) for another month as a safety net, then decommission.Rollback strategy: at every phase, the monolith is still receiving all data (either directly or via reverse replication). To roll back, simply point all traffic back to the monolith. The only phase where rollback is complex is Phase 4 after flipping the source of truth — at that point, any writes that went only to the shards must be replicated back to the monolith before rollback. This is why the dual-write period before the source-of-truth flip is critical.The biggest risk: schema migrations during the migration window. Freeze schema changes (or coordinate them extremely carefully) during phases 2-4. A schema migration that changes the shard key column is catastrophic during migration.Follow-up: The migration is 80% complete, but you discover that 3% of queries require cross-shard joins that you did not anticipate. What do you do?Do not proceed with the original plan. 3% cross-shard joins is high enough to cause significant latency regressions for real users. You have three options: (1) Denormalize the data needed for those joins into each shard, accepting the write amplification and consistency lag. (2) Introduce a query routing layer that detects cross-shard queries and fans them out to multiple shards, merging results — this adds latency and complexity but preserves the sharding model. (3) Re-evaluate the shard key. If the cross-shard queries are all driven by a secondary access pattern (e.g., organization-level reports while sharding by user_id), consider a secondary index shard or a read-replica that aggregates across shards for those specific query patterns. Option 3 is often the right answer — you do not need to solve cross-shard queries generically. You need to solve the specific 5-10 query patterns that cross shards.

Summary

You’ve now seen how to apply database design principles to real-world systems:

E-Commerce

Inventory, orders, flexible attributes

Analytics

Time-series, pre-aggregation, partitioning

Social Feed

Fan-out, hybrid approach, caching

Multi-Tenant

RLS, isolation, connection management

Event Sourcing

Append-only, projections, replay

Global

Geo-routing, data residency, federation

Course Completion 🎉

Congratulations! You’ve completed the Database Engineering Mastery course.
You’re now prepared for:
  • Senior/Staff engineering interviews
  • PostgreSQL open-source contributions
  • Production database architecture and operations

What’s Next?

Join the Community

Connect with fellow database engineers

Start Contributing

Make your first PostgreSQL contribution