When your application grows beyond what a single server can handle, you need scaling strategies. This module covers horizontal scaling, sharding, and partitioning.The most important scaling principle: Do not shard until you absolutely must. Sharding adds irreversible complexity — cross-shard joins, distributed transactions, and operational overhead that will slow your team down for years. Most PostgreSQL instances can handle far more load than teams realize, especially with proper indexing, connection pooling, and read replicas. The order of operations should be: optimize queries, add indexes, use connection pooling, add read replicas, partition tables, and only then consider sharding.
Estimated Time: 12-14 hours Hands-On: Implement table partitioning Key Skill: Choosing the right scaling strategy
-- Create partitioned tableCREATE TABLE orders ( id BIGINT, customer_id INT, order_date DATE, total DECIMAL(10,2), status VARCHAR(20)) PARTITION BY RANGE (order_date);-- Create partitionsCREATE TABLE orders_2023 PARTITION OF orders FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');CREATE TABLE orders_2024 PARTITION OF orders FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');CREATE TABLE orders_2025 PARTITION OF orders FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');-- PostgreSQL automatically routes insertsINSERT INTO orders VALUES (1, 100, '2024-06-15', 99.99, 'completed');-- Goes to orders_2024-- Query only scans relevant partitionsEXPLAIN SELECT * FROM orders WHERE order_date >= '2024-06-01';
Best for: Discrete categories, regions
CREATE TABLE customers ( id BIGINT, name TEXT, email TEXT, region VARCHAR(10)) PARTITION BY LIST (region);CREATE TABLE customers_na PARTITION OF customers FOR VALUES IN ('US', 'CA', 'MX');CREATE TABLE customers_eu PARTITION OF customers FOR VALUES IN ('UK', 'DE', 'FR', 'ES', 'IT');CREATE TABLE customers_asia PARTITION OF customers FOR VALUES IN ('JP', 'CN', 'IN', 'SG');-- Default partition for unknown regionsCREATE TABLE customers_other PARTITION OF customers DEFAULT;
Best for: Even distribution, no natural partition key
CREATE TABLE events ( id BIGINT, user_id INT, event_type TEXT, data JSONB) PARTITION BY HASH (user_id);-- Create 4 partitionsCREATE TABLE events_p0 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 0);CREATE TABLE events_p1 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 1);CREATE TABLE events_p2 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 2);CREATE TABLE events_p3 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 3);-- user_id % 4 determines partition
-- Add new partition for next yearCREATE TABLE orders_2026 PARTITION OF orders FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');-- Detach old partition (keeps data, removes from parent)ALTER TABLE orders DETACH PARTITION orders_2020;-- Then archive or drop-- Option 1: Archive to cold storage\copy orders_2020 TO 'orders_2020.csv' CSV HEADERDROP TABLE orders_2020;-- Option 2: Drop immediatelyDROP TABLE orders_2020;-- Automatic partition creation with pg_partmanCREATE EXTENSION pg_partman;SELECT partman.create_parent( p_parent_table := 'public.orders', p_control := 'order_date', p_type := 'native', p_interval := 'monthly');
# Application determines sharddef get_shard(user_id): return user_id % NUM_SHARDS# All user data for user 12345shard = get_shard(12345) # e.g., shard 1conn = get_connection(shard)
Pros: Even distribution, simple
Cons: Resharding is expensive (data migration)
Range-Based Sharding
# User IDs 0-1M on shard 0, 1M-2M on shard 1, etc.def get_shard(user_id): if user_id < 1_000_000: return 0 elif user_id < 2_000_000: return 1 else: return 2
Pros: Range queries possible, easy to add shards
Cons: Hotspots if ranges accessed unevenly
This table is the reason experienced engineers avoid sharding as long as possible. Each row represents a problem that does not exist in a single-node database and has no perfect solution in a sharded architecture. Read this table as a cost-benefit analysis, not a feature list.
Challenge
Problem
Solution
Hidden Cost
Joins
Can’t join across shards
Denormalize, or application-level joins
Data duplication, stale reads, application complexity
Transactions
No cross-shard ACID
Two-phase commit, or eventual consistency
2PC is slow and fragile; eventual consistency means reasoning about conflicts
Aggregations
SUM/COUNT across shards
Scatter-gather queries
Latency is bounded by the slowest shard; fan-out amplifies load
Sequences
IDs must be unique globally
UUID, or dedicated ID service
UUIDs are 16 bytes (vs 8 for bigint) and fragment B-Tree indexes; ID services are a new SPOF
Rebalancing
Shards become uneven over time
Live migration between shards
Requires careful coordination; risk of data loss or downtime during migration
Real-world analogy: Database caching works like a series of increasingly distant warehouses. Your application checks the closest warehouse (Redis) first, then the regional warehouse (PostgreSQL buffer pool), then the national warehouse (OS page cache), and finally the factory (disk). Each layer is larger but slower. The art of cache engineering is keeping the data most likely to be needed in the closest warehouse.
“There are only two hard things in Computer Science: cache invalidation and naming things.” — Phil Karlton. Cache invalidation is genuinely the hardest part of caching. The patterns below represent different tradeoffs between staleness tolerance and implementation complexity. Choose the simplest one that meets your staleness requirements.
-- PostgreSQL LISTEN/NOTIFY for cache invalidation.-- This approach pushes invalidation events to subscribers in real-time,-- eliminating the polling overhead of TTL-based approaches.CREATE OR REPLACE FUNCTION notify_cache_invalidation()RETURNS TRIGGER AS $$BEGIN PERFORM pg_notify('cache_invalidation', json_build_object( 'table', TG_TABLE_NAME, 'id', NEW.id, 'operation', TG_OP )::text ); RETURN NEW;END;$$ LANGUAGE plpgsql;CREATE TRIGGER user_cache_triggerAFTER INSERT OR UPDATE OR DELETE ON usersFOR EACH ROW EXECUTE FUNCTION notify_cache_invalidation();
The golden rule of PostgreSQL connections: Fewer connections usually means higher throughput. This is counterintuitive but well-established. Beyond approximately 2 * CPU_cores + disk_spindles active connections, performance degrades due to context switching, lock contention in ProcArray, and cache thrashing. A server with 16 cores will often perform better with 40 active connections than with 400.
-- postgresql.conf# Connection limitsmax_connections = 200 -- Keep low! Use a pooler for client-facing connections.superuser_reserved_connections = 3 -- Always keep a few reserved for admin access during emergencies.# Memory per connectionwork_mem = 64MB -- DANGER: this is per-operation, not per-connection. -- A query with 3 sorts = 3 x 64MB = 192MB for that one query. -- With 200 connections: worst case = 200 x 192MB = 37GB.maintenance_work_mem = 256MB -- Used by VACUUM, CREATE INDEX, ALTER TABLE ADD FOREIGN KEY.# Connection overhead# Each connection uses:# - ~10MB for backend process (fixed cost)# - work_mem per sort/hash operation (variable cost)# - Session state (catalogs, prepared statements)## Rule of thumb:# (max_connections * average_work_mem_per_connection) < 50% of available RAM# Leave the other 50% for shared_buffers and OS page cache.
Practical context: Geo-sharding is often driven by data residency regulations (GDPR, CCPA, data sovereignty laws) as much as by performance requirements. If your EU users’ data must stay in EU data centers, geo-sharding is not optional — it is a compliance requirement.
# Route users to their nearest shard based on regionREGION_SHARDS = { 'us': 'postgres://shard-us.db.example.com/app', 'eu': 'postgres://shard-eu.db.example.com/app', 'asia': 'postgres://shard-asia.db.example.com/app',}def get_db_connection(user): region = determine_user_region(user) return connect(REGION_SHARDS[region])# User data stays in their region (data residency compliance).# Cross-region queries require a federation layer.## Key design decision: what happens when a US user interacts# with an EU user? You need either (a) a reference table replicated# to all regions, or (b) an application-level fan-out that queries# multiple shards and merges results. Neither is free.
┌─────────────────────────────────────────────────────────────────────────────┐│ SCALING DECISION FRAMEWORK │├─────────────────────────────────────────────────────────────────────────────┤│ ││ 1. Is your database actually the bottleneck? ││ ├── No → Fix application code first ││ └── Yes ↓ ││ ││ 2. Have you optimized queries and indexes? ││ ├── No → Do that first (cheapest fix) ││ └── Yes ↓ ││ ││ 3. Is it a read-heavy workload? ││ ├── Yes → Add read replicas + caching ││ └── No (write-heavy) ↓ ││ ││ 4. Can you scale vertically? ││ ├── Yes → Upgrade hardware (simpler) ││ └── No (maxed out) ↓ ││ ││ 5. Can you partition within one database? ││ ├── Yes → Table partitioning ││ └── No (too big) ↓ ││ ││ 6. Can you shard by a natural key? ││ ├── Yes → Implement sharding ││ └── No → Consider different database architecture ││ │└─────────────────────────────────────────────────────────────────────────────┘
Create a time-series orders table with monthly partitions and automated maintenance.
Complete Solution
-- 1. Create partitioned orders tableCREATE TABLE orders ( id BIGINT GENERATED ALWAYS AS IDENTITY, customer_id INT NOT NULL, order_date TIMESTAMPTZ NOT NULL DEFAULT NOW(), total DECIMAL(12,2) NOT NULL, status VARCHAR(20) DEFAULT 'pending', items JSONB, PRIMARY KEY (id, order_date)) PARTITION BY RANGE (order_date);-- 2. Create monthly partitions for 2024DO $$DECLARE start_date DATE := '2024-01-01'; end_date DATE; partition_name TEXT;BEGIN FOR i IN 0..11 LOOP start_date := DATE '2024-01-01' + (i || ' months')::INTERVAL; end_date := start_date + '1 month'::INTERVAL; partition_name := 'orders_' || TO_CHAR(start_date, 'YYYY_MM'); EXECUTE format( 'CREATE TABLE %I PARTITION OF orders FOR VALUES FROM (%L) TO (%L)', partition_name, start_date, end_date ); END LOOP;END $$;-- 3. Create indexes on partitions (automatically applied to all)CREATE INDEX idx_orders_customer ON orders(customer_id);CREATE INDEX idx_orders_status ON orders(status);-- 4. Insert test dataINSERT INTO orders (customer_id, order_date, total, status, items)SELECT (random() * 10000)::INT, '2024-01-01'::TIMESTAMPTZ + (random() * 365 * INTERVAL '1 day'), (random() * 1000)::DECIMAL(12,2), (ARRAY['pending', 'processing', 'shipped', 'delivered'])[1 + (random()*3)::INT], jsonb_build_object('item_count', (random() * 10)::INT)FROM generate_series(1, 1000000);-- 5. Verify partition pruningEXPLAIN (ANALYZE, COSTS OFF) SELECT * FROM orders WHERE order_date >= '2024-06-01' AND order_date < '2024-07-01';-- Should only scan orders_2024_06-- 6. Maintenance: Drop old partitionALTER TABLE orders DETACH PARTITION orders_2024_01;DROP TABLE orders_2024_01;-- 7. Add new partition for next periodCREATE TABLE orders_2025_01 PARTITION OF orders FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
Your 2TB PostgreSQL database is slowing down. Walk through your scaling strategy before reaching for sharding.
Strong Answer:
The scaling ladder in order: (1) Query optimization via pg_stat_statements — find top 10 queries by total_exec_time, fix missing indexes and N+1 patterns (often buys 2-5x headroom with zero infrastructure changes). (2) Configuration tuning — verify shared_buffers at 25% RAM, random_page_cost at 1.1 for SSD. (3) PgBouncer in transaction mode to reduce max_connections from 500 to 100. (4) Read replicas for read-heavy workloads. (5) Table partitioning for the largest tables. (6) Vertical scaling (modern instances offer 96+ cores, 768GB RAM). (7) Sharding only after exhausting all of the above.
Follow-up: When do you know sharding is truly necessary?The signal is write throughput saturation on a single node. If WAL generation rate is saturating disk I/O after all optimizations, you have outgrown single-node. Read scaling never requires sharding — replicas handle that. It is always write scaling that forces the sharding decision.
Compare partitioning and sharding. What mistakes do teams make when implementing partitioning?
Strong Answer:
Partitioning splits a table within one instance — all SQL features work normally. Sharding splits across multiple instances — requires routing layer, breaks cross-shard JOINs and transactions. Partition when single-node hardware is sufficient but table size causes maintenance issues. Shard when write throughput exceeds single-node capacity.
Common mistakes: (1) Too many partitions — daily partitions on 10-year data creates 3650 partitions, causing planner overhead and catalog bloat. Use monthly or quarterly. (2) Queries missing the partition key in WHERE, causing full partition scans. (3) Not automating future partition creation — inserts fail when data falls outside existing ranges. Use pg_partman.
Follow-up: Can you partition an existing large table without downtime?Yes. Create a new partitioned table, use logical replication or trigger-based sync to replicate data from old to new, then perform a brief lock to swap table names. pg_partman and pgloader assist. Alternatively, ATTACH existing tables as partitions if data already conforms to partition boundaries.
Design a caching strategy for a PostgreSQL-backed application. When do you use Redis versus PostgreSQL's buffer pool?
Strong Answer:
PostgreSQL’s buffer pool is already a cache. If the working set fits in RAM, PostgreSQL serves data at sub-millisecond latency. Adding Redis only helps when: (1) the same small result is read thousands of times/sec (hot key — Redis handles 100K+ reads/sec per key), (2) session/ephemeral data without durability needs, (3) expensive computed results that tolerate bounded staleness, or (4) rate limiting and atomic counters.
Cache invalidation: use cache-aside with TTL for simplicity. For stronger consistency, use PostgreSQL LISTEN/NOTIFY to push invalidation events. Never use write-behind for critical data.
The biggest mistake: caching individual row lookups when the real bottleneck is an expensive aggregation query. Cache the aggregation result, not individual rows.
Follow-up: How does pg_prewarm fit in?pg_prewarm loads tables/indexes into shared_buffers proactively. After a restart, the buffer pool is cold. Combined with autoprewarm (PG 11+), which saves buffer contents on shutdown and reloads on startup, you maintain warm cache across restarts — reducing the need for external caching to handle cold-start.