Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Replication & High Availability Concept

Module 7: Replication & High Availability

Production databases must survive hardware failures, network issues, and data center outages. This module teaches you how to build highly available PostgreSQL deployments. Real-world analogy: Replication is like having a live backup singer who knows every word to every song. If the lead vocalist (primary) loses their voice mid-concert (server crash), the backup (standby) can take over immediately without the audience (users) noticing more than a brief pause. The key engineering challenge is keeping the backup singer perfectly in sync — and deciding whether you are willing to slow down the lead singer to guarantee synchronization.
Estimated Time: 10-12 hours
Hands-On: Set up streaming replication
Key Skill: Designing for 99.99% uptime
PostgreSQL replication and high availability topologies

7.1 High Availability Concepts

The Need for Replication

┌─────────────────────────────────────────────────────────────────────────────┐
│                    WHY YOU NEED REPLICATION                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Without Replication:                                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                                                                     │    │
│  │    ┌────────────┐         ┌────────────────┐                       │    │
│  │    │    App     │────────▶│   Database     │                       │    │
│  │    │   Server   │         │   (Primary)    │                       │    │
│  │    └────────────┘         └───────┬────────┘                       │    │
│  │                                   │                                 │    │
│  │                                   ▼                                 │    │
│  │                              FAILURE                                │    │
│  │                                                                     │    │
│  │    Result: Complete outage until server is repaired                │    │
│  │            Risk of data loss if disk fails                         │    │
│  │            Recovery time: hours to days                            │    │
│  │                                                                     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  With Replication:                                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                                                                     │    │
│  │    ┌────────────┐         ┌────────────────┐                       │    │
│  │    │    App     │────────▶│   Primary DB   │                       │    │
│  │    │   Server   │         └───────┬────────┘                       │    │
│  │    └─────┬──────┘                 │ WAL Stream                     │    │
│  │          │                        ▼                                 │    │
│  │          │                ┌────────────────┐                       │    │
│  │          └───(failover)──▶│   Standby DB   │                       │    │
│  │                           └────────────────┘                       │    │
│  │                                                                     │    │
│  │    Result: Automatic failover in seconds                           │    │
│  │            Zero data loss (with sync replication)                  │    │
│  │            Read queries can use standby                            │    │
│  │                                                                     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Replication Types

Replicates WAL bytes - exact binary copy of primary.
Primary ─────WAL─────▶ Standby
         (byte stream)
Pros:
  • Complete replica (all databases, schemas)
  • Simple setup
  • Low overhead on primary
Cons:
  • Same PostgreSQL major version required
  • Cannot select specific tables
  • Standby is read-only

7.2 Streaming Replication Setup

Primary Server Configuration

-- postgresql.conf on primary
wal_level = replica                    -- Enable replication
max_wal_senders = 5                    -- Max replication connections
max_replication_slots = 5              -- Prevent WAL removal before standby catches up
synchronous_commit = on                -- Wait for WAL flush

-- For synchronous replication (zero data loss)
synchronous_standby_names = 'FIRST 1 (standby1, standby2)'

-- pg_hba.conf - allow replication connections
-- host  replication  replicator  192.168.1.0/24  scram-sha-256
-- Create replication user
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secure_password';

-- Create replication slot (prevents WAL cleanup before standby catches up)
SELECT pg_create_physical_replication_slot('standby1_slot');

-- View slots
SELECT * FROM pg_replication_slots;

Standby Server Setup

# Take base backup from primary
pg_basebackup -h primary_host -U replicator -D /var/lib/postgresql/data \
  -P -Xs -R -S standby1_slot

# The -R flag creates standby.signal and configures primary_conninfo
-- postgresql.auto.conf on standby (created by pg_basebackup -R)
primary_conninfo = 'host=primary_host port=5432 user=replicator password=xxx'
primary_slot_name = 'standby1_slot'

-- postgresql.conf on standby
hot_standby = on                       -- Allow read queries on standby

Monitoring Replication

Practical tip: Replication lag is not one number — it is four numbers, each measuring a different stage. Understanding the difference between sent_lsn, write_lsn, flush_lsn, and replay_lsn is essential for diagnosing replication problems. If sent is far ahead of write, you have a network bottleneck. If write is ahead of flush, the standby’s disk is slow. If flush is ahead of replay, the standby’s CPU cannot keep up with applying changes.
-- On primary: View connected standbys
SELECT 
    client_addr,
    state,
    sent_lsn,      -- WAL sent over the network to the standby
    write_lsn,     -- WAL written to standby's disk (but not fsynced)
    flush_lsn,     -- WAL fsynced to standby's disk (durable)
    replay_lsn,    -- WAL actually applied (queries on standby see this data)
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replication_lag_bytes
FROM pg_stat_replication;

-- On standby: View replication status
SELECT 
    status,
    received_lsn,
    latest_end_lsn,
    pg_last_wal_receive_lsn(),
    pg_last_wal_replay_lsn(),
    pg_last_xact_replay_timestamp()
FROM pg_stat_wal_receiver;

-- Calculate lag in seconds
SELECT 
    EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp())) AS lag_seconds
FROM pg_stat_wal_receiver;

Failover Process Deep Dive

Database Failover Automatic vs Manual Failover: Automatic (PostgreSQL with Patroni/Stolon):
  • Faster recovery (seconds to minutes)
  • Risk of split-brain if network partition
  • Requires robust failure detection
Manual (Traditional):
  • Slower (minutes to hours)
  • Human verification prevents mistakes
  • Better for planned maintenance
Common pitfall — split-brain: Split-brain occurs when both the primary and the standby believe they are the leader, accepting writes independently. This creates two divergent copies of your data that are extremely difficult to reconcile. It is the single most dangerous failure mode in database replication. Patroni avoids this by using a distributed consensus store (etcd/ZooKeeper) as the source of truth for leadership — a node is only the leader if the consensus store says so. Failover Checklist:
  1. ✓ Verify leader is truly dead (not just slow)
  2. ✓ Choose most up-to-date follower
  3. ✓ Promote follower to leader
  4. ✓ Update DNS/load balancer
  5. ✓ Reconfigure other followers
  6. ✓ Monitor for split-brain
  7. ✓ Investigate root cause
Testing Failover: Use chaos engineering tools like Chaos Monkey to regularly test your failover process in production.

7.3 Synchronous vs Asynchronous

┌─────────────────────────────────────────────────────────────────────────────┐
│            SYNCHRONOUS vs ASYNCHRONOUS REPLICATION                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ASYNCHRONOUS (default):                                                     │
│  ────────────────────────                                                    │
│                                                                              │
│    Client       Primary                  Standby                            │
│       │                                                                      │
│       │──COMMIT──▶│                                                         │
│       │           │──WAL FLUSH──▶DISK                                       │
│       │◀──OK──────│                                                         │
│       │           │           │                                              │
│       │           │──WAL──────▶│  (async, may lag)                          │
│       │           │           │──REPLAY──▶                                   │
│                                                                              │
│    Pros: Lowest latency, primary doesn't wait                               │
│    Cons: Standby can be behind, data loss risk on failover                 │
│                                                                              │
│  SYNCHRONOUS:                                                                │
│  ─────────────                                                               │
│                                                                              │
│    Client       Primary                  Standby                            │
│       │                                                                      │
│       │──COMMIT──▶│                                                         │
│       │           │──WAL FLUSH──▶DISK                                       │
│       │           │──WAL──────────────────▶│                                │
│       │           │                        │──FLUSH──▶DISK                  │
│       │           │◀───────────OK──────────│                                │
│       │◀──OK──────│                                                         │
│                                                                              │
│    Pros: Zero data loss (WAL on both disks before commit returns)          │
│    Cons: Higher latency (network round trip per commit)                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Configuration Options

-- Asynchronous (default)
synchronous_commit = on
synchronous_standby_names = ''

-- Synchronous with named standbys
synchronous_standby_names = 'standby1'  -- Wait for standby1

-- First N of list (any 1 from list)
synchronous_standby_names = 'FIRST 1 (standby1, standby2)'

-- Any N of list (quorum)
synchronous_standby_names = 'ANY 2 (standby1, standby2, standby3)'

-- Per-transaction control (this is a powerful production technique)
SET synchronous_commit = off;  -- This transaction doesn't wait for standby
-- Use this for non-critical writes like logging, analytics, or notifications.
-- Keep synchronous_commit = on for financial transactions and order processing.
-- This gives you the best of both worlds: strong durability where it matters,
-- high throughput where eventual consistency is acceptable.

7.4 Failover and Switchover

Planned Switchover

# Step 1: Stop writes to primary
# (Application should connect to new primary after switch)

# Step 2: Ensure standby is caught up
psql -h primary -c "SELECT pg_current_wal_lsn();"
psql -h standby -c "SELECT pg_last_wal_replay_lsn();"
# Both should match

# Step 3: Promote standby
pg_ctl promote -D /var/lib/postgresql/data
# Or
psql -c "SELECT pg_promote();"

# Step 4: Old primary becomes standby (optional)
# Configure recovery settings and restart

Automatic Failover with Patroni

# patroni.yml
scope: postgres-cluster
name: node1

restapi:
  listen: 0.0.0.0:8008
  connect_address: 192.168.1.1:8008

etcd:
  hosts: etcd1:2379,etcd2:2379,etcd3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 192.168.1.1:5432
  data_dir: /var/lib/postgresql/data
  authentication:
    replication:
      username: replicator
      password: secret

7.5 Logical Replication

Publisher Setup

-- On publisher
CREATE PUBLICATION my_publication 
FOR TABLE users, orders, products;

-- Or all tables
CREATE PUBLICATION all_tables FOR ALL TABLES;

-- View publications
SELECT * FROM pg_publication;
SELECT * FROM pg_publication_tables;

Subscriber Setup

-- On subscriber (tables must exist with same schema)
CREATE SUBSCRIPTION my_subscription
CONNECTION 'host=publisher port=5432 dbname=mydb user=replicator'
PUBLICATION my_publication;

-- View subscriptions
SELECT * FROM pg_subscription;
SELECT * FROM pg_stat_subscription;

-- Pause/resume
ALTER SUBSCRIPTION my_subscription DISABLE;
ALTER SUBSCRIPTION my_subscription ENABLE;

Use Cases for Logical Replication

Cross-Version Migration

Replicate from PG 14 to PG 16, then switch over

Selective Replication

Only sync specific tables to analytics database

Multi-Master

Write to different tables on different servers

Data Consolidation

Combine data from multiple sources

7.6 Connection Pooling

The Problem

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CONNECTION OVERHEAD                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Each PostgreSQL connection:                                                 │
│  • Spawns a backend process (~5-10 MB RAM)                                  │
│  • Has connection establishment overhead (~100ms)                           │
│  • Maintains session state                                                   │
│                                                                              │
│  100 app servers × 20 connections each = 2000 connections!                  │
│                                                                              │
│  Without Pooler:                                                             │
│  ┌────────┐                                                                 │
│  │ App 1  │──conn1──┐                                                       │
│  │        │──conn2──┤                                                       │
│  └────────┘         │                                                       │
│  ┌────────┐         ├──▶ PostgreSQL (2000 backends = 20GB RAM!)            │
│  │ App 2  │──conn3──┤                                                       │
│  │        │──conn4──┤                                                       │
│  └────────┘         │                                                       │
│  ...                 │                                                       │
│                                                                              │
│  With Connection Pooler:                                                     │
│  ┌────────┐                                                                 │
│  │ App 1  │──┐         ┌──────────┐                                        │
│  │        │──┤         │          │                                        │
│  └────────┘  ├────────▶│ PgBouncer│───(20 conns)──▶ PostgreSQL            │
│  ┌────────┐  │         │          │        (200MB RAM)                     │
│  │ App 2  │──┤         └──────────┘                                        │
│  │        │──┘                                                              │
│  └────────┘                                                                 │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

PgBouncer Setup

# pgbouncer.ini
[databases]
mydb = host=localhost port=5432 dbname=mydb

[pgbouncer]
listen_addr = *
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt

# Pool mode
pool_mode = transaction    # Recommended for most apps

# Pool size
default_pool_size = 20     # Connections to PostgreSQL per db/user
max_client_conn = 1000     # Total client connections
reserve_pool_size = 5      # Extra for burst

# Timeouts
server_idle_timeout = 600
client_idle_timeout = 0
query_timeout = 0

Pool Modes

Common pitfall: Choosing the wrong pool mode is the number one source of PgBouncer problems. In transaction mode, session-level state (SET commands, prepared statements, temp tables, LISTEN/NOTIFY) is lost between transactions because each transaction may land on a different backend connection. If your application uses Django or Rails with prepared statements, you may need session mode — but that significantly reduces pooling efficiency.
ModeDescriptionBest ForGotcha
sessionConnection held for entire sessionSession state needed (SET, temp tables)Minimal pooling benefit — one backend per client
transactionConnection held for transactionMost applications (stateless queries)Breaks prepared statements, SET, LISTEN, temp tables
statementConnection released after each querySimple queries, max sharingCompletely breaks multi-statement transactions
-- Monitor PgBouncer
psql -h localhost -p 6432 pgbouncer

SHOW POOLS;
SHOW SERVERS;
SHOW CLIENTS;
SHOW STATS;

7.7 Read Replicas and Load Balancing

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│               READ REPLICA LOAD BALANCING                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                        ┌─────────────┐                                       │
│                        │   HAProxy/  │                                       │
│                        │   PgPool    │                                       │
│                        └──────┬──────┘                                       │
│                               │                                              │
│              ┌────────────────┼────────────────┐                            │
│              │                │                │                            │
│              ▼                ▼                ▼                            │
│      ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                     │
│      │   Primary   │  │  Replica 1  │  │  Replica 2  │                     │
│      │ (Read/Write)│  │ (Read Only) │  │ (Read Only) │                     │
│      └──────┬──────┘  └──────┬──────┘  └──────┬──────┘                     │
│             │                │                │                             │
│             │───WAL─────────▶│                │                             │
│             │───────────WAL─────────────────▶│                             │
│                                                                              │
│   Routing Rules:                                                            │
│   • All writes → Primary                                                    │
│   • Reads → Round-robin across replicas                                     │
│   • Heavy reports → Dedicated replica                                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Application-Level Routing

# Django example
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'HOST': 'primary.db.example.com',
        'NAME': 'mydb',
    },
    'replica': {
        'ENGINE': 'django.db.backends.postgresql',
        'HOST': 'replica.db.example.com',
        'NAME': 'mydb',
    }
}

# Database router
class PrimaryReplicaRouter:
    def db_for_read(self, model, **hints):
        return 'replica'
    
    def db_for_write(self, model, **hints):
        return 'default'
    
    def allow_migrate(self, db, app_label, model_name=None, **hints):
        return db == 'default'

7.8 Disaster Recovery

Backup Strategies

# Full database backup
pg_dump -Fc mydb > backup.dump

# Specific tables
pg_dump -Fc -t users -t orders mydb > partial.dump

# Restore
pg_restore -d mydb backup.dump
Pros: Portable, selective, human-readable (with -Fp) Cons: Slow for large databases, point-in-time only
# Full physical backup with WAL
pg_basebackup -h primary -U replicator -D /backup/base \
    -Fp -Xs -P -R

# Compressed
pg_basebackup -h primary -U replicator -D - -Ft -z > backup.tar.gz
Pros: Fast, consistent, can be used for new replicas Cons: Same PG version only, full backup only
-- postgresql.conf
archive_mode = on
archive_command = 'rsync -a %p backup:/wal_archive/%f'
# Recovery to specific point in time
# recovery.conf / postgresql.conf
restore_command = 'cp /wal_archive/%f %p'
recovery_target_time = '2024-01-15 14:30:00'
Enables: Recover to any point in time

3-2-1 Backup Rule

┌─────────────────────────────────────────────────────────────────────────────┐
│                          3-2-1 BACKUP RULE                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   3 copies of your data:                                                    │
│   ┌────────────┐  ┌────────────┐  ┌────────────┐                           │
│   │  Primary   │  │  Replica   │  │   Backup   │                           │
│   │   (Live)   │  │   (Live)   │  │  (Archive) │                           │
│   └────────────┘  └────────────┘  └────────────┘                           │
│                                                                              │
│   2 different storage types:                                                │
│   ┌────────────────────┐  ┌────────────────────┐                           │
│   │    Local SSD       │  │    Object Storage  │                           │
│   │  (Primary + Replica)│  │    (S3, GCS, etc.) │                           │
│   └────────────────────┘  └────────────────────┘                           │
│                                                                              │
│   1 offsite location:                                                       │
│   ┌────────────────────┐                                                    │
│   │  Different Region  │                                                    │
│   │  or Cloud Provider │                                                    │
│   └────────────────────┘                                                    │
│                                                                              │
│   Test your backups regularly!                                              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

7.9 Practice: Set Up Replication

Lab Exercise

  1. Set up two PostgreSQL instances (use Docker)
  2. Configure streaming replication
  3. Test failover
  4. Set up PgBouncer in front
version: '3.8'

services:
  primary:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: primary_pass
      POSTGRES_USER: postgres
    volumes:
      - ./primary-data:/var/lib/postgresql/data
      - ./primary-init:/docker-entrypoint-initdb.d
    ports:
      - "5432:5432"
    command: >
      postgres 
      -c wal_level=replica 
      -c max_wal_senders=5 
      -c max_replication_slots=5

  standby:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: standby_pass
      PGUSER: replicator
      PGPASSWORD: repl_pass
    volumes:
      - ./standby-data:/var/lib/postgresql/data
    ports:
      - "5433:5432"
    depends_on:
      - primary

  pgbouncer:
    image: edoburu/pgbouncer
    environment:
      DATABASE_URL: "postgres://postgres:primary_pass@primary:5432/postgres"
      POOL_MODE: transaction
      MAX_CLIENT_CONN: 100
    ports:
      - "6432:5432"
    depends_on:
      - primary

Next Module

Module 8: Scaling Strategies

Scale from thousands to millions of users

Interview Deep-Dive

Strong Answer:
  • 99.99% means max 52 minutes downtime per year, covering planned maintenance and unplanned failures. Architecture: primary + 2 synchronous standbys managed by Patroni with etcd consensus. Use synchronous_standby_names = 'ANY 1 (standby1, standby2)' so the primary needs only one standby ACK to commit. PgBouncer in front absorbs connection storms during failover. HAProxy for read replica load balancing.
  • Zero-downtime maintenance: pg_upgrade via logical replication for major versions, CREATE INDEX CONCURRENTLY for schema changes, never VACUUM FULL during business hours. Budget: 30 min for 2 unplanned failovers/year, 22 min for planned maintenance.
Follow-up: How do you prevent split-brain during failover?Patroni uses etcd as the source of truth for leadership. A node is only the leader if it holds the etcd leader key (requires quorum). If the old primary is partitioned, its lease expires and Patroni demotes it. The new primary can only be promoted after acquiring the leader key. This prevents both nodes from accepting writes simultaneously.
Strong Answer:
  • Cross-version upgrades: Physical requires identical major versions. Logical works across versions (PG 14 to PG 16), enabling zero-downtime upgrades.
  • Selective replication: Physical copies everything. Logical can publish specific tables to specific subscribers.
  • Writable subscribers: Physical standbys are read-only. Logical subscribers can have their own tables, indexes, and materialized views.
  • Tradeoffs: Logical has higher overhead (WAL decoding), does not replicate DDL, and minimal conflict resolution (INSERT conflicts stop replication until manually resolved). For pure HA failover, physical is simpler and faster.
Follow-up: What happens if the subscriber is down for a day?It catches up automatically via replication slots, which prevent the publisher from recycling unconsumed WAL. The danger: a long-disconnected subscriber causes unbounded WAL retention on the publisher. Monitor pg_replication_slots for active = false and set max_slot_wal_keep_size to cap retention.
Strong Answer:
  • Check pg_stat_replication on the primary and compare the four LSN values to identify the bottleneck stage:
  • sent_lsn >> write_lsn: Network bottleneck. Fix: faster link, enable wal_compression (30-50% WAL reduction), or co-locate replica.
  • write_lsn >> flush_lsn: Replica disk cannot keep up with fsync. Fix: faster storage (NVMe).
  • flush_lsn >> replay_lsn: Replica CPU cannot apply WAL fast enough. Common when primary generates WAL from many parallel backends but replay is single-threaded. Fix: reduce indexes on replica, or leverage PG 15+ parallel recovery improvements.
  • Bonus: Long-running queries on replica holding back replay due to high max_standby_streaming_delay. Fix: lower the delay (accept query cancellation) or route analytics to a dedicated replica.
Follow-up: Can you achieve zero lag while allowing reads on the replica?Yes. synchronous_commit = remote_apply ensures the standby has fully replayed WAL before the primary acknowledges the commit. Reads on the standby are always consistent with committed primary data. The cost is increased commit latency on the primary.