Lock Manager Deep Dive

This module provides production-grade understanding of PostgreSQL’s locking subsystem. Essential knowledge for debugging contention issues, understanding deadlocks, and contributing to concurrency-related code.

Target Audience: Database infrastructure engineers, performance specialists
Prerequisites: Transactions module, basic OS concurrency concepts
Source Directories: src/backend/storage/lmgr/, src/include/storage/
Interview Relevance: Staff+ distributed systems roles

Part 1: Lock Hierarchy Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                    POSTGRESQL LOCK HIERARCHY                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Level 1: SPINLOCKS (Lowest level)                                         │
│   ────────────────────────────────────────────────────────────────────────   │
│   • CPU-level atomic operations                                             │
│   • Busy-wait (no context switch)                                           │
│   • Hold for nanoseconds                                                    │
│   • Used to protect LWLock internals                                        │
│                                                                              │
│                              ▼                                               │
│                                                                              │
│   Level 2: LIGHTWEIGHT LOCKS (LWLocks)                                      │
│   ────────────────────────────────────────────────────────────────────────   │
│   • Shared/Exclusive modes                                                  │
│   • Can sleep (no busy-wait)                                                │
│   • Hold for microseconds to milliseconds                                   │
│   • Protect shared memory structures                                        │
│   • Examples: BufferContent, WALInsert, ProcArray                           │
│                                                                              │
│                              ▼                                               │
│                                                                              │
│   Level 3: HEAVYWEIGHT LOCKS (Regular Locks)                                │
│   ────────────────────────────────────────────────────────────────────────   │
│   • Eight lock modes (Access Share → Access Exclusive)                      │
│   • Deadlock detection                                                      │
│   • Hold for transaction duration                                           │
│   • Protect database objects (tables, rows, pages)                          │
│                                                                              │
│                              ▼                                               │
│                                                                              │
│   Level 4: PREDICATE LOCKS (Serializable Isolation)                         │
│   ────────────────────────────────────────────────────────────────────────   │
│   • Track read dependencies for SSI                                         │
│   • Don't block; detect conflicts                                           │
│   • Used only at SERIALIZABLE isolation level                               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Part 2: Spinlocks

2.1 Spinlock Implementation

/* src/include/storage/s_lock.h */

/*
 * Spinlocks are the most primitive synchronization mechanism.
 * They use CPU atomic instructions and busy-wait.
 */

/* x86-64 implementation using XCHG */
static __inline__ int
tas(volatile slock_t *lock)
{
    register slock_t _res = 1;
    
    __asm__ __volatile__(
        "lock\n"
        "xchgb %0,%1\n"
        : "+q"(_res), "+m"(*lock)
        :
        : "memory", "cc");
    
    return (int) _res;
}

/* ARM implementation using LDREX/STREX */
static __inline__ int
tas(volatile slock_t *lock)
{
    int ret;
    int val = 1;
    
    __asm__ __volatile__(
        "1: ldrex %0, [%2]\n"
        "   strex %1, %3, [%2]\n"
        "   cmp %1, #0\n"
        "   bne 1b"
        : "=&r"(ret), "=&r"(tmp)
        : "r"(lock), "r"(val)
        : "cc", "memory");
    
    return ret;
}

/* Spinlock acquire with backoff */
void
SpinLockAcquire(volatile slock_t *lock)
{
    int spins = 0;
    
    while (tas(lock))
    {
        /* Spin with exponential backoff */
        if (++spins > NUM_SPINLOCK_DELAYS)
        {
            /*
             * Exceeded spin limit - yield CPU.
             * This is a safety valve; spinlocks should 
             * never be held this long.
             */
            pg_usleep(1);
            spins = 0;
        }
        else
        {
            /* CPU pause instruction - reduces power, helps hyperthreading */
            SPIN_DELAY();
        }
    }
}

2.2 Spinlock Usage Rules

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SPINLOCK RULES (CRITICAL!)                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   ✓ DO:                                                                     │
│   • Hold for absolute minimum time (nanoseconds)                            │
│   • Use only to protect LWLock wait-queue manipulation                      │
│   • Release in same function that acquired                                  │
│                                                                              │
│   ✗ DON'T:                                                                  │
│   • Call any function that might block                                      │
│   • Access disk or network                                                  │
│   • Allocate memory                                                         │
│   • Acquire another spinlock (no nesting!)                                  │
│   • Call elog/ereport (might allocate)                                      │
│                                                                              │
│   Why so strict?                                                             │
│   • Spinlocks busy-wait - wastes CPU while waiting                          │
│   • No deadlock detection - must guarantee no deadlocks                     │
│   • Priority inversion possible - low-priority holder blocks everyone       │
│                                                                              │
│   Example of safe spinlock usage:                                           │
│   ┌────────────────────────────────────────────────────────────────────┐    │
│   │  SpinLockAcquire(&lock);                                           │    │
│   │  /* Only simple memory operations here */                          │    │
│   │  value = shared_data->counter;                                     │    │
│   │  shared_data->counter = value + 1;                                 │    │
│   │  SpinLockRelease(&lock);                                           │    │
│   └────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Part 3: Lightweight Locks (LWLocks)

3.1 LWLock Architecture

/* src/include/storage/lwlock.h */

typedef struct LWLock
{
    uint16      tranche;        /* Tranche ID (identifies lock type) */
    pg_atomic_uint32 state;     /* Lock state - packed bits */
    proclist_head waiters;      /* List of waiting PGPROCs */
#ifdef LOCK_DEBUG
    pg_atomic_uint32 nwaiters;  /* Number of waiters */
    struct PGPROC *owner;       /* Current exclusive owner */
#endif
} LWLock;

/*
 * Lock state bits (packed into 32-bit atomic):
 *
 * Bits 0-23:  Number of shared lockers (up to 16M concurrent readers)
 * Bit 24:     Exclusive lock held
 * Bit 25:     Lock has waiters
 * Bits 26-31: Reserved
 */
#define LW_FLAG_HAS_WAITERS     ((uint32) 1 << 25)
#define LW_FLAG_RELEASE_OK      ((uint32) 1 << 26)
#define LW_VAL_EXCLUSIVE        ((uint32) 1 << 24)
#define LW_VAL_SHARED           1

/* LWLock modes */
typedef enum LWLockMode
{
    LW_EXCLUSIVE,   /* Write lock - blocks all others */
    LW_SHARED,      /* Read lock - allows other readers */
    LW_WAIT_UNTIL_FREE  /* Special: wait then don't acquire */
} LWLockMode;

3.2 LWLock Acquisition

/* Simplified acquisition flow from lwlock.c */

bool
LWLockAcquire(LWLock *lock, LWLockMode mode)
{
    PGPROC *proc = MyProc;
    uint32 old_state;
    
    /* Fast path: try atomic acquire */
    if (mode == LW_EXCLUSIVE)
    {
        /* Try to set exclusive bit if currently unlocked */
        old_state = pg_atomic_fetch_or_u32(&lock->state, LW_VAL_EXCLUSIVE);
        if ((old_state & LW_VAL_EXCLUSIVE) == 0 &&
            (old_state & LW_SHARED_MASK) == 0)
        {
            /* Success! We got the exclusive lock */
            return true;
        }
    }
    else /* LW_SHARED */
    {
        /* Increment shared count if no exclusive holder/waiter */
        old_state = pg_atomic_fetch_add_u32(&lock->state, LW_VAL_SHARED);
        if ((old_state & LW_VAL_EXCLUSIVE) == 0)
        {
            /* Success! We got a shared lock */
            return true;
        }
        /* Failed - undo the increment */
        pg_atomic_fetch_sub_u32(&lock->state, LW_VAL_SHARED);
    }
    
    /* Slow path: must queue and wait */
    return LWLockAcquireCommon(lock, mode);
}

static bool
LWLockAcquireCommon(LWLock *lock, LWLockMode mode)
{
    PGPROC *proc = MyProc;
    
    for (;;)
    {
        /* Add ourselves to wait queue */
        SpinLockAcquire(&lock->mutex);
        
        /* Try to acquire again while holding spinlock */
        if (LWLockAttemptLock(lock, mode))
        {
            SpinLockRelease(&lock->mutex);
            return true;
        }
        
        /* Queue ourselves */
        proc->lwWaiting = true;
        proc->lwWaitMode = mode;
        proclist_push_tail(&lock->waiters, proc->pgprocno, lwWaitLink);
        
        /* Set has-waiters flag */
        pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_HAS_WAITERS);
        
        SpinLockRelease(&lock->mutex);
        
        /* Sleep until woken */
        for (;;)
        {
            PGSemaphoreLock(proc->sem);
            
            if (!proc->lwWaiting)
                break;  /* We were granted the lock */
                
            /* Check for interrupts (cancel, die) */
            CHECK_FOR_INTERRUPTS();
        }
        
        /* Lock was granted */
        return true;
    }
}

3.3 Common LWLock Tranches

┌─────────────────────────────────────────────────────────────────────────────┐
│                    IMPORTANT LWLOCK TRANCHES                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Tranche               Purpose                          Contention Risk    │
│   ──────────────────────────────────────────────────────────────────────    │
│   BufferContent         Protect buffer page contents     High (hot pages)   │
│   BufferMapping         Buffer hash table                Medium             │
│   WALInsert             WAL insertion                    Very High          │
│   WALWrite              WAL file writes                  High               │
│   ProcArray             Active transaction list          Medium             │
│   XidGen                Transaction ID generation        Low                │
│   CLogControl           Commit log buffers               Medium             │
│   SubtransControl       Subtransaction tracking          Low                │
│   SInvalRead            Shared invalidation queue        Medium             │
│   AutovacuumSchedule    Autovacuum coordination          Low                │
│   RelCacheInit          Relation cache init              Low (startup)      │
│   CheckpointStart       Checkpoint coordination          Low                │
│                                                                              │
│   Monitoring LWLock waits:                                                  │
│   ─────────────────────────────────────────────────────────────────────────  │
│                                                                              │
│   SELECT wait_event_type, wait_event, count(*)                              │
│   FROM pg_stat_activity                                                     │
│   WHERE wait_event_type = 'LWLock'                                          │
│   GROUP BY 1, 2                                                             │
│   ORDER BY 3 DESC;                                                          │
│                                                                              │
│   -- See all current LWLock waits                                           │
│   SELECT pid, wait_event, state, query                                      │
│   FROM pg_stat_activity                                                     │
│   WHERE wait_event_type = 'LWLock';                                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.4 WALInsertLock Contention (Hot Topic)

/*
 * WALInsertLock is often the hottest lock in write-heavy workloads.
 * PostgreSQL uses multiple WAL insert locks to reduce contention.
 */

/* Number of WAL insertion locks (compile-time constant) */
#define NUM_XLOGINSERT_LOCKS  8

/*
 * WAL insertion flow:
 *
 * 1. Reserve space in WAL (need exclusive insert lock)
 *    - Acquire one of NUM_XLOGINSERT_LOCKS locks
 *    - Advance reserved position
 *    - Hold lock while copying data
 *
 * 2. Copy data to WAL buffer
 *    - Still holding insert lock
 *    - May need to wait for buffer space
 *
 * 3. Release insert lock
 *    - Others can now insert
 *    - Our data not yet durable!
 *
 * 4. Wait for flush (if synchronous_commit)
 *    - WALWriteLock for actual writes
 *    - fsync for durability
 */

/* Reducing WAL insert contention: */
/*
 * 1. Batch inserts into larger transactions
 * 2. Use synchronous_commit = off (if acceptable)
 * 3. Use unlogged tables for temporary data
 * 4. Increase wal_buffers
 * 5. Use faster storage (NVMe)
 * 6. Consider commit_delay for group commit
 */

Part 4: Heavyweight Locks

4.1 Lock Modes Matrix

┌─────────────────────────────────────────────────────────────────────────────┐
│                    HEAVYWEIGHT LOCK MODES                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Mode                    Conflicts With              Acquired By            │
│   ─────────────────────────────────────────────────────────────────────────  │
│   AccessShare (1)         AccessExclusive             SELECT                 │
│   RowShare (2)            Exclusive, AccessExclusive  SELECT FOR UPDATE/SHARE│
│   RowExclusive (3)        Share, ShareRowExcl,        INSERT, UPDATE, DELETE │
│                           Exclusive, AccessExclusive                         │
│   ShareUpdateExclusive(4) ShareUpdateExcl, Share,     VACUUM, ANALYZE,       │
│                           ShareRowExcl, Excl, AccExcl CREATE INDEX CONCUR    │
│   Share (5)               RowExcl, ShareUpdateExcl,   CREATE INDEX           │
│                           ShareRowExcl, Excl, AccExcl                        │
│   ShareRowExclusive (6)   RowExcl, ShareUpdateExcl,   CREATE TRIGGER, some   │
│                           Share, ShareRowExcl,         ALTER TABLE            │
│                           Excl, AccessExcl                                    │
│   Exclusive (7)           All except AccessShare      REFRESH MATVIEW CONCUR │
│   AccessExclusive (8)     ALL modes                   DROP, TRUNCATE,        │
│                                                        ALTER TABLE, VACUUM FULL│
│                                                                              │
│   Conflict Matrix (X = conflicts):                                          │
│   ─────────────────────────────────────────────────────────────────────────  │
│                                                                              │
│          │ AS  RS  RE  SUE  S  SRE  E  AE                                   │
│   ───────┼─────────────────────────────────                                 │
│   AS     │                              X                                    │
│   RS     │                         X    X                                    │
│   RE     │              X    X     X    X                                    │
│   SUE    │         X    X    X     X    X                                    │
│   S      │         X    X    X     X    X                                    │
│   SRE    │         X    X    X     X    X                                    │
│   E      │    X    X    X    X     X    X                                    │
│   AE     │ X  X    X    X    X     X    X                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

4.2 Lock Manager Data Structures

/* src/backend/storage/lmgr/lock.c */

/*
 * Lock hash table structure
 */

/* Lock tag identifies what is being locked */
typedef struct LOCKTAG
{
    uint32      locktag_field1; /* OID of database */
    uint32      locktag_field2; /* OID of relation */
    uint32      locktag_field3; /* Block number or other */
    uint32      locktag_field4; /* Offset or other */
    uint8       locktag_type;   /* Lock type */
    uint8       locktag_lockmethodid;  /* Lock method */
} LOCKTAG;

/* Lock types */
typedef enum LockTagType
{
    LOCKTAG_RELATION,           /* Whole relation */
    LOCKTAG_RELATION_EXTEND,    /* Extending a relation */
    LOCKTAG_DATABASE_FROZEN_IDS, /* Database frozen XID */
    LOCKTAG_PAGE,               /* Single page */
    LOCKTAG_TUPLE,              /* Single tuple */
    LOCKTAG_TRANSACTION,        /* Transaction ID (for waiting) */
    LOCKTAG_VIRTUALTRANSACTION, /* Virtual transaction ID */
    LOCKTAG_SPECULATIVE_TOKEN,  /* Speculative insertion token */
    LOCKTAG_OBJECT,             /* Non-relation database object */
    LOCKTAG_USERLOCK,           /* Advisory lock */
    LOCKTAG_ADVISORY            /* New advisory lock */
} LockTagType;

/* The main lock structure in shared memory */
typedef struct LOCK
{
    LOCKTAG     tag;            /* What is locked */
    LOCKMASK    grantMask;      /* Bitmask of granted lock modes */
    LOCKMASK    waitMask;       /* Bitmask of awaited lock modes */
    SHM_QUEUE   procLocks;      /* List of PROCLOCK objects */
    PROC_QUEUE  waitProcs;      /* Queue of waiting PGPROCs */
    int         requested[MAX_LOCKMODES];  /* Count per mode */
    int         nRequested;     /* Total requested */
    int         granted[MAX_LOCKMODES];    /* Count per mode */
    int         nGranted;       /* Total granted */
} LOCK;

/* Per-backend lock holding info */
typedef struct PROCLOCK
{
    PROCLOCKTAG tag;            /* (LOCK*, PGPROC*) identifier */
    PGPROC     *groupLeader;    /* Proc's lock group leader */
    LOCKMASK    holdMask;       /* Locks this proc holds */
    LOCKMASK    releaseMask;    /* Locks to release on commit */
    SHM_QUEUE   lockLink;       /* Links in LOCK's procLocks list */
    SHM_QUEUE   procLink;       /* Links in PGPROC's procLocks list */
} PROCLOCK;

4.3 Lock Acquisition Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                    LOCK ACQUISITION FLOW                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   LockAcquire(LOCKTAG, lockmode)                                            │
│   │                                                                          │
│   ├─► 1. Check local lock table (fast path for same-backend)               │
│   │      └─► If already hold compatible lock: just increment count          │
│   │                                                                          │
│   ├─► 2. Hash LOCKTAG to find partition                                     │
│   │      └─► Lock table is partitioned for concurrency                      │
│   │                                                                          │
│   ├─► 3. Acquire partition LWLock (LWLockAcquire)                           │
│   │                                                                          │
│   ├─► 4. Look up or create LOCK entry                                       │
│   │      └─► HashTableLookup/Insert in shared memory                        │
│   │                                                                          │
│   ├─► 5. Look up or create PROCLOCK entry                                   │
│   │      └─► Links PGPROC to LOCK                                           │
│   │                                                                          │
│   ├─► 6. Try to grant lock                                                  │
│   │      │                                                                   │
│   │      ├─► Check conflict matrix against grantMask + waitMask             │
│   │      │                                                                   │
│   │      ├─► If no conflict:                                                │
│   │      │   └─► Grant lock, update masks, return SUCCESS                   │
│   │      │                                                                   │
│   │      └─► If conflict:                                                   │
│   │          └─► Continue to step 7                                         │
│   │                                                                          │
│   ├─► 7. Check for deadlock (if waiting would be needed)                    │
│   │      └─► Run deadlock detection before sleeping                         │
│   │                                                                          │
│   ├─► 8. Add to wait queue                                                  │
│   │      └─► Queue ordered by mode + FIFO                                   │
│   │                                                                          │
│   ├─► 9. Release partition LWLock                                           │
│   │                                                                          │
│   └─► 10. Sleep on semaphore until granted or timeout                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

4.4 Viewing Locks

-- All current locks
SELECT 
    locktype,
    relation::regclass,
    mode,
    granted,
    pid,
    pg_blocking_pids(pid) as blocked_by
FROM pg_locks
WHERE NOT granted
ORDER BY pid;

-- Detailed lock view with queries
SELECT 
    blocked.pid AS blocked_pid,
    blocked.usename AS blocked_user,
    blocking.pid AS blocking_pid,
    blocking.usename AS blocking_user,
    blocked.query AS blocked_query,
    blocking.query AS blocking_query,
    blocked.wait_event_type,
    blocked.wait_event
FROM pg_stat_activity AS blocked
JOIN pg_locks AS blocked_locks ON blocked.pid = blocked_locks.pid
JOIN pg_locks AS blocking_locks ON 
    blocked_locks.locktype = blocking_locks.locktype AND
    blocked_locks.relation = blocking_locks.relation AND
    blocked_locks.pid != blocking_locks.pid
JOIN pg_stat_activity AS blocking ON blocking_locks.pid = blocking.pid
WHERE NOT blocked_locks.granted;

-- Lock tree visualization
WITH RECURSIVE lock_tree AS (
    SELECT pid, pg_blocking_pids(pid) as blockers, 0 as depth
    FROM pg_stat_activity 
    WHERE cardinality(pg_blocking_pids(pid)) = 0 
      AND pid != pg_backend_pid()
    
    UNION ALL
    
    SELECT s.pid, pg_blocking_pids(s.pid), lt.depth + 1
    FROM pg_stat_activity s
    JOIN lock_tree lt ON s.pid = ANY(lt.blockers)
)
SELECT repeat('  ', depth) || pid::text as tree, 
       (SELECT query FROM pg_stat_activity WHERE pid = lock_tree.pid)
FROM lock_tree
ORDER BY depth, pid;

Part 5: Deadlock Detection

5.1 Deadlock Detection Algorithm

/*
 * PostgreSQL uses a wait-for graph to detect deadlocks.
 * Detection runs when a process has waited deadlock_timeout
 * (default 1 second) without getting the lock.
 */

/*
 * Wait-For Graph:
 * 
 * Nodes: Transactions (PGPROCs)
 * Edges: A → B means A is waiting for B
 * 
 * Deadlock = cycle in the graph
 * 
 *     T1 ──waits for──► T2
 *      ▲                 │
 *      │                 │
 *      └──waits for──────┘
 *           DEADLOCK!
 */

/* From deadlock.c - cycle detection */
static bool
FindLockCycle(PGPROC *checkProc, EDGE *softEdges, int *nSoftEdges)
{
    int         nVisitedProcs = 0;
    int         i;
    
    /* Start DFS from checkProc */
    visitedProcs[nVisitedProcs++] = checkProc;
    
    /* Explore all outgoing edges */
    for (i = 0; i < nVisitedProcs; i++)
    {
        PGPROC     *proc = visitedProcs[i];
        LOCK       *lock = proc->waitLock;
        PROCLOCK   *proclock;
        
        if (lock == NULL)
            continue;
        
        /* Check each holder of the lock we're waiting for */
        proclock = (PROCLOCK *) SHMQueueNext(&lock->procLocks,
                                             &lock->procLocks,
                                             offsetof(PROCLOCK, lockLink));
        
        while (proclock)
        {
            PGPROC *holder = proclock->tag.myProc;
            
            /* Skip if this holder wouldn't block us */
            if (!DoLockModesConflict(proc->waitLockMode, 
                                     proclock->holdMask))
            {
                proclock = NextProcLock(proclock);
                continue;
            }
            
            /* Check if this completes a cycle */
            if (holder == checkProc)
            {
                /* DEADLOCK DETECTED! */
                return true;
            }
            
            /* Add to exploration list if not visited */
            if (!IsVisited(holder))
            {
                visitedProcs[nVisitedProcs++] = holder;
            }
            
            proclock = NextProcLock(proclock);
        }
    }
    
    return false;  /* No cycle found */
}

5.2 Deadlock Resolution

/*
 * When deadlock detected, PostgreSQL chooses a victim to abort.
 * 
 * Victim selection heuristic:
 * 1. Prefer to abort the process that just started waiting
 *    (the one that triggered detection)
 * 2. Abort the transaction that has done the least work
 * 
 * The victim's transaction is rolled back with:
 * ERROR: deadlock detected
 * DETAIL: Process X waits for ShareLock on relation Y; 
 *         blocked by process Z.
 *         Process Z waits for ExclusiveLock on relation W; 
 *         blocked by process X.
 * HINT: See server log for query details.
 */

/* Client sees: */
/*
 * ERROR:  deadlock detected
 * DETAIL:  Process 12345 waits for ShareLock on tuple (0,5) of 
 *          relation 16384 of database 12345; blocked by process 12346.
 *          Process 12346 waits for ShareLock on tuple (0,3) of 
 *          relation 16384 of database 12345; blocked by process 12345.
 * HINT:  See server log for query details.
 * CONTEXT:  while locking tuple (0,5) in relation "accounts"
 */

5.3 Deadlock Prevention Strategies

-- Strategy 1: Consistent lock ordering
-- Always access tables/rows in the same order

-- BAD: Transaction 1        Transaction 2
--      UPDATE accounts      UPDATE orders
--      WHERE id = 1         WHERE id = 1
--      UPDATE orders        UPDATE accounts
--      WHERE id = 1         WHERE id = 1
--      ↑ DEADLOCK RISK!

-- GOOD: Both in same order
--      UPDATE accounts      UPDATE accounts
--      UPDATE orders        UPDATE orders

-- Strategy 2: Lock all resources upfront
BEGIN;
SELECT * FROM accounts WHERE id IN (1, 2, 3) FOR UPDATE;
-- Now do work...
COMMIT;

-- Strategy 3: Use NOWAIT or SKIP LOCKED
BEGIN;
SELECT * FROM accounts WHERE id = 1 FOR UPDATE NOWAIT;
-- ERROR immediately if locked, no waiting = no deadlock
COMMIT;

-- Strategy 4: Reduce lock scope
-- Use row-level locks instead of table-level
-- Keep transactions short

-- Strategy 5: Use advisory locks for application-level ordering
SELECT pg_advisory_lock(hashtext('resource_' || resource_id::text));
-- Do work
SELECT pg_advisory_unlock(hashtext('resource_' || resource_id::text));

Part 6: Row-Level Locking Details

6.1 Tuple Lock Implementation

/*
 * Row-level locks are NOT stored in the lock manager!
 * They're encoded in the tuple header (xmax field).
 * 
 * This is critical for scalability - millions of row locks
 * don't consume lock manager memory.
 */

/* Lock strength (from lockdefs.h) */
typedef enum LockTupleMode
{
    LockTupleKeyShare,      /* SELECT FOR KEY SHARE */
    LockTupleShare,         /* SELECT FOR SHARE */
    LockTupleNoKeyUpdate,   /* SELECT FOR NO KEY UPDATE */
    LockTupleExclusive      /* SELECT FOR UPDATE */
} LockTupleMode;

/*
 * Tuple header bits used for row locking:
 *
 * HEAP_XMAX_LOCK_ONLY   - xmax is a locker, not deleter
 * HEAP_XMAX_KEYSHR_LOCK - Key share lock held
 * HEAP_XMAX_SHR_LOCK    - Share lock held  
 * HEAP_XMAX_EXCL_LOCK   - Exclusive lock held
 *
 * When multiple transactions lock same row:
 * - xmax points to a MultiXact ID
 * - MultiXact contains list of (xid, lock mode) pairs
 * - Stored in pg_multixact/members and pg_multixact/offsets
 */

/* Lock a tuple for update */
TM_Result
heap_lock_tuple(Relation relation, HeapTuple tuple,
                CommandId cid, LockTupleMode mode,
                LockWaitPolicy wait_policy,
                bool follow_updates,
                Buffer *buffer, TM_FailureData *tmfd)
{
    HTSU_Result result;
    TransactionId xmax;
    uint16      infomask;
    
    /* Read current tuple state */
    *buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(&tuple->t_self));
    LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
    
    /* Check visibility and current locks */
    xmax = HeapTupleHeaderGetRawXmax(tuple->t_data);
    infomask = tuple->t_data->t_infomask;
    
    if (TransactionIdIsValid(xmax))
    {
        /* Someone has xmax set - check if conflict */
        if (DoesMultiXactConflict(xmax, mode))
        {
            /* Must wait for the conflicting locker */
            if (wait_policy == LockWaitBlock)
            {
                /* Release buffer lock, wait for xmax transaction */
                LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
                XactLockTableWait(xmax, relation, &tuple->t_self,
                                  XLTW_Lock);
                /* Re-acquire and retry */
                goto recheck;
            }
            else if (wait_policy == LockWaitSkip)
                return TM_WouldBlock;
            else /* LockWaitError */
                return TM_WouldBlock;  /* Caller raises error */
        }
    }
    
    /* Can acquire the lock - update tuple header */
    if (NeedMultiXact(xmax, mode))
    {
        /* Create/extend MultiXact */
        newxmax = MultiXactIdCreate(xmax, mode);
    }
    else
    {
        newxmax = GetCurrentTransactionId();
    }
    
    HeapTupleHeaderSetXmax(tuple->t_data, newxmax);
    /* Set appropriate lock bits */
    SetTupleLockBits(tuple->t_data, mode);
    
    return TM_Ok;
}

6.2 MultiXact System

┌─────────────────────────────────────────────────────────────────────────────┐
│                    MULTIXACT FOR CONCURRENT ROW LOCKS                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Scenario: Multiple transactions lock same row for share                   │
│                                                                              │
│   T1: SELECT * FROM accounts WHERE id = 1 FOR SHARE;                        │
│   T2: SELECT * FROM accounts WHERE id = 1 FOR SHARE;                        │
│   T3: SELECT * FROM accounts WHERE id = 1 FOR SHARE;                        │
│                                                                              │
│   Tuple Header State:                                                        │
│   ┌───────────────────────────────────────────────────────────────────┐     │
│   │ xmax = MultiXactId 12345                                          │     │
│   │ t_infomask = HEAP_XMAX_IS_MULTI | HEAP_XMAX_LOCK_ONLY | SHARED   │     │
│   └───────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│   pg_multixact/offsets:                                                     │
│   ┌───────────────────────────────────────────────────────────────────┐     │
│   │ MultiXactId 12345 → offset 1000, nmembers 3                       │     │
│   └───────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│   pg_multixact/members (at offset 1000):                                    │
│   ┌───────────────────────────────────────────────────────────────────┐     │
│   │ [0]: xid=T1, status=ForShare                                      │     │
│   │ [1]: xid=T2, status=ForShare                                      │     │
│   │ [2]: xid=T3, status=ForShare                                      │     │
│   └───────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│   When T4 tries FOR UPDATE:                                                 │
│   1. Check MultiXact members                                                │
│   2. FOR UPDATE conflicts with FOR SHARE                                    │
│   3. T4 must wait for all of T1, T2, T3 to finish                          │
│                                                                              │
│   Cleanup:                                                                   │
│   • VACUUM removes old MultiXact entries                                    │
│   • Must track oldest active MultiXact                                      │
│   • Can cause bloat if long transactions hold row locks                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Part 7: Advisory Locks

7.1 Advisory Lock Types

-- Session-level locks (held until explicit unlock or session end)
SELECT pg_advisory_lock(key);           -- Exclusive
SELECT pg_advisory_lock_shared(key);    -- Shared
SELECT pg_advisory_unlock(key);         -- Release
SELECT pg_advisory_unlock_all();        -- Release all

-- Transaction-level locks (released at transaction end)
SELECT pg_advisory_xact_lock(key);          -- Exclusive
SELECT pg_advisory_xact_lock_shared(key);   -- Shared
-- No explicit unlock - released at COMMIT/ROLLBACK

-- Non-blocking versions (return true/false)
SELECT pg_try_advisory_lock(key);           -- Try exclusive
SELECT pg_try_advisory_lock_shared(key);    -- Try shared

-- Two-key versions (for finer granularity)
SELECT pg_advisory_lock(classid, objid);    -- Two int4 keys

7.2 Advisory Lock Use Cases

-- Use Case 1: Singleton job processing
CREATE OR REPLACE FUNCTION process_daily_report()
RETURNS void AS $$
BEGIN
    -- Only one instance should run
    IF NOT pg_try_advisory_lock(hashtext('daily_report')) THEN
        RAISE NOTICE 'Another instance is running';
        RETURN;
    END IF;
    
    -- Do the actual work
    PERFORM generate_report();
    
    PERFORM pg_advisory_unlock(hashtext('daily_report'));
END;
$$ LANGUAGE plpgsql;

-- Use Case 2: Prevent duplicate order processing
CREATE OR REPLACE FUNCTION process_order(order_id integer)
RETURNS boolean AS $$
BEGIN
    -- Lock this specific order
    IF NOT pg_try_advisory_xact_lock('orders'::regclass::int, order_id) THEN
        RETURN false;  -- Another process handling this order
    END IF;
    
    -- Process order (lock held for transaction)
    UPDATE orders SET status = 'processing' WHERE id = order_id;
    -- ... more processing ...
    UPDATE orders SET status = 'completed' WHERE id = order_id;
    
    RETURN true;
END;
$$ LANGUAGE plpgsql;

-- Use Case 3: Rate limiting
CREATE OR REPLACE FUNCTION check_rate_limit(user_id integer)
RETURNS boolean AS $$
DECLARE
    lock_acquired boolean;
BEGIN
    -- Try to acquire lock for this user
    -- Each user can have only one concurrent request
    SELECT pg_try_advisory_lock(hashtext('rate:' || user_id::text))
    INTO lock_acquired;
    
    RETURN lock_acquired;
END;
$$ LANGUAGE plpgsql;

-- Viewing advisory locks
SELECT 
    classid,
    objid,
    mode,
    pid,
    granted
FROM pg_locks 
WHERE locktype = 'advisory';

8.1 Common Lock Contention Patterns

┌─────────────────────────────────────────────────────────────────────────────┐
│                    LOCK CONTENTION PATTERNS                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Pattern 1: Hot Row / Counter Table                                         │
│   ─────────────────────────────────────────────────────────────────────────  │
│   Problem: Many transactions update the same row (e.g., global counter).     │
│   Symptoms: High tuple lock waits, serialization errors.                     │
│   Solutions:                                                                 │
│   • Use SEQUENCES instead of tables for IDs.                                 │
│   • Partition hot rows (e.g., 10 rows for 1 counter, SUM at query time).     │
│   • Use SKIP LOCKED for queue patterns.                                      │
│                                                                              │
│   Pattern 2: Catalog Lock Contention                                         │
│   ─────────────────────────────────────────────────────────────────────────  │
│   Problem: Frequent DDL (CREATE TEMP TABLE, etc.) in high-concurrency.       │
│   Symptoms: RowExclusiveLock on pg_attribute or pg_class.                    │
│   Solutions:                                                                 │
│   • Reduce DDL frequency; reuse temp tables if possible.                     │
│   • Avoid excessive dynamic table creation.                                  │
│                                                                              │
│   Pattern 3: Foreign Key Contention                                          │
│   ─────────────────────────────────────────────────────────────────────────  │
│   Problem: FK checks acquire locks on parent table rows.                     │
│   Symptoms: Parent table rows locked frequently, blocking other updates.      │
│   Solutions:                                                                 │
│   • INDEX foreign key columns (prevents full scans on parent table).         │
│   • Batch child inserts to reduce the number of lock acquisitions.           │
│                                                                              │
│   Pattern 4: Long-Running Transactions                                       │
│   ─────────────────────────────────────────────────────────────────────────  │
│   Problem: Transaction holds locks while doing slow I/O or external calls.   │
│   Symptoms: Growing lock waits, deadlocks, connection pile-up.               │
│   Solutions:                                                                 │
│   • Keep transactions as short as possible.                                  │
│   • Don't hold locks during external API calls (Stripe, Twilio).             │
│   • Implement idle_in_transaction_session_timeout.                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

8.2 Distributed Locking: Database vs. External Tools

For distributed applications, you often face the choice: Lock in the database or use a distributed lock manager (DLM) like Redis (Redlock) or etcd/Consul.

Feature	Database (PostgreSQL)	External DLM (Redis/etcd)
Integrity	ACID compliant; lock released on rollbacks.	Independent of DB state; risky if DB tx fails.
Performance	High overhead for massive scale.	Ultra-low latency (in-memory).
Complexity	Low (use `SELECT FOR UPDATE`).	High (requires separate infrastructure).
Use Case	Inventory, financial balances.	Global rate limiting, leader election.

The “Fencing Token” Pattern

When using a distributed lock, always use a Fencing Token to prevent the “stale locker” problem (where a process loses its lock but doesn’t know it).

Lock service issues an incrementing token with every lock grant.
Database verifies that the token in the request is still valid/latest before committing.

8.3 Advanced Deadlock Troubleshooting

If you encounter frequent deadlocks:

Lower deadlock_timeout: To detect and fail faster (but increases CPU overhead).
Log Lock Waits: Set log_lock_waits = on to identify who is holding what for how long.
Trace the Graph: Use pg_blocking_pids() recursively to find the root blocker.

-- Root blocker query
WITH RECURSIVE blockers AS (
  SELECT pid, pg_blocking_pids(pid) AS pids FROM pg_stat_activity WHERE cardinality(pg_blocking_pids(pid)) > 0
  UNION ALL
  SELECT s.pid, pg_blocking_pids(s.pid) FROM pg_stat_activity s JOIN blockers b ON s.pid = ANY(b.pids)
)
SELECT * FROM pg_stat_activity WHERE pid IN (SELECT unnest(pids) FROM blockers);

│ Example: Counter table, sequence simulation │ │ Symptoms: High tuple lock waits, serialization errors │ │ Solutions: │ │ • Use sequences instead of counter tables │ │ • Partition hot data │ │ • Use SKIP LOCKED for queue patterns │ │ • Consider optimistic locking with retry │ │ │ │ Pattern 2: Lock Escalation │ │ ───────────────────────────────────────────────────────────────────────── │ │ Problem: Operations acquiring table-level locks │ │ Example: VACUUM FULL, ALTER TABLE, REFRESH MATERIALIZED VIEW │ │ Symptoms: Queries blocked, connection pile-up │ │ Solutions: │ │ • Use CONCURRENTLY variants │ │ • Schedule during low-traffic windows │ │ • Use statement_timeout to prevent indefinite waits │ │ │ │ Pattern 3: Foreign Key Contention │ │ ───────────────────────────────────────────────────────────────────────── │ │ Problem: FK checks acquire locks on parent table │ │ Example: High INSERT rate to child table │ │ Symptoms: Parent table locked frequently │ │ Solutions: │ │ • Index foreign key columns (crucial!) │ │ • Batch child inserts │ │ • Consider deferrable constraints │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

8.2 Lock Monitoring Queries

-- Real-time lock contention
SELECT 
    now() - query_start as duration,
    pid,
    usename,
    wait_event_type,
    wait_event,
    state,
    left(query, 60) as query
FROM pg_stat_activity
WHERE wait_event_type IN ('Lock', 'LWLock')
ORDER BY duration DESC;

-- Lock statistics over time (requires pg_stat_statements)
SELECT 
    queryid,
    calls,
    mean_time,
    blk_read_time + blk_write_time as io_time,
    left(query, 80) as query
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 20;

-- Lock wait events histogram
SELECT 
    wait_event,
    count(*) as occurrences
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
  AND state = 'active'
GROUP BY wait_event
ORDER BY occurrences DESC;

-- Find lock-holding sessions
WITH lock_holders AS (
    SELECT 
        pid,
        granted,
        mode,
        locktype,
        relation::regclass
    FROM pg_locks
    WHERE relation IS NOT NULL
)
SELECT 
    h.pid as holder_pid,
    h.mode as holder_mode,
    h.relation,
    w.pid as waiter_pid,
    w.mode as waiter_mode,
    a.query as holder_query
FROM lock_holders h
JOIN lock_holders w ON h.relation = w.relation 
                   AND h.pid != w.pid 
                   AND h.granted 
                   AND NOT w.granted
JOIN pg_stat_activity a ON h.pid = a.pid;

Part 9: Interview Deep Dives

Q: Explain the difference between LWLocks and heavyweight locks

Answer:

Aspect	LWLock	Heavyweight Lock
Purpose	Protect shared memory structures	Protect database objects
Modes	Shared/Exclusive only	8 modes (AccessShare to AccessExclusive)
Deadlock Detection	None (must avoid by design)	Yes, runs after deadlock_timeout
Duration	Microseconds to milliseconds	Can span entire transaction
Visibility	Not in pg_locks	Visible in pg_locks
Wait Tracking	wait_event_type = ‘LWLock’	wait_event_type = ‘Lock’
Memory	Fixed in shared memory	Can grow (lock table)

Key insight: LWLocks are for internal synchronization; heavyweight locks are for user-visible database object protection.

Q: How does PostgreSQL detect and resolve deadlocks?

Answer:

Detection Trigger:
- Process waits longer than deadlock_timeout (default 1s)
- Detection is expensive, so we delay it
Algorithm:
- Build wait-for graph from lock manager data
- Run DFS to find cycles
- Cycle = deadlock
Resolution:
- Choose victim (usually the detecting process)
- Abort victim’s transaction
- Error raised to client
- Other transactions proceed
Why not prevent deadlocks?
- Prevention requires knowing all locks upfront
- Would require either:
  - Lock ordering (not practical for SQL)
  - Wait-die/wound-wait (complex, reduces concurrency)
- Detection is simpler and deadlocks are rare

Q: How would you debug a production lock contention issue?

Answer:

-- Step 1: Identify waiting sessions
SELECT pid, wait_event, query, now() - query_start as duration
FROM pg_stat_activity 
WHERE wait_event_type = 'Lock';

-- Step 2: Find what's blocking them
SELECT 
    blocked.pid AS victim,
    blocking.pid AS blocker,
    blocked.query AS waiting_query,
    blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking 
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));

-- Step 3: Analyze lock details
SELECT * FROM pg_locks WHERE pid IN (blocker_pid, victim_pid);

-- Step 4: Check transaction age
SELECT pid, xact_start, now() - xact_start as xact_duration
FROM pg_stat_activity WHERE pid = blocker_pid;

Resolution options:

Terminate blocker: SELECT pg_terminate_backend(blocker_pid);
Cancel query: SELECT pg_cancel_backend(blocker_pid);
Wait for natural completion

Prevention:

Set lock_timeout and statement_timeout
Use idle_in_transaction_session_timeout
Review application lock ordering

Next Steps

Buffer Manager Internals

Deep dive into PostgreSQL’s buffer pool

Memory Management

Memory contexts and allocation patterns

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Lock Manager Deep Dive

​Part 1: Lock Hierarchy Overview

​Part 2: Spinlocks

​2.1 Spinlock Implementation

​2.2 Spinlock Usage Rules

​Part 3: Lightweight Locks (LWLocks)

​3.1 LWLock Architecture

​3.2 LWLock Acquisition

​3.3 Common LWLock Tranches

​3.4 WALInsertLock Contention (Hot Topic)

​Part 4: Heavyweight Locks

​4.1 Lock Modes Matrix

​4.2 Lock Manager Data Structures

​4.3 Lock Acquisition Flow

​4.4 Viewing Locks

​Part 5: Deadlock Detection

​5.1 Deadlock Detection Algorithm

​5.2 Deadlock Resolution

​5.3 Deadlock Prevention Strategies

​Part 6: Row-Level Locking Details

​6.1 Tuple Lock Implementation

​6.2 MultiXact System

​Part 7: Advisory Locks

​7.1 Advisory Lock Types

​7.2 Advisory Lock Use Cases

​Part 8: Lock-Related Performance Issues

​8.1 Common Lock Contention Patterns

​8.2 Distributed Locking: Database vs. External Tools

​The “Fencing Token” Pattern

​8.3 Advanced Deadlock Troubleshooting

​8.2 Lock Monitoring Queries

​Part 9: Interview Deep Dives

​Next Steps

Buffer Manager Internals

Memory Management

Lock Manager Deep Dive

Part 1: Lock Hierarchy Overview

Part 2: Spinlocks

2.1 Spinlock Implementation

2.2 Spinlock Usage Rules

Part 3: Lightweight Locks (LWLocks)

3.1 LWLock Architecture

3.2 LWLock Acquisition

3.3 Common LWLock Tranches

3.4 WALInsertLock Contention (Hot Topic)

Part 4: Heavyweight Locks

4.1 Lock Modes Matrix

4.2 Lock Manager Data Structures

4.3 Lock Acquisition Flow

4.4 Viewing Locks

Part 5: Deadlock Detection

5.1 Deadlock Detection Algorithm

5.2 Deadlock Resolution

5.3 Deadlock Prevention Strategies

Part 6: Row-Level Locking Details

6.1 Tuple Lock Implementation

6.2 MultiXact System

Part 7: Advisory Locks

7.1 Advisory Lock Types

7.2 Advisory Lock Use Cases

Part 8: Lock-Related Performance Issues

8.1 Common Lock Contention Patterns

8.2 Distributed Locking: Database vs. External Tools

The “Fencing Token” Pattern

8.3 Advanced Deadlock Troubleshooting

8.2 Lock Monitoring Queries

Part 9: Interview Deep Dives

Next Steps