Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Deadlocks

A deadlock is a situation where two or more threads are waiting indefinitely for resources held by each other. Think of two cars meeting on a narrow one-lane bridge from opposite sides — neither can move forward until the other backs up, but neither is willing to back up. In software, “backing up” means releasing a lock you already hold, which most programs are not designed to do. Understanding deadlocks is essential for senior engineers — they can bring entire systems down, and they are notoriously difficult to reproduce because they depend on precise timing between threads.
Caveat 1: Lock ordering inversion is the #1 cause of production deadlocks. Every multi-lock deadlock you will ever see in real code reduces to “thread A grabbed L1 then L2, thread B grabbed L2 then L1.” Senior engineers know this and aggressively enforce a canonical lock order (e.g., always lower memory address first; always parent inode before child inode). The trap: third-party libraries that internally acquire locks. Calling library.foo() while holding your own lock is the silent ordering violation — your code looks correct, but the library may take a global lock (the C runtime allocator does this!) that some other thread is waiting on while it holds yours. Document every lock acquisition order in a comment, and never call into unknown library code while holding a lock.
Pattern: enforce lock order with annotations. Linux uses lockdep (covered below) which infers ordering from observed acquisitions and yells if any code path violates it. For C++, use Clang’s Thread Safety Analysis (-Wthread-safety) with GUARDED_BY and ACQUIRED_BEFORE attributes — compile-time enforcement. For Java, the JCStress harness exercises lock orderings under stress. The cheapest mechanical rule: in functions that take multiple locks as parameters, sort by std::less on the address before locking.
Caveat 2: Priority inversion in real-time systems will kill you. A high-priority thread waiting on a lock held by a low-priority thread is fine — IF the low-priority thread is running. But if a medium-priority thread preempts the low-priority holder, the high-priority thread is blocked indirectly by the medium thread. This is exactly what stranded the Mars Pathfinder rover in 1997. The Linux/POSIX fix is priority inheritance: when a high-priority thread blocks on a lock, the holder temporarily gets boosted to the waiter’s priority so it cannot be preempted by lower-priority work.
Pattern: enable priority inheritance explicitly. It is OFF by default in pthreads. Use pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT) for any mutex shared across priority levels. For real-time threads, also pin them to dedicated cores (pthread_setaffinity_np) and use SCHED_FIFO — inheritance only works if the scheduler can actually run the boosted holder. The alternative, priority ceiling, is faster (no dynamic priority changes) but requires you to know all threads that will use each lock at design time.
Caveat 3: Diagnose deadlocks with the right tools, not by reading code. Production deadlocks are rarely “obvious from the code” — if they were, they would have been caught in review. Use lockdep (kernel; enable with CONFIG_PROVE_LOCKING=y), ThreadSanitizer (TSan; user-space, -fsanitize=thread), or Helgrind (Valgrind tool). For a wedged production process, attach with gdb and run thread apply all bt — a deadlock shows N threads all in pthread_mutex_lock waiting on each other’s locks. For Java, jstack produces “Found one Java-level deadlock” output automatically.
Pattern: keep diagnostics one command away. Bake gdb -batch -ex "thread apply all bt" into your container’s preStop hook so a stuck pod dumps stack traces before being killed. For lockdep, your CI should run debug kernels even if production runs prod kernels — lockdep catches deadlocks that have not happened yet but are reachable. Linus rejects kernel patches that break lockdep clean runs; treat your project the same way.
Caveat 4: Watchdog timers are a last resort, not a design. “Add a 5-second timeout and abort the request if it expires” feels like robust engineering. It is not — it is hiding a bug. Watchdogs are appropriate for external failures you cannot prevent (a third-party API hangs). They are inappropriate for internal deadlocks you should fix at the design level. A timeout that fires once per million requests usually means a real deadlock, not a slow network — and burying it under retries lets it accumulate and bite you when load spikes.
Pattern: instrument timeouts so they are visible. Every timeout should emit a metric (counter + last-seen timestamp) and a structured log including the stack trace at the moment of timeout. If the timeout count is non-zero in production, treat it as a bug. PostgreSQL does this with log_lock_waits = on and deadlock_timeout — when a backend waits longer than the threshold, it dumps the wait graph to the log so DBAs can see exactly which transactions are tangled.
Interview Frequency: High
Key Topics: Four conditions, Banker’s algorithm, prevention strategies
Time to Master: 6-8 hours

Deadlock Fundamentals

Classic Deadlock Scenario

Deadlock Example

The Four Necessary Conditions

Coffman Conditions: ALL four must hold for deadlock to occur. Breaking ANY one prevents deadlock.

1. Mutual Exclusion

Resources can only be held by one thread at a time.Example: A mutex can only be held by one thread.

2. Hold and Wait

Thread holds resources while waiting for others.Example: Thread holds mutex1 while waiting for mutex2.

3. No Preemption

Resources can only be released voluntarily.Example: Can’t force a thread to release its mutex.

4. Circular Wait

Circular chain of threads waiting for each other.Example: A waits for B, B waits for A.

Resource Allocation Graph

Visualize resource allocation to detect deadlocks: Resource Allocation Graph Detection rule:
  • Single instance per resource: Cycle = Deadlock
  • Multiple instances: Cycle is necessary but not sufficient

Deadlock Prevention

Break one of the four conditions:

1. Break Mutual Exclusion

// Strategy: Replace mutual exclusion with lock-free atomics.
// If no thread ever "holds" a lock, deadlock is structurally impossible.

#include <stdatomic.h>
atomic_int counter = 0;

void increment() {
    // atomic_fetch_add uses a hardware CAS loop internally --
    // no mutex, so no possibility of hold-and-wait
    atomic_fetch_add(&counter, 1);
}
Limitation: Not all resources can be shared. Atomics work for counters and flags, but you cannot atomically update a complex data structure spanning multiple fields without a lock (or a carefully designed lock-free algorithm).

2. Break Hold and Wait

// Strategy: Acquire ALL locks at once, or none.
// A thread never "holds" one lock while "waiting" for another,
// so the hold-and-wait condition is broken by construction.

typedef struct {
    pthread_mutex_t mutex;
    int id;
} resource_t;

bool acquire_all(resource_t *resources[], int count) {
    for (int i = 0; i < count; i++) {
        if (pthread_mutex_trylock(&resources[i]->mutex) != 0) {
            // Failed on lock i -- release everything we already hold
            // to avoid holding some locks while waiting for others
            for (int j = 0; j < i; j++) {
                pthread_mutex_unlock(&resources[j]->mutex);
            }
            return false;  // Caller should back off and retry
        }
    }
    return true;  // Successfully acquired all locks atomically
}
Limitation: Low resource utilization (you hold all locks even if you only need one at a time), and repeated acquire/release cycles can cause livelock if multiple threads keep colliding and backing off in sync. Add randomized backoff to mitigate.

3. Break No Preemption

// Strategy: If you cannot get the next resource, voluntarily release
// what you already hold. This is "self-preemption" -- the thread
// gives up its own resources rather than waiting forever.

bool acquire_with_preemption(pthread_mutex_t *m1, pthread_mutex_t *m2) {
    while (1) {
        pthread_mutex_lock(m1);
        
        if (pthread_mutex_trylock(m2) == 0) {
            return true;  // Got both locks
        }
        
        // Cannot get m2 -- release m1 to break the "no preemption" condition.
        // Without this release, we would hold m1 and block on m2 = classic deadlock setup.
        pthread_mutex_unlock(m1);
        
        // Randomized backoff prevents livelock: without it, two threads
        // can release and retry in perfect lockstep forever
        usleep(random() % 1000);
    }
}
Limitation: Only works for restartable resources. If the thread has already sent half a network packet or written partial data while holding m1, releasing m1 means that partial work is visible to other threads. This pattern is best for pure lock acquisition, not for locks that guard in-progress mutations.

4. Break Circular Wait (Lock Ordering)

// Strategy: Always acquire locks in the SAME global order.
// If every thread acquires Lock A before Lock B, no thread can ever
// hold B while waiting for A -- the circular wait condition is impossible.

// Approach 1: Order by memory address (works when locks have no natural ordering)
void safe_lock_two(pthread_mutex_t *a, pthread_mutex_t *b) {
    // Lower address first -- deterministic, no extra metadata needed
    if (a < b) {
        pthread_mutex_lock(a);
        pthread_mutex_lock(b);
    } else {
        pthread_mutex_lock(b);
        pthread_mutex_lock(a);
    }
}

// Approach 2: Explicit priority levels (better for complex systems)
#define LOCK_PRIORITY_DB 1      // Acquired first
#define LOCK_PRIORITY_CACHE 2   // Acquired second
#define LOCK_PRIORITY_LOG 3     // Acquired last

// Rule: Always lock in ascending priority order: DB -> Cache -> Log
// Violating this order is a bug, and tools like lockdep will catch it.
This is the most practical prevention method. The Linux kernel uses this extensively — it defines a strict lock ordering hierarchy and uses the lockdep runtime validator to detect ordering violations during development. If you only remember one deadlock prevention technique, make it this one. Practical tip: Document your lock ordering in a comment at the top of each subsystem. Future maintainers will thank you. The Linux kernel documents its lock ordering in files like Documentation/locking/lockdep-design.rst.

Deadlock Avoidance

Prevention is static — you design the rules upfront. Avoidance is dynamic: the system evaluates each resource request at runtime and denies it if granting it could lead to deadlock in the future. Think of it like a bank that checks your credit before approving a loan — even if it has the money right now, it will not lend if doing so could leave it unable to cover other obligations.

Safe State

A state is safe if there exists a sequence where all processes can complete: Safe vs Unsafe State

Banker’s Algorithm

Named after the way a banker decides whether to approve a loan: “If I give this customer what they ask for, can I still guarantee that every other customer can eventually get their maximum loan?” If the answer is no, the request is denied even though the bank has the cash right now.
// System state -- the "bank's ledger"
int n_processes, n_resources;
int available[n_resources];          // Cash on hand
int max[n_processes][n_resources];   // Maximum each process will ever need
int allocation[n_processes][n_resources];  // Currently loaned out
int need[n_processes][n_resources];  // max - allocation = remaining demand

bool is_safe_state() {
    bool finish[n_processes] = {false};
    int work[n_resources];
    memcpy(work, available, sizeof(available));
    
    int completed = 0;
    while (completed < n_processes) {
        bool found = false;
        
        for (int i = 0; i < n_processes; i++) {
            if (finish[i]) continue;
            
            // Check if process i can finish with available resources
            bool can_finish = true;
            for (int j = 0; j < n_resources; j++) {
                if (need[i][j] > work[j]) {
                    can_finish = false;
                    break;
                }
            }
            
            if (can_finish) {
                // Simulate process completing
                for (int j = 0; j < n_resources; j++) {
                    work[j] += allocation[i][j];
                }
                finish[i] = true;
                completed++;
                found = true;
            }
        }
        
        if (!found) return false;  // No process can finish
    }
    
    return true;  // Found safe sequence
}

bool request_resources(int process_id, int request[]) {
    // Check if request exceeds need
    for (int j = 0; j < n_resources; j++) {
        if (request[j] > need[process_id][j]) {
            return false;  // Error: exceeds maximum claim
        }
        if (request[j] > available[j]) {
            return false;  // Must wait
        }
    }
    
    // Pretend to allocate
    for (int j = 0; j < n_resources; j++) {
        available[j] -= request[j];
        allocation[process_id][j] += request[j];
        need[process_id][j] -= request[j];
    }
    
    // Check if resulting state is safe
    if (is_safe_state()) {
        return true;  // Grant request
    }
    
    // Not safe, rollback
    for (int j = 0; j < n_resources; j++) {
        available[j] += request[j];
        allocation[process_id][j] -= request[j];
        need[process_id][j] += request[j];
    }
    
    return false;  // Deny request
}
Complexity: O(n^2 x m) where n = processes, m = resource types. This makes Banker’s algorithm impractical for systems with thousands of processes and resource types — it is mostly of academic interest and interview value. Real production systems use prevention (lock ordering) or detection (wait-for graphs) instead.

Deadlock Detection

Instead of preventing or avoiding deadlocks upfront, let them happen and then detect and recover. This is the approach taken by most databases (PostgreSQL, MySQL/InnoDB) because it offers the best resource utilization — you do not waste resources on conservative avoidance, and deadlocks are rare enough that the occasional recovery cost is acceptable.

Detection Algorithm

// Similar to Banker's safety check
bool detect_deadlock() {
    bool finish[n_processes];
    int work[n_resources];
    
    memcpy(work, available, sizeof(available));
    
    // Initially, only processes with no allocation are "finished"
    for (int i = 0; i < n_processes; i++) {
        bool has_allocation = false;
        for (int j = 0; j < n_resources; j++) {
            if (allocation[i][j] > 0) {
                has_allocation = true;
                break;
            }
        }
        finish[i] = !has_allocation;
    }
    
    // Find processes that can complete
    bool changed;
    do {
        changed = false;
        for (int i = 0; i < n_processes; i++) {
            if (finish[i]) continue;
            
            bool can_finish = true;
            for (int j = 0; j < n_resources; j++) {
                if (request[i][j] > work[j]) {
                    can_finish = false;
                    break;
                }
            }
            
            if (can_finish) {
                finish[i] = true;
                for (int j = 0; j < n_resources; j++) {
                    work[j] += allocation[i][j];
                }
                changed = true;
            }
        }
    } while (changed);
    
    // Any unfinished process is deadlocked
    for (int i = 0; i < n_processes; i++) {
        if (!finish[i]) return true;  // Deadlock detected
    }
    return false;
}

When to Run Detection

StrategyFrequencyTrade-off
Every requestImmediate detectionHigh overhead
Periodic (every N seconds)BalanceMay delay detection
When CPU utilization dropsTriggeredResource dependent

Deadlock Recovery

Once detected, how to break the deadlock:

1. Process Termination

void recover_by_termination() {
    // Option A: Kill all deadlocked processes (drastic)
    for (int i = 0; i < n_processes; i++) {
        if (is_deadlocked(i)) {
            kill_process(i);
        }
    }
    
    // Option B: Kill one at a time until deadlock resolved
    while (detect_deadlock()) {
        int victim = select_victim();  // Based on priority, resources, etc.
        kill_process(victim);
    }
}

int select_victim() {
    // Criteria for victim selection:
    // 1. Priority (kill lowest priority)
    // 2. Resources held (kill one with most)
    // 3. Progress (kill one with least work done)
    // 4. Resources needed (kill one needing most)
    // 5. Interactive vs batch (prefer killing batch)
}

2. Resource Preemption

void recover_by_preemption() {
    // Take resources from some processes, give to others
    
    while (detect_deadlock()) {
        int victim = select_victim();
        
        // Roll back victim to safe state
        rollback_process(victim);
        
        // Reallocate its resources
        for (int j = 0; j < n_resources; j++) {
            available[j] += allocation[victim][j];
            allocation[victim][j] = 0;
        }
    }
}
Challenges:
  • Rolling back may lose work
  • Same process may be selected repeatedly (starvation)
  • Need checkpointing for rollback

Locking Design Playbook

This section is a practical checklist for designing locking in a new subsystem so you avoid deadlocks up front.

Step 1: Identify Resources and Contention Points

  • List all shared data structures and external resources (sockets, files, devices) that multiple threads touch.
  • For each, decide:
    • Is it mostly read or write heavy?
    • How large is the critical section?
    • Is ordering important (e.g., parent → child, global → local)?

Step 2: Define a Global Lock Ordering

  • Assign each lock a total order (e.g., GLOBAL < DB < CACHE < LOG).
  • Enforce the rule: always acquire locks in increasing order.
  • Document this in comments and design docs so future contributors follow it.

Step 3: Choose the Right Primitive

  • Coarse mutex for simple systems with low contention.
  • Fine-grained locks where contention hotspots are identified.
  • Reader-writer locks for read-mostly data.
  • Lock-free / RCU only when the performance benefit justifies the complexity.

Step 4: Add Timeouts and Diagnostics

  • Prefer trylock + backoff in non-critical paths where waiting forever is unacceptable.
  • Add lock acquisition logging at debug level for tricky subsystems.
  • In the kernel, enable tools like lockdep; in user space, integrate watch-dog threads.

Step 5: Simulate and Test

  • Write stress tests that:
    • Start many threads acquiring locks in randomized orders (still following your documented rules).
    • Intentionally break the ordering in a test-only build to ensure your detection mechanisms fire.
  • Use tools like helgrind, TSan, or kernel lock validators where applicable.
A good locking design is boring: clearly ordered, well documented, and easy to reason about. Use this playbook whenever you add a new subsystem or refactor an old one.

Livelock

Livelock is the polite version of deadlock: two people meet in a hallway and keep stepping aside in the same direction, forever. The threads are not blocked — they are actively executing — but they make no progress.
// LIVELOCK EXAMPLE
void thread_a() {
    while (1) {
        lock(mutex_a);
        if (trylock(mutex_b) == 0) {
            // Got both
            do_work();
            unlock(mutex_b);
            unlock(mutex_a);
            return;
        }
        unlock(mutex_a);  // Release and retry
        // Both threads keep doing this forever!
    }
}

// SOLUTION: Add randomized backoff
void thread_a_fixed() {
    while (1) {
        lock(mutex_a);
        if (trylock(mutex_b) == 0) {
            do_work();
            unlock(mutex_b);
            unlock(mutex_a);
            return;
        }
        unlock(mutex_a);
        usleep(random() % 1000);  // Random backoff
    }
}

Starvation

A thread never gets the resources it needs, even though the system is not deadlocked. Think of a shy person at a buffet who keeps stepping back whenever someone more aggressive reaches for food — there is plenty of food, but they never eat.
// STARVATION EXAMPLE
// Reader-writer lock with reader preference
// Writers may starve if readers keep coming

// SOLUTION: Fair scheduling
// - Limit how long readers can hold lock
// - Give waiting writers priority after threshold
// - Use fair lock (FIFO ordering)

Priority Inversion

Low-priority thread blocks high-priority thread. This is the classic “intern blocks the CEO” problem — the intern holds the only conference room key, but a medium-priority manager keeps preempting the intern before they can unlock the door, so the CEO (who needs the room) waits indefinitely.
High priority (H):   Waiting for lock held by L
Medium priority (M): Running (doesn't need lock)
Low priority (L):    Holds lock, preempted by M

H is blocked by M (indirectly)!
Solutions:
  • Priority inheritance: L temporarily gets H’s priority
  • Priority ceiling: Lock has priority, holder gets it

Interview Deep Dive Questions

Answer:For database transactions:
-- 1. Lock ordering by table name/ID
-- Always lock accounts before orders
BEGIN;
SELECT * FROM accounts WHERE id = 1 FOR UPDATE;
SELECT * FROM orders WHERE account_id = 1 FOR UPDATE;
-- Work...
COMMIT;

-- 2. Lock timeout
SET lock_timeout = '5s';
-- Transaction aborts if can't get lock in 5s

-- 3. Deadlock detection (PostgreSQL does this)
-- Detect cycles, abort youngest transaction
For application locks:
// 1. Lock ordering
// Define consistent order based on resource ID
void transfer(Account *from, Account *to, int amount) {
    Account *first = from->id < to->id ? from : to;
    Account *second = from->id < to->id ? to : from;
    
    lock(&first->mutex);
    lock(&second->mutex);
    // Transfer...
    unlock(&second->mutex);
    unlock(&first->mutex);
}

// 2. Try-lock with timeout
bool acquire_or_abort(mutex_t *locks[], int count) {
    for (int i = 0; i < count; i++) {
        if (!try_lock_timeout(&locks[i], 100)) {
            // Timeout, release all and abort
            for (int j = 0; j < i; j++) {
                unlock(locks[j]);
            }
            return false;
        }
    }
    return true;
}
Answer:
Resources: A=10, B=5, C=7

Process  Allocation   Max      Need     
         A  B  C    A  B  C   A  B  C   
P0       0  1  0    7  5  3   7  4  3   
P1       2  0  0    3  2  2   1  2  2   
P2       3  0  2    9  0  2   6  0  0   
P3       2  1  1    2  2  2   0  1  1   
P4       0  0  2    4  3  3   4  3  1   

Available: A=3, B=3, C=2

Finding safe sequence:

1. Check P0: Need (7,4,3) > Available (3,3,2)? YES → Skip
2. Check P1: Need (1,2,2) ≤ Available (3,3,2)? YES
   → P1 can complete
   → Work = (3,3,2) + (2,0,0) = (5,3,2)

3. Check P0: (7,4,3) > (5,3,2)? YES → Skip
4. Check P2: (6,0,0) > (5,3,2)? YES → Skip  
5. Check P3: (0,1,1) ≤ (5,3,2)? YES
   → P3 can complete
   → Work = (5,3,2) + (2,1,1) = (7,4,3)

6. Check P0: (7,4,3) ≤ (7,4,3)? YES
   → P0 can complete
   → Work = (7,4,3) + (0,1,0) = (7,5,3)

7. Check P2: (6,0,0) ≤ (7,5,3)? YES
   → Work = (7,5,3) + (3,0,2) = (10,5,5)

8. Check P4: (4,3,1) ≤ (10,5,5)? YES
   → P4 can complete

Safe sequence: <P1, P3, P0, P2, P4>
System is in safe state!
Request evaluation:
P1 requests (1, 0, 2):

1. Check: Request ≤ Need? (1,0,2) ≤ (1,2,2)? YES
2. Check: Request ≤ Available? (1,0,2) ≤ (3,3,2)? YES
3. Pretend to allocate:
   Allocation[P1] = (2,0,0) + (1,0,2) = (3,0,2)
   Available = (3,3,2) - (1,0,2) = (2,3,0)
   Need[P1] = (1,2,2) - (1,0,2) = (0,2,0)

4. Check if safe: Run safety algorithm...
   → If safe, grant request
   → If not safe, deny and rollback
Answer:PostgreSQL approach:
  1. Wait-for graph: Track which transaction waits for which
  2. Deadlock detection: Periodically (every deadlock_timeout, default 1s) check for cycles
  3. Victim selection: Abort transaction that:
    • Has done least work (youngest)
    • Holds fewest locks
    • Is easiest to rollback
-- Example deadlock
-- Session 1                    -- Session 2
BEGIN;                          BEGIN;
UPDATE accounts SET             UPDATE accounts SET
  balance = 100                   balance = 200
  WHERE id = 1;                   WHERE id = 2;

-- Session 1 waits              -- Session 2 waits
UPDATE accounts SET             UPDATE accounts SET
  balance = 150                   balance = 250
  WHERE id = 2;                   WHERE id = 1;

-- DEADLOCK! PostgreSQL detects and aborts one:
-- ERROR: deadlock detected
-- DETAIL: Process 1234 waits for ShareLock on transaction 5678;
--         Process 5678 waits for ShareLock on transaction 1234.
-- HINT: See server log for query details.
Prevention strategies:
-- 1. Lock ordering
-- Always update rows in consistent order (by ID)
SELECT * FROM accounts WHERE id IN (1, 2) 
ORDER BY id FOR UPDATE;

-- 2. Lock timeout
SET lock_timeout = '5s';

-- 3. Use NOWAIT
SELECT * FROM accounts WHERE id = 1 FOR UPDATE NOWAIT;
-- Fails immediately if lock not available

-- 4. Advisory locks for complex operations
SELECT pg_advisory_lock(12345);
-- ... operations ...
SELECT pg_advisory_unlock(12345);
Answer:Mars Pathfinder (1997):
System had three threads:
- High priority: Communication task
- Medium priority: Data gathering
- Low priority: Meteorological task (held shared bus lock)

What happened:
1. Low-priority task acquires bus lock
2. High-priority task preempts, needs bus lock, blocks
3. Medium-priority task preempts low-priority
4. Medium runs indefinitely
5. Low can't release lock (preempted)
6. High starves, watchdog timer triggers reset

Result: Spacecraft kept resetting!
Solutions implemented:
  1. Priority Inheritance:
void lock_with_inheritance(mutex_t *m) {
    disable_preemption();
    
    if (m->holder != NULL) {
        // Boost holder's priority to ours
        if (current->priority > m->holder->priority) {
            m->holder->effective_priority = current->priority;
        }
    }
    
    // Now wait for lock
    while (!try_lock(m)) {
        sleep_on_mutex(m);
    }
    
    enable_preemption();
}

void unlock_with_inheritance(mutex_t *m) {
    // Restore original priority
    current->effective_priority = current->base_priority;
    m->holder = NULL;
    wake_waiters(m);
}
  1. Priority Ceiling Protocol:
// Each mutex has ceiling = highest priority of any thread that uses it

void lock_with_ceiling(mutex_t *m) {
    // Immediately boost to ceiling priority
    saved_priority = current->priority;
    current->priority = m->ceiling;
    
    acquire_lock(m);
}

void unlock_with_ceiling(mutex_t *m) {
    release_lock(m);
    current->priority = saved_priority;
}
Mars Pathfinder fix: Enabled priority inheritance on VxWorks mutexes via remote patch upload!
Answer:
// Lock wrapper with deadlock detection

#include <pthread.h>
#include <stdbool.h>

#define MAX_THREADS 100
#define MAX_LOCKS 100

typedef struct {
    pthread_mutex_t internal_mutex;
    int id;
    int holder_tid;  // Thread holding this lock
} tracked_mutex_t;

// Adjacency matrix for wait-for graph
// waits_for[i][j] = true if thread i waits for thread j
bool waits_for[MAX_THREADS][MAX_THREADS];
pthread_mutex_t graph_lock = PTHREAD_MUTEX_INITIALIZER;

bool has_cycle_dfs(int node, bool visited[], bool in_stack[]) {
    visited[node] = true;
    in_stack[node] = true;
    
    for (int i = 0; i < MAX_THREADS; i++) {
        if (waits_for[node][i]) {
            if (!visited[i]) {
                if (has_cycle_dfs(i, visited, in_stack)) {
                    return true;
                }
            } else if (in_stack[i]) {
                return true;  // Back edge = cycle!
            }
        }
    }
    
    in_stack[node] = false;
    return false;
}

bool detect_deadlock() {
    bool visited[MAX_THREADS] = {false};
    bool in_stack[MAX_THREADS] = {false};
    
    for (int i = 0; i < MAX_THREADS; i++) {
        if (!visited[i]) {
            if (has_cycle_dfs(i, visited, in_stack)) {
                return true;
            }
        }
    }
    return false;
}

int tracked_lock(tracked_mutex_t *m) {
    int my_tid = get_thread_id();
    
    // Try to acquire
    if (pthread_mutex_trylock(&m->internal_mutex) == 0) {
        m->holder_tid = my_tid;
        return 0;
    }
    
    // Update wait-for graph
    pthread_mutex_lock(&graph_lock);
    waits_for[my_tid][m->holder_tid] = true;
    
    if (detect_deadlock()) {
        waits_for[my_tid][m->holder_tid] = false;
        pthread_mutex_unlock(&graph_lock);
        return EDEADLK;  // Would deadlock!
    }
    pthread_mutex_unlock(&graph_lock);
    
    // Actually wait
    pthread_mutex_lock(&m->internal_mutex);
    
    // Got lock, update graph
    pthread_mutex_lock(&graph_lock);
    waits_for[my_tid][m->holder_tid] = false;
    m->holder_tid = my_tid;
    pthread_mutex_unlock(&graph_lock);
    
    return 0;
}

Practice Exercises

1

Banker's Simulator

Implement Banker’s algorithm. Test with various request sequences.
2

Deadlock Detector

Build a system that wraps pthread_mutex and detects potential deadlocks.
3

Dining Philosophers

Solve with different strategies: lock ordering, waiter, chandy-misra.
4

Priority Inheritance

Implement mutex with priority inheritance protocol.

Key Takeaways

Four Conditions

Mutual exclusion, hold-and-wait, no preemption, circular wait. Break any one.

Lock Ordering Works

Most practical prevention. Define consistent order, always follow it.

Detection + Recovery

For complex systems. Run detection periodically, abort victim.

Databases Handle It

Built-in deadlock detection. Abort youngest transaction on cycle.

Next: Inter-Process Communication

Interview Deep-Dive

Strong Answer Framework:
  1. Define detection vs. diagnosis. Detection answers “is something stuck?”; diagnosis answers “why?” Both are needed but the tooling differs.
  2. Detection signals in order of usefulness:
    • Latency p99 cliff — requests that normally take 50ms suddenly take 30 seconds. Easy to alert on, often the first signal.
    • Thread pool saturation — monitor “active threads / max threads” per pool. When it pegs at 100% and stays there while RPS drops, threads are blocked, not busy.
    • Lock wait time — expose pthread_mutex wait time as a metric (Linux: perf lock contention; PostgreSQL: pg_stat_activity.wait_event). A growing wait queue with stable acquisition rate means contention; a frozen wait queue means deadlock.
    • CPU drop with steady RPS — a deadlocked process holds threads but does no work. You see CPU collapse while request rate is stable — a paradoxical signature.
  3. Diagnosis steps once detected:
    • Capture stacks from every thread (gdb thread apply all bt, jstack, py-spy dump).
    • Look for the cycle: thread A in pthread_mutex_lock for mutex M1, where M1 is held by thread B who is in pthread_mutex_lock for M2.
    • For Java, JVM auto-detects and prints “Found one Java-level deadlock”; for Linux, lockdep does this in debug kernels.
  4. Automate the response: a watchdog that detects “no requests completed in 60s but threads remain” should dump stacks and either restart or page on-call. Netflix’s Hystrix / resilience4j uses thread pool saturation as a circuit breaker trigger for exactly this reason.
Real-World Example: GitLab’s 2017 database outage included a deadlock signature on the secondary replica before the primary failed. They diagnosed it from pg_stat_activity showing two backends each waiting on the other’s transaction. After this, they added a Prometheus exporter for pg_locks and alert when wait time exceeds 1 second. PostgreSQL has built-in deadlock detection (default deadlock_timeout = 1s) that aborts one transaction with ERROR: deadlock detected — in databases, you usually do not need custom detection; you need to alert on the count of these errors.
Senior Follow-up 1: How do you distinguish a deadlock from a slow query or a stalled disk? Look at the wait event, not just the wait time. PostgreSQL distinguishes Lock wait events (deadlock candidate) from IO waits (slow disk) from Client waits (idle in transaction). A pure deadlock has zero CPU usage and no I/O activity on the blocked thread. A slow query is on-CPU or doing I/O.
Senior Follow-up 2: What is a “lockdep splat” and what does it tell you? Lockdep emits a multi-line warning to dmesg showing the lock acquisition chain that forms the cycle, the call stacks where each acquisition happened, and the thread context. The format is precise: “Existing dependency chain (in reverse order)” followed by the lock classes. You read it bottom-up to see the order of acquisitions that conflict.
Senior Follow-up 3: Can you detect deadlocks across processes (not just threads)? Within one host, yes — the kernel knows all flock/fcntl/futex waiters and can build the wait-for graph. PostgreSQL detects deadlocks across backend processes because the lock manager is in shared memory. Across hosts (distributed locks via Redis/Zookeeper), you need an external coordinator with timeouts; there is no kernel-level visibility. Most distributed systems use timeouts + lease-based locks rather than true detection.
Common Wrong Answers:
  • “Just set a timeout on every operation” — treats the symptom; the deadlock still happens, you just get faster errors instead of fixing the root cause.
  • “Restart the service if requests are slow” — masks the bug and conflates deadlock with general slowness.
  • “Use a deadlock detection algorithm in the application” — usually overkill; the OS / DB already does this. Reach for it only when you have your own custom resource manager.
Further Reading:
  • PostgreSQL docs, “Lock Management” section, especially deadlock_timeout and pg_locks — the model implementation.
  • Linux kernel Documentation/locking/lockdep-design.rst — lockdep internals, free.
  • Brendan Gregg, “Systems Performance” Chapter 5 — how to diagnose lock contention with perf.
Strong Answer Framework:
  1. State the problem precisely: 5 philosophers, 5 forks shared between adjacent pairs, each needs both adjacent forks to eat. Naive “pick up left then right” deadlocks if all five act simultaneously.
  2. Solution 1: Resource ordering (lock by global order). Number forks 0-4. Each philosopher acquires min(left, right) first, max(left, right) second. Breaks circular wait. Scales O(1) per attempt, no central coordination.
  3. Solution 2: Asymmetric breaking. Philosopher 4 picks up right then left; everyone else picks up left then right. One asymmetric actor is enough to break the cycle. Scales identically to ordering, sometimes simpler to retrofit.
  4. Solution 3: Waiter (global mutex / arbiter). A central waiter grants permission to pick up forks. Breaks hold-and-wait. Does NOT scale — the waiter serializes all eating, throughput drops to 1 philosopher at a time.
  5. Solution 4: Chandy-Misra (token-based). Distributed protocol with dirty/clean fork states. No central coordinator, scales to thousands of philosophers across machines. Best for distributed systems but high implementation complexity.
  6. Solution 5: Try-lock with backoff. Each philosopher tries non-blocking acquire of both forks; if either fails, release and back off. Avoids deadlock but creates livelock risk without randomized backoff.
  7. Recommend resource ordering for production: simplest, scales, and matches what real systems (Linux kernel lock hierarchy, database lock managers) actually do.
Real-World Example: The Linux kernel’s VFS lock hierarchy is dining philosophers at scale. Inodes have parent-child relationships, and any operation that locks two inodes must lock parent before child. This is enforced at the lockdep level — if any code path violates the ordering, lockdep reports it during testing. The kernel’s Documentation/filesystems/directory-locking.rst documents the exact ordering rules: parent-before-child for create/delete, ancestor-first for renames, and a global rename-lock for cross-directory moves to prevent ordering cycles.
Senior Follow-up 1: Why does the waiter solution not scale? The waiter is a single point of serialization — every fork acquisition goes through one mutex, so throughput is bounded by 1 / (waiter critical section time). With 5 philosophers it might be fine; with 1000 it is a 1000x slowdown vs. independent ordering. This is the same reason “one big lock” loses to per-shard locks at scale.
Senior Follow-up 2: How does Chandy-Misra avoid the deadlock without ordering? Each fork has a “clean” or “dirty” state. A philosopher only sends a fork to a requesting neighbor if it is dirty. After eating, forks become dirty. A philosopher who wants to eat requests the missing fork; the holder gives it up if dirty, keeps it if clean. The asymmetry of clean/dirty + request tokens guarantees the wait-for graph is acyclic.
Senior Follow-up 3: How does this map to per-row database locking with thousands of locks? You cannot pre-order locks across arbitrary user queries — users decide which rows to update, in what order. So databases use detection + recovery rather than prevention: PostgreSQL builds a wait-for graph in shared memory and runs cycle detection every deadlock_timeout (default 1s). When it finds a cycle, it aborts the transaction with the youngest XID. The user sees ERROR: deadlock detected and is expected to retry.
Common Wrong Answers:
  • “Just use the waiter, it is the simplest” — correct that it is simple, wrong that it scales. Ignores the throughput cost.
  • “Try-lock without backoff fixes it” — creates livelock; threads keep retrying in lockstep.
  • “Acquire both forks atomically” — not actually possible in standard pthreads without a higher-level coordinator (which is the waiter solution in disguise).
Further Reading:
  • Edsger Dijkstra, “Hierarchical Ordering of Sequential Processes” (1971) — where the dining philosophers were first stated.
  • Chandy & Misra, “The Drinking Philosophers Problem” (1984) — generalization with the token algorithm.
  • Linux kernel Documentation/filesystems/directory-locking.rst — production-scale lock ordering.
Strong Answer Framework:
  1. First, characterize the wait. Userspace mutex shows up as __lll_lock_wait or futex_wait in the stack. Kernel semaphore shows up with the syscall in the upper frames — the thread is in kernel space, blocked on down() or down_interruptible().
  2. Identify the resource each is waiting on.
    • For the userspace mutex: in gdb, print *((pthread_mutex_t*)0xADDRESS) to see the owner field. That tells you which thread holds it.
    • For the kernel semaphore: this is harder from userspace. Use cat /proc/PID/stack to see the in-kernel call chain, and cat /proc/PID/wchan to see what kernel function the thread is sleeping in.
  3. Build the wait-for graph manually. Thread A waits on userspace mutex M held by Thread B. Thread B is in ioctl waiting on kernel semaphore S. What does S protect? Often a device driver’s internal state. If the holder of S is also waiting on M (e.g., a driver callback that calls back into userspace via a signal handler that takes M), you have a deadlock spanning userspace and kernel space.
  4. Use kernel tools:
    • echo l > /proc/sysrq-trigger dumps all CPU stack traces.
    • echo t > /proc/sysrq-trigger dumps all task states.
    • cat /proc/lockdep_chains if lockdep is on — shows kernel lock dependency graph.
    • bpftrace to observe semaphore acquisitions in real time: trace down and up calls.
  5. Common root causes:
    • Driver bug: ioctl handler holds a kernel lock and triggers a signal that calls back into userspace, where the signal handler takes a userspace lock that is held by another thread waiting on the ioctl. Classic kernel-userspace inversion.
    • mmap-backed I/O: writing to mmap’d memory triggers a page fault, which takes mmap_sem, which conflicts with another thread’s mmap syscall. This is the famous mmap_sem contention problem that drove the kernel’s switch to mmap_lock (rwsem) in 2020.
  6. Fix patterns: never call signal handlers or callbacks while holding a kernel lock you might re-enter; never hold a userspace lock while making syscalls that could block on the same resource the lock protects.
Real-World Example: The Linux kernel’s mmap_sem was the source of dozens of deadlock bugs over a decade. A classic case: a thread calls mmap, which takes mmap_sem as writer. The new VMA triggers a page fault on first access; the fault handler also wants mmap_sem as reader. Recursive acquisition deadlock. The fix landed in 5.8 (2020): mmap_sem was renamed to mmap_lock and given proper read/write semantics with annotations enforced by lockdep. Greg Kroah-Hartman cited this as one of the longest-running multi-kernel-version refactors.
Senior Follow-up 1: Why is debugging across the user/kernel boundary so hard? Userspace tools (gdb, strace) cannot see kernel stack frames except via /proc/PID/stack. Kernel tools (ftrace, bpftrace) cannot easily decode userspace symbols without debug info. You usually need both: gdb for userspace state, cat /proc/PID/stack plus crash (kernel core dump tool) for the kernel side. perf can bridge them with --call-graph dwarf but the merged stacks are still imperfect.
Senior Follow-up 2: What is D state and how does it relate to deadlock? D state (TASK_UNINTERRUPTIBLE) means the thread is sleeping in the kernel and cannot be killed — not even by SIGKILL. It is used for short, non-interruptible waits (e.g., disk I/O) that the kernel guarantees will complete. A thread stuck in D state for minutes usually indicates a driver bug or storage hang — effectively a kernel deadlock. You see it in top as the D state column.
Senior Follow-up 3: How do you reproduce a userspace+kernel deadlock for testing? Hard. Use stress-ng for kernel pressure (--vm, --io, --futex), inject usleep in userspace handlers to widen the race window, and run on a slow VM where timing differences are amplified. For real reproducibility, use rr (Mozilla’s record-and-replay debugger) which captures the exact thread interleaving so you can replay the deadlock deterministically.
Common Wrong Answers:
  • “Just kill -9 the process” — D state ignores SIGKILL; you cannot kill a thread blocked in the kernel.
  • “Add a timeout to the ioctl” — often impossible (drivers may not honor timeouts) and treats the symptom.
  • “Restart the host” — works but loses the diagnostic state. Capture stacks first.
Further Reading:
  • Brendan Gregg, “BPF Performance Tools” — has a chapter on tracing kernel locks with bpftrace.
  • Linux kernel Documentation/admin-guide/sysrq.rst — the SysRq interface for emergency diagnostics.
  • crash utility documentation — post-mortem kernel debugger; essential for analyzing kernel deadlocks from a vmcore.
Strong Answer:
  • This is a distributed deadlock, and it is much harder to detect than a local deadlock because no single node has a complete picture of the wait-for graph. In a local deadlock, the OS or database can inspect all threads and locks. In a distributed deadlock, the “locks” span multiple services, databases, and network connections.
  • The wait cycle here is: Service A holds Lock-X, calls Service B over the network (blocking), Service B holds Lock-Y, calls Service A over the network (blocking). If A’s handler for B’s request also needs Lock-X, we have a classic circular wait distributed across two machines.
  • Detection approaches: (1) Timeouts — the most practical approach. Every RPC call should have a deadline. If Service A’s call to B times out, it releases Lock-X and retries or fails. (2) Distributed deadlock detection using a centralized coordinator that collects wait-for edges from all services and runs cycle detection. (3) Wound-wait or wait-die schemes where the younger transaction is always aborted to break cycles.
  • The fix is architectural: avoid holding locks across network calls. Use a saga pattern where each service does its local work and commits, then sends a message to the next service. If a later step fails, compensating transactions undo earlier steps.
Follow-up: What is the difference between a deadlock and a livelock in a distributed system, and which is harder to debug?A livelock occurs when services keep retrying failed operations without making progress. For example, if both A and B detect a timeout, release their locks, and immediately retry, they might re-acquire their respective locks and deadlock again, in an infinite loop. Each individual attempt looks healthy — no stuck threads, no timeout breaches — but throughput goes to zero. Livelocks are harder to debug because monitoring tools show active, running services with normal CPU usage. The symptom is high retry rates and zero successful completions. The fix is exponential backoff with jitter so the services eventually desynchronize and one succeeds.
Strong Answer:
  • lockdep tracks the order in which locks are acquired across all code paths and builds a global lock ordering graph at runtime. Every time a thread acquires a lock, lockdep records the pair (previously held lock, newly acquired lock) as a directed edge. If it ever detects a cycle in this graph, it reports a potential deadlock — even if the actual deadlock has not occurred yet.
  • The key insight is that lockdep operates on lock classes, not lock instances. All instances of the same lock type (e.g., all inode mutexes) are treated as one class. If code path 1 acquires class A then B, and code path 2 acquires B then A, lockdep reports a violation even if the actual instances are different. This catches deadlocks that might only manifest under rare timing.
  • lockdep also tracks locks across interrupt context boundaries. If you acquire a lock in process context and the same lock class is also acquired in interrupt context, lockdep warns you because an interrupt could fire while the process-context code holds the lock.
  • The trade-off is performance: lockdep adds overhead to every lock acquisition, so it is typically enabled in debug kernels and disabled in production. It has found hundreds of real locking bugs in the Linux kernel.
Follow-up: Why does lockdep use lock classes rather than individual lock instances?If lockdep tracked individual instances, it would miss the vast majority of potential deadlocks because the specific pair of instances that deadlocks might never be acquired together in testing. By using classes, lockdep reasons about code structure rather than runtime behavior. The downside is false positives: there are cases where ordering within a class is intentionally different (e.g., locking a parent directory inode then a child). The kernel uses annotations (lock_set_subclass, lock_nested) to inform lockdep about these intentional ordering variations.
Strong Answer:
  • The problem: five philosophers sit around a table, each needs two forks to eat. If each picks up their left fork simultaneously, all hold one fork and wait for the other — deadlock.
  • Solution 1 — Resource ordering: Number the forks 0-4. Each philosopher always picks up the lower-numbered fork first. This breaks the circular wait condition. This is the most practical production solution: simple, zero overhead, and never wastes CPU.
  • Solution 2 — Global arbitrator: A single mutex protects the “pick up forks” operation. Breaks hold-and-wait but creates a bottleneck — only one philosopher can attempt to eat at a time.
  • Solution 3 — Chandy-Misra: A distributed protocol using dirty/clean fork states. Fully distributed, no central coordinator, designed for systems where global ordering is impractical. But complex to implement correctly.
  • In production, I use resource ordering every time unless the system is truly distributed. It is easy to reason about, verifiable by lockdep, zero runtime overhead, and scales perfectly.
Follow-up: How does lock ordering work when you have thousands of lock instances, like per-row locks in a database?You use a natural ordering based on resource identity: lock by table name first, then by primary key value. When a transaction needs to update rows in tables A and B, it acquires all locks on A’s rows first (in primary key order), then B’s rows. Alternatively, the database engine allows arbitrary lock ordering but runs periodic deadlock detection on the wait-for graph, aborting the youngest transaction when a cycle is found. This is what PostgreSQL does — it is impossible to enforce ordering across arbitrary user queries, so detection and recovery is the only practical approach.