> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Deadlocks > Deadlock conditions, prevention, avoidance, detection, and recovery # Deadlocks A **deadlock** is a situation where two or more threads are waiting indefinitely for resources held by each other. Think of two cars meeting on a narrow one-lane bridge from opposite sides -- neither can move forward until the other backs up, but neither is willing to back up. In software, "backing up" means releasing a lock you already hold, which most programs are not designed to do. Understanding deadlocks is essential for senior engineers -- they can bring entire systems down, and they are notoriously difficult to reproduce because they depend on precise timing between threads. **Caveat 1: Lock ordering inversion is the #1 cause of production deadlocks.** Every multi-lock deadlock you will ever see in real code reduces to "thread A grabbed L1 then L2, thread B grabbed L2 then L1." Senior engineers know this and aggressively enforce a **canonical lock order** (e.g., always lower memory address first; always parent inode before child inode). The trap: third-party libraries that internally acquire locks. Calling `library.foo()` while holding your own lock is the silent ordering violation -- your code looks correct, but the library may take a global lock (the C runtime allocator does this!) that some other thread is waiting on while it holds yours. Document every lock acquisition order in a comment, and never call into unknown library code while holding a lock. **Pattern: enforce lock order with annotations.** Linux uses lockdep (covered below) which infers ordering from observed acquisitions and yells if any code path violates it. For C++, use Clang's Thread Safety Analysis (`-Wthread-safety`) with `GUARDED_BY` and `ACQUIRED_BEFORE` attributes -- compile-time enforcement. For Java, the `JCStress` harness exercises lock orderings under stress. The cheapest mechanical rule: in functions that take multiple locks as parameters, sort by `std::less` on the address before locking. **Caveat 2: Priority inversion in real-time systems will kill you.** A high-priority thread waiting on a lock held by a low-priority thread is fine -- IF the low-priority thread is running. But if a *medium*-priority thread preempts the low-priority holder, the high-priority thread is blocked indirectly by the medium thread. This is exactly what stranded the Mars Pathfinder rover in 1997. The Linux/POSIX fix is **priority inheritance**: when a high-priority thread blocks on a lock, the holder temporarily gets boosted to the waiter's priority so it cannot be preempted by lower-priority work. **Pattern: enable priority inheritance explicitly.** It is OFF by default in pthreads. Use `pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT)` for any mutex shared across priority levels. For real-time threads, also pin them to dedicated cores (`pthread_setaffinity_np`) and use `SCHED_FIFO` -- inheritance only works if the scheduler can actually run the boosted holder. The alternative, **priority ceiling**, is faster (no dynamic priority changes) but requires you to know all threads that will use each lock at design time. **Caveat 3: Diagnose deadlocks with the right tools, not by reading code.** Production deadlocks are rarely "obvious from the code" -- if they were, they would have been caught in review. Use **lockdep** (kernel; enable with `CONFIG_PROVE_LOCKING=y`), **ThreadSanitizer** (TSan; user-space, `-fsanitize=thread`), or **Helgrind** (Valgrind tool). For a wedged production process, attach with gdb and run `thread apply all bt` -- a deadlock shows N threads all in `pthread_mutex_lock` waiting on each other's locks. For Java, `jstack` produces "Found one Java-level deadlock" output automatically. **Pattern: keep diagnostics one command away.** Bake `gdb -batch -ex "thread apply all bt"` into your container's preStop hook so a stuck pod dumps stack traces before being killed. For lockdep, your CI should run debug kernels even if production runs prod kernels -- lockdep catches deadlocks that have not happened *yet* but are reachable. Linus rejects kernel patches that break lockdep clean runs; treat your project the same way. **Caveat 4: Watchdog timers are a last resort, not a design.** "Add a 5-second timeout and abort the request if it expires" feels like robust engineering. It is not -- it is hiding a bug. Watchdogs are appropriate for *external* failures you cannot prevent (a third-party API hangs). They are inappropriate for *internal* deadlocks you should fix at the design level. A timeout that fires once per million requests usually means a real deadlock, not a slow network -- and burying it under retries lets it accumulate and bite you when load spikes. **Pattern: instrument timeouts so they are visible.** Every timeout should emit a metric (counter + last-seen timestamp) and a structured log including the stack trace at the moment of timeout. If the timeout count is non-zero in production, treat it as a bug. PostgreSQL does this with `log_lock_waits = on` and `deadlock_timeout` -- when a backend waits longer than the threshold, it dumps the wait graph to the log so DBAs can see exactly which transactions are tangled. **Interview Frequency**: High\ **Key Topics**: Four conditions, Banker's algorithm, prevention strategies\ **Time to Master**: 6-8 hours *** ## Deadlock Fundamentals ### Classic Deadlock Scenario Deadlock Example

### The Four Necessary Conditions **Coffman Conditions**: ALL four must hold for deadlock to occur. Breaking ANY one prevents deadlock. Resources can only be held by one thread at a time. *Example: A mutex can only be held by one thread.* Thread holds resources while waiting for others. *Example: Thread holds mutex1 while waiting for mutex2.* Resources can only be released voluntarily. *Example: Can't force a thread to release its mutex.* Circular chain of threads waiting for each other. *Example: A waits for B, B waits for A.* *** ## Resource Allocation Graph Visualize resource allocation to detect deadlocks:

**Detection rule**: * Single instance per resource: Cycle = Deadlock * Multiple instances: Cycle is necessary but not sufficient *** ## Deadlock Prevention Break one of the four conditions: ### 1. Break Mutual Exclusion ```c theme={null} // Strategy: Replace mutual exclusion with lock-free atomics. // If no thread ever "holds" a lock, deadlock is structurally impossible. #include atomic_int counter = 0; void increment() { // atomic_fetch_add uses a hardware CAS loop internally -- // no mutex, so no possibility of hold-and-wait atomic_fetch_add(&counter, 1); } ``` **Limitation**: Not all resources can be shared. Atomics work for counters and flags, but you cannot atomically update a complex data structure spanning multiple fields without a lock (or a carefully designed lock-free algorithm). ### 2. Break Hold and Wait ```c theme={null} // Strategy: Acquire ALL locks at once, or none. // A thread never "holds" one lock while "waiting" for another, // so the hold-and-wait condition is broken by construction. typedef struct { pthread_mutex_t mutex; int id; } resource_t; bool acquire_all(resource_t *resources[], int count) { for (int i = 0; i < count; i++) { if (pthread_mutex_trylock(&resources[i]->mutex) != 0) { // Failed on lock i -- release everything we already hold // to avoid holding some locks while waiting for others for (int j = 0; j < i; j++) { pthread_mutex_unlock(&resources[j]->mutex); } return false; // Caller should back off and retry } } return true; // Successfully acquired all locks atomically } ``` **Limitation**: Low resource utilization (you hold all locks even if you only need one at a time), and repeated acquire/release cycles can cause **livelock** if multiple threads keep colliding and backing off in sync. Add randomized backoff to mitigate. ### 3. Break No Preemption ```c theme={null} // Strategy: If you cannot get the next resource, voluntarily release // what you already hold. This is "self-preemption" -- the thread // gives up its own resources rather than waiting forever. bool acquire_with_preemption(pthread_mutex_t *m1, pthread_mutex_t *m2) { while (1) { pthread_mutex_lock(m1); if (pthread_mutex_trylock(m2) == 0) { return true; // Got both locks } // Cannot get m2 -- release m1 to break the "no preemption" condition. // Without this release, we would hold m1 and block on m2 = classic deadlock setup. pthread_mutex_unlock(m1); // Randomized backoff prevents livelock: without it, two threads // can release and retry in perfect lockstep forever usleep(random() % 1000); } } ``` **Limitation**: Only works for restartable resources. If the thread has already sent half a network packet or written partial data while holding m1, releasing m1 means that partial work is visible to other threads. This pattern is best for pure lock acquisition, not for locks that guard in-progress mutations. ### 4. Break Circular Wait (Lock Ordering) ```c theme={null} // Strategy: Always acquire locks in the SAME global order. // If every thread acquires Lock A before Lock B, no thread can ever // hold B while waiting for A -- the circular wait condition is impossible. // Approach 1: Order by memory address (works when locks have no natural ordering) void safe_lock_two(pthread_mutex_t *a, pthread_mutex_t *b) { // Lower address first -- deterministic, no extra metadata needed if (a < b) { pthread_mutex_lock(a); pthread_mutex_lock(b); } else { pthread_mutex_lock(b); pthread_mutex_lock(a); } } // Approach 2: Explicit priority levels (better for complex systems) #define LOCK_PRIORITY_DB 1 // Acquired first #define LOCK_PRIORITY_CACHE 2 // Acquired second #define LOCK_PRIORITY_LOG 3 // Acquired last // Rule: Always lock in ascending priority order: DB -> Cache -> Log // Violating this order is a bug, and tools like lockdep will catch it. ``` **This is the most practical prevention method.** The Linux kernel uses this extensively -- it defines a strict lock ordering hierarchy and uses the `lockdep` runtime validator to detect ordering violations during development. If you only remember one deadlock prevention technique, make it this one. **Practical tip**: Document your lock ordering in a comment at the top of each subsystem. Future maintainers will thank you. The Linux kernel documents its lock ordering in files like `Documentation/locking/lockdep-design.rst`. *** ## Deadlock Avoidance Prevention is static -- you design the rules upfront. **Avoidance** is dynamic: the system evaluates each resource request at runtime and denies it if granting it *could* lead to deadlock in the future. Think of it like a bank that checks your credit before approving a loan -- even if it has the money right now, it will not lend if doing so could leave it unable to cover other obligations. ### Safe State A state is **safe** if there exists a sequence where all processes can complete: Safe vs Unsafe State

### Banker's Algorithm Named after the way a banker decides whether to approve a loan: "If I give this customer what they ask for, can I still guarantee that every other customer can eventually get their maximum loan?" If the answer is no, the request is denied even though the bank has the cash right now. ```c theme={null} // System state -- the "bank's ledger" int n_processes, n_resources; int available[n_resources]; // Cash on hand int max[n_processes][n_resources]; // Maximum each process will ever need int allocation[n_processes][n_resources]; // Currently loaned out int need[n_processes][n_resources]; // max - allocation = remaining demand bool is_safe_state() { bool finish[n_processes] = {false}; int work[n_resources]; memcpy(work, available, sizeof(available)); int completed = 0; while (completed < n_processes) { bool found = false; for (int i = 0; i < n_processes; i++) { if (finish[i]) continue; // Check if process i can finish with available resources bool can_finish = true; for (int j = 0; j < n_resources; j++) { if (need[i][j] > work[j]) { can_finish = false; break; } } if (can_finish) { // Simulate process completing for (int j = 0; j < n_resources; j++) { work[j] += allocation[i][j]; } finish[i] = true; completed++; found = true; } } if (!found) return false; // No process can finish } return true; // Found safe sequence } bool request_resources(int process_id, int request[]) { // Check if request exceeds need for (int j = 0; j < n_resources; j++) { if (request[j] > need[process_id][j]) { return false; // Error: exceeds maximum claim } if (request[j] > available[j]) { return false; // Must wait } } // Pretend to allocate for (int j = 0; j < n_resources; j++) { available[j] -= request[j]; allocation[process_id][j] += request[j]; need[process_id][j] -= request[j]; } // Check if resulting state is safe if (is_safe_state()) { return true; // Grant request } // Not safe, rollback for (int j = 0; j < n_resources; j++) { available[j] += request[j]; allocation[process_id][j] -= request[j]; need[process_id][j] += request[j]; } return false; // Deny request } ``` **Complexity**: O(n^2 x m) where n = processes, m = resource types. This makes Banker's algorithm impractical for systems with thousands of processes and resource types -- it is mostly of academic interest and interview value. Real production systems use prevention (lock ordering) or detection (wait-for graphs) instead. *** ## Deadlock Detection Instead of preventing or avoiding deadlocks upfront, let them happen and then detect and recover. This is the approach taken by most databases (PostgreSQL, MySQL/InnoDB) because it offers the best resource utilization -- you do not waste resources on conservative avoidance, and deadlocks are rare enough that the occasional recovery cost is acceptable. ### Detection Algorithm ```c theme={null} // Similar to Banker's safety check bool detect_deadlock() { bool finish[n_processes]; int work[n_resources]; memcpy(work, available, sizeof(available)); // Initially, only processes with no allocation are "finished" for (int i = 0; i < n_processes; i++) { bool has_allocation = false; for (int j = 0; j < n_resources; j++) { if (allocation[i][j] > 0) { has_allocation = true; break; } } finish[i] = !has_allocation; } // Find processes that can complete bool changed; do { changed = false; for (int i = 0; i < n_processes; i++) { if (finish[i]) continue; bool can_finish = true; for (int j = 0; j < n_resources; j++) { if (request[i][j] > work[j]) { can_finish = false; break; } } if (can_finish) { finish[i] = true; for (int j = 0; j < n_resources; j++) { work[j] += allocation[i][j]; } changed = true; } } } while (changed); // Any unfinished process is deadlocked for (int i = 0; i < n_processes; i++) { if (!finish[i]) return true; // Deadlock detected } return false; } ``` ### When to Run Detection | Strategy | Frequency | Trade-off | | -------------------------- | ------------------- | ------------------- | | Every request | Immediate detection | High overhead | | Periodic (every N seconds) | Balance | May delay detection | | When CPU utilization drops | Triggered | Resource dependent | *** ## Deadlock Recovery Once detected, how to break the deadlock: ### 1. Process Termination ```c theme={null} void recover_by_termination() { // Option A: Kill all deadlocked processes (drastic) for (int i = 0; i < n_processes; i++) { if (is_deadlocked(i)) { kill_process(i); } } // Option B: Kill one at a time until deadlock resolved while (detect_deadlock()) { int victim = select_victim(); // Based on priority, resources, etc. kill_process(victim); } } int select_victim() { // Criteria for victim selection: // 1. Priority (kill lowest priority) // 2. Resources held (kill one with most) // 3. Progress (kill one with least work done) // 4. Resources needed (kill one needing most) // 5. Interactive vs batch (prefer killing batch) } ``` ### 2. Resource Preemption ```c theme={null} void recover_by_preemption() { // Take resources from some processes, give to others while (detect_deadlock()) { int victim = select_victim(); // Roll back victim to safe state rollback_process(victim); // Reallocate its resources for (int j = 0; j < n_resources; j++) { available[j] += allocation[victim][j]; allocation[victim][j] = 0; } } } ``` **Challenges**: * Rolling back may lose work * Same process may be selected repeatedly (starvation) * Need checkpointing for rollback *** ## Locking Design Playbook This section is a **practical checklist** for designing locking in a new subsystem so you avoid deadlocks up front. ### Step 1: Identify Resources and Contention Points * List all shared data structures and external resources (sockets, files, devices) that multiple threads touch. * For each, decide: * Is it mostly **read** or **write** heavy? * How large is the critical section? * Is ordering important (e.g., parent → child, global → local)? ### Step 2: Define a Global Lock Ordering * Assign each lock a **total order** (e.g., `GLOBAL` \< `DB` \< `CACHE` \< `LOG`). * Enforce the rule: **always acquire locks in increasing order**. * Document this in comments and design docs so future contributors follow it. ### Step 3: Choose the Right Primitive * **Coarse mutex** for simple systems with low contention. * **Fine-grained locks** where contention hotspots are identified. * **Reader-writer locks** for read-mostly data. * **Lock-free / RCU** only when the performance benefit justifies the complexity. ### Step 4: Add Timeouts and Diagnostics * Prefer `trylock` + backoff in non-critical paths where waiting forever is unacceptable. * Add **lock acquisition logging** at debug level for tricky subsystems. * In the kernel, enable tools like **lockdep**; in user space, integrate watch-dog threads. ### Step 5: Simulate and Test * Write **stress tests** that: * Start many threads acquiring locks in randomized orders (still following your documented rules). * Intentionally break the ordering in a test-only build to ensure your detection mechanisms fire. * Use tools like `helgrind`, `TSan`, or kernel lock validators where applicable. A good locking design is boring: clearly ordered, well documented, and easy to reason about. Use this playbook whenever you add a new subsystem or refactor an old one. *** ## Related Problems ### Livelock Livelock is the polite version of deadlock: two people meet in a hallway and keep stepping aside in the same direction, forever. The threads are not blocked -- they are actively executing -- but they make no progress. ```c theme={null} // LIVELOCK EXAMPLE void thread_a() { while (1) { lock(mutex_a); if (trylock(mutex_b) == 0) { // Got both do_work(); unlock(mutex_b); unlock(mutex_a); return; } unlock(mutex_a); // Release and retry // Both threads keep doing this forever! } } // SOLUTION: Add randomized backoff void thread_a_fixed() { while (1) { lock(mutex_a); if (trylock(mutex_b) == 0) { do_work(); unlock(mutex_b); unlock(mutex_a); return; } unlock(mutex_a); usleep(random() % 1000); // Random backoff } } ``` ### Starvation A thread never gets the resources it needs, even though the system is not deadlocked. Think of a shy person at a buffet who keeps stepping back whenever someone more aggressive reaches for food -- there is plenty of food, but they never eat. ```c theme={null} // STARVATION EXAMPLE // Reader-writer lock with reader preference // Writers may starve if readers keep coming // SOLUTION: Fair scheduling // - Limit how long readers can hold lock // - Give waiting writers priority after threshold // - Use fair lock (FIFO ordering) ``` ### Priority Inversion Low-priority thread blocks high-priority thread. This is the classic "intern blocks the CEO" problem -- the intern holds the only conference room key, but a medium-priority manager keeps preempting the intern before they can unlock the door, so the CEO (who needs the room) waits indefinitely. ``` High priority (H): Waiting for lock held by L Medium priority (M): Running (doesn't need lock) Low priority (L): Holds lock, preempted by M H is blocked by M (indirectly)! ``` **Solutions**: * **Priority inheritance**: L temporarily gets H's priority * **Priority ceiling**: Lock has priority, holder gets it *** ## Interview Deep Dive Questions **Answer:** **For database transactions**: ```sql theme={null} -- 1. Lock ordering by table name/ID -- Always lock accounts before orders BEGIN; SELECT * FROM accounts WHERE id = 1 FOR UPDATE; SELECT * FROM orders WHERE account_id = 1 FOR UPDATE; -- Work... COMMIT; -- 2. Lock timeout SET lock_timeout = '5s'; -- Transaction aborts if can't get lock in 5s -- 3. Deadlock detection (PostgreSQL does this) -- Detect cycles, abort youngest transaction ``` **For application locks**: ```c theme={null} // 1. Lock ordering // Define consistent order based on resource ID void transfer(Account *from, Account *to, int amount) { Account *first = from->id < to->id ? from : to; Account *second = from->id < to->id ? to : from; lock(&first->mutex); lock(&second->mutex); // Transfer... unlock(&second->mutex); unlock(&first->mutex); } // 2. Try-lock with timeout bool acquire_or_abort(mutex_t *locks[], int count) { for (int i = 0; i < count; i++) { if (!try_lock_timeout(&locks[i], 100)) { // Timeout, release all and abort for (int j = 0; j < i; j++) { unlock(locks[j]); } return false; } } return true; } ``` **Answer:** ``` Resources: A=10, B=5, C=7 Process Allocation Max Need A B C A B C A B C P0 0 1 0 7 5 3 7 4 3 P1 2 0 0 3 2 2 1 2 2 P2 3 0 2 9 0 2 6 0 0 P3 2 1 1 2 2 2 0 1 1 P4 0 0 2 4 3 3 4 3 1 Available: A=3, B=3, C=2 Finding safe sequence: 1. Check P0: Need (7,4,3) > Available (3,3,2)? YES → Skip 2. Check P1: Need (1,2,2) ≤ Available (3,3,2)? YES → P1 can complete → Work = (3,3,2) + (2,0,0) = (5,3,2) 3. Check P0: (7,4,3) > (5,3,2)? YES → Skip 4. Check P2: (6,0,0) > (5,3,2)? YES → Skip 5. Check P3: (0,1,1) ≤ (5,3,2)? YES → P3 can complete → Work = (5,3,2) + (2,1,1) = (7,4,3) 6. Check P0: (7,4,3) ≤ (7,4,3)? YES → P0 can complete → Work = (7,4,3) + (0,1,0) = (7,5,3) 7. Check P2: (6,0,0) ≤ (7,5,3)? YES → Work = (7,5,3) + (3,0,2) = (10,5,5) 8. Check P4: (4,3,1) ≤ (10,5,5)? YES → P4 can complete Safe sequence: System is in safe state! ``` **Request evaluation**: ``` P1 requests (1, 0, 2): 1. Check: Request ≤ Need? (1,0,2) ≤ (1,2,2)? YES 2. Check: Request ≤ Available? (1,0,2) ≤ (3,3,2)? YES 3. Pretend to allocate: Allocation[P1] = (2,0,0) + (1,0,2) = (3,0,2) Available = (3,3,2) - (1,0,2) = (2,3,0) Need[P1] = (1,2,2) - (1,0,2) = (0,2,0) 4. Check if safe: Run safety algorithm... → If safe, grant request → If not safe, deny and rollback ``` **Answer:** **PostgreSQL approach**: 1. **Wait-for graph**: Track which transaction waits for which 2. **Deadlock detection**: Periodically (every `deadlock_timeout`, default 1s) check for cycles 3. **Victim selection**: Abort transaction that: * Has done least work (youngest) * Holds fewest locks * Is easiest to rollback ```sql theme={null} -- Example deadlock -- Session 1 -- Session 2 BEGIN; BEGIN; UPDATE accounts SET UPDATE accounts SET balance = 100 balance = 200 WHERE id = 1; WHERE id = 2; -- Session 1 waits -- Session 2 waits UPDATE accounts SET UPDATE accounts SET balance = 150 balance = 250 WHERE id = 2; WHERE id = 1; -- DEADLOCK! PostgreSQL detects and aborts one: -- ERROR: deadlock detected -- DETAIL: Process 1234 waits for ShareLock on transaction 5678; -- Process 5678 waits for ShareLock on transaction 1234. -- HINT: See server log for query details. ``` **Prevention strategies**: ```sql theme={null} -- 1. Lock ordering -- Always update rows in consistent order (by ID) SELECT * FROM accounts WHERE id IN (1, 2) ORDER BY id FOR UPDATE; -- 2. Lock timeout SET lock_timeout = '5s'; -- 3. Use NOWAIT SELECT * FROM accounts WHERE id = 1 FOR UPDATE NOWAIT; -- Fails immediately if lock not available -- 4. Advisory locks for complex operations SELECT pg_advisory_lock(12345); -- ... operations ... SELECT pg_advisory_unlock(12345); ``` **Answer:** **Mars Pathfinder (1997)**: ``` System had three threads: - High priority: Communication task - Medium priority: Data gathering - Low priority: Meteorological task (held shared bus lock) What happened: 1. Low-priority task acquires bus lock 2. High-priority task preempts, needs bus lock, blocks 3. Medium-priority task preempts low-priority 4. Medium runs indefinitely 5. Low can't release lock (preempted) 6. High starves, watchdog timer triggers reset Result: Spacecraft kept resetting! ``` **Solutions implemented**: 1. **Priority Inheritance**: ```c theme={null} void lock_with_inheritance(mutex_t *m) { disable_preemption(); if (m->holder != NULL) { // Boost holder's priority to ours if (current->priority > m->holder->priority) { m->holder->effective_priority = current->priority; } } // Now wait for lock while (!try_lock(m)) { sleep_on_mutex(m); } enable_preemption(); } void unlock_with_inheritance(mutex_t *m) { // Restore original priority current->effective_priority = current->base_priority; m->holder = NULL; wake_waiters(m); } ``` 2. **Priority Ceiling Protocol**: ```c theme={null} // Each mutex has ceiling = highest priority of any thread that uses it void lock_with_ceiling(mutex_t *m) { // Immediately boost to ceiling priority saved_priority = current->priority; current->priority = m->ceiling; acquire_lock(m); } void unlock_with_ceiling(mutex_t *m) { release_lock(m); current->priority = saved_priority; } ``` **Mars Pathfinder fix**: Enabled priority inheritance on VxWorks mutexes via remote patch upload! **Answer:** ```c theme={null} // Lock wrapper with deadlock detection #include #include #define MAX_THREADS 100 #define MAX_LOCKS 100 typedef struct { pthread_mutex_t internal_mutex; int id; int holder_tid; // Thread holding this lock } tracked_mutex_t; // Adjacency matrix for wait-for graph // waits_for[i][j] = true if thread i waits for thread j bool waits_for[MAX_THREADS][MAX_THREADS]; pthread_mutex_t graph_lock = PTHREAD_MUTEX_INITIALIZER; bool has_cycle_dfs(int node, bool visited[], bool in_stack[]) { visited[node] = true; in_stack[node] = true; for (int i = 0; i < MAX_THREADS; i++) { if (waits_for[node][i]) { if (!visited[i]) { if (has_cycle_dfs(i, visited, in_stack)) { return true; } } else if (in_stack[i]) { return true; // Back edge = cycle! } } } in_stack[node] = false; return false; } bool detect_deadlock() { bool visited[MAX_THREADS] = {false}; bool in_stack[MAX_THREADS] = {false}; for (int i = 0; i < MAX_THREADS; i++) { if (!visited[i]) { if (has_cycle_dfs(i, visited, in_stack)) { return true; } } } return false; } int tracked_lock(tracked_mutex_t *m) { int my_tid = get_thread_id(); // Try to acquire if (pthread_mutex_trylock(&m->internal_mutex) == 0) { m->holder_tid = my_tid; return 0; } // Update wait-for graph pthread_mutex_lock(&graph_lock); waits_for[my_tid][m->holder_tid] = true; if (detect_deadlock()) { waits_for[my_tid][m->holder_tid] = false; pthread_mutex_unlock(&graph_lock); return EDEADLK; // Would deadlock! } pthread_mutex_unlock(&graph_lock); // Actually wait pthread_mutex_lock(&m->internal_mutex); // Got lock, update graph pthread_mutex_lock(&graph_lock); waits_for[my_tid][m->holder_tid] = false; m->holder_tid = my_tid; pthread_mutex_unlock(&graph_lock); return 0; } ``` *** ## Practice Exercises Implement Banker's algorithm. Test with various request sequences. Build a system that wraps pthread\_mutex and detects potential deadlocks. Solve with different strategies: lock ordering, waiter, chandy-misra. Implement mutex with priority inheritance protocol. *** ## Key Takeaways Mutual exclusion, hold-and-wait, no preemption, circular wait. Break any one. Most practical prevention. Define consistent order, always follow it. For complex systems. Run detection periodically, abort victim. Built-in deadlock detection. Abort youngest transaction on cycle. *** Next: [Inter-Process Communication](/operating-systems/ipc) → *** ## Interview Deep-Dive **Strong Answer Framework:** 1. **Define detection vs. diagnosis.** Detection answers "is something stuck?"; diagnosis answers "why?" Both are needed but the tooling differs. 2. **Detection signals in order of usefulness:** * **Latency p99 cliff** -- requests that normally take 50ms suddenly take 30 seconds. Easy to alert on, often the first signal. * **Thread pool saturation** -- monitor "active threads / max threads" per pool. When it pegs at 100% and stays there while RPS drops, threads are blocked, not busy. * **Lock wait time** -- expose `pthread_mutex` wait time as a metric (Linux: `perf lock contention`; PostgreSQL: `pg_stat_activity.wait_event`). A growing wait queue with stable acquisition rate means contention; a frozen wait queue means deadlock. * **CPU drop with steady RPS** -- a deadlocked process holds threads but does no work. You see CPU collapse while request rate is stable -- a paradoxical signature. 3. **Diagnosis steps once detected:** * Capture stacks from every thread (`gdb thread apply all bt`, `jstack`, `py-spy dump`). * Look for the cycle: thread A in `pthread_mutex_lock` for mutex M1, where M1 is held by thread B who is in `pthread_mutex_lock` for M2. * For Java, JVM auto-detects and prints "Found one Java-level deadlock"; for Linux, lockdep does this in debug kernels. 4. **Automate the response**: a watchdog that detects "no requests completed in 60s but threads remain" should dump stacks and either restart or page on-call. Netflix's Hystrix / resilience4j uses thread pool saturation as a circuit breaker trigger for exactly this reason. **Real-World Example:** GitLab's 2017 database outage included a deadlock signature on the secondary replica before the primary failed. They diagnosed it from `pg_stat_activity` showing two backends each waiting on the other's transaction. After this, they added a Prometheus exporter for `pg_locks` and alert when wait time exceeds 1 second. PostgreSQL has built-in deadlock detection (default `deadlock_timeout = 1s`) that aborts one transaction with `ERROR: deadlock detected` -- in databases, you usually do not need custom detection; you need to alert on the count of these errors. **Senior Follow-up 1: How do you distinguish a deadlock from a slow query or a stalled disk?** Look at the wait event, not just the wait time. PostgreSQL distinguishes `Lock` wait events (deadlock candidate) from `IO` waits (slow disk) from `Client` waits (idle in transaction). A pure deadlock has zero CPU usage and no I/O activity on the blocked thread. A slow query is on-CPU or doing I/O. **Senior Follow-up 2: What is a "lockdep splat" and what does it tell you?** Lockdep emits a multi-line warning to dmesg showing the lock acquisition chain that forms the cycle, the call stacks where each acquisition happened, and the thread context. The format is precise: "Existing dependency chain (in reverse order)" followed by the lock classes. You read it bottom-up to see the order of acquisitions that conflict. **Senior Follow-up 3: Can you detect deadlocks across processes (not just threads)?** Within one host, yes -- the kernel knows all `flock`/`fcntl`/`futex` waiters and can build the wait-for graph. PostgreSQL detects deadlocks across backend processes because the lock manager is in shared memory. Across hosts (distributed locks via Redis/Zookeeper), you need an external coordinator with timeouts; there is no kernel-level visibility. Most distributed systems use timeouts + lease-based locks rather than true detection. **Common Wrong Answers:** * "Just set a timeout on every operation" -- treats the symptom; the deadlock still happens, you just get faster errors instead of fixing the root cause. * "Restart the service if requests are slow" -- masks the bug and conflates deadlock with general slowness. * "Use a deadlock detection algorithm in the application" -- usually overkill; the OS / DB already does this. Reach for it only when you have your own custom resource manager. **Further Reading:** * PostgreSQL docs, "Lock Management" section, especially `deadlock_timeout` and `pg_locks` -- the model implementation. * Linux kernel `Documentation/locking/lockdep-design.rst` -- lockdep internals, free. * Brendan Gregg, "Systems Performance" Chapter 5 -- how to diagnose lock contention with `perf`. **Strong Answer Framework:** 1. **State the problem precisely**: 5 philosophers, 5 forks shared between adjacent pairs, each needs both adjacent forks to eat. Naive "pick up left then right" deadlocks if all five act simultaneously. 2. **Solution 1: Resource ordering (lock by global order).** Number forks 0-4. Each philosopher acquires `min(left, right)` first, `max(left, right)` second. Breaks circular wait. **Scales O(1) per attempt, no central coordination.** 3. **Solution 2: Asymmetric breaking.** Philosopher 4 picks up right then left; everyone else picks up left then right. One asymmetric actor is enough to break the cycle. **Scales identically to ordering, sometimes simpler to retrofit.** 4. **Solution 3: Waiter (global mutex / arbiter).** A central waiter grants permission to pick up forks. Breaks hold-and-wait. **Does NOT scale -- the waiter serializes all eating, throughput drops to 1 philosopher at a time.** 5. **Solution 4: Chandy-Misra (token-based).** Distributed protocol with dirty/clean fork states. No central coordinator, scales to thousands of philosophers across machines. **Best for distributed systems but high implementation complexity.** 6. **Solution 5: Try-lock with backoff.** Each philosopher tries non-blocking acquire of both forks; if either fails, release and back off. Avoids deadlock but creates **livelock risk** without randomized backoff. 7. **Recommend resource ordering for production**: simplest, scales, and matches what real systems (Linux kernel lock hierarchy, database lock managers) actually do. **Real-World Example:** The Linux kernel's VFS lock hierarchy is dining philosophers at scale. Inodes have parent-child relationships, and any operation that locks two inodes must lock parent before child. This is enforced at the lockdep level -- if any code path violates the ordering, lockdep reports it during testing. The kernel's `Documentation/filesystems/directory-locking.rst` documents the exact ordering rules: parent-before-child for create/delete, ancestor-first for renames, and a global rename-lock for cross-directory moves to prevent ordering cycles. **Senior Follow-up 1: Why does the waiter solution not scale?** The waiter is a single point of serialization -- every fork acquisition goes through one mutex, so throughput is bounded by 1 / (waiter critical section time). With 5 philosophers it might be fine; with 1000 it is a 1000x slowdown vs. independent ordering. This is the same reason "one big lock" loses to per-shard locks at scale. **Senior Follow-up 2: How does Chandy-Misra avoid the deadlock without ordering?** Each fork has a "clean" or "dirty" state. A philosopher only sends a fork to a requesting neighbor if it is dirty. After eating, forks become dirty. A philosopher who wants to eat requests the missing fork; the holder gives it up if dirty, keeps it if clean. The asymmetry of clean/dirty + request tokens guarantees the wait-for graph is acyclic. **Senior Follow-up 3: How does this map to per-row database locking with thousands of locks?** You cannot pre-order locks across arbitrary user queries -- users decide which rows to update, in what order. So databases use **detection + recovery** rather than prevention: PostgreSQL builds a wait-for graph in shared memory and runs cycle detection every `deadlock_timeout` (default 1s). When it finds a cycle, it aborts the transaction with the youngest XID. The user sees `ERROR: deadlock detected` and is expected to retry. **Common Wrong Answers:** * "Just use the waiter, it is the simplest" -- correct that it is simple, wrong that it scales. Ignores the throughput cost. * "Try-lock without backoff fixes it" -- creates livelock; threads keep retrying in lockstep. * "Acquire both forks atomically" -- not actually possible in standard pthreads without a higher-level coordinator (which is the waiter solution in disguise). **Further Reading:** * Edsger Dijkstra, "Hierarchical Ordering of Sequential Processes" (1971) -- where the dining philosophers were first stated. * Chandy & Misra, "The Drinking Philosophers Problem" (1984) -- generalization with the token algorithm. * Linux kernel `Documentation/filesystems/directory-locking.rst` -- production-scale lock ordering. **Strong Answer Framework:** 1. **First, characterize the wait.** Userspace mutex shows up as `__lll_lock_wait` or `futex_wait` in the stack. Kernel semaphore shows up with the syscall in the upper frames -- the thread is in kernel space, blocked on `down()` or `down_interruptible()`. 2. **Identify the resource each is waiting on.** * For the userspace mutex: in gdb, `print *((pthread_mutex_t*)0xADDRESS)` to see the owner field. That tells you which thread holds it. * For the kernel semaphore: this is harder from userspace. Use `cat /proc/PID/stack` to see the in-kernel call chain, and `cat /proc/PID/wchan` to see what kernel function the thread is sleeping in. 3. **Build the wait-for graph manually.** Thread A waits on userspace mutex M held by Thread B. Thread B is in ioctl waiting on kernel semaphore S. What does S protect? Often a device driver's internal state. If the holder of S is also waiting on M (e.g., a driver callback that calls back into userspace via a signal handler that takes M), you have a deadlock spanning userspace and kernel space. 4. **Use kernel tools:** * `echo l > /proc/sysrq-trigger` dumps all CPU stack traces. * `echo t > /proc/sysrq-trigger` dumps all task states. * `cat /proc/lockdep_chains` if lockdep is on -- shows kernel lock dependency graph. * `bpftrace` to observe semaphore acquisitions in real time: trace `down` and `up` calls. 5. **Common root causes:** * Driver bug: ioctl handler holds a kernel lock and triggers a signal that calls back into userspace, where the signal handler takes a userspace lock that is held by another thread waiting on the ioctl. Classic kernel-userspace inversion. * `mmap`-backed I/O: writing to mmap'd memory triggers a page fault, which takes `mmap_sem`, which conflicts with another thread's `mmap` syscall. This is the famous `mmap_sem` contention problem that drove the kernel's switch to `mmap_lock` (rwsem) in 2020. 6. **Fix patterns:** never call signal handlers or callbacks while holding a kernel lock you might re-enter; never hold a userspace lock while making syscalls that could block on the same resource the lock protects. **Real-World Example:** The Linux kernel's `mmap_sem` was the source of dozens of deadlock bugs over a decade. A classic case: a thread calls `mmap`, which takes `mmap_sem` as writer. The new VMA triggers a page fault on first access; the fault handler also wants `mmap_sem` as reader. Recursive acquisition deadlock. The fix landed in 5.8 (2020): `mmap_sem` was renamed to `mmap_lock` and given proper read/write semantics with annotations enforced by lockdep. Greg Kroah-Hartman cited this as one of the longest-running multi-kernel-version refactors. **Senior Follow-up 1: Why is debugging across the user/kernel boundary so hard?** Userspace tools (gdb, strace) cannot see kernel stack frames except via `/proc/PID/stack`. Kernel tools (ftrace, bpftrace) cannot easily decode userspace symbols without debug info. You usually need both: gdb for userspace state, `cat /proc/PID/stack` plus `crash` (kernel core dump tool) for the kernel side. perf can bridge them with `--call-graph dwarf` but the merged stacks are still imperfect. **Senior Follow-up 2: What is `D state` and how does it relate to deadlock?** `D state` (TASK\_UNINTERRUPTIBLE) means the thread is sleeping in the kernel and cannot be killed -- not even by SIGKILL. It is used for short, non-interruptible waits (e.g., disk I/O) that the kernel guarantees will complete. A thread stuck in D state for minutes usually indicates a driver bug or storage hang -- effectively a kernel deadlock. You see it in `top` as the `D` state column. **Senior Follow-up 3: How do you reproduce a userspace+kernel deadlock for testing?** Hard. Use `stress-ng` for kernel pressure (`--vm`, `--io`, `--futex`), inject `usleep` in userspace handlers to widen the race window, and run on a slow VM where timing differences are amplified. For real reproducibility, use rr (Mozilla's record-and-replay debugger) which captures the exact thread interleaving so you can replay the deadlock deterministically. **Common Wrong Answers:** * "Just kill -9 the process" -- D state ignores SIGKILL; you cannot kill a thread blocked in the kernel. * "Add a timeout to the ioctl" -- often impossible (drivers may not honor timeouts) and treats the symptom. * "Restart the host" -- works but loses the diagnostic state. Capture stacks first. **Further Reading:** * Brendan Gregg, "BPF Performance Tools" -- has a chapter on tracing kernel locks with bpftrace. * Linux kernel `Documentation/admin-guide/sysrq.rst` -- the SysRq interface for emergency diagnostics. * `crash` utility documentation -- post-mortem kernel debugger; essential for analyzing kernel deadlocks from a vmcore. **Strong Answer:** * This is a distributed deadlock, and it is much harder to detect than a local deadlock because no single node has a complete picture of the wait-for graph. In a local deadlock, the OS or database can inspect all threads and locks. In a distributed deadlock, the "locks" span multiple services, databases, and network connections. * The wait cycle here is: Service A holds Lock-X, calls Service B over the network (blocking), Service B holds Lock-Y, calls Service A over the network (blocking). If A's handler for B's request also needs Lock-X, we have a classic circular wait distributed across two machines. * Detection approaches: (1) Timeouts -- the most practical approach. Every RPC call should have a deadline. If Service A's call to B times out, it releases Lock-X and retries or fails. (2) Distributed deadlock detection using a centralized coordinator that collects wait-for edges from all services and runs cycle detection. (3) Wound-wait or wait-die schemes where the younger transaction is always aborted to break cycles. * The fix is architectural: avoid holding locks across network calls. Use a saga pattern where each service does its local work and commits, then sends a message to the next service. If a later step fails, compensating transactions undo earlier steps. **Follow-up: What is the difference between a deadlock and a livelock in a distributed system, and which is harder to debug?** A livelock occurs when services keep retrying failed operations without making progress. For example, if both A and B detect a timeout, release their locks, and immediately retry, they might re-acquire their respective locks and deadlock again, in an infinite loop. Each individual attempt looks healthy -- no stuck threads, no timeout breaches -- but throughput goes to zero. Livelocks are harder to debug because monitoring tools show active, running services with normal CPU usage. The symptom is high retry rates and zero successful completions. The fix is exponential backoff with jitter so the services eventually desynchronize and one succeeds. **Strong Answer:** * lockdep tracks the order in which locks are acquired across all code paths and builds a global lock ordering graph at runtime. Every time a thread acquires a lock, lockdep records the pair (previously held lock, newly acquired lock) as a directed edge. If it ever detects a cycle in this graph, it reports a potential deadlock -- even if the actual deadlock has not occurred yet. * The key insight is that lockdep operates on lock classes, not lock instances. All instances of the same lock type (e.g., all inode mutexes) are treated as one class. If code path 1 acquires class A then B, and code path 2 acquires B then A, lockdep reports a violation even if the actual instances are different. This catches deadlocks that might only manifest under rare timing. * lockdep also tracks locks across interrupt context boundaries. If you acquire a lock in process context and the same lock class is also acquired in interrupt context, lockdep warns you because an interrupt could fire while the process-context code holds the lock. * The trade-off is performance: lockdep adds overhead to every lock acquisition, so it is typically enabled in debug kernels and disabled in production. It has found hundreds of real locking bugs in the Linux kernel. **Follow-up: Why does lockdep use lock classes rather than individual lock instances?** If lockdep tracked individual instances, it would miss the vast majority of potential deadlocks because the specific pair of instances that deadlocks might never be acquired together in testing. By using classes, lockdep reasons about code structure rather than runtime behavior. The downside is false positives: there are cases where ordering within a class is intentionally different (e.g., locking a parent directory inode then a child). The kernel uses annotations (`lock_set_subclass`, `lock_nested`) to inform lockdep about these intentional ordering variations. **Strong Answer:** * The problem: five philosophers sit around a table, each needs two forks to eat. If each picks up their left fork simultaneously, all hold one fork and wait for the other -- deadlock. * Solution 1 -- Resource ordering: Number the forks 0-4. Each philosopher always picks up the lower-numbered fork first. This breaks the circular wait condition. This is the most practical production solution: simple, zero overhead, and never wastes CPU. * Solution 2 -- Global arbitrator: A single mutex protects the "pick up forks" operation. Breaks hold-and-wait but creates a bottleneck -- only one philosopher can attempt to eat at a time. * Solution 3 -- Chandy-Misra: A distributed protocol using dirty/clean fork states. Fully distributed, no central coordinator, designed for systems where global ordering is impractical. But complex to implement correctly. * In production, I use resource ordering every time unless the system is truly distributed. It is easy to reason about, verifiable by lockdep, zero runtime overhead, and scales perfectly. **Follow-up: How does lock ordering work when you have thousands of lock instances, like per-row locks in a database?** You use a natural ordering based on resource identity: lock by table name first, then by primary key value. When a transaction needs to update rows in tables A and B, it acquires all locks on A's rows first (in primary key order), then B's rows. Alternatively, the database engine allows arbitrary lock ordering but runs periodic deadlock detection on the wait-for graph, aborting the youngest transaction when a cycle is found. This is what PostgreSQL does -- it is impossible to enforce ordering across arbitrary user queries, so detection and recovery is the only practical approach.