Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Kernel Synchronization

The Linux kernel is massively concurrent — multiple CPUs executing kernel code simultaneously, interrupts preempting at any moment, and softirqs running in parallel. Understanding synchronization primitives is essential for reading kernel code and debugging race conditions. The analogy: Imagine a busy restaurant kitchen (the kernel). Multiple chefs (CPUs) share the same cutting boards and stoves (shared data). Without coordination, two chefs reach for the same knife (race condition), or one chef waits for a stove that another chef already reserved but walked away from (deadlock). Kernel synchronization is the set of rules and tools that prevent these disasters — and just like in a kitchen, different situations call for different tools. A quick “hands off for one second” (spinlock) works for a short task, but for a 20-minute simmer (sleeping operation), you need a reservation system (mutex) so other chefs can use the stove while waiting.
Interview Frequency: Very High (critical for systems roles)
Key Topics: Spinlocks, mutexes, RCU, memory barriers, deadlock debugging
Time to Master: 14-16 hours

Why Kernel Synchronization is Different

User-space synchronization is simpler because:
  • Processes are preemptible (scheduler handles it)
  • Only threads of the same process share address space
  • You can always sleep when waiting for a lock
Kernel synchronization is harder because:
  • Interrupt context: Cannot sleep, must use spinlocks
  • Preemption: Kernel code can be preempted (on PREEMPT kernels)
  • Multiple CPUs: True parallelism, not just concurrency
  • Diverse contexts: Process context, softirq, hardirq - different rules for each

Synchronization Contexts

┌─────────────────────────────────────────────────────────────────────────────┐
│                   KERNEL EXECUTION CONTEXTS                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Context         Can Sleep?    Preemptible?    Lock Types Allowed           │
│  ────────────────────────────────────────────────────────────────────────   │
│                                                                              │
│  Process         Yes           Yes             mutex, semaphore,             │
│  Context         (GFP_KERNEL)  (CONFIG_       spinlock, rwlock              │
│                                 PREEMPT)                                    │
│                                                                              │
│  Softirq/        No            No              spinlock (with               │
│  Tasklet                                       IRQ disabling if             │
│                                                sharing with hardirq)        │
│                                                                              │
│  Hardirq         No            No              spin_lock_irqsave            │
│                  (atomic)                      only                         │
│                                                                              │
│  NMI             No            No              raw_spinlock,                │
│                  (super                        trylock only                 │
│                   atomic)                                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Atomic Operations

The foundation of all synchronization - operations that cannot be interrupted:

Atomic Types

#include <linux/atomic.h>

// Basic atomic integer
atomic_t counter = ATOMIC_INIT(0);

// 64-bit atomic
atomic64_t big_counter = ATOMIC64_INIT(0);

// Atomic operations
atomic_read(&counter);           // Read value
atomic_set(&counter, 5);         // Set value
atomic_add(10, &counter);        // Add
atomic_sub(3, &counter);         // Subtract
atomic_inc(&counter);            // Increment
atomic_dec(&counter);            // Decrement

// Atomic read-modify-write
int old = atomic_fetch_add(5, &counter);  // Returns old value
int old = atomic_xchg(&counter, 10);       // Exchange

// Atomic test operations
atomic_dec_and_test(&counter);   // Returns true if result is 0
atomic_add_negative(n, &counter); // Returns true if result is negative

// Compare and swap
atomic_cmpxchg(&counter, old, new);  // If counter==old, set to new
atomic_try_cmpxchg(&counter, &old, new);  // Returns bool

When to Use Atomic Operations

// Good: Simple counter
static atomic_t connection_count = ATOMIC_INIT(0);

void on_connect(void) {
    atomic_inc(&connection_count);
}

void on_disconnect(void) {
    atomic_dec(&connection_count);
}

// Bad: Compound operation (not atomic!)
if (atomic_read(&counter) == 0) {  // RACE!
    atomic_inc(&counter);          // Another CPU could change it
}

// Good: Use atomic primitive
atomic_cmpxchg(&counter, 0, 1);  // Atomic "compare and set if zero"

Spinlocks

The most basic lock — spins in a loop waiting for the lock. The analogy: A spinlock is like standing outside a single-occupancy bathroom door, jiggling the handle every millisecond. You’re not doing anything productive while waiting (burning CPU cycles), but the moment the door unlocks, you’re in instantly. This makes spinlocks ideal for very short critical sections (microseconds) where the overhead of “going to sit down and waiting to be called” (sleeping, as mutexes do) would be more expensive than just standing there. The critical rule: Never sleep while holding a spinlock. If you go to sleep, you’re holding the bathroom door locked while napping on the couch — nobody else can get in, and on a uniprocessor system, nobody can wake you up either because the scheduler itself might need that lock.

Basic Spinlock

#include <linux/spinlock.h>

// Static initialization
static DEFINE_SPINLOCK(my_lock);

// Dynamic initialization
spinlock_t my_lock;
spin_lock_init(&my_lock);

// Basic usage
spin_lock(&my_lock);
// ... critical section ...
spin_unlock(&my_lock);

// With trylock (non-blocking)
if (spin_trylock(&my_lock)) {
    // ... critical section ...
    spin_unlock(&my_lock);
} else {
    // Lock not available, handle gracefully
}

Spinlock Variants

// Basic (use in process context when not sharing with interrupts)
spin_lock(&lock);
spin_unlock(&lock);

// Disable softirqs (use when sharing with softirq context)
spin_lock_bh(&lock);      // Disable bottom halves
spin_unlock_bh(&lock);    // Re-enable bottom halves

// Disable all interrupts (use when sharing with hardirq context)
unsigned long flags;
spin_lock_irqsave(&lock, flags);     // Save IRQ state, disable IRQs
// ... critical section ...
spin_unlock_irqrestore(&lock, flags); // Restore IRQ state

// Disable interrupts (unconditionally)
spin_lock_irq(&lock);     // Assumes IRQs were enabled
spin_unlock_irq(&lock);   // Re-enables IRQs

// Warning: Don't nest irq variants wrong!
spin_lock_irq(&lock1);
spin_lock_irq(&lock2);    // WRONG! First unlock will re-enable IRQs
spin_unlock_irq(&lock2);  // IRQs now enabled, but lock1 still held!
spin_unlock_irq(&lock1);

// Correct: Use irqsave for inner locks
spin_lock_irq(&lock1);
spin_lock(&lock2);        // IRQs already disabled
spin_unlock(&lock2);
spin_unlock_irq(&lock1);

Spinlock Selection Guide

┌─────────────────────────────────────────────────────────────────────────────┐
│                   WHICH SPINLOCK VARIANT TO USE?                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Q: Is the lock ever acquired in hardirq context?                           │
│  │                                                                           │
│  ├─ YES → Always use spin_lock_irqsave() / spin_unlock_irqrestore()         │
│  │        (Even from process context - must disable IRQs to prevent         │
│  │         deadlock if interrupt tries to acquire same lock)                │
│  │                                                                           │
│  └─ NO ──► Is the lock ever acquired in softirq/tasklet context?            │
│            │                                                                 │
│            ├─ YES → Use spin_lock_bh() / spin_unlock_bh() from process ctx  │
│            │        Use spin_lock() / spin_unlock() from softirq ctx        │
│            │                                                                 │
│            └─ NO ──► Use spin_lock() / spin_unlock()                        │
│                      (But consider if a mutex would be better)              │
│                                                                              │
│  DANGER: Never sleep while holding a spinlock!                               │
│  The CPU spins, so other CPUs waste time waiting.                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Reader-Writer Locks

When reads are much more common than writes:

Read-Write Spinlocks

#include <linux/rwlock.h>

static DEFINE_RWLOCK(my_rwlock);

// Reader (shared access)
read_lock(&my_rwlock);
// ... read data ...
read_unlock(&my_rwlock);

// Writer (exclusive access)
write_lock(&my_rwlock);
// ... modify data ...
write_unlock(&my_rwlock);

// IRQ-safe variants
read_lock_irqsave(&my_rwlock, flags);
write_lock_irqsave(&my_rwlock, flags);

Seqlock (Sequence Lock)

For data with very frequent reads and rare writes:
#include <linux/seqlock.h>

static DEFINE_SEQLOCK(my_seqlock);

// Writer (takes exclusive lock)
write_seqlock(&my_seqlock);
// ... modify data ...
write_sequnlock(&my_seqlock);

// Reader (lockless, may need to retry)
unsigned int seq;
do {
    seq = read_seqbegin(&my_seqlock);
    // ... read data into local variables ...
} while (read_seqretry(&my_seqlock, seq));

// If read_seqretry returns true, a write occurred during read
// and we must retry
When to use seqlock:
  • Data is read very frequently
  • Writes are rare
  • Readers can tolerate occasional retries
  • Example: System time (jiffies)

Mutexes

For when you can sleep - more efficient than spinlocks for long waits:
#include <linux/mutex.h>

// Static initialization
static DEFINE_MUTEX(my_mutex);

// Dynamic initialization
struct mutex my_mutex;
mutex_init(&my_mutex);

// Basic usage
mutex_lock(&my_mutex);           // Sleep if locked
// ... critical section (can sleep here!) ...
mutex_unlock(&my_mutex);

// Interruptible (can be interrupted by signals)
if (mutex_lock_interruptible(&my_mutex)) {
    return -ERESTARTSYS;  // Signal received
}
// ... critical section ...
mutex_unlock(&my_mutex);

// Killable (only SIGKILL can interrupt)
if (mutex_lock_killable(&my_mutex)) {
    return -ERESTARTSYS;
}

// Trylock (non-blocking)
if (mutex_trylock(&my_mutex)) {
    // ... got the lock ...
    mutex_unlock(&my_mutex);
} else {
    // Lock not available
}

Mutex Rules

  1. Only the owner can unlock: Unlike spinlocks, mutexes track ownership
  2. Cannot be used in interrupt context: They might sleep
  3. Cannot be held across sleep with spinlock: Spinlock holder cannot sleep
  4. Recursive locking forbidden: Will deadlock
// WRONG: Recursive locking
mutex_lock(&my_mutex);
foo();  // foo() also calls mutex_lock(&my_mutex) → DEADLOCK!
mutex_unlock(&my_mutex);

// WRONG: Holding mutex in hardirq
irqreturn_t my_irq_handler(int irq, void *dev_id) {
    mutex_lock(&my_mutex);  // BUG: Cannot sleep in IRQ context!
    // ...
}

Semaphores

Counting semaphores for limiting concurrency:
#include <linux/semaphore.h>

// Initialize with count
static DEFINE_SEMAPHORE(my_sem, 3);  // Allow 3 concurrent accessors

struct semaphore my_sem;
sema_init(&my_sem, 3);

// Acquire (decrement)
down(&my_sem);              // Sleep if count is 0
down_interruptible(&my_sem); // Can be interrupted
down_trylock(&my_sem);      // Non-blocking

// Release (increment)
up(&my_sem);
Use cases:
  • Limiting concurrent I/O operations
  • Resource pools (e.g., connection pool)
  • Producer-consumer with bounded buffer

RCU (Read-Copy-Update)

RCU is Linux’s secret weapon for read-mostly data structures. It allows readers to access data without locks while writers make copies and update pointers atomically.

RCU Concept

┌─────────────────────────────────────────────────────────────────────────────┐
│                        RCU MECHANISM                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Traditional Locking:                                                        │
│  ───────────────────                                                        │
│  Reader 1: [====lock====][read][====unlock====]                             │
│  Reader 2:                    [wait...][====lock====][read][unlock]         │
│  Writer:   [wait..................][lock][write][unlock]                    │
│                                                                              │
│  RCU:                                                                        │
│  ─────                                                                       │
│  Reader 1: [rcu_read_lock][read old data][rcu_read_unlock]                  │
│  Reader 2:     [rcu_read_lock][read][rcu_read_unlock]                       │
│  Writer:   [copy][update copy][publish ptr][wait for readers][free old]    │
│            │                  │            │                                 │
│            │                  │            └─ synchronize_rcu()             │
│            │                  └─ rcu_assign_pointer()                       │
│            └─ kmalloc + copy data                                           │
│                                                                              │
│  Key Insight:                                                                │
│  • Readers never wait                                                       │
│  • Writers copy, update pointer atomically, wait, then free old            │
│  • "Grace period" = time for all pre-existing readers to finish            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

RCU API

#include <linux/rcupdate.h>

// Reader-side
rcu_read_lock();
struct my_data *ptr = rcu_dereference(global_ptr);
// ... use ptr (do not dereference after rcu_read_unlock) ...
rcu_read_unlock();

// Writer-side (update)
struct my_data *new = kmalloc(sizeof(*new), GFP_KERNEL);
*new = *old;                    // Copy old data
new->field = new_value;         // Modify
rcu_assign_pointer(global_ptr, new);  // Atomic publish
synchronize_rcu();              // Wait for readers
kfree(old);                     // Safe to free now

// Asynchronous free (don't wait)
call_rcu(&old->rcu_head, my_free_callback);

RCU-Protected List

#include <linux/rculist.h>

struct my_entry {
    int data;
    struct list_head list;
    struct rcu_head rcu;  // For call_rcu
};

static LIST_HEAD(my_list);
static DEFINE_SPINLOCK(list_lock);  // Protects writers

// Reader (lockless)
void read_list(void)
{
    struct my_entry *entry;
    
    rcu_read_lock();
    list_for_each_entry_rcu(entry, &my_list, list) {
        process(entry->data);
    }
    rcu_read_unlock();
}

// Writer (needs lock for mutual exclusion with other writers)
void add_entry(int data)
{
    struct my_entry *new = kmalloc(sizeof(*new), GFP_KERNEL);
    new->data = data;
    
    spin_lock(&list_lock);
    list_add_rcu(&new->list, &my_list);
    spin_unlock(&list_lock);
}

// Delete with RCU
static void entry_free_rcu(struct rcu_head *head)
{
    struct my_entry *entry = container_of(head, struct my_entry, rcu);
    kfree(entry);
}

void remove_entry(struct my_entry *entry)
{
    spin_lock(&list_lock);
    list_del_rcu(&entry->list);
    spin_unlock(&list_lock);
    
    call_rcu(&entry->rcu, entry_free_rcu);  // Free after grace period
}

When to Use RCU

Use CaseGood for RCU?Why
Read-mostly data✅ YesReaders are lockless
Frequently updated data❌ NoWriters block waiting for grace period
Large data structures⚠️ DependsCopying cost may be high
Pointer-based updates✅ YesSingle pointer update is atomic
Field-by-field updates❌ NoNeed copy-on-write for consistency

Memory Barriers

Compilers and CPUs reorder memory operations for performance. Barriers prevent unwanted reordering. The analogy: Imagine you’re writing instructions on a whiteboard for a coworker. You write “Step 1: Prepare data” and “Step 2: Set flag to ready.” You assume they’ll read top-to-bottom. But the compiler is like an overeager assistant who rearranges your instructions for “efficiency” — it might move Step 2 before Step 1 because writing the flag is simpler. The CPU is even worse: its out-of-order execution pipeline might physically execute Step 2 before Step 1. Memory barriers are your way of saying “No, really, this order matters.” Why this is hard: On x86 (which has a relatively strong memory model), you can often get away without explicit barriers. But on ARM or RISC-V, the memory model is weaker, and reordering is aggressive. Code that works perfectly on your x86 dev machine can have subtle race conditions on ARM servers. This is why the kernel uses smp_wmb(), smp_rmb(), and smp_mb() instead of architecture-specific instructions — they compile to the right barrier for each platform.

Why Memory Barriers?

// Without barriers, this might be reordered!
int ready = 0;
int data = 0;

// CPU 1 (writer)
data = 42;          // Might execute after ready=1!
ready = 1;

// CPU 2 (reader)
while (!ready);     // Might read ready=1 but...
print(data);        // ...data still shows 0!

// With barriers
// CPU 1 (writer)
data = 42;
smp_wmb();          // Write barrier: data visible before ready
ready = 1;

// CPU 2 (reader)
while (!ready);
smp_rmb();          // Read barrier: read data after ready
print(data);        // Now guaranteed to see 42

Barrier Types

#include <linux/barrier.h>

// Full memory barrier (both reads and writes)
mb();           // Barrier for all memory types
smp_mb();       // Only needed on SMP systems

// Write barrier (writes before cannot pass writes after)
wmb();
smp_wmb();

// Read barrier (reads before cannot pass reads after)
rmb();
smp_rmb();

// Acquire/Release semantics (lighter than full barriers)
smp_load_acquire(&variable);   // All reads after this see writes before corresponding release
smp_store_release(&variable, value);  // All writes before this visible to loads after acquire

// Compiler-only barrier (doesn't affect CPU)
barrier();      // Prevents compiler reordering
READ_ONCE(x);   // Prevent compiler from caching/reordering read
WRITE_ONCE(x, val);  // Prevent compiler from tearing/reordering write

Common Barrier Patterns

// Pattern 1: Flag-guarded data
struct {
    int data;
    int ready;  // Flag
} shared;

// Writer
WRITE_ONCE(shared.data, 42);
smp_wmb();                      // Ensure data written before ready
WRITE_ONCE(shared.ready, 1);

// Reader
while (!READ_ONCE(shared.ready))
    cpu_relax();
smp_rmb();                      // Ensure ready read before data
int value = READ_ONCE(shared.data);

// Pattern 2: Ring buffer
// Producer
buffer[head] = item;
smp_wmb();                      // Item visible before head update
WRITE_ONCE(head, next_head);

// Consumer
while (READ_ONCE(head) == tail)
    cpu_relax();
smp_rmb();                      // Head visible before reading item
item = buffer[tail];

Per-CPU Variables

Avoid locking entirely by giving each CPU its own copy:
#include <linux/percpu.h>

// Static per-CPU variable
static DEFINE_PER_CPU(int, my_counter);

// Dynamic allocation
int __percpu *my_ptr;
my_ptr = alloc_percpu(int);

// Access (must disable preemption!)
void increment_counter(void)
{
    preempt_disable();           // Or get_cpu()
    this_cpu_inc(my_counter);    // Increment this CPU's copy
    preempt_enable();            // Or put_cpu()
}

// Alternative: get_cpu() returns CPU number and disables preemption
void access_percpu(void)
{
    int cpu = get_cpu();
    int *ptr = per_cpu_ptr(&my_counter, cpu);
    (*ptr)++;
    put_cpu();
}

// Read all CPUs (for statistics)
int total = 0;
int cpu;
for_each_possible_cpu(cpu) {
    total += per_cpu(my_counter, cpu);
}

// Atomic per-CPU operations (no preemption disable needed)
this_cpu_add(my_counter, 5);
this_cpu_read(my_counter);
this_cpu_write(my_counter, 0);

Deadlock Prevention and Lockdep

Lock Ordering Rules

┌─────────────────────────────────────────────────────────────────────────────┐
│                       DEADLOCK PREVENTION                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Classic Deadlock:                                                           │
│  ─────────────────                                                          │
│  CPU 0:                      CPU 1:                                         │
│  lock(A)                     lock(B)                                        │
│  lock(B)  ← waits for B      lock(A)  ← waits for A                        │
│  ...                         ...                                            │
│  unlock(B)                   unlock(A)                                      │
│  unlock(A)                   unlock(B)                                      │
│                                                                              │
│  Both CPUs wait forever!                                                     │
│                                                                              │
│  Prevention: Consistent Lock Ordering                                        │
│  ─────────────────────────────────────                                      │
│  Rule: Always acquire locks in the same order                               │
│                                                                              │
│  CPU 0:                      CPU 1:                                         │
│  lock(A)                     lock(A)  ← waits for A                        │
│  lock(B)                     ...      ← eventually gets A                   │
│  ...                         lock(B)                                        │
│  unlock(B)                   ...                                            │
│  unlock(A)                   unlock(B)                                      │
│                              unlock(A)                                      │
│                                                                              │
│  Document your lock ordering!                                                │
│  Example: i_mutex → dentry->d_lock → dcache_lock                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Lockdep: Lock Validator

Linux has a runtime lock validator that catches potential deadlocks:
# Enable in kernel config
CONFIG_PROVE_LOCKING=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCK_STAT=y

# Check lockdep warnings in dmesg
dmesg | grep -i "lockdep\|deadlock\|held lock"

Lockdep Output Example

=============================================
WARNING: possible recursive locking detected
---------------------------------------------
kworker/0:1/1234 is trying to acquire lock:
 (&my_lock){+.+.}, at: my_function+0x20/0x100

but task is already holding lock:
 (&my_lock){+.+.}, at: my_function+0x10/0x100

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&my_lock);
  lock(&my_lock);   <-- Recursive!

 *** DEADLOCK ***

Lock Annotations

// Annotate lock ordering for lockdep
static DEFINE_MUTEX(outer_lock);
static DEFINE_MUTEX(inner_lock);

void my_function(void)
{
    mutex_lock(&outer_lock);
    mutex_lock_nested(&inner_lock, SINGLE_DEPTH_NESTING);  // Tell lockdep this is intentional
    // ...
    mutex_unlock(&inner_lock);
    mutex_unlock(&outer_lock);
}

// Lock class annotation (for locks that seem same but aren't)
static struct lock_class_key my_lock_key;
__mutex_init(&my_lock, "my_lock", &my_lock_key);

Lock Statistics

# Enable lock statistics
echo 1 > /proc/sys/kernel/lock_stat

# View lock contention
cat /proc/lock_stat

# Sample output:
# class name                 con-bounces    contentions   waittime-min   ...
# &mm->mmap_lock-W:            123456         12345          0.50
# &inode->i_lock:               98765          9876          0.10

Complete Example: Thread-Safe Data Structure

#include <linux/spinlock.h>
#include <linux/rcupdate.h>
#include <linux/slab.h>

struct my_data {
    int value;
    struct rcu_head rcu;
};

struct my_container {
    struct my_data __rcu *current;  // RCU-protected pointer
    spinlock_t update_lock;          // Writer mutual exclusion
};

// Initialize
void my_container_init(struct my_container *c)
{
    struct my_data *data = kzalloc(sizeof(*data), GFP_KERNEL);
    spin_lock_init(&c->update_lock);
    rcu_assign_pointer(c->current, data);
}

// Read (lockless, can be called from any context including IRQ)
int my_container_read(struct my_container *c)
{
    int value;
    
    rcu_read_lock();
    struct my_data *data = rcu_dereference(c->current);
    value = data->value;
    rcu_read_unlock();
    
    return value;
}

// Update (process context only, might sleep)
void my_container_update(struct my_container *c, int new_value)
{
    struct my_data *old, *new;
    
    new = kmalloc(sizeof(*new), GFP_KERNEL);
    if (!new)
        return;
    
    spin_lock(&c->update_lock);
    
    old = rcu_dereference_protected(c->current, 
                                    lockdep_is_held(&c->update_lock));
    new->value = new_value;
    rcu_assign_pointer(c->current, new);
    
    spin_unlock(&c->update_lock);
    
    synchronize_rcu();  // Wait for readers
    kfree(old);
}

Interview Questions

Answer:Use spinlock when:
  • Critical section is very short (microseconds)
  • Code might run in interrupt context
  • Cannot sleep (holding other spinlocks)
  • Lock contention is low
Use mutex when:
  • Critical section might be long
  • Code can sleep (e.g., waiting for I/O)
  • Always in process context
  • Better for longer waits (sleeping is more efficient than spinning)
Key difference: Spinlocks busy-wait (waste CPU), mutexes sleep (yield CPU). Spinlocks are faster for very short holds; mutexes are better for anything longer.
Answer:RCU (Read-Copy-Update) provides lockless reads for read-mostly data:How it works:
  1. Readers use rcu_read_lock() - just disables preemption, no actual lock
  2. Writers copy data, modify copy, atomically update pointer
  3. Writers wait for grace period (all pre-existing readers finish)
  4. Writers free old data
Use when:
  • Reads vastly outnumber writes
  • Can afford to copy data on write
  • Updates are pointer-based (not field-by-field)
Examples: Routing tables, configuration data, module listsAdvantage: Zero overhead for readers on the fast path
Answer:Four conditions for deadlock (all must be true):
  1. Mutual exclusion: Resource can only be held by one thread
  2. Hold and wait: Thread holds one lock while waiting for another
  3. No preemption: Locks cannot be forcibly taken
  4. Circular wait: A→B→C→A dependency chain
Prevention strategies:
  1. Lock ordering: Always acquire locks in the same global order
  2. Lock hierarchy: Document and enforce lock levels
  3. Try-lock with backoff: Use trylock, release and retry if fails
  4. Avoid holding locks while calling unknown code
Detection: Use lockdep (CONFIG_PROVE_LOCKING=y)
Answer:Problem: CPUs and compilers reorder memory operations for performance. On multi-core systems, this can cause one CPU to see operations in a different order than another CPU performed them.Example without barriers:
// CPU 1                    // CPU 2
data = 42;                  while (!flag);
flag = 1;                   print(data);  // Might print 0!
Why it fails: CPU 1 might reorder (flag=1 before data=42), or CPU 2 might read stale data from cache.Solution:
// CPU 1                    // CPU 2
data = 42;                  while (!READ_ONCE(flag));
smp_wmb();                  smp_rmb();
WRITE_ONCE(flag, 1);        print(data);  // Now guaranteed 42
Barrier types: smp_wmb() (writes), smp_rmb() (reads), smp_mb() (full)

Summary Table

PrimitiveCan Sleep?Use CaseContext
atomic_tN/ASimple countersAny
spinlockNoShort critical sectionsAny
spin_lock_irqNoShared with hardirqProcess/softirq
rwlockNoRead-heavy, short holdAny
seqlockNoVery frequent readsAny
mutexYesLong critical sectionsProcess only
semaphoreYesCounting/limitingProcess only
RCUReaders: NoRead-mostly dataReaders: Any
per-cpuN/AAvoid sharingAny

Debugging Tips

# Enable lockdep to catch lock ordering bugs at runtime
# These must be set in kernel config (CONFIG_PROVE_LOCKING=y)
# Check if your kernel has lockdep enabled:
cat /proc/lockdep_stats 2>/dev/null && echo "lockdep enabled" || echo "lockdep not available"

# View lockdep warnings (these are always bugs, never false positives)
dmesg | grep -A 20 "possible.*deadlock\|inconsistent lock state\|circular locking"

# View lock contention statistics
echo 1 | sudo tee /proc/sys/kernel/lock_stat
cat /proc/lock_stat | head -30
# Look for high "contentions" and "waittime-avg" -- these are your bottlenecks

# Trace lock hold times with bpftrace (find who holds locks too long)
sudo bpftrace -e '
kprobe:mutex_lock { @start[tid] = nsecs; }
kretprobe:mutex_lock /@start[tid]/ {
    $dur = (nsecs - @start[tid]) / 1000;
    if ($dur > 1000) { // > 1ms acquisition time
        printf("SLOW mutex_lock: %s held for %d us\n", comm, $dur);
    }
    delete(@start[tid]);
}'

# Check for RCU stalls (sign of preemption issues or long critical sections)
dmesg | grep "rcu.*stall"
# RCU stalls typically mean a CPU is stuck in a loop with preemption disabled

# Monitor spinlock contention in real-time
sudo perf lock record -a sleep 5
sudo perf lock report
Debugging Deadlocks in Production: If a system hangs and you suspect deadlock, check /proc/sysrq-trigger. Write ‘t’ to dump all task states: echo t | sudo tee /proc/sysrq-trigger, then check dmesg for tasks in “D” state (uninterruptible sleep) and their stack traces. Two tasks waiting on each other’s locks will show up with matching lock addresses in their stacks.

Common Misconceptions

MisconceptionReality
”Spinlocks are always faster than mutexes”Only true for critical sections shorter than ~5-10 microseconds. For anything longer, mutexes win because they yield the CPU instead of burning cycles
”RCU has no overhead on the read path”Almost true. rcu_read_lock() is just preempt_disable() — a single thread flag change. But on PREEMPT_RT kernels, it becomes a real lock. On standard kernels, the overhead is ~1 nanosecond
”Memory barriers are only needed on ARM/weak architectures”x86 has a strong memory model (TSO), but it still reorders stores past loads. The compiler can also reorder anything. You always need READ_ONCE()/WRITE_ONCE() at minimum, even on x86
”Lockdep only works in debug kernels”True that you need CONFIG_PROVE_LOCKING=y, but many distros (Ubuntu, Fedora) ship debug kernels with lockdep enabled. Use them in CI/testing
”Atomic operations are always lock-free”On x86, atomic_inc() uses a LOCK prefix which temporarily locks the bus/cache line. On heavily contended counters, this can be as slow as a spinlock. Per-CPU variables are truly contention-free

Key Takeaways

  1. Context matters: Know if you can sleep (affects lock choice)
  2. Spin vs sleep: Spin for microseconds, sleep for longer
  3. RCU is magic: Use it for read-mostly data structures
  4. Lock ordering: Document and follow it religiously
  5. Use lockdep: It catches bugs before production
  6. Barriers are subtle: Prefer higher-level primitives when possible

Next Steps


Interview Deep-Dive

Strong Answer:
  • This is a classic deadlock scenario. NAPI poll runs in softirq context (NET_RX_SOFTIRQ). If a process holds the spin_lock() and a softirq fires on the same CPU, the softirq handler will try to acquire the same lock. Since spinlocks are not reentrant and the lock holder is spinning at a lower priority level that cannot run until the softirq completes, we have a deadlock: the softirq waits for the lock, and the lock holder waits for the softirq to finish.
  • The fix is to use spin_lock_bh() (disable bottom halves) in the process context code. This disables softirq processing on the local CPU while the lock is held, preventing the NAPI handler from running on the same CPU. The NAPI handler on other CPUs can still contend for the lock normally using spin_lock(), and since they are already in softirq context, they do not need to disable softirqs.
  • If the data structure were also accessed from a hard IRQ handler, we would need spin_lock_irqsave() in both process and softirq contexts, which disables all interrupts on the local CPU while holding the lock.
  • The general rule: the lock variant must disable any higher-priority context that also acquires the lock. Process context must disable softirqs (if shared with softirq) or IRQs (if shared with hardirq). Softirq context must disable IRQs (if shared with hardirq). Hardirq context needs no disabling (nothing higher priority can preempt).
Follow-up: How would lockdep detect this bug if the developer missed it during code review?Follow-up Answer:
  • Lockdep tracks the context in which each lock is acquired (process, softirq, hardirq) using lock class annotations. When spin_lock() is first acquired in process context, lockdep records that this lock class can be taken in process context. When the same lock class is later acquired in softirq context (NAPI poll), lockdep detects that the lock is used in both contexts and that the process-context acquisition did not disable softirqs. It then prints a warning like “inconsistent lock state — this lock was taken in softirq context, but was acquired in process context without disabling softirqs, possible deadlock.” Lockdep does this analysis statically based on lock class relationships, so it can detect the bug the first time both code paths are exercised, even if the actual deadlock timing never occurred.
Strong Answer:
  • Modern CPUs have store buffers that batch writes before committing them to the cache hierarchy, and they may reorder memory operations for performance. On a single core, this reordering is invisible because the CPU maintains the illusion of sequential execution for a single thread. But on multi-core systems, one CPU’s store buffer contents are not visible to other CPUs until they are flushed to cache, and the order in which writes become visible may differ from program order.
  • Concrete example: a producer-consumer pattern with a flag.
  • Producer (CPU 0): data = 42; ready = 1; — The CPU may reorder these writes, or the store buffer may flush ready = 1 before data = 42, because they are to independent memory locations.
  • Consumer (CPU 1): while (!ready); use(data); — Even if the consumer sees ready == 1, it might still read the old value of data (0) because the data write has not yet propagated to CPU 1’s cache.
  • On a single-core machine, both operations run on the same CPU, so the store buffer is always consistent with the local view — this bug never manifests.
  • The fix: WRITE_ONCE(data, 42); smp_wmb(); WRITE_ONCE(ready, 1); on the producer, and while (!READ_ONCE(ready)); smp_rmb(); use(READ_ONCE(data)); on the consumer. The smp_wmb() ensures the data write is committed before the ready write, and smp_rmb() ensures the consumer reads data after reading ready.
  • READ_ONCE and WRITE_ONCE prevent the compiler from optimizing away or reordering the accesses. The smp_*mb() barriers prevent the CPU from reordering across the barrier.
Follow-up: How do acquire/release semantics simplify this pattern compared to explicit barriers?Follow-up Answer:
  • Acquire/release semantics provide a lighter-weight alternative. smp_store_release(&ready, 1) on the producer ensures all prior writes (including data = 42) are visible before the release store. smp_load_acquire(&ready) on the consumer ensures all subsequent reads (including data) see writes that happened before the corresponding release. This is conceptually simpler: the programmer thinks in terms of “publish” (release) and “consume” (acquire) rather than manually placing barriers between specific operations. On x86, release stores and acquire loads are naturally ordered by the hardware’s TSO (Total Store Order) memory model, so these compile to just regular loads/stores with compiler barriers — no hardware fence instructions needed. On ARM, they compile to actual barrier instructions (dmb ish).
Strong Answer:
  • The claim is partially correct but dangerously oversimplified. Mutexes are more efficient when the critical section is long (milliseconds) because sleeping yields the CPU to other productive work instead of wasting cycles spinning. However, for short critical sections (microseconds), mutexes are significantly worse.
  • The cost of a mutex sleep/wake cycle: the blocking thread calls schedule() which involves saving all registers, switching the stack, running the scheduler’s pick_next_task(), restoring the new task’s context, and flushing the pipeline. When the mutex is released, the sleeping thread must be woken (wake_up_process()), re-scheduled, and context-switched back in. This costs approximately 2-10 microseconds total (two context switches). If the critical section is 100 nanoseconds, the overhead of sleeping is 20-100 times longer than just spinning.
  • Additionally, mutexes cannot be used in interrupt context (they might sleep), softirq context, or while holding a spinlock. Replacing spinlocks with mutexes in code paths that can be called from interrupt handlers would crash the kernel.
  • Modern spinlock implementations are not pure spin loops. They use adaptive spinning: if the lock holder is running on another CPU, spin briefly (since the holder is likely to release soon). If the holder is not running, or after a timeout, fall back to sleeping. The kernel’s mutex_lock() itself uses this adaptive approach for the initial contention check.
  • My evaluation: use mutexes when the critical section may sleep or is longer than ~10 microseconds. Use spinlocks for short critical sections, anything callable from interrupt context, and when holding the lock for less time than a context switch costs.
Follow-up: What are the ownership semantics of mutexes versus spinlocks, and why does this matter for debugging?Follow-up Answer:
  • Mutexes have strict ownership: the thread that locked the mutex must be the one to unlock it. The kernel tracks the owner in mutex->owner. Attempting to unlock a mutex from a different thread triggers a warning. This ownership tracking enables lockdep to detect deadlocks involving mutexes (it knows which thread holds which mutex). Spinlocks do not track ownership — any context can unlock them. This is by design (an interrupt handler might need to release a lock acquired in process context), but it means lockdep has less information for deadlock detection with raw spinlocks. The practical implication: mutex bugs are easier to debug because the kernel will catch ownership violations, while spinlock bugs (especially across contexts) can silently cause data corruption before detection.