Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Kernel Synchronization
The Linux kernel is massively concurrent — multiple CPUs executing kernel code simultaneously, interrupts preempting at any moment, and softirqs running in parallel. Understanding synchronization primitives is essential for reading kernel code and debugging race conditions. The analogy: Imagine a busy restaurant kitchen (the kernel). Multiple chefs (CPUs) share the same cutting boards and stoves (shared data). Without coordination, two chefs reach for the same knife (race condition), or one chef waits for a stove that another chef already reserved but walked away from (deadlock). Kernel synchronization is the set of rules and tools that prevent these disasters — and just like in a kitchen, different situations call for different tools. A quick “hands off for one second” (spinlock) works for a short task, but for a 20-minute simmer (sleeping operation), you need a reservation system (mutex) so other chefs can use the stove while waiting.Key Topics: Spinlocks, mutexes, RCU, memory barriers, deadlock debugging
Time to Master: 14-16 hours
Why Kernel Synchronization is Different
User-space synchronization is simpler because:- Processes are preemptible (scheduler handles it)
- Only threads of the same process share address space
- You can always sleep when waiting for a lock
- Interrupt context: Cannot sleep, must use spinlocks
- Preemption: Kernel code can be preempted (on PREEMPT kernels)
- Multiple CPUs: True parallelism, not just concurrency
- Diverse contexts: Process context, softirq, hardirq - different rules for each
Synchronization Contexts
Atomic Operations
The foundation of all synchronization - operations that cannot be interrupted:Atomic Types
When to Use Atomic Operations
Spinlocks
The most basic lock — spins in a loop waiting for the lock. The analogy: A spinlock is like standing outside a single-occupancy bathroom door, jiggling the handle every millisecond. You’re not doing anything productive while waiting (burning CPU cycles), but the moment the door unlocks, you’re in instantly. This makes spinlocks ideal for very short critical sections (microseconds) where the overhead of “going to sit down and waiting to be called” (sleeping, as mutexes do) would be more expensive than just standing there. The critical rule: Never sleep while holding a spinlock. If you go to sleep, you’re holding the bathroom door locked while napping on the couch — nobody else can get in, and on a uniprocessor system, nobody can wake you up either because the scheduler itself might need that lock.Basic Spinlock
Spinlock Variants
Spinlock Selection Guide
Reader-Writer Locks
When reads are much more common than writes:Read-Write Spinlocks
Seqlock (Sequence Lock)
For data with very frequent reads and rare writes:- Data is read very frequently
- Writes are rare
- Readers can tolerate occasional retries
- Example: System time (jiffies)
Mutexes
For when you can sleep - more efficient than spinlocks for long waits:Mutex Rules
- Only the owner can unlock: Unlike spinlocks, mutexes track ownership
- Cannot be used in interrupt context: They might sleep
- Cannot be held across sleep with spinlock: Spinlock holder cannot sleep
- Recursive locking forbidden: Will deadlock
Semaphores
Counting semaphores for limiting concurrency:- Limiting concurrent I/O operations
- Resource pools (e.g., connection pool)
- Producer-consumer with bounded buffer
RCU (Read-Copy-Update)
RCU is Linux’s secret weapon for read-mostly data structures. It allows readers to access data without locks while writers make copies and update pointers atomically.RCU Concept
RCU API
RCU-Protected List
When to Use RCU
| Use Case | Good for RCU? | Why |
|---|---|---|
| Read-mostly data | ✅ Yes | Readers are lockless |
| Frequently updated data | ❌ No | Writers block waiting for grace period |
| Large data structures | ⚠️ Depends | Copying cost may be high |
| Pointer-based updates | ✅ Yes | Single pointer update is atomic |
| Field-by-field updates | ❌ No | Need copy-on-write for consistency |
Memory Barriers
Compilers and CPUs reorder memory operations for performance. Barriers prevent unwanted reordering. The analogy: Imagine you’re writing instructions on a whiteboard for a coworker. You write “Step 1: Prepare data” and “Step 2: Set flag to ready.” You assume they’ll read top-to-bottom. But the compiler is like an overeager assistant who rearranges your instructions for “efficiency” — it might move Step 2 before Step 1 because writing the flag is simpler. The CPU is even worse: its out-of-order execution pipeline might physically execute Step 2 before Step 1. Memory barriers are your way of saying “No, really, this order matters.” Why this is hard: On x86 (which has a relatively strong memory model), you can often get away without explicit barriers. But on ARM or RISC-V, the memory model is weaker, and reordering is aggressive. Code that works perfectly on your x86 dev machine can have subtle race conditions on ARM servers. This is why the kernel usessmp_wmb(), smp_rmb(), and smp_mb() instead of architecture-specific instructions — they compile to the right barrier for each platform.
Why Memory Barriers?
Barrier Types
Common Barrier Patterns
Per-CPU Variables
Avoid locking entirely by giving each CPU its own copy:Deadlock Prevention and Lockdep
Lock Ordering Rules
Lockdep: Lock Validator
Linux has a runtime lock validator that catches potential deadlocks:Lockdep Output Example
Lock Annotations
Lock Statistics
Complete Example: Thread-Safe Data Structure
Interview Questions
Q: When would you use a spinlock vs a mutex?
Q: When would you use a spinlock vs a mutex?
- Critical section is very short (microseconds)
- Code might run in interrupt context
- Cannot sleep (holding other spinlocks)
- Lock contention is low
- Critical section might be long
- Code can sleep (e.g., waiting for I/O)
- Always in process context
- Better for longer waits (sleeping is more efficient than spinning)
Q: Explain RCU and when you'd use it
Q: Explain RCU and when you'd use it
- Readers use
rcu_read_lock()- just disables preemption, no actual lock - Writers copy data, modify copy, atomically update pointer
- Writers wait for grace period (all pre-existing readers finish)
- Writers free old data
- Reads vastly outnumber writes
- Can afford to copy data on write
- Updates are pointer-based (not field-by-field)
Q: What causes a deadlock and how do you prevent it?
Q: What causes a deadlock and how do you prevent it?
- Mutual exclusion: Resource can only be held by one thread
- Hold and wait: Thread holds one lock while waiting for another
- No preemption: Locks cannot be forcibly taken
- Circular wait: A→B→C→A dependency chain
- Lock ordering: Always acquire locks in the same global order
- Lock hierarchy: Document and enforce lock levels
- Try-lock with backoff: Use trylock, release and retry if fails
- Avoid holding locks while calling unknown code
CONFIG_PROVE_LOCKING=y)Q: Why do you need memory barriers?
Q: Why do you need memory barriers?
data from cache.Solution:smp_wmb() (writes), smp_rmb() (reads), smp_mb() (full)Summary Table
| Primitive | Can Sleep? | Use Case | Context |
|---|---|---|---|
| atomic_t | N/A | Simple counters | Any |
| spinlock | No | Short critical sections | Any |
| spin_lock_irq | No | Shared with hardirq | Process/softirq |
| rwlock | No | Read-heavy, short hold | Any |
| seqlock | No | Very frequent reads | Any |
| mutex | Yes | Long critical sections | Process only |
| semaphore | Yes | Counting/limiting | Process only |
| RCU | Readers: No | Read-mostly data | Readers: Any |
| per-cpu | N/A | Avoid sharing | Any |
Debugging Tips
Common Misconceptions
| Misconception | Reality |
|---|---|
| ”Spinlocks are always faster than mutexes” | Only true for critical sections shorter than ~5-10 microseconds. For anything longer, mutexes win because they yield the CPU instead of burning cycles |
| ”RCU has no overhead on the read path” | Almost true. rcu_read_lock() is just preempt_disable() — a single thread flag change. But on PREEMPT_RT kernels, it becomes a real lock. On standard kernels, the overhead is ~1 nanosecond |
| ”Memory barriers are only needed on ARM/weak architectures” | x86 has a strong memory model (TSO), but it still reorders stores past loads. The compiler can also reorder anything. You always need READ_ONCE()/WRITE_ONCE() at minimum, even on x86 |
| ”Lockdep only works in debug kernels” | True that you need CONFIG_PROVE_LOCKING=y, but many distros (Ubuntu, Fedora) ship debug kernels with lockdep enabled. Use them in CI/testing |
| ”Atomic operations are always lock-free” | On x86, atomic_inc() uses a LOCK prefix which temporarily locks the bus/cache line. On heavily contended counters, this can be as slow as a spinlock. Per-CPU variables are truly contention-free |
Key Takeaways
- Context matters: Know if you can sleep (affects lock choice)
- Spin vs sleep: Spin for microseconds, sleep for longer
- RCU is magic: Use it for read-mostly data structures
- Lock ordering: Document and follow it religiously
- Use lockdep: It catches bugs before production
- Barriers are subtle: Prefer higher-level primitives when possible
Next Steps
- Interrupts → - See where spinlocks are essential
- Process Subsystem → - Scheduler uses many of these primitives
- Data Structures → - RCU-protected data structures
Interview Deep-Dive
You are reviewing a kernel patch that uses spin_lock() to protect a data structure accessed from both process context and a network driver's NAPI poll handler. What bug does this introduce, and how would you fix it?
You are reviewing a kernel patch that uses spin_lock() to protect a data structure accessed from both process context and a network driver's NAPI poll handler. What bug does this introduce, and how would you fix it?
- This is a classic deadlock scenario. NAPI poll runs in softirq context (NET_RX_SOFTIRQ). If a process holds the
spin_lock()and a softirq fires on the same CPU, the softirq handler will try to acquire the same lock. Since spinlocks are not reentrant and the lock holder is spinning at a lower priority level that cannot run until the softirq completes, we have a deadlock: the softirq waits for the lock, and the lock holder waits for the softirq to finish. - The fix is to use
spin_lock_bh()(disable bottom halves) in the process context code. This disables softirq processing on the local CPU while the lock is held, preventing the NAPI handler from running on the same CPU. The NAPI handler on other CPUs can still contend for the lock normally usingspin_lock(), and since they are already in softirq context, they do not need to disable softirqs. - If the data structure were also accessed from a hard IRQ handler, we would need
spin_lock_irqsave()in both process and softirq contexts, which disables all interrupts on the local CPU while holding the lock. - The general rule: the lock variant must disable any higher-priority context that also acquires the lock. Process context must disable softirqs (if shared with softirq) or IRQs (if shared with hardirq). Softirq context must disable IRQs (if shared with hardirq). Hardirq context needs no disabling (nothing higher priority can preempt).
- Lockdep tracks the context in which each lock is acquired (process, softirq, hardirq) using lock class annotations. When
spin_lock()is first acquired in process context, lockdep records that this lock class can be taken in process context. When the same lock class is later acquired in softirq context (NAPI poll), lockdep detects that the lock is used in both contexts and that the process-context acquisition did not disable softirqs. It then prints a warning like “inconsistent lock state — this lock was taken in softirq context, but was acquired in process context without disabling softirqs, possible deadlock.” Lockdep does this analysis statically based on lock class relationships, so it can detect the bug the first time both code paths are exercised, even if the actual deadlock timing never occurred.
Explain memory barriers and why they are necessary on modern multi-core CPUs. Give a concrete example of a bug that only manifests without barriers and would pass all tests on a single-core machine.
Explain memory barriers and why they are necessary on modern multi-core CPUs. Give a concrete example of a bug that only manifests without barriers and would pass all tests on a single-core machine.
- Modern CPUs have store buffers that batch writes before committing them to the cache hierarchy, and they may reorder memory operations for performance. On a single core, this reordering is invisible because the CPU maintains the illusion of sequential execution for a single thread. But on multi-core systems, one CPU’s store buffer contents are not visible to other CPUs until they are flushed to cache, and the order in which writes become visible may differ from program order.
- Concrete example: a producer-consumer pattern with a flag.
- Producer (CPU 0):
data = 42; ready = 1;— The CPU may reorder these writes, or the store buffer may flushready = 1beforedata = 42, because they are to independent memory locations. - Consumer (CPU 1):
while (!ready); use(data);— Even if the consumer seesready == 1, it might still read the old value ofdata(0) because the data write has not yet propagated to CPU 1’s cache. - On a single-core machine, both operations run on the same CPU, so the store buffer is always consistent with the local view — this bug never manifests.
- The fix:
WRITE_ONCE(data, 42); smp_wmb(); WRITE_ONCE(ready, 1);on the producer, andwhile (!READ_ONCE(ready)); smp_rmb(); use(READ_ONCE(data));on the consumer. Thesmp_wmb()ensures the data write is committed before the ready write, andsmp_rmb()ensures the consumer reads data after reading ready. READ_ONCEandWRITE_ONCEprevent the compiler from optimizing away or reordering the accesses. Thesmp_*mb()barriers prevent the CPU from reordering across the barrier.
- Acquire/release semantics provide a lighter-weight alternative.
smp_store_release(&ready, 1)on the producer ensures all prior writes (includingdata = 42) are visible before the release store.smp_load_acquire(&ready)on the consumer ensures all subsequent reads (includingdata) see writes that happened before the corresponding release. This is conceptually simpler: the programmer thinks in terms of “publish” (release) and “consume” (acquire) rather than manually placing barriers between specific operations. On x86, release stores and acquire loads are naturally ordered by the hardware’s TSO (Total Store Order) memory model, so these compile to just regular loads/stores with compiler barriers — no hardware fence instructions needed. On ARM, they compile to actual barrier instructions (dmb ish).
A kernel developer proposes replacing all spinlocks in a subsystem with mutexes 'because mutexes are more efficient since they sleep instead of spinning.' Evaluate this claim and explain when it would actually make things worse.
A kernel developer proposes replacing all spinlocks in a subsystem with mutexes 'because mutexes are more efficient since they sleep instead of spinning.' Evaluate this claim and explain when it would actually make things worse.
- The claim is partially correct but dangerously oversimplified. Mutexes are more efficient when the critical section is long (milliseconds) because sleeping yields the CPU to other productive work instead of wasting cycles spinning. However, for short critical sections (microseconds), mutexes are significantly worse.
- The cost of a mutex sleep/wake cycle: the blocking thread calls
schedule()which involves saving all registers, switching the stack, running the scheduler’spick_next_task(), restoring the new task’s context, and flushing the pipeline. When the mutex is released, the sleeping thread must be woken (wake_up_process()), re-scheduled, and context-switched back in. This costs approximately 2-10 microseconds total (two context switches). If the critical section is 100 nanoseconds, the overhead of sleeping is 20-100 times longer than just spinning. - Additionally, mutexes cannot be used in interrupt context (they might sleep), softirq context, or while holding a spinlock. Replacing spinlocks with mutexes in code paths that can be called from interrupt handlers would crash the kernel.
- Modern spinlock implementations are not pure spin loops. They use adaptive spinning: if the lock holder is running on another CPU, spin briefly (since the holder is likely to release soon). If the holder is not running, or after a timeout, fall back to sleeping. The kernel’s
mutex_lock()itself uses this adaptive approach for the initial contention check. - My evaluation: use mutexes when the critical section may sleep or is longer than ~10 microseconds. Use spinlocks for short critical sections, anything callable from interrupt context, and when holding the lock for less time than a context switch costs.
- Mutexes have strict ownership: the thread that locked the mutex must be the one to unlock it. The kernel tracks the owner in
mutex->owner. Attempting to unlock a mutex from a different thread triggers a warning. This ownership tracking enables lockdep to detect deadlocks involving mutexes (it knows which thread holds which mutex). Spinlocks do not track ownership — any context can unlock them. This is by design (an interrupt handler might need to release a lock acquired in process context), but it means lockdep has less information for deadlock detection with raw spinlocks. The practical implication: mutex bugs are easier to debug because the kernel will catch ownership violations, while spinlock bugs (especially across contexts) can silently cause data corruption before detection.