Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Advanced Synchronization
For high-concurrency systems, standard mutexes are often too slow. A “Senior” engineer must understand how to eliminate locks entirely using RCU and how to verify lock safety using tools like lockdep. The fundamental tension in synchronization is between correctness and performance. A single global lock guarantees correctness but serializes all work. No locks maximize parallelism but invite data corruption. Every technique in this chapter exists somewhere on that spectrum, trading complexity for throughput.Key Internals: RCU grace periods, lockdep dependency graphs, CAS / CMPXCHG, memory barriers
Prerequisites: Threads & Concurrency, Deadlocks
Time to Master: 12-15 hours
0. From Data Races to Correct Patterns
Before diving into RCU and lock-free algorithms, it helps to see the bugs we are trying to avoid and the idioms that fix them.0.1 Broken Counter: Lost Updates
- On a multi-core system,
counter++compiles to load, increment, store. - Two threads can read the same old value and both write back, losing updates.
0.2 Reader-Writer Pattern
- Anti-pattern: Using a single mutex around a large data structure that is read-mostly.
- Pattern: Use a reader-writer lock or RCU so that many readers can proceed in parallel while writers serialize.
0.3 Producer–Consumer Queue
A classic pattern that combines a mutex with a condition variable:1. RCU (Read-Copy-Update)
RCU is the secret weapon of the Linux kernel. It allows multiple readers to access data simultaneously with zero overhead (no locks, no atomic increments), even while a writer is modifying that data. The analogy: imagine a Wikipedia article. Thousands of people read it simultaneously with no coordination. When an editor wants to change it, they create a new version of the page. Readers who started before the edit see the old version; new readers see the new one. Once no reader is still viewing the old version, it is garbage-collected. Nobody was ever blocked.The RCU Philosophy
- Readers: Don’t use any locks. They just read the data.
- Writers: Instead of modifying data in place, they:
- Create a Copy of the data.
- Update the copy.
- Publish the new pointer atomically (using a memory barrier).
- Defer the deletion of the old data until all current readers are finished.
Grace Periods & Quiescent States
How does the writer know when it’s safe to delete the old data?- Quiescent State: A point in time where a CPU is guaranteed not to be in an RCU read-side critical section (e.g., a context switch, returning to user space, or entering the idle loop). The key insight is that RCU read-side critical sections cannot span these boundaries.
- Grace Period: The time interval during which every CPU has passed through at least one quiescent state. Once the grace period ends, no reader can possibly still have a reference to the old data. Grace periods are typically milliseconds long, which is why RCU trades memory for latency: the old copy lives a bit longer, but readers are never stalled.
struct fib_info), the file descriptor table, and the module list all use RCU. Any time you see rcu_read_lock() / rcu_read_unlock() in kernel code, the data structure is optimized for read-mostly access patterns where millions of lookups per second would make conventional locking a bottleneck.
RCU vs. Reader-Writer Locks
| Feature | Reader-Writer Lock (rwlock) | RCU |
|---|---|---|
| Reader Overhead | Atomic increment (cache line bouncing) | Zero |
| Writer Latency | High (must wait for all readers) | Low (asynchronous deletion) |
| Deadlock Risk | Possible | Impossible for readers |
| Scalability | Poor (limited by cache coherency) | Infinite |
2. Kernel Lock Validation: lockdep
Locking is hard. Deadlocks are harder. The Linux kernel uses a runtime validator called lockdep to prove that a deadlock could happen, even if it hasn’t yet. Think of lockdep as a building inspector who examines the wiring diagram of your house: even if no short circuit has occurred yet, the inspector can identify configurations that would cause one under the right conditions.
How lockdep Works
- Lock Classes: Every lock belongs to a class (e.g., all
inode->i_lockinstances belong to one class). lockdep does not track individual lock instances — that would be too expensive. Instead, it tracks the acquisition patterns between classes. If anyinode->i_lockis ever held before anydentry->d_lock, lockdep records the dependencyinode_lock -> dentry_lockfor all instances. - Dependency Tracking: lockdep builds a directed graph of lock acquisition order. If it sees
Lock A -> Lock Bin one code path andLock B -> Lock Ain another, it triggers a “Possible Deadlock” warning — even if the two code paths have never actually run concurrently. This is powerful: it catches bugs that would otherwise only manifest under rare timing conditions. - IRQ Context Tracking: It also tracks if a lock is acquired both with and without interrupts enabled. If a thread holds spinlock S with interrupts on, and an interrupt handler also acquires S, the interrupt will spin forever waiting for S to be released — but the release cannot happen because the interrupt preempted the holder. lockdep catches this by tracking
in_irqvsnot_in_irqcontext for every lock class.
dmesg, it includes the full lock acquisition chain that forms the cycle. Read the “Existing dependency chain” and “New dependency” sections carefully — they tell you exactly which two code paths conflict. Fixing the bug usually means reordering lock acquisition or splitting a lock into finer-grained classes.
3. Lock-Free & Wait-Free Progress
When we say an algorithm is “Lock-Free,” we are talking about Progress Guarantees.- Lock-Free: At least one thread in the system is guaranteed to make progress in a finite number of steps. This prevents system-wide deadlocks but allows individual threads to starve.
- Wait-Free: Every thread is guaranteed to make progress in a finite number of steps. This is the strongest guarantee, common in Hard RTOS.
- Obstruction-Free: A thread makes progress only if it executes in isolation (no other thread interferes).
CAS (Compare-And-Swap) Internals
The foundation of lock-free programming is the atomic CAS instruction. It is the “swap the sign on the door only if nobody changed it since I last looked” primitive.LOCK CMPXCHG instruction. The LOCK prefix acquires exclusive ownership of the cache line for the duration of the operation, preventing any other core from reading or writing it. On ARM, the equivalent uses LDXR/STXR (load-exclusive / store-exclusive) pairs.
Lock-Free Stack Example
To see CAS in action, here is a classic lock-free push operation for a stack:4. The ABA Problem
A classic pitfall in lock-free programming. The name comes from the sequence of values the pointer takes: A, then B, then back to A.- Thread 1 reads value
Afrom a pointer and prepares a CAS to update it. - Thread 2 changes the pointer to
B(maybe freeingA), does some work, allocates new memory that happens to land at the same address as oldA, and writesAback. - Thread 1 performs CAS, sees
A, and thinks nothing has changed — but the underlying data at addressAis completely different. Thread 1 has been tricked into accepting a stale, recycled object.
- Solution: Hazard Pointers (each thread publishes which pointers it is currently examining, preventing reclamation), Epoch-based Reclamation (like RCU — defer freeing until all threads have moved to a new epoch), or Tagged Pointers (pack a monotonically increasing counter into the pointer so CAS detects the version change even if the address is reused).
5. Memory Barriers in Sync
Synchronization is not just about atomicity; it is about Visibility. Modern CPUs and compilers reorder memory operations for performance. Without barriers, Thread A might writedata = 42 then set ready = true, but Thread B could see ready == true while data is still stale — because the CPU reordered the stores or the compiler rearranged the instructions.
Think of it this way: writing to memory is like putting a letter in the outbox. Without a barrier, the postal worker might deliver letters out of order. A memory barrier is a “flush the outbox NOW, in order” instruction.
A mutex lock() and unlock() act as Memory Barriers. They ensure that all memory writes performed while holding the lock are visible to the next thread that acquires the lock.
- Acquire Barrier: Applied on
lock(). Ensures subsequent loads/stores stay after the lock is acquired. No memory operation after the lock can “leak” above it. - Release Barrier: Applied on
unlock(). Ensures preceding loads/stores stay before the lock is released. No memory operation before the unlock can “leak” below it. - Full Barrier (mfence): Prevents any reordering across the barrier in either direction. Used in
__sync_synchronize()andatomic_thread_fence(memory_order_seq_cst).
atomic_store_explicit(&ready, true, memory_order_release) and atomic_load_explicit(&ready, memory_order_acquire) instead of raw barriers. The compiler generates the correct platform-specific barrier instructions for you. In kernel code, use smp_store_release() and smp_load_acquire() — the kernel’s own abstraction over hardware barriers.
6. Interview Deep Dive
Explain RCU and when you would use it instead of a reader-writer lock.
Explain RCU and when you would use it instead of a reader-writer lock.
rcu_read_lock() which compiles to essentially nothing (it just disables preemption). Writers create a new copy of the data, atomically publish the pointer, and defer freeing the old copy until a “grace period” has elapsed — meaning every CPU has gone through at least one context switch.Use RCU when: (a) the data structure is read-mostly (thousands of reads per write), (b) readers need to be fast and never block, and (c) you can tolerate the old version being visible briefly. The Linux routing table, file descriptor table, and module list all use RCU.Use reader-writer locks when: writes are frequent or readers need to see the latest version immediately.What is the ABA problem, and how does it differ from a simple data race?
What is the ABA problem, and how does it differ from a simple data race?
Why are memory barriers necessary, and how do mutex operations relate to them?
Why are memory barriers necessary, and how do mutex operations relate to them?
Summary for Senior Engineers
- Use RCU for read-mostly data structures (like routing tables or file descriptor tables). It is the single most important synchronization primitive in the Linux kernel.
- Understand that
lockdepis your best friend during kernel development. It catches deadlocks that would otherwise only manifest under rare timing conditions in production. - Avoid custom lock-free structures unless you are prepared to handle the ABA problem and memory ordering nuances. In most cases, a well-tuned mutex or RCU is both safer and fast enough.
- Memory Barriers are what actually make synchronization work across multiple CPU cores. If you write lock-free code without understanding barriers, it will work on x86 and break on ARM.
- Know the hierarchy: mutex (simple, correct) -> rwlock (read parallelism) -> RCU (zero-cost reads) -> lock-free CAS (maximum control, maximum risk). Move down this list only when profiling proves you need to.
Interview Deep-Dive
Explain RCU to me as if I am a senior engineer who has never worked with the Linux kernel. When would I reach for RCU instead of a reader-writer lock, and what is the catch?
Explain RCU to me as if I am a senior engineer who has never worked with the Linux kernel. When would I reach for RCU instead of a reader-writer lock, and what is the catch?
- Readers: Enter a critical section with
rcu_read_lock(), which simply disables preemption (a thread-local operation, zero shared state). Read the data. Exit withrcu_read_unlock(). No atomic instructions, no cache-line bouncing. Cost: effectively zero. - Writers: Do not modify in place. Instead: (1) copy the data structure, (2) modify the copy, (3) atomically swap the pointer from old to new via
rcu_assign_pointer(), (4) wait for a “grace period” before freeing the old copy. - Grace period: The writer calls
synchronize_rcu(), which blocks until every CPU that might have been in an RCU read-side critical section has exited. The kernel detects this by watching for context switches and idle periods on each CPU (“quiescent states”).
rcu_dereference() for reads and rcu_assign_pointer() for publishes to get correct memory barriers, especially on weakly-ordered architectures like ARM).Follow-up: How does RCU handle readers that get preempted in the middle of an RCU critical section?In non-preemptible kernels, rcu_read_lock() disables preemption, so this cannot happen. In preemptible kernels (PREEMPT_RT), “preemptible RCU” tracks preempted readers explicitly on per-CPU blocked-tasks lists. The grace period does not end until all such tasks exit their critical sections. This adds complexity but allows RCU to work correctly in fully preemptible kernels.What is the ABA problem in lock-free programming? Give a concrete example and explain how epoch-based reclamation solves it.
What is the ABA problem in lock-free programming? Give a concrete example and explain how epoch-based reclamation solves it.
- Thread T1 reads
top = A(stack:A -> B -> C). It prepares to pop A by doingCAS(top, A, B). - T1 is preempted before the CAS.
- Thread T2 pops A, pops B, pushes A back. Stack:
A -> C. B has been freed. - T1 resumes.
CAS(top, A, B)succeeds because top is still A. But B is freed memory. The stack now points to freed memory — use-after-free.
- The system maintains a global epoch counter (0, 1, 2, cycling).
- Each thread records which epoch it entered when starting an operation.
- When a thread removes a node, it places it on a “retired” list tagged with the current epoch instead of freeing immediately.
- The system periodically checks: are all active threads in the current epoch or later? If so, the previous epoch’s retired nodes can be safely freed.
- Key guarantee: while T1 holds its epoch registration, A cannot be freed. T2 retires A to the list, but freeing is deferred until T1 advances its epoch.
Implement a thread-safe singleton in C++ -- explain why double-checked locking (DCLP) was broken before C++11 and how the standard fixed it.
Implement a thread-safe singleton in C++ -- explain why double-checked locking (DCLP) was broken before C++11 and how the standard fixed it.
- Sketch the naive version — one mutex around
if (instance == nullptr) instance = new Foo();. Correct but every call pays the lock cost. Senior interviewers want you to acknowledge that “correct and slow” is the default starting point. - Show the DCLP attempt and explain why it is broken: the second check appears to avoid the lock on the fast path, but the publish of
instanceis the bug. - Explain the actual race: thread A executes
instance = new Foo();. The compiler/CPU is allowed to reorder this into roughly: (a) allocate memory, (b) assign pointer toinstance, (c) run constructor. Thread B sees a non-nullinstance, skips the lock, and dereferences a half-constructed object. - Show the C++11 fix:
static Foo& get() { static Foo instance; return instance; }. The standard guarantees thread-safe initialization of function-local statics (“magic statics”). Implementation: compilers emit a guard variable plus an acquire/release barrier so other threads either see fully constructed or block. - Acknowledge the manual DCLP fix if needed: use
std::atomic<Foo*>withmemory_order_releaseon the store andmemory_order_acquireon the load. This makes the publish ordering explicit so the constructor cannot leak past the pointer assignment.
__cxa_guard_acquire / __cxa_guard_release, which compile to a fast atomic load on the hot path.volatile not enough to fix DCLP, and what does Java’s volatile do differently?
volatile in C/C++ prevents the compiler from caching reads in registers, but it does NOT emit memory barriers, so CPU reordering still happens. Java’s volatile since JSR-133 (2004) is much stronger: it emits acquire/release barriers and prevents reordering of surrounding accesses. That is why DCLP with volatile works in Java 5+ but never works in C++ regardless of volatile.thread_local instance gives each thread its own copy with zero contention.std::call_once differ from magic statics?
std::call_once lets you initialize state that is not a function-local static (e.g., a global pointer initialized lazily). It works with any callable, including lambdas with captures. Both compile to similar guard logic, but call_once uses an explicit once_flag which you can pass between functions. Performance is comparable.- “Just use a mutex on every call” — correct but the question is specifically about avoiding lock cost on the fast path, so this misses the point of the question.
- “Mark the pointer
volatileand DCLP works” — false in C++; conflates Java semantics with C++ semantics. A senior should know the language-specific rules. - “Use std::shared_ptr with atomic store” — overcomplicated and
std::atomic<std::shared_ptr<T>>was only standardized in C++20. Magic statics orcall_onceare simpler and pre-date this.
- Scott Meyers, Andrei Alexandrescu, “C++ and the Perils of Double-Checked Locking” (2004) — the paper that ended DCLP in C++.
- Hans Boehm, “Threads Cannot be Implemented as a Library” (2005) — argues why C++03 needed a memory model.
- cppreference.com, “Static local variables” — documents the C++11 magic-statics guarantee.
Explain mutex vs. spinlock -- when is each the right choice, and how does Linux's futex blur the line?
Explain mutex vs. spinlock -- when is each the right choice, and how does Linux's futex blur the line?
- Define the contract: a mutex parks waiters in the kernel (sleep + wake), a spinlock keeps waiters busy-waiting in userspace. Both serialize access; they differ in what waiters do.
- State the simple decision rule: spinlock if the expected hold time is less than the context-switch cost (~1-2us on Linux x86). Mutex if longer or if the holder might be preempted.
- Walk through why context matters: in the kernel, you cannot sleep while holding a spinlock (you would deadlock the scheduler). In userspace, you can but spinning while preempted means you spin uselessly.
- Introduce futex as the adaptive answer: tries CAS in userspace (zero syscall on uncontended path), falls back to a kernel wait queue only when contended.
pthread_mutex_ton glibc is a futex with adaptive spinning. - Explain priority inversion implications: a spinlock cannot inherit priority. A high-priority thread spinning for a low-priority holder that is preempted by a medium-priority thread spins forever. Real-time systems use priority-inheriting mutexes (
PTHREAD_PRIO_INHERIT).
spinlock_t uses to mutex in the BKL (Big Kernel Lock) removal effort, finalized around 2011 (kernel 2.6.39). The lesson: hold times that grew past microseconds made spinning a measurable regression. Conversely, the seqlock_t (used in jiffies updates and routing tables) is a spin-based primitive that wins because the writer’s critical section is single-digit nanoseconds.pthread_mutex_lock on glibc spin a few times before sleeping?
Adaptive mutexes (PTHREAD_MUTEX_ADAPTIVE_NP) and the default mutex on modern glibc spin briefly because the wake-up path costs ~5us. If the lock is freed within ~1us, spinning is cheaper than the syscall round trip. The spin count is bounded (typically tens of iterations) so a long-held lock still parks the waiter.- “Spinlocks are always faster because they avoid syscalls” — wrong; only true for very short hold times. A spinlock held for 1ms on 8 cores wastes 7ms of CPU per acquisition.
- “Mutexes always sleep” — wrong on Linux; futex-based mutexes spin briefly first.
- “Spinlocks cause deadlocks more easily” — wrong; both can deadlock identically. They differ in what waiters do, not in deadlock semantics.
- Ulrich Drepper, “Futexes Are Tricky” (2003, updated 2011) — the canonical futex reference, available free.
- Linux kernel
Documentation/locking/spinlocks.rst— spinlock semantics and rules. - Paul McKenney, “Is Parallel Programming Hard, And, If So, What Can You Do About It?” — chapter on locking primitives, free PDF online.
Why is implementing a correct lock-free queue hard? Walk through the ABA problem with a real failure mode.
Why is implementing a correct lock-free queue hard? Walk through the ABA problem with a real failure mode.
- Set up the naive lock-free queue: Michael-Scott queue using CAS on head/tail pointers. Show the basic enqueue/dequeue.
- Trace ABA step by step with a memory reuse scenario (the failure mode is concrete addresses, so use addresses).
- Explain why CAS does not catch it: CAS compares the pointer value, not the object identity. If memory at address X is freed and a new object happens to land at X, CAS sees “still X” and proceeds.
- Walk through three solutions:
- Tagged pointers (lower 16 bits store a counter, CAS sees version change). Caps you at 2^N reuses before wraparound.
- Hazard pointers (each thread publishes pointers it is using; reclamation scans all hazards before freeing). Bounds memory but adds per-access overhead.
- Epoch-based reclamation (defer freeing until all threads have left their entry epoch — userspace RCU). Simpler but unbounded memory if a thread stalls.
- State the meta-lesson: lock-free does not mean “no concurrency bugs.” It means “no blocking.” The data race surface area is just as large; you have moved the problem to memory reclamation.
UnboundedQueue (Facebook, ~2017) uses hazard pointers for the same reason.CMPXCHG16B (double-width CAS) help?
It enables tagged pointers without packing tags into the pointer’s spare bits: store the pointer in 8 bytes, the version counter in another 8 bytes, and CAS both atomically. Useful on systems where the pointer uses all 64 bits. ARM equivalent is CASP (LSE extensions).- “Lock-free is always faster than mutex” — false under contention with retry storms.
- “ABA only matters with manual memory management” — also affects reference-counted structures and any state that recycles values.
- “Just use
std::atomicand you are done” — atomics give you the primitive, not the algorithm. You still must reason about memory reclamation.
- Maged Michael, Michael Scott, “Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms” (1996) — the foundational paper.
- Maged Michael, “Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects” (2004).
- Crossbeam epoch crate documentation — a clean modern implementation of EBR.
You have 64 cores incrementing a shared counter. The mutex-based approach is a bottleneck. Walk through progressively better solutions.
You have 64 cores incrementing a shared counter. The mutex-based approach is a bottleneck. Walk through progressively better solutions.
Next: File Systems & VFS →