Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Advanced Synchronization

For high-concurrency systems, standard mutexes are often too slow. A “Senior” engineer must understand how to eliminate locks entirely using RCU and how to verify lock safety using tools like lockdep. The fundamental tension in synchronization is between correctness and performance. A single global lock guarantees correctness but serializes all work. No locks maximize parallelism but invite data corruption. Every technique in this chapter exists somewhere on that spectrum, trading complexity for throughput.
Caveat 1: Spin vs. sleep is not “spin = fast, sleep = slow.” Spinning wastes CPU; sleeping wastes cycles via context switches and cache eviction. Pick wrong and your throughput collapses. A pthread mutex held for 100 nanoseconds should NEVER cause a sleep — the wakeup path costs ~5 microseconds and trashes the cache. A pthread mutex held for 10 milliseconds should NEVER cause a spin — you burn an entire core for nothing. Linux’s futex (“fast userspace mutex”) solves this with adaptive behavior: stay in userspace via atomic CAS for the uncontended case, only enter the kernel when the lock is actually contested. When designing your own primitive, the rule of thumb is: spin if the expected hold time is less than 2x the context-switch cost (roughly 1-2 microseconds on modern x86); otherwise sleep.
Pattern: prefer adaptive locks over hand-rolled spin/sleep choices. Use pthread_mutex_t (which becomes a futex on Linux) and std::shared_mutex — they already adapt. If you absolutely need a spinlock, bound the spin count: for (int i = 0; i < 64; i++) if (try_lock()) return; futex_wait(...). The Linux kernel’s adaptive spinning (MUTEX_FLAG_WAITERS + owner-on-CPU check) only spins if the lock holder is currently running on another core — if the holder is descheduled, spinning is pointless and you sleep immediately.
Caveat 2: Lock-free is harder than it looks. It is not “no locks = no bugs.” Lock-free code has its own pantheon of failure modes: the ABA problem (covered in section 4), memory ordering bugs that work on x86 but corrupt data on ARM/POWER, and retry storms where N threads CAS-loop against each other and only one makes progress per cycle (lock-free guarantees system-wide progress, not per-thread fairness). A naive lock-free queue can be slower than a well-tuned mutex queue on contended workloads because every failed CAS invalidates the cache line on every other core.
Pattern: justify lock-free with a benchmark, not a vibe. Before reaching for std::atomic or hazard pointers, prototype with a mutex and measure. If the mutex is <5% of total CPU under target load, lock-free is premature optimization. When you do go lock-free, use perf c2c to find cache-line contention, prefer per-thread shards with periodic aggregation (the LMAX Disruptor pattern), and use established libraries (Folly’s MPMCQueue, Boost.Lockfree, Crossbeam in Rust) rather than rolling your own.
Caveat 3: Reader-writer locks are not always faster than plain mutexes. Two failure modes bite teams: (1) writer starvation — on a read-heavy workload with reader-preference rwlocks, writers can wait indefinitely as new readers keep arriving. (2) Reader cache-line ping-pong — every reader must atomically increment a shared counter, which on a 32-core box causes more cache traffic than a plain mutex would. Below roughly a 10:1 read-to-write ratio, a regular mutex is often faster because it has half the bookkeeping.
Pattern: choose by read:write ratio. Read:write under 10:1 use a plain mutex. Between 10:1 and 100:1 use a writer-preference rwlock (PTHREAD_RWLOCK_PREFER_WRITER_NP on glibc, but verify — glibc treats this as a hint). Above 100:1 use RCU or seqlocks to eliminate reader-side atomics entirely. Always profile under your real read:write mix; synthetic benchmarks often pick rwlocks for workloads where they are a regression in production.
Caveat 4: Lock granularity (coarse vs. fine) — measure before redesigning. “Make the locks finer-grained” is the most common cargo-culted advice in concurrent programming, and it is often a regression. Each fine-grained lock adds: more cache lines bouncing between cores, more potential for lock-ordering deadlocks (now you have N locks to keep ordered instead of one), more overhead per critical section. A single coarse lock with a 50-nanosecond critical section is often faster than ten fine-grained locks with 20-nanosecond sections each, because you save eight lock acquisitions.
Pattern: profile, then split. Use perf lock contention (Linux) or mutex_profile to find which mutex is hot. Split only that one. The Linux kernel scaled the dcache lock through ~15 years of incremental splits, each driven by a measured contention hotspot. Two heuristics: (1) if a single mutex is <10% of contended CPU time, leave it alone; (2) prefer “data sharding” (one lock per shard, threads pick a shard by hash) over “operation sharding” (separate locks for read path and write path) — data sharding scales linearly with shard count, operation sharding usually does not.
Mastery Level: Senior Systems / Kernel Engineer
Key Internals: RCU grace periods, lockdep dependency graphs, CAS / CMPXCHG, memory barriers
Prerequisites: Threads & Concurrency, Deadlocks
Time to Master: 12-15 hours

0. From Data Races to Correct Patterns

Before diving into RCU and lock-free algorithms, it helps to see the bugs we are trying to avoid and the idioms that fix them.

0.1 Broken Counter: Lost Updates

// BUGGY: Incrementing a shared counter without synchronization
volatile int counter = 0;

void *worker(void *arg) {
    for (int i = 0; i < 1000000; i++) {
        counter++;  // data race
    }
    return NULL;
}
  • On a multi-core system, counter++ compiles to load, increment, store.
  • Two threads can read the same old value and both write back, losing updates.
Fix with mutex + condition variable (pattern):
int counter = 0;
pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;

void *worker(void *arg) {
    for (int i = 0; i < 1000000; i++) {
        pthread_mutex_lock(&m);
        counter++;
        pthread_mutex_unlock(&m);
    }
    return NULL;
}

0.2 Reader-Writer Pattern

  • Anti-pattern: Using a single mutex around a large data structure that is read-mostly.
  • Pattern: Use a reader-writer lock or RCU so that many readers can proceed in parallel while writers serialize.

0.3 Producer–Consumer Queue

A classic pattern that combines a mutex with a condition variable:
pthread_mutex_t q_mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t q_not_empty = PTHREAD_COND_INITIALIZER;

void enqueue(item *it) {
    pthread_mutex_lock(&q_mutex);
    push_to_queue(it);
    pthread_cond_signal(&q_not_empty);
    pthread_mutex_unlock(&q_mutex);
}

item *dequeue(void) {
    pthread_mutex_lock(&q_mutex);
    while (queue_empty()) {
        pthread_cond_wait(&q_not_empty, &q_mutex);
    }
    item *it = pop_from_queue();
    pthread_mutex_unlock(&q_mutex);
    return it;
}
You will see these same patterns reappear in kernel form (wait queues, spinlocks, RCU lists). Having them clear in user space makes the advanced material below much easier to internalize.

1. RCU (Read-Copy-Update)

RCU is the secret weapon of the Linux kernel. It allows multiple readers to access data simultaneously with zero overhead (no locks, no atomic increments), even while a writer is modifying that data. The analogy: imagine a Wikipedia article. Thousands of people read it simultaneously with no coordination. When an editor wants to change it, they create a new version of the page. Readers who started before the edit see the old version; new readers see the new one. Once no reader is still viewing the old version, it is garbage-collected. Nobody was ever blocked.

The RCU Philosophy

  1. Readers: Don’t use any locks. They just read the data.
  2. Writers: Instead of modifying data in place, they:
    • Create a Copy of the data.
    • Update the copy.
    • Publish the new pointer atomically (using a memory barrier).
    • Defer the deletion of the old data until all current readers are finished.

Grace Periods & Quiescent States

How does the writer know when it’s safe to delete the old data?
  • Quiescent State: A point in time where a CPU is guaranteed not to be in an RCU read-side critical section (e.g., a context switch, returning to user space, or entering the idle loop). The key insight is that RCU read-side critical sections cannot span these boundaries.
  • Grace Period: The time interval during which every CPU has passed through at least one quiescent state. Once the grace period ends, no reader can possibly still have a reference to the old data. Grace periods are typically milliseconds long, which is why RCU trades memory for latency: the old copy lives a bit longer, but readers are never stalled.
Practical tip: In the Linux kernel, the routing table (struct fib_info), the file descriptor table, and the module list all use RCU. Any time you see rcu_read_lock() / rcu_read_unlock() in kernel code, the data structure is optimized for read-mostly access patterns where millions of lookups per second would make conventional locking a bottleneck.

RCU vs. Reader-Writer Locks

FeatureReader-Writer Lock (rwlock)RCU
Reader OverheadAtomic increment (cache line bouncing)Zero
Writer LatencyHigh (must wait for all readers)Low (asynchronous deletion)
Deadlock RiskPossibleImpossible for readers
ScalabilityPoor (limited by cache coherency)Infinite

2. Kernel Lock Validation: lockdep

Locking is hard. Deadlocks are harder. The Linux kernel uses a runtime validator called lockdep to prove that a deadlock could happen, even if it hasn’t yet. Think of lockdep as a building inspector who examines the wiring diagram of your house: even if no short circuit has occurred yet, the inspector can identify configurations that would cause one under the right conditions.

How lockdep Works

  • Lock Classes: Every lock belongs to a class (e.g., all inode->i_lock instances belong to one class). lockdep does not track individual lock instances — that would be too expensive. Instead, it tracks the acquisition patterns between classes. If any inode->i_lock is ever held before any dentry->d_lock, lockdep records the dependency inode_lock -> dentry_lock for all instances.
  • Dependency Tracking: lockdep builds a directed graph of lock acquisition order. If it sees Lock A -> Lock B in one code path and Lock B -> Lock A in another, it triggers a “Possible Deadlock” warning — even if the two code paths have never actually run concurrently. This is powerful: it catches bugs that would otherwise only manifest under rare timing conditions.
  • IRQ Context Tracking: It also tracks if a lock is acquired both with and without interrupts enabled. If a thread holds spinlock S with interrupts on, and an interrupt handler also acquires S, the interrupt will spin forever waiting for S to be released — but the release cannot happen because the interrupt preempted the holder. lockdep catches this by tracking in_irq vs not_in_irq context for every lock class.
Practical tip: When you see a lockdep warning in dmesg, it includes the full lock acquisition chain that forms the cycle. Read the “Existing dependency chain” and “New dependency” sections carefully — they tell you exactly which two code paths conflict. Fixing the bug usually means reordering lock acquisition or splitting a lock into finer-grained classes.
# Check for lockdep warnings in the kernel log
dmesg | grep -i "possible.*deadlock"
dmesg | grep -i "inconsistent lock state"

# lockdep statistics (how many lock classes it is tracking)
cat /proc/lockdep_stats

3. Lock-Free & Wait-Free Progress

When we say an algorithm is “Lock-Free,” we are talking about Progress Guarantees.
  • Lock-Free: At least one thread in the system is guaranteed to make progress in a finite number of steps. This prevents system-wide deadlocks but allows individual threads to starve.
  • Wait-Free: Every thread is guaranteed to make progress in a finite number of steps. This is the strongest guarantee, common in Hard RTOS.
  • Obstruction-Free: A thread makes progress only if it executes in isolation (no other thread interferes).

CAS (Compare-And-Swap) Internals

The foundation of lock-free programming is the atomic CAS instruction. It is the “swap the sign on the door only if nobody changed it since I last looked” primitive.
// Pseudo-code for Atomic Compare and Swap.
// The entire if-then-write happens atomically in hardware --
// no other CPU can interleave between the comparison and the store.
bool CAS(int *addr, int expected, int new_val) {
    if (*addr == expected) {
        *addr = new_val;
        return true;   // Success: we were the one to update it
    }
    return false;       // Failure: someone else changed it first
}
On x86, this is the LOCK CMPXCHG instruction. The LOCK prefix acquires exclusive ownership of the cache line for the duration of the operation, preventing any other core from reading or writing it. On ARM, the equivalent uses LDXR/STXR (load-exclusive / store-exclusive) pairs.

Lock-Free Stack Example

To see CAS in action, here is a classic lock-free push operation for a stack:
// A lock-free stack node
struct node {
    int value;
    struct node *next;
};

// Global stack top pointer (shared between all threads)
struct node *top = NULL;

void push(int value) {
    struct node *new_node = malloc(sizeof(*new_node));
    new_node->value = value;

    do {
        // Snapshot the current top
        new_node->next = top;
        // Attempt to swing the top pointer to our new node.
        // If another thread pushed between our snapshot and this CAS,
        // CAS fails (top != new_node->next), and we retry with
        // the updated snapshot.
    } while (!CAS(&top, new_node->next, new_node));
}
This push is lock-free: if one thread’s CAS fails because another thread succeeded, at least one thread made progress. No thread is ever blocked waiting for a lock.

4. The ABA Problem

A classic pitfall in lock-free programming. The name comes from the sequence of values the pointer takes: A, then B, then back to A.
  1. Thread 1 reads value A from a pointer and prepares a CAS to update it.
  2. Thread 2 changes the pointer to B (maybe freeing A), does some work, allocates new memory that happens to land at the same address as old A, and writes A back.
  3. Thread 1 performs CAS, sees A, and thinks nothing has changed — but the underlying data at address A is completely different. Thread 1 has been tricked into accepting a stale, recycled object.
The real-world analogy: you leave your car parked in spot A. Someone tows it, parks a different car in spot A that happens to be the same color. You return, see “your” car, get in, and drive away in someone else’s vehicle.
  • Solution: Hazard Pointers (each thread publishes which pointers it is currently examining, preventing reclamation), Epoch-based Reclamation (like RCU — defer freeing until all threads have moved to a new epoch), or Tagged Pointers (pack a monotonically increasing counter into the pointer so CAS detects the version change even if the address is reused).

5. Memory Barriers in Sync

Synchronization is not just about atomicity; it is about Visibility. Modern CPUs and compilers reorder memory operations for performance. Without barriers, Thread A might write data = 42 then set ready = true, but Thread B could see ready == true while data is still stale — because the CPU reordered the stores or the compiler rearranged the instructions. Think of it this way: writing to memory is like putting a letter in the outbox. Without a barrier, the postal worker might deliver letters out of order. A memory barrier is a “flush the outbox NOW, in order” instruction. A mutex lock() and unlock() act as Memory Barriers. They ensure that all memory writes performed while holding the lock are visible to the next thread that acquires the lock.
  • Acquire Barrier: Applied on lock(). Ensures subsequent loads/stores stay after the lock is acquired. No memory operation after the lock can “leak” above it.
  • Release Barrier: Applied on unlock(). Ensures preceding loads/stores stay before the lock is released. No memory operation before the unlock can “leak” below it.
  • Full Barrier (mfence): Prevents any reordering across the barrier in either direction. Used in __sync_synchronize() and atomic_thread_fence(memory_order_seq_cst).
// Without barriers, this publish/consume pattern is broken on
// weakly-ordered architectures (ARM, POWER):

// Thread A (publisher)
data = 42;
smp_wmb();          // Write memory barrier: ensures data=42 is visible
                    // before ready=true becomes visible to other CPUs
ready = true;

// Thread B (consumer)
while (!ready) ;    // Spin until ready
smp_rmb();          // Read memory barrier: ensures we see the data
                    // that was written before ready was set
assert(data == 42); // Safe after the barrier
Practical tip: In user-space C11/C++11, use atomic_store_explicit(&ready, true, memory_order_release) and atomic_load_explicit(&ready, memory_order_acquire) instead of raw barriers. The compiler generates the correct platform-specific barrier instructions for you. In kernel code, use smp_store_release() and smp_load_acquire() — the kernel’s own abstraction over hardware barriers.

6. Interview Deep Dive

RCU (Read-Copy-Update) eliminates reader-side overhead entirely. Readers call rcu_read_lock() which compiles to essentially nothing (it just disables preemption). Writers create a new copy of the data, atomically publish the pointer, and defer freeing the old copy until a “grace period” has elapsed — meaning every CPU has gone through at least one context switch.Use RCU when: (a) the data structure is read-mostly (thousands of reads per write), (b) readers need to be fast and never block, and (c) you can tolerate the old version being visible briefly. The Linux routing table, file descriptor table, and module list all use RCU.Use reader-writer locks when: writes are frequent or readers need to see the latest version immediately.
The ABA problem is specific to CAS-based lock-free algorithms. A data race is accessing shared memory without synchronization. The ABA problem occurs even with correct atomic operations: Thread 1 reads pointer value A, prepares a CAS. Thread 2 changes A to B, then back to A (possibly after freeing and reallocating the memory). Thread 1’s CAS succeeds because the pointer is A again, but the data behind A has changed.Solutions: (1) Tagged pointers with a monotonic counter so CAS checks both address and version. (2) Hazard pointers where threads publish which pointers they are examining. (3) Epoch-based reclamation (RCU-style) where memory is not freed until all threads are in a safe state.
Modern CPUs reorder memory operations for performance. Without barriers, writes from one thread can appear in a different order to another thread. A mutex lock acts as an acquire barrier (nothing leaks above it) and unlock acts as a release barrier (nothing leaks below it). Together, they ensure that all data written while holding the lock is visible to the next holder.A senior engineer would note that on x86, the memory model is “Total Store Order” — stores are not reordered with other stores, so many barrier bugs do not manifest on x86 but will crash on ARM or POWER. This is why testing lock-free code on x86 alone is insufficient.

Summary for Senior Engineers

  • Use RCU for read-mostly data structures (like routing tables or file descriptor tables). It is the single most important synchronization primitive in the Linux kernel.
  • Understand that lockdep is your best friend during kernel development. It catches deadlocks that would otherwise only manifest under rare timing conditions in production.
  • Avoid custom lock-free structures unless you are prepared to handle the ABA problem and memory ordering nuances. In most cases, a well-tuned mutex or RCU is both safer and fast enough.
  • Memory Barriers are what actually make synchronization work across multiple CPU cores. If you write lock-free code without understanding barriers, it will work on x86 and break on ARM.
  • Know the hierarchy: mutex (simple, correct) -> rwlock (read parallelism) -> RCU (zero-cost reads) -> lock-free CAS (maximum control, maximum risk). Move down this list only when profiling proves you need to.

Interview Deep-Dive

Strong Answer:RCU (Read-Copy-Update) is the kernel’s answer to the read-mostly data structure problem. Think of it this way: you have a routing table read by every packet (millions of times per second) and updated maybe once every few minutes. A reader-writer lock requires every reader to perform an atomic increment on a shared counter. On a 64-core machine, those atomic increments cause cache-line bouncing that adds 100+ nanoseconds per read.RCU eliminates reader overhead entirely:
  • Readers: Enter a critical section with rcu_read_lock(), which simply disables preemption (a thread-local operation, zero shared state). Read the data. Exit with rcu_read_unlock(). No atomic instructions, no cache-line bouncing. Cost: effectively zero.
  • Writers: Do not modify in place. Instead: (1) copy the data structure, (2) modify the copy, (3) atomically swap the pointer from old to new via rcu_assign_pointer(), (4) wait for a “grace period” before freeing the old copy.
  • Grace period: The writer calls synchronize_rcu(), which blocks until every CPU that might have been in an RCU read-side critical section has exited. The kernel detects this by watching for context switches and idle periods on each CPU (“quiescent states”).
When to use RCU over rwlock: read-to-write ratio of 100:1 or higher, zero reader overhead required, updates infrequent and can tolerate grace period delay.The catch: writer latency (grace period can be milliseconds), memory overhead (old and new versions coexist during grace period), and complexity (must use rcu_dereference() for reads and rcu_assign_pointer() for publishes to get correct memory barriers, especially on weakly-ordered architectures like ARM).Follow-up: How does RCU handle readers that get preempted in the middle of an RCU critical section?In non-preemptible kernels, rcu_read_lock() disables preemption, so this cannot happen. In preemptible kernels (PREEMPT_RT), “preemptible RCU” tracks preempted readers explicitly on per-CPU blocked-tasks lists. The grace period does not end until all such tasks exit their critical sections. This adds complexity but allows RCU to work correctly in fully preemptible kernels.
Strong Answer:The ABA problem is the Achilles’ heel of CAS-based lock-free algorithms. CAS checks “is the value still what I expect?” but cannot detect if the value was changed to something else and changed back.Concrete example with a lock-free stack:
  1. Thread T1 reads top = A (stack: A -> B -> C). It prepares to pop A by doing CAS(top, A, B).
  2. T1 is preempted before the CAS.
  3. Thread T2 pops A, pops B, pushes A back. Stack: A -> C. B has been freed.
  4. T1 resumes. CAS(top, A, B) succeeds because top is still A. But B is freed memory. The stack now points to freed memory — use-after-free.
Epoch-based reclamation solves this by deferring memory reclamation:
  • The system maintains a global epoch counter (0, 1, 2, cycling).
  • Each thread records which epoch it entered when starting an operation.
  • When a thread removes a node, it places it on a “retired” list tagged with the current epoch instead of freeing immediately.
  • The system periodically checks: are all active threads in the current epoch or later? If so, the previous epoch’s retired nodes can be safely freed.
  • Key guarantee: while T1 holds its epoch registration, A cannot be freed. T2 retires A to the list, but freeing is deferred until T1 advances its epoch.
This is essentially user-space RCU. The Crossbeam library in Rust implements this pattern.Follow-up: What are hazard pointers and when would you prefer them over epoch-based reclamation?Hazard pointers bound unreclaimed memory to O(threads * pointers_per_thread) by having each thread publish addresses it is currently accessing. Before freeing, the reclaimer scans all hazard pointer lists. If the address appears, freeing is deferred. The advantage: bounded memory overhead even if one thread stalls. The disadvantage: scanning all hazard pointers on every free is expensive. Use hazard pointers for memory-constrained environments; use epoch-based for simpler code when temporary memory bloat is acceptable.
Strong Answer Framework:
  1. Sketch the naive version — one mutex around if (instance == nullptr) instance = new Foo();. Correct but every call pays the lock cost. Senior interviewers want you to acknowledge that “correct and slow” is the default starting point.
  2. Show the DCLP attempt and explain why it is broken: the second check appears to avoid the lock on the fast path, but the publish of instance is the bug.
  3. Explain the actual race: thread A executes instance = new Foo();. The compiler/CPU is allowed to reorder this into roughly: (a) allocate memory, (b) assign pointer to instance, (c) run constructor. Thread B sees a non-null instance, skips the lock, and dereferences a half-constructed object.
  4. Show the C++11 fix: static Foo& get() { static Foo instance; return instance; }. The standard guarantees thread-safe initialization of function-local statics (“magic statics”). Implementation: compilers emit a guard variable plus an acquire/release barrier so other threads either see fully constructed or block.
  5. Acknowledge the manual DCLP fix if needed: use std::atomic<Foo*> with memory_order_release on the store and memory_order_acquire on the load. This makes the publish ordering explicit so the constructor cannot leak past the pointer assignment.
Real-World Example: Scott Meyers and Andrei Alexandrescu’s 2004 paper “C++ and the Perils of Double-Checked Locking” is the canonical reference — it proved DCLP could not be implemented portably in C++03 because the language had no memory model. The fix landed with C++11’s memory model (N2429, “Concurrency memory model compiler consequences”). LLVM and GCC both implement magic statics via __cxa_guard_acquire / __cxa_guard_release, which compile to a fast atomic load on the hot path.
Senior Follow-up 1: Why is volatile not enough to fix DCLP, and what does Java’s volatile do differently? volatile in C/C++ prevents the compiler from caching reads in registers, but it does NOT emit memory barriers, so CPU reordering still happens. Java’s volatile since JSR-133 (2004) is much stronger: it emits acquire/release barriers and prevents reordering of surrounding accesses. That is why DCLP with volatile works in Java 5+ but never works in C++ regardless of volatile.
Senior Follow-up 2: When would a singleton be the wrong answer entirely? When unit testing — singletons make tests order-dependent and hard to mock. Prefer dependency injection: pass the dependency in. Another case: per-thread state. A singleton serializes access; a thread_local instance gives each thread its own copy with zero contention.
Senior Follow-up 3: How does std::call_once differ from magic statics? std::call_once lets you initialize state that is not a function-local static (e.g., a global pointer initialized lazily). It works with any callable, including lambdas with captures. Both compile to similar guard logic, but call_once uses an explicit once_flag which you can pass between functions. Performance is comparable.
Common Wrong Answers:
  • “Just use a mutex on every call” — correct but the question is specifically about avoiding lock cost on the fast path, so this misses the point of the question.
  • “Mark the pointer volatile and DCLP works” — false in C++; conflates Java semantics with C++ semantics. A senior should know the language-specific rules.
  • “Use std::shared_ptr with atomic store” — overcomplicated and std::atomic<std::shared_ptr<T>> was only standardized in C++20. Magic statics or call_once are simpler and pre-date this.
Further Reading:
  • Scott Meyers, Andrei Alexandrescu, “C++ and the Perils of Double-Checked Locking” (2004) — the paper that ended DCLP in C++.
  • Hans Boehm, “Threads Cannot be Implemented as a Library” (2005) — argues why C++03 needed a memory model.
  • cppreference.com, “Static local variables” — documents the C++11 magic-statics guarantee.
Strong Answer Framework:
  1. Define the contract: a mutex parks waiters in the kernel (sleep + wake), a spinlock keeps waiters busy-waiting in userspace. Both serialize access; they differ in what waiters do.
  2. State the simple decision rule: spinlock if the expected hold time is less than the context-switch cost (~1-2us on Linux x86). Mutex if longer or if the holder might be preempted.
  3. Walk through why context matters: in the kernel, you cannot sleep while holding a spinlock (you would deadlock the scheduler). In userspace, you can but spinning while preempted means you spin uselessly.
  4. Introduce futex as the adaptive answer: tries CAS in userspace (zero syscall on uncontended path), falls back to a kernel wait queue only when contended. pthread_mutex_t on glibc is a futex with adaptive spinning.
  5. Explain priority inversion implications: a spinlock cannot inherit priority. A high-priority thread spinning for a low-priority holder that is preempted by a medium-priority thread spins forever. Real-time systems use priority-inheriting mutexes (PTHREAD_PRIO_INHERIT).
Real-World Example: The Linux kernel switched many spinlock_t uses to mutex in the BKL (Big Kernel Lock) removal effort, finalized around 2011 (kernel 2.6.39). The lesson: hold times that grew past microseconds made spinning a measurable regression. Conversely, the seqlock_t (used in jiffies updates and routing tables) is a spin-based primitive that wins because the writer’s critical section is single-digit nanoseconds.
Senior Follow-up 1: What is “ticket locks” and why did Linux replace them with qspinlock? Ticket locks (introduced in 2008 for x86) gave each waiter a number and spun until the “now serving” counter reached it — FIFO fair, simple. But all CPUs spin on the same cache line, causing massive coherence traffic at high contention. Qspinlock (kernel 4.2, 2015) uses an MCS-style queue where each waiter spins on its own cache line, dramatically reducing bus traffic at scale (32+ cores).
Senior Follow-up 2: Why does pthread_mutex_lock on glibc spin a few times before sleeping? Adaptive mutexes (PTHREAD_MUTEX_ADAPTIVE_NP) and the default mutex on modern glibc spin briefly because the wake-up path costs ~5us. If the lock is freed within ~1us, spinning is cheaper than the syscall round trip. The spin count is bounded (typically tens of iterations) so a long-held lock still parks the waiter.
Senior Follow-up 3: When would you use a spinlock in userspace? Almost never in application code — the OS preempts you arbitrarily and a preempted spinlock holder makes every spinner waste CPU. Exception: realtime threads pinned to dedicated cores (e.g., DPDK, Solarflare Onload, HFT engines) where you control scheduling and can guarantee the holder is never preempted.
Common Wrong Answers:
  • “Spinlocks are always faster because they avoid syscalls” — wrong; only true for very short hold times. A spinlock held for 1ms on 8 cores wastes 7ms of CPU per acquisition.
  • “Mutexes always sleep” — wrong on Linux; futex-based mutexes spin briefly first.
  • “Spinlocks cause deadlocks more easily” — wrong; both can deadlock identically. They differ in what waiters do, not in deadlock semantics.
Further Reading:
  • Ulrich Drepper, “Futexes Are Tricky” (2003, updated 2011) — the canonical futex reference, available free.
  • Linux kernel Documentation/locking/spinlocks.rst — spinlock semantics and rules.
  • Paul McKenney, “Is Parallel Programming Hard, And, If So, What Can You Do About It?” — chapter on locking primitives, free PDF online.
Strong Answer Framework:
  1. Set up the naive lock-free queue: Michael-Scott queue using CAS on head/tail pointers. Show the basic enqueue/dequeue.
  2. Trace ABA step by step with a memory reuse scenario (the failure mode is concrete addresses, so use addresses).
  3. Explain why CAS does not catch it: CAS compares the pointer value, not the object identity. If memory at address X is freed and a new object happens to land at X, CAS sees “still X” and proceeds.
  4. Walk through three solutions:
    • Tagged pointers (lower 16 bits store a counter, CAS sees version change). Caps you at 2^N reuses before wraparound.
    • Hazard pointers (each thread publishes pointers it is using; reclamation scans all hazards before freeing). Bounds memory but adds per-access overhead.
    • Epoch-based reclamation (defer freeing until all threads have left their entry epoch — userspace RCU). Simpler but unbounded memory if a thread stalls.
  5. State the meta-lesson: lock-free does not mean “no concurrency bugs.” It means “no blocking.” The data race surface area is just as large; you have moved the problem to memory reclamation.
Real-World Example: The Linux kernel hit ABA bugs in the 2000s when implementing per-CPU lock-free structures. The fix was to adopt RCU widely. In userspace, the Crossbeam crate (Rust) implements epoch-based reclamation explicitly because Rust’s memory safety does not protect against ABA — the type system cannot reason about cross-thread pointer reuse. Folly’s UnboundedQueue (Facebook, ~2017) uses hazard pointers for the same reason.
Senior Follow-up 1: Does ABA exist in garbage-collected languages? The classic pointer-recycling form does not — the GC will not free an object while any thread holds a reference, so CAS on a reference pointer is safe. But a “logical ABA” can still happen with versioned state machines (e.g., a counter going A -> B -> A) where you need a separate version field regardless of language.
Senior Follow-up 2: How does x86’s CMPXCHG16B (double-width CAS) help? It enables tagged pointers without packing tags into the pointer’s spare bits: store the pointer in 8 bytes, the version counter in another 8 bytes, and CAS both atomically. Useful on systems where the pointer uses all 64 bits. ARM equivalent is CASP (LSE extensions).
Senior Follow-up 3: When is a lock-free queue actually slower than a mutex queue? Under high contention with short critical sections. Lock-free retry storms cause every contending CAS to invalidate the cache line on every other core. A mutex parks losers in a wait queue, so only one thread pounds the cache line per critical section. Folly’s MPMCQueue benchmarks show mutex queues winning on contended SPSC at certain core counts.
Common Wrong Answers:
  • “Lock-free is always faster than mutex” — false under contention with retry storms.
  • “ABA only matters with manual memory management” — also affects reference-counted structures and any state that recycles values.
  • “Just use std::atomic and you are done” — atomics give you the primitive, not the algorithm. You still must reason about memory reclamation.
Further Reading:
  • Maged Michael, Michael Scott, “Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms” (1996) — the foundational paper.
  • Maged Michael, “Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects” (2004).
  • Crossbeam epoch crate documentation — a clean modern implementation of EBR.
Strong Answer:Each solution trades accuracy or complexity for throughput:Level 1: Mutex (current approach) Every increment acquires a lock, increments, releases. On 64 cores, this serializes all operations. Throughput: ~10-20M increments/sec (limited by lock acquisition at ~50-100ns due to cache-line bouncing).Level 2: Atomic increment Replace the mutex with __atomic_fetch_add(&counter, 1, __ATOMIC_RELAXED). Uses hardware LOCK XADD. No lock, but the cache line still bounces across 64 L1 caches via the MESI protocol. Throughput: ~50-100M increments/sec. Better, but still limited by coherence traffic.Level 3: Per-CPU counters Each CPU has its own counter on its own cache line (padded to 64 bytes to avoid false sharing). Threads increment their local counter — no contention. Reading the global total requires summing all per-CPU counters. Throughput for increments: billions per second (thread-local store, no inter-core communication). Trade-off: global read is expensive and slightly stale.
struct padded_counter {
    long value;
    char padding[56];
} __attribute__((aligned(64)));
struct padded_counter per_cpu[MAX_CPUS];

void increment() { per_cpu[sched_getcpu()].value++; }
long read_total() {
    long sum = 0;
    for (int i = 0; i < num_cpus; i++) sum += per_cpu[i].value;
    return sum;
}
Level 4: Approximate counting If exact counts are not needed (rate limiting, statistics), increment a thread-local counter and only propagate to a shared counter every N increments. Reduces shared-state updates by factor N.The meta-lesson: scalability on multi-core requires eliminating shared mutable state. Every shared cache line is a serialization point.Follow-up: What is false sharing and how do you detect it?False sharing: two unrelated variables share the same 64-byte cache line. One core writes variable A, another writes variable B — the hardware invalidates the entire cache line on both cores, causing ping-pong despite no logical contention. Detection: perf c2c on Linux reports cache lines with high contention and identifies source addresses. Fix: pad structures to 64-byte alignment.

Next: File Systems & VFS