Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

CPU Architectures & Microarchitecture

To build high-performance operating systems, a senior engineer must look past the Instruction Set Architecture (ISA) and understand the Microarchitecture — the physical implementation of that ISA. The ISA is the contract (“what instructions exist”), while the microarchitecture is the factory floor (“how those instructions actually get executed”). This chapter bridges the gap between digital logic and kernel-level abstractions.
Interview Frequency: High for systems and performance roles
Key Topics: Pipelines, Branch Prediction, OoO Execution, Memory Consistency, Cache Coherency, MESI
Time to Master: 10-15 hours

1. The Execution Pipeline

Modern CPUs do not execute one instruction at a time. They use a Pipeline to overlap the execution of multiple instructions. The analogy is a car wash: while Car 1 is being dried, Car 2 is being rinsed, and Car 3 is being soaped — all simultaneously. Each “stage” of the pipeline handles one phase of instruction execution, and a new instruction enters the pipeline every cycle.

Pipeline Stages (Classic 5-Stage)

  1. Instruction Fetch (IF): Get instruction from memory/cache.
  2. Instruction Decode (ID): Determine what the instruction does and read registers.
  3. Execute (EX): Perform the actual calculation in the ALU.
  4. Memory Access (MEM): Read/Write data from RAM if needed.
  5. Write Back (WB): Update the register file with the result.

Pipeline Hazards

Hazards are situations that prevent the next instruction in the instruction stream from executing in its designated clock cycle. Returning to the car wash analogy: a hazard is when a car in the rinse stage needs something that the soap stage has not finished yet, causing a “bubble” (stall) in the pipeline.
  • Structural Hazard: Hardware conflict (e.g., two instructions need the same functional unit at the same time). This is like two cars needing the same wash bay simultaneously.
  • Data Hazard: Instruction depends on the result of a previous instruction still in the pipeline. For example, ADD R1, R2, R3 followed by SUB R4, R1, R5 — the SUB needs R1, but ADD has not written it yet.
    • Solution: Data Forwarding (bypassing the register file to send results directly to the next stage, like passing a note ahead in line).
  • Control Hazard: Caused by branches. The CPU does not know which instruction to fetch next until the branch is resolved.
    • Solution: Branch Prediction.
Why this matters for OS engineers: Every context switch flushes the pipeline. On a modern CPU with a 15-20 stage pipeline, that is 15-20 cycles of wasted work per switch. This is one reason why excessive context switching destroys throughput.

2. Branch Prediction & Speculative Execution

If the CPU waited for every branch to finish before fetching the next instruction, performance would collapse — the pipeline would stall on every if, for, and function return. Instead, the CPU guesses the outcome and starts executing along the predicted path. Think of it like a GPS that starts calculating the route for the highway exit it expects you to take. If you take a different exit, it has to recalculate (expensive), but most of the time the guess is right (97%+ accuracy on modern predictors).

Branch Predictor Internals

  • BHT (Branch History Table): A simple table indexed by the lower bits of the PC, storing the history of the branch (Taken/Not Taken).
  • PHT (Pattern History Table): Uses global history (the outcome of the last N branches) to predict the current one.
  • TAGE Predictor: The modern gold standard. It uses multiple tables with different history lengths to capture both short-term and long-term patterns.
  • BTB (Branch Target Buffer): Stores the target address of successful branches so the CPU can start fetching the target before even decoding the instruction.

Speculative Execution

The CPU executes instructions along the predicted path. This is one of the most impactful optimizations in modern processors — and also the source of major security vulnerabilities (Spectre, Meltdown).
  • On Success: The instructions are “retired” or “committed” to the architectural state. The speculation was invisible to correctness.
  • On Failure: The pipeline is flushed, speculative results are discarded, and the CPU restarts from the correct path. This is a massive performance penalty (~15-20 cycles).
Security implication: Even though speculative results are “discarded” architecturally, they leave traces in the cache (side channels). Spectre exploits this: trick the branch predictor into speculating along a path that loads secret data into cache, then measure cache timing to extract it. This is why modern kernels include mitigations like retpolines and IBRS that sacrifice branch prediction performance for security.

3. Out-of-Order (OoO) Execution

Modern “Superscalar” CPUs (x86, ARM, RISC-V) use OoO to maximize throughput by executing instructions as soon as their operands are available, regardless of their original order in the program. Think of a restaurant kitchen: orders do not come out in the sequence they were placed. The salad (fast) comes out before the steak (slow), even if the steak was ordered first. But the waiter (Reorder Buffer) holds the salad at the pass until the steak is ready, so the customer sees both arrive “in order.”

Key Components

  • Register Renaming: Maps “architectural registers” (e.g., RAX) to a much larger pool of “physical registers” (often 180+ on modern x86) to eliminate False Dependencies (WAW/WAR hazards). Without renaming, two independent instructions that happen to use the same register would appear to depend on each other.
  • Reservation Stations (RS): Buffers that hold instructions waiting for data. Once all operands are ready, the instruction is dispatched to an execution unit. A modern Intel core has ~100 RS entries.
  • Reorder Buffer (ROB): Ensures that even if instructions execute out of order, they commit in order. This maintains the “illusion” of sequential execution required for correct exception handling and precise interrupts.
Why this matters for OS engineers: When the kernel takes an interrupt or exception, the ROB guarantees that the architectural state is consistent up to a precise instruction boundary. Without this, the kernel could not reliably determine which instruction faulted or resume execution correctly.

4. Memory Consistency Models

The OS kernel must coordinate data between multiple cores. However, CPUs may reorder memory reads and writes for performance. This is one of the most counterintuitive aspects of modern hardware: the code you wrote is not the code the CPU executes, and the memory operations you see in the assembly are not the order they hit RAM. Think of it like a post office that delivers letters out of order to maximize route efficiency. If you send Letter A then Letter B, the recipient might get B first. Memory barriers are like certified mail — they force the post office to deliver in the order you specified.

TSO (Total Store Order) - x86

  • Behavior: Stores are never reordered with other stores. Loads are never reordered with other loads. However, a Load can be reordered with an earlier Store to a different address.
  • Kernel Impact: Relatively easy to program for. Locks often don’t need explicit memory barriers for every access.

Weak Ordering - ARM / RISC-V

  • Behavior: Any Load/Store can be reordered with any other Load/Store unless there is a data dependency or an explicit barrier.
  • Kernel Impact: Harder to program. The kernel must use Memory Barriers (DMB in ARM, FENCE in RISC-V) to ensure that, for example, a “flag” is seen after the “data” it protects is written.

Memory Barriers

  • Load Barrier: Ensures all loads before the barrier are complete before any load after the barrier.
  • Store Barrier: Ensures all stores before the barrier are visible to other cores before any store after the barrier.
  • Full Barrier: Combines both.

5. Cache Coherency (MESI Protocol)

Caches are local to each core. If Core A modifies a memory location that Core B has cached, Core B’s cache is now “stale.” The hardware uses the MESI Protocol to keep them in sync.

MESI States

  1. Modified (M): The cache line is present only in the current cache and is “dirty” (different from RAM).
  2. Exclusive (E): The cache line is present only in the current cache but matches RAM.
  3. Shared (S): The cache line is present in multiple caches and matches RAM.
  4. Invalid (I): The cache line is invalid.

The “False Sharing” Problem

If two threads on different cores modify different variables that happen to reside on the same cache line (typically 64 bytes), the cores will fight for ownership of that line, causing the line to “ping-pong” between caches. This can degrade performance by 10-100x compared to properly aligned data. Think of it like two people sharing a whiteboard: even though they are writing in different corners, each time one picks up the eraser, the other has to wait. The whiteboard is the cache line — the unit of ownership transfer.
  • Solution: Cache Alignment. Use __attribute__((aligned(64))) in C or alignas(64) in C++ to ensure critical variables are on their own lines.
// BAD: counter_a and counter_b likely share a cache line
struct shared_counters {
    int counter_a;  // Core 0 increments this
    int counter_b;  // Core 1 increments this -- false sharing!
};

// GOOD: each counter gets its own cache line
struct aligned_counters {
    alignas(64) int counter_a;  // Own cache line
    alignas(64) int counter_b;  // Own cache line
};
Practical tip: Use perf c2c on Linux to detect false sharing in production. It highlights cache lines that are being bounced between cores.

6. Privileged Execution & Protection Rings

The CPU hardware enforces the boundary between the OS and Applications.

x86 Rings

  • Ring 0: Kernel Mode. Access to all instructions (e.g., HLT, MOV CR3) and all memory.
  • Ring 3: User Mode. Restricted access.
  • Protection Transition: Triggered by interrupts, exceptions, or the SYSCALL instruction.

ARM Exception Levels (EL)

  • EL0: User Application.
  • EL1: OS Kernel.
  • EL2: Hypervisor.
  • EL3: Secure Monitor (TrustZone).

7. SIMD & Vector Processing

Modern workloads (ML, Crypto, Video) use SIMD (Single Instruction, Multiple Data) to process arrays of data in parallel. Think of it like a teacher handing out exam papers: instead of giving one paper to one student at a time, SIMD lets the teacher hand out 8, 16, or even 64 papers in a single motion.
  • x86: SSE (128-bit), AVX/AVX2 (256-bit), AVX-512 (512-bit registers — that is 64 bytes, an entire cache line, in one instruction).
  • ARM: NEON (128-bit), SVE (Scalable Vector Extension, up to 2048-bit, length determined at runtime by the hardware).
  • Kernel Responsibility: During a context switch, the OS must save the state of these massive registers. AVX-512 alone adds 32 registers of 64 bytes each = 2KB of extra state per context switch. To save time, kernels often use Lazy FP/SIMD Saving: only saving the registers if the next process actually tries to use them. The CPU raises a #NM (Device Not Available) exception on the first FP/SIMD instruction, and the kernel then loads the saved state.
Practical tip: AVX-512 can cause frequency throttling on Intel CPUs — the core drops to a lower clock speed when running 512-bit instructions. This means a single AVX-512 thread can slow down neighboring threads on the same core. This is why some organizations (notably Google) disable AVX-512 in their fleet. Check your CPU’s behavior with turbostat while running SIMD-heavy workloads.

7. From ISA to OS ABI

So far we have focused on microarchitecture (pipelines, OoO, caches). The operating system also cares about the architectural contract between compiled binaries and the kernel, usually called the Application Binary Interface (ABI).

7.1 Calling Conventions and System V ABI

On Linux/x86-64, the System V ABI defines:
  • Which registers are used for function arguments (RDI, RSI, RDX, RCX, R8, R9).
  • Which registers a function must preserve (callee-saved) vs. can freely clobber.
  • How the stack frame is laid out (return address, saved base pointer, locals).
The kernel relies on this when:
  • Creating a new thread or process (it must fabricate a valid stack frame so that returning from the scheduler or execve() lands in user code correctly).
  • Delivering a signal (it builds a signal frame on the user stack that looks like a normal call frame, then adjusts RIP/RSP).

7.2 Syscall ABIs

Each architecture defines how syscalls are invoked from user space:
  • x86-64 Linux:
    • Syscall number in RAX.
    • Arguments in RDI, RSI, RDX, R10, R8, R9.
    • syscall instruction transitions to kernel, return value in RAX.
  • ARM64 Linux:
    • Syscall number in x8.
    • Arguments in x0x5.
    • svc #0 traps into the kernel.
The OS must save and restore the full architectural state (all relevant registers, flags, FP/SIMD state if used) according to these rules on every transition.

7.3 How OS Code Becomes “Architecture-Dependent”

You will see architecture-specific directories in kernels, for example:
  • arch/x86/entry/ for syscall and interrupt entry/exit.
  • arch/arm64/mm/ for ARM64 page table layout and TLB handling.
The high-level logic (“schedule this task”, “map this page”) is often shared across architectures, but:
  • The exact sequence of instructions to switch page tables, flush TLBs, or enter low-power states is ISA-specific.
  • The OS must obey the memory model and barrier instructions of each ISA (TSO vs weak ordering).
Understanding this bridge from ISA to ABI is what lets you read actual kernel code in arch/* directories and connect it back to the abstractions in the rest of this course.

Summary for Senior Engineers

  • Pipeline flushing is the hidden cost of unpredictable code (branches) and context switches. A senior engineer would say: “We need to measure branch misprediction rates before assuming the algorithm is faster — perf stat -e branch-misses tells you immediately.”
  • Memory Ordering is the primary source of subtle concurrency bugs in cross-platform kernels. Code that works on x86 (TSO) may break on ARM (weak ordering) because x86 hides reordering that ARM exposes.
  • Cache Coherency means memory is not a single global pool; it is a hierarchy of ownership. Every shared variable access in a multithreaded program involves the MESI protocol, whether you think about it or not.
  • OoO Execution means the code you see is almost never the code the CPU executes. The CPU is constantly reordering, speculating, and retiring — your job is to understand when this optimization helps and when it creates correctness or security hazards.
Next: Process Management

Production Caveats: Where Microarchitecture Bites Real Code

The hardware is invisible until it is not. Most software pretends the CPU is a simple in-order machine; reality is a wildly out-of-order, speculative, cache-coherent, multi-level-pipelined beast. The bugs that emerge from this gap are the hardest to reproduce and the most expensive to fix.
Microarchitecture traps that have caused real production incidents:
  1. x86’s strong memory model masks bugs that surface on ARM weak ordering. Code that “works” on x86 because TSO hides reordering can race catastrophically on ARM. A classic example: a producer writes data then flag = 1; a consumer reads flag, then data. On x86 this is safe (stores in order). On ARM the consumer can see flag = 1 with data still uninitialized. Java’s Memory Model (JMM) hides most of this, but native code (C, C++, Go cgo, Rust unsafe) does not. Cloudflare’s Graviton migration in 2020 surfaced exactly this bug in a custom rate limiter.
  2. False sharing on cache lines kills performance silently. Two threads writing to adjacent counters in a struct can run 100x slower than separating them. There is no error — just inexplicable slowness. perf c2c is the diagnostic, but most engineers do not know to look. The Linux kernel went through a multi-year refactor to add ____cacheline_aligned_in_smp to many shared structs after profiling revealed false sharing in scheduling and locking hot paths.
  3. Speculation side channels (Spectre, Meltdown, MDS, L1TF) are the hidden cost. Mitigations are real performance costs: KPTI 5 to 30 percent on syscall-heavy workloads, IBRS another 5 to 15 percent, MDS clearing on every transition another 2 to 5 percent. You cannot just disable them without understanding your tenancy model. Cloud providers (AWS, GCP) leave them on; HFT firms running on dedicated hardware turn them off via mitigations=off. Pick your trade-off deliberately.
  4. AVX-512 frequency throttling. Intel CPUs (Skylake-X through Ice Lake) drop core frequency when running 512-bit instructions to stay in power envelope. A single core using AVX-512 can drop neighboring cores by 100 to 500 MHz for tens of milliseconds after the AVX work ends. Glibc’s memcpy accidentally hit this on Skylake; the fix was to detect AVX-512 throttling cost and avoid 512-bit when copies are small. Google reportedly disables AVX-512 in their fleet for this reason.
  5. TLB shootdowns scale poorly. When the kernel modifies a page table entry, every CPU caching that translation must invalidate it. With mmap/munmap storms, each operation can broadcast IPIs to hundreds of CPUs. A NUMA box with 192 cores spends real time on IPIs alone. PCID (x86) and ASID (ARM) help but do not eliminate the cost.
Solutions and patterns:
  • Use std::atomic with explicit memory orderings, not raw pointers. memory_order_acquire/memory_order_release document and enforce ordering. memory_order_relaxed is fine for counters where order does not matter; do not use it for synchronization. Java has volatile for the same purpose; Go has sync/atomic.
  • Pad shared data to cache-line boundaries. alignas(64) in C++, __attribute__((aligned(64))) in C, [CacheLineSize]byte padding in Go. This is the simplest fix for false sharing and worth doing prophylactically on any per-core counter.
  • Profile with hardware counters. perf stat -e cycles,instructions,cache-misses,branch-misses,dTLB-load-misses on production traffic. The IPC (instructions per cycle) tells you whether the CPU is doing useful work or stalling. Below 1.0 means you are memory-bound or branch-bound.
  • Use huge pages to reduce TLB pressure. Transparent Huge Pages (THP) in Linux are automatic for large allocations. Explicit MAP_HUGETLB for databases and JVMs. A single 2MB page maps 2MB of memory in one TLB entry instead of 512 4KB entries. Result: lower TLB miss rate, faster.
  • For lock-free code, use __builtin_prefetch and read your code for fences. Lock-free correctness is a nightmare on weak-memory architectures. Test on ARM, run with thread sanitizer (TSan), and benchmark carefully. The “lock-free is faster” assumption is only true if you got the ordering right.
  • Run TSan, ASan, and MSan in CI. Thread Sanitizer catches data races; Address Sanitizer catches memory errors; Memory Sanitizer catches uses of uninitialized memory. The combination eliminates most undefined-behavior bugs that microarchitecture exposes.

Senior Interview Questions: Hardware-Software Boundary

Strong Answer Framework:
  1. State the difference clearly. x86 implements TSO (Total Store Order): stores are never reordered with stores, loads are never reordered with loads, but a load can be reordered with an earlier store to a different address. ARM implements a weakly-ordered model: any load or store can be reordered with any other unless explicit barriers (DMB, DSB, ISB) prevent it.
  2. The classic example. Producer writes data then flag = 1; consumer reads flag then data.
    // Producer (Thread A)
    data = 42;       // Store 1
    flag = 1;        // Store 2
    
    // Consumer (Thread B)
    while (!flag);   // Load 1
    use(data);       // Load 2
    
    On x86 (TSO), the producer’s two stores hit memory in order; the consumer’s loads also stay in order. The consumer never sees flag == 1 with data still 0. On ARM, both reorderings are allowed — the producer’s stores can hit memory in either order, and the consumer’s loads can complete in either order. The consumer can see flag = 1 with data = 0.
  3. The fix in C++. Use std::atomic with memory_order_release on the producer’s store and memory_order_acquire on the consumer’s load:
    std::atomic<int> flag{0};
    int data = 0;
    
    // Producer
    data = 42;
    flag.store(1, std::memory_order_release);
    
    // Consumer
    while (flag.load(std::memory_order_acquire) == 0);
    use(data);
    
    On x86, this compiles to ordinary mov instructions (TSO already provides the guarantee). On ARM, the compiler emits STLR (store-release) and LDAR (load-acquire), which the hardware honors with the appropriate ordering.
  4. The deeper issue. Many real-world bugs are subtler. Double-checked locking in C without atomics, “publishing” an object via a non-atomic pointer, hand-rolled MPMC queues. All can work for years on x86 then fail on the first ARM deployment. The honest answer is: do not roll your own concurrency primitives unless you have read both the x86 and ARM memory model specs and can prove your code correct under each.
  5. Detection strategy. Run on ARM in CI even if your prod is x86. Run TSan; it instruments memory accesses and detects races. For published lock-free libraries, look at their arm64.s to see what fences they emit.
Real-World Example: Cloudflare’s 2020 migration to ARM (Graviton2) on EC2 surfaced a race in their internal token bucket rate limiter. The C code used a non-atomic counter with a “release” comment but no std::atomic. On x86 it had been silently working because of TSO; on ARM it produced incorrect rates under heavy load. Fix was a one-line change to std::atomic<uint64_t> with relaxed increments and acquire/release on the publication path. They documented the finding publicly to help others migrating.
Senior follow-up: What does std::memory_order_seq_cst cost on ARM versus x86? On x86, it costs roughly the same as acquire/release because TSO already provides most of the ordering — the compiler emits an mfence only on stores. On ARM, sequential consistency requires DMB ISH (data memory barrier, inner shareable) on every operation — expensive. For high-throughput code on ARM, prefer acquire/release over seq_cst.
Senior follow-up: Are there bugs that go in the opposite direction — code that works on ARM but races on x86? Rarely. ARM’s weak ordering is a strict superset of x86’s reorderings, so if you wrote barrier-correct code for ARM, it will work on x86. The asymmetry is in how easily wrong code “happens to work” on each.
Senior follow-up: What is the difference between LL/SC and CAS, and which does each architecture use? CAS (Compare-And-Swap) is x86’s LOCK CMPXCHG: atomic compare-and-replace in one instruction. LL/SC (Load-Linked / Store-Conditional) is ARM/RISC-V’s pair: ldxr reserves the address, stxr stores only if no other CPU has touched the reservation. LL/SC composes nicely (you can build any RMW from it), but it can spuriously fail under contention. Modern ARM has CAS as a separate instruction (CAS/CASA/CASAL); pre-Armv8.1 had only LL/SC.
Common Wrong Answers:
  1. “Just use volatile and you’ll be fine.” volatile in C/C++ does not provide atomicity or ordering across threads. It only prevents the compiler from caching values in registers. It is not a synchronization primitive.
  2. “x86 is sequentially consistent.” Wrong — x86 is TSO, which is weaker than sequential consistency. Specifically, a store followed by a load to a different address can be reordered. This matters for store-buffer-related races.
Further Reading:
  • Sewell, Sarkar, et al., “x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors” (CACM, 2010).
  • ARM Architecture Reference Manual, chapter on Memory Model.
  • Paul McKenney, “Memory Barriers: a Hardware View for Software Hackers” (2010) — the practitioner’s guide.
Strong Answer Framework:
  1. What the TLB is. The Translation Lookaside Buffer is a small, fast cache of recent virtual-to-physical address translations. Every memory access requires translating the virtual address through the page tables (a 4-level walk on x86-64, 5 levels with LA57). The walk takes 4 to 5 memory accesses; the TLB caches the result so the access is one cycle instead of dozens. Modern x86 has 64 to 1024 TLB entries (split between L1, L2, and STLB).
  2. Why context switches affect it. When the kernel switches from process A to process B, A and B have different page tables (different CR3 values). The translations cached in the TLB belong to A; using them under B’s address space would be a memory-isolation disaster. Historically, the CPU flushed the entire TLB on every CR3 write. Result: the new process starts with a cold TLB, and its first thousand or so memory accesses each trigger a 4-level page walk — 50 to 100 cycles instead of 1.
  3. The cost in numbers. A cold TLB on a 1GB working set can cost millions of cycles to warm up. In a fork/exec storm or aggressive scheduling, this dominates. Linux measures this as “TLB miss rate” via perf stat -e dTLB-load-misses.
  4. PCID (Process Context Identifier). Introduced in Westmere (2010) but widely deployed only since Haswell. Each TLB entry is tagged with a 12-bit context ID. On context switch, the kernel writes a new CR3 with PCID = process_B_id and tells the CPU not to flush. TLB entries from A and B coexist; lookups only match entries tagged with the current PCID. Result: switching back to A finds A’s TLB entries warm.
  5. Limitations. Only 4096 PCIDs total, so the kernel maintains a small per-CPU cache mapping process IDs to PCIDs. On overflow, it has to evict (and flush) old PCIDs. Also, PCID does not help with TLB shootdowns — when a page is unmapped, every CPU that has cached the translation must still invalidate it, and the IPI to do that is not free.
  6. PCID and KPTI interaction. KPTI (Meltdown mitigation) doubles each process’s PCID usage: one for kernel page tables, one for user page tables. The CR3 switch happens on every syscall instead of every context switch. PCID becomes more valuable here — without it, KPTI would flush the TLB twice per syscall, costing 30 to 50 percent throughput.
Real-World Example: When Linux 4.14 enabled KPTI for Meltdown in early 2018, AWS’s published benchmarks showed 5 to 30 percent throughput drops on syscall-heavy workloads. Workloads on Haswell or later (with PCID) saw the lower end of this range; workloads on older CPUs (no PCID) saw the higher end. AWS published guidance specifically pointing at PCID as the critical hardware feature for tolerating KPTI overhead.
Senior follow-up: What is a TLB shootdown and why is it expensive? When the kernel modifies a page table entry (like during munmap or mprotect), every CPU that may have cached the old translation must invalidate it. The kernel sends an inter-processor interrupt (IPI) to each affected CPU; each receiving CPU must drop what it is doing, run the invalidation handler, and ack. On a 192-core system, a single shootdown can take tens of microseconds. Workloads that frequently mmap/munmap (JVMs, garbage-collected runtimes) can spend significant CPU on shootdowns.
Senior follow-up: How do huge pages help? A 4KB page maps 4KB with one TLB entry. A 2MB huge page maps 2MB with one TLB entry. For a 1GB working set: 262,144 small-page TLB entries needed (impossible) versus 512 huge-page entries (easily fits in STLB). Result: vastly lower TLB miss rate. PostgreSQL on huge pages can show 10 to 30 percent throughput gains on large databases.
Senior follow-up: What about ARM’s ASID? ASID (Address Space ID) is the ARM equivalent of PCID. ARM has had it since the Cortex-A series; it is a 16-bit (or 8-bit on older parts) tag. Same idea, more bits, fewer overflow concerns. ARM also has VMID (Virtual Machine ID) for hypervisor-side tagging.
Common Wrong Answers:
  1. “The TLB is part of the L1 cache.” No. The TLB is a separate structure that caches address translations. L1/L2/L3 caches store data values. Both are SRAM, both are on-die, but they hold different things.
  2. “PCID makes context switches free.” It removes the TLB flush, which is the dominant cost, but the context switch still saves and restores registers, the FPU/SIMD state, and updates the scheduler’s per-process bookkeeping. Switches are roughly 1 microsecond; PCID brings it from “1 microsecond plus 10 microseconds of cold-TLB recovery” to roughly 1 microsecond.
Further Reading:
  • Intel Software Developer’s Manual, Volume 3, chapter on Paging.
  • “Improving Real-Time Performance by Utilizing Cache Allocation Technology” (Intel whitepaper) — adjacent topic on cache and TLB partitioning.
  • Linux kernel arch/x86/mm/tlb.c — where the actual TLB management lives. Read it.
Strong Answer Framework:
  1. Define what “faster” means. Throughput (operations per second), latency (per-operation), tail latency (p99/p99.9), or scalability (throughput as cores increase). A mutex can win throughput at low contention; a lock-free queue can win scalability at high contention. The right answer depends on the use case.
  2. Reproduce the claim with a benchmark. Run both implementations under the same workload: 1, 4, 16, 64 threads; uncontended, lightly contended, heavily contended. Measure with perf stat for IPC and cache miss rates, and with histograms (HdrHistogram) for latency distributions. A 2x throughput claim that only holds at 64 threads but loses at 4 threads is workload-specific.
  3. Check the memory ordering carefully. Lock-free algorithms are notoriously hard to get right, especially on weak-memory architectures. Run on ARM. Run with TSan. Check whether the implementation uses seq_cst (correct but slow), acquire/release (correct if ordering is reasoned through), or relaxed (probably wrong for synchronization).
  4. Measure on the target platform, not a microbenchmark. Lock-free algorithms often “win” microbenchmarks that fit in L1 cache and lose in production where cache lines bounce between cores. The microbenchmark also misses the second-order effect: lock-free algorithms tend to have higher cache-coherence traffic, which can slow down unrelated cores on the same socket.
  5. Consider the failure modes. Mutexes are well-understood: hold a mutex too long and you slow others down, but at least the algorithm is correct. Lock-free code can have ABA bugs, livelock under contention, memory reclamation bugs (use-after-free across threads), and pathological backoff. The failure modes of lock-free are subtler and the bugs harder to find.
  6. Check whether lock contention is actually a problem. If the mutex is held briefly (under 100ns) and contention is rare, the mutex is essentially free. Replacing it with lock-free adds complexity for no measurable gain. Profile with perf lock report to see where actual contention occurs.
Real-World Example: The Linux kernel’s dcache (directory entry cache) was famously moved from RCU + spinlock to a more aggressively lock-free design (RCU-walk) in Nick Piggin’s 2010 patches. The benchmark wins at 16+ cores were significant — 30 percent throughput improvement on Will-It-Scale’s open test. But the patches were merged only after rigorous TSan-equivalent verification (Paul McKenney’s RCU testing) because the original implementation had subtle ordering bugs that took months to surface.
Senior follow-up: What is RCU and when does it beat both mutexes and lock-free? RCU (Read-Copy-Update) is a synchronization technique optimized for read-heavy workloads. Readers run in a “lockless” critical section (just disable preemption); writers create a new copy, atomically swap, and wait for all readers to finish before freeing the old copy. Reads are roughly free (no atomic ops); writes are expensive (waiting for grace period). For 99-to-1 read-to-write ratios, RCU dominates. The Linux kernel uses it extensively.
Senior follow-up: What is the ABA problem and how does hazard-pointer or epoch-based reclamation solve it? ABA: thread T1 reads pointer P, value A. T2 changes P to B then back to A (with a fresh A). T1 then does CAS (P, A, …) and succeeds, even though the world changed under it. Hazard pointers: T1 publishes “I am reading P” before reading; T2 cannot reclaim A until T1 unpublishes. Epoch-based: T1 enters a global epoch; reclamation only happens in old epochs no thread is in. Both add bookkeeping but eliminate ABA.
Senior follow-up: When should you give up and use a mutex? Almost always, in application code. The exceptions are: (1) you have profiled and lock contention is the dominant cost; (2) you are writing kernel-style code where blocking is not allowed; (3) you have a clear lock-free algorithm with a published correctness proof. For most application code, a mutex with appropriately-scoped critical sections is the right answer.
Common Wrong Answers:
  1. “Lock-free is always faster.” False. Mutexes win at low contention. The right tool depends on the workload.
  2. “If it works in testing, it is correct.” Memory-model bugs are intermittent and platform-specific. Working in testing on x86 says little about correctness on ARM under load.
Further Reading:
  • Maurice Herlihy and Nir Shavit, The Art of Multiprocessor Programming — the textbook on lock-free design.
  • Paul McKenney, “Is Parallel Programming Hard, And, If So, What Can You Do About It?” (free PDF) — the practitioner’s deep dive on RCU and synchronization.
  • Paul McKenney’s Linux Weekly News articles on RCU — accessible explanations of subtle topics.

Interview Deep-Dive

Strong Answer:
  • The most likely cause is the difference in memory consistency models. x86 uses Total Store Order (TSO), which is relatively strong — stores are never reordered with other stores, and most lock-free code written for x86 implicitly relies on this. ARM64 uses a weak memory model where any load or store can be reordered with any other unless explicit barriers (DMB, DSB) are used. If the application uses lock-free algorithms or has subtle ordering assumptions, it may produce incorrect results on ARM (leading to retries or fallback paths) or the compiler/runtime may insert excessive barriers that kill performance.
  • Second, I would check SIMD usage. If the application uses AVX2 or AVX-512 intrinsics (common in database engines, compression, JSON parsing), those do not exist on ARM. The fallback is NEON, which has different widths and instruction semantics. If the code path falls back to scalar code because ARM SIMD was not implemented, that explains the throughput drop.
  • Third, branch prediction and pipeline characteristics differ. ARM cores (like Graviton) typically have different pipeline depths and branch predictor designs. Code with many unpredictable branches might perform differently. I would use perf stat to compare IPC (instructions per cycle), branch misprediction rates, and cache miss rates on both platforms.
  • Investigation approach: run perf stat -d on both platforms for the same workload, compare IPC, L1/L2/LLC miss rates, branch misprediction rates, and TLB miss rates. If IPC is significantly lower on ARM, look at the hottest functions with perf record -g and check if they are memory-bound (high cache misses) or instruction-bound (high branch mispredictions).
Follow-up: What is false sharing, and how would you detect it using hardware performance counters?False sharing occurs when two threads on different cores modify different variables that happen to live on the same cache line (typically 64 bytes). Even though there is no logical data race, the MESI protocol forces the cache line to bounce between cores — one core invalidates the other’s copy on every write. You detect it with perf c2c (cache-to-cache analysis) on Linux, which identifies cache lines with high cross-core invalidation rates. It shows you the exact addresses and the source code lines involved. The fix is to align hot per-thread variables to cache line boundaries using __attribute__((aligned(64))) in C or alignas(64) in C++, or pad structures so that per-thread fields do not share a cache line.
Strong Answer:
  • Speculative execution means the CPU executes instructions along a predicted branch path before the branch condition is resolved. If the prediction is correct, the results are committed; if wrong, they are discarded and the pipeline is flushed. The key insight behind Spectre is that even though architectural state is rolled back on a misprediction, microarchitectural state (specifically, the cache) is not. A speculative load that brings data into the cache leaves a timing side-channel: the attacker can measure access times to determine what data was loaded speculatively.
  • Spectre v1 (bounds check bypass): the attacker tricks the branch predictor into speculatively executing past an array bounds check, loading secret data into the cache. Spectre v2 (branch target injection): the attacker trains the indirect branch predictor to redirect speculative execution to a gadget that leaks data.
  • OS-level mitigations include: retpolines (replacing indirect branches with a return trampoline that prevents the branch predictor from being trained by an attacker — used heavily in the Linux kernel), IBRS/IBPB (hardware features that flush or restrict the branch predictor on privilege-level transitions so user-space training does not affect kernel speculation), STIBP (restricting branch prediction sharing between hyperthreads), and KPTI (Kernel Page Table Isolation) which is actually a Meltdown mitigation — it unmaps kernel memory from user-space page tables so speculative kernel memory reads from user space fault.
  • The performance cost is real: KPTI adds TLB flush overhead on every syscall (mitigated by PCID), retpolines add branch prediction overhead, and IBRS can cost 5-10% on some workloads. This is why production systems sometimes selectively disable mitigations for trusted workloads while keeping them enabled on multi-tenant systems.
Follow-up: If you are running a multi-tenant system and cannot disable these mitigations, what architectural changes would you make to minimize the performance impact?The biggest performance hit comes from frequent kernel/user transitions (each triggers KPTI TLB flushes and potential IBRS updates). I would restructure the application to minimize syscalls: use io_uring to batch I/O operations (one syscall for hundreds of I/O requests instead of one per request), use mmap for file access to avoid read/write syscalls, and use huge pages to reduce TLB pressure. For networking, kernel bypass with DPDK or XDP avoids the syscall path entirely for packet processing. These changes improve performance generally but have outsized benefits when Spectre mitigations are active because they reduce the frequency of the most expensive transitions.
Strong Answer:
  • Out-of-order execution lets the CPU execute instructions as soon as their operands are ready, regardless of program order. This maximizes throughput because if instruction 3 is waiting for a cache miss but instruction 5 has all its data ready, the CPU executes instruction 5 first rather than stalling the entire pipeline.
  • The Reorder Buffer maintains the illusion of sequential execution. It commits (retires) instructions in program order for three critical reasons. First, precise exceptions: if instruction 5 executes but instruction 3 faults (e.g., divide by zero), the CPU must report the fault as if instruction 5 never executed. Second, branch misprediction recovery: if we discover that a branch was mispredicted, we need to discard all instructions after the branch. In-order commit means everything after the branch in the ROB can be cleanly flushed without affecting committed state. Third, memory ordering: on x86 (TSO), stores must appear in program order to other cores. The ROB ensures stores are committed (made visible) in order.
  • The ROB is a circular buffer, typically 200-350 entries on modern x86 cores. When it fills up, the CPU stalls — it cannot issue new instructions until the oldest instruction commits. This is why long-latency operations (like an L3 cache miss that takes 40 cycles) can block the entire pipeline if they are at the head of the ROB.
  • Register renaming works hand-in-hand with the ROB: architectural registers (like RAX) are mapped to a much larger set of physical registers. This eliminates false dependencies (WAW, WAR) so that more instructions can execute in parallel.
Follow-up: How does the store buffer interact with the ROB, and what is store-to-load forwarding?Stores are not written directly to the cache when they execute — they go into a store buffer, which holds them until the ROB commits the store instruction. This keeps speculative stores invisible to other cores. Store-to-load forwarding is an optimization where if a younger load reads from the same address as an older store still in the store buffer, the CPU forwards the data directly from the store buffer to the load without going to cache. However, forwarding can fail if the load is wider than the store, or if the addresses only partially overlap — in that case the load must wait for the store to commit to cache first, causing a pipeline stall.