Skip to main content

CPU Architectures & Microarchitecture

To build high-performance operating systems, a “Senior Engineer” must look past the Instruction Set Architecture (ISA) and understand the Microarchitecture—the physical implementation of that ISA. This chapter bridges the gap between digital logic and kernel-level abstractions.

1. The Execution Pipeline

Modern CPUs do not execute one instruction at a time. They use a Pipeline to overlap the execution of multiple instructions, similar to an assembly line.

Pipeline Stages (Classic 5-Stage)

  1. Instruction Fetch (IF): Get instruction from memory/cache.
  2. Instruction Decode (ID): Determine what the instruction does and read registers.
  3. Execute (EX): Perform the actual calculation in the ALU.
  4. Memory Access (MEM): Read/Write data from RAM if needed.
  5. Write Back (WB): Update the register file with the result.

Pipeline Hazards

Hazards are situations that prevent the next instruction in the instruction stream from executing in its designated clock cycle.
  • Structural Hazard: Hardware conflict (e.g., two instructions need the same functional unit).
  • Data Hazard: Instruction depends on the result of a previous instruction still in the pipeline.
    • Solution: Data Forwarding (bypassing the register file to send results directly to the next stage).
  • Control Hazard: Caused by branches. The CPU doesn’t know which instruction to fetch next until the branch is resolved.
    • Solution: Branch Prediction.

2. Branch Prediction & Speculative Execution

If the CPU waited for every branch to finish before fetching the next instruction, performance would collapse. Instead, it guesses the outcome.

Branch Predictor Internals

  • BHT (Branch History Table): A simple table indexed by the lower bits of the PC, storing the history of the branch (Taken/Not Taken).
  • PHT (Pattern History Table): Uses global history (the outcome of the last N branches) to predict the current one.
  • TAGE Predictor: The modern gold standard. It uses multiple tables with different history lengths to capture both short-term and long-term patterns.
  • BTB (Branch Target Buffer): Stores the target address of successful branches so the CPU can start fetching the target before even decoding the instruction.

Speculative Execution

The CPU executes instructions along the predicted path.
  • On Success: The instructions are “retired” or “committed” to the architectural state.
  • On Failure: The pipeline is flushed, speculative results are discarded, and the CPU restarts from the correct path. This is a massive performance penalty (~15-20 cycles).

3. Out-of-Order (OoO) Execution

Modern “Superscalar” CPUs (x86, ARM, RISC-V) use OoO to maximize throughput by executing instructions as soon as their operands are available, regardless of their original order in the program.

Key Components

  • Register Renaming: Maps “architectural registers” (e.g., RAX) to a much larger pool of “physical registers” to eliminate False Dependencies (WAW/WAR).
  • Reservation Stations (RS): Buffers that hold instructions waiting for data. Once all operands are ready, the instruction is dispatched to an execution unit.
  • Reorder Buffer (ROB): Ensures that even if instructions execute out of order, they commit in order. This maintains the “illusion” of sequential execution required for correct exception handling.

4. Memory Consistency Models

The OS kernel must coordinate data between multiple cores. However, CPUs may reorder memory reads and writes for performance.

TSO (Total Store Order) - x86

  • Behavior: Stores are never reordered with other stores. Loads are never reordered with other loads. However, a Load can be reordered with an earlier Store to a different address.
  • Kernel Impact: Relatively easy to program for. Locks often don’t need explicit memory barriers for every access.

Weak Ordering - ARM / RISC-V

  • Behavior: Any Load/Store can be reordered with any other Load/Store unless there is a data dependency or an explicit barrier.
  • Kernel Impact: Harder to program. The kernel must use Memory Barriers (DMB in ARM, FENCE in RISC-V) to ensure that, for example, a “flag” is seen after the “data” it protects is written.

Memory Barriers

  • Load Barrier: Ensures all loads before the barrier are complete before any load after the barrier.
  • Store Barrier: Ensures all stores before the barrier are visible to other cores before any store after the barrier.
  • Full Barrier: Combines both.

5. Cache Coherency (MESI Protocol)

Caches are local to each core. If Core A modifies a memory location that Core B has cached, Core B’s cache is now “stale.” The hardware uses the MESI Protocol to keep them in sync.

MESI States

  1. Modified (M): The cache line is present only in the current cache and is “dirty” (different from RAM).
  2. Exclusive (E): The cache line is present only in the current cache but matches RAM.
  3. Shared (S): The cache line is present in multiple caches and matches RAM.
  4. Invalid (I): The cache line is invalid.

The “False Sharing” Problem

If two threads on different cores modify different variables that happen to reside on the same cache line (typically 64 bytes), the cores will fight for ownership of that line, causing the line to “ping-pong” between caches.
  • Solution: Cache Alignment. Use __attribute__((aligned(64))) in C to ensure critical variables are on their own lines.

6. Privileged Execution & Protection Rings

The CPU hardware enforces the boundary between the OS and Applications.

x86 Rings

  • Ring 0: Kernel Mode. Access to all instructions (e.g., HLT, MOV CR3) and all memory.
  • Ring 3: User Mode. Restricted access.
  • Protection Transition: Triggered by interrupts, exceptions, or the SYSCALL instruction.

ARM Exception Levels (EL)

  • EL0: User Application.
  • EL1: OS Kernel.
  • EL2: Hypervisor.
  • EL3: Secure Monitor (TrustZone).

7. SIMD & Vector Processing

Modern workloads (ML, Crypto, Video) use SIMD (Single Instruction, Multiple Data) to process arrays of data in parallel.
  • x86: SSE, AVX, AVX-512 (512-bit registers).
  • ARM: NEON, SVE (Scalable Vector Extension).
  • Kernel Responsibility: During a context switch, the OS must save the state of these massive registers. To save time, kernels often use Lazy FP/SIMD Saving: only saving the registers if the next process actually tries to use them.

7. From ISA to OS ABI

So far we have focused on microarchitecture (pipelines, OoO, caches). The operating system also cares about the architectural contract between compiled binaries and the kernel, usually called the Application Binary Interface (ABI).

7.1 Calling Conventions and System V ABI

On Linux/x86-64, the System V ABI defines:
  • Which registers are used for function arguments (RDI, RSI, RDX, RCX, R8, R9).
  • Which registers a function must preserve (callee-saved) vs. can freely clobber.
  • How the stack frame is laid out (return address, saved base pointer, locals).
The kernel relies on this when:
  • Creating a new thread or process (it must fabricate a valid stack frame so that returning from the scheduler or execve() lands in user code correctly).
  • Delivering a signal (it builds a signal frame on the user stack that looks like a normal call frame, then adjusts RIP/RSP).

7.2 Syscall ABIs

Each architecture defines how syscalls are invoked from user space:
  • x86-64 Linux:
    • Syscall number in RAX.
    • Arguments in RDI, RSI, RDX, R10, R8, R9.
    • syscall instruction transitions to kernel, return value in RAX.
  • ARM64 Linux:
    • Syscall number in x8.
    • Arguments in x0x5.
    • svc #0 traps into the kernel.
The OS must save and restore the full architectural state (all relevant registers, flags, FP/SIMD state if used) according to these rules on every transition.

7.3 How OS Code Becomes “Architecture-Dependent”

You will see architecture-specific directories in kernels, for example:
  • arch/x86/entry/ for syscall and interrupt entry/exit.
  • arch/arm64/mm/ for ARM64 page table layout and TLB handling.
The high-level logic (“schedule this task”, “map this page”) is often shared across architectures, but:
  • The exact sequence of instructions to switch page tables, flush TLBs, or enter low-power states is ISA-specific.
  • The OS must obey the memory model and barrier instructions of each ISA (TSO vs weak ordering).
Understanding this bridge from ISA to ABI is what lets you read actual kernel code in arch/* directories and connect it back to the abstractions in the rest of this course.

Summary for Senior Engineers

  • Pipeline flushing is the hidden cost of unpredictable code (branches) and context switches.
  • Memory Ordering is the primary source of subtle concurrency bugs in cross-platform kernels.
  • Cache Coherency means memory is not a single global pool; it’s a hierarchy of ownership.
  • OoO Execution means the code you see is almost never the code the CPU executes.
Next: Process Management