CPU Architectures & Microarchitecture
1. The Execution Pipeline
Pipeline Stages (Classic 5-Stage)
Pipeline Hazards
2. Branch Prediction & Speculative Execution
Branch Predictor Internals
Speculative Execution
3. Out-of-Order (OoO) Execution
Key Components
4. Memory Consistency Models
TSO (Total Store Order) - x86
Weak Ordering - ARM / RISC-V
Memory Barriers
5. Cache Coherency (MESI Protocol)
MESI States
The “False Sharing” Problem
6. Privileged Execution & Protection Rings
x86 Rings
ARM Exception Levels (EL)
7. SIMD & Vector Processing
7. From ISA to OS ABI
7.1 Calling Conventions and System V ABI
7.2 Syscall ABIs
7.3 How OS Code Becomes “Architecture-Dependent”
Summary for Senior Engineers

CPU Architectures & Microarchitecture

To build high-performance operating systems, a “Senior Engineer” must look past the Instruction Set Architecture (ISA) and understand the Microarchitecture—the physical implementation of that ISA. This chapter bridges the gap between digital logic and kernel-level abstractions.

1. The Execution Pipeline

Modern CPUs do not execute one instruction at a time. They use a Pipeline to overlap the execution of multiple instructions, similar to an assembly line.

Pipeline Stages (Classic 5-Stage)

Instruction Fetch (IF): Get instruction from memory/cache.
Instruction Decode (ID): Determine what the instruction does and read registers.
Execute (EX): Perform the actual calculation in the ALU.
Memory Access (MEM): Read/Write data from RAM if needed.
Write Back (WB): Update the register file with the result.

Pipeline Hazards

Hazards are situations that prevent the next instruction in the instruction stream from executing in its designated clock cycle.

Structural Hazard: Hardware conflict (e.g., two instructions need the same functional unit).
Data Hazard: Instruction depends on the result of a previous instruction still in the pipeline.
- Solution: Data Forwarding (bypassing the register file to send results directly to the next stage).
Control Hazard: Caused by branches. The CPU doesn’t know which instruction to fetch next until the branch is resolved.
- Solution: Branch Prediction.

2. Branch Prediction & Speculative Execution

If the CPU waited for every branch to finish before fetching the next instruction, performance would collapse. Instead, it guesses the outcome.

Branch Predictor Internals

BHT (Branch History Table): A simple table indexed by the lower bits of the PC, storing the history of the branch (Taken/Not Taken).
PHT (Pattern History Table): Uses global history (the outcome of the last N branches) to predict the current one.
TAGE Predictor: The modern gold standard. It uses multiple tables with different history lengths to capture both short-term and long-term patterns.
BTB (Branch Target Buffer): Stores the target address of successful branches so the CPU can start fetching the target before even decoding the instruction.

Speculative Execution

The CPU executes instructions along the predicted path.

On Success: The instructions are “retired” or “committed” to the architectural state.
On Failure: The pipeline is flushed, speculative results are discarded, and the CPU restarts from the correct path. This is a massive performance penalty (~15-20 cycles).

3. Out-of-Order (OoO) Execution

Modern “Superscalar” CPUs (x86, ARM, RISC-V) use OoO to maximize throughput by executing instructions as soon as their operands are available, regardless of their original order in the program.

Key Components

Register Renaming: Maps “architectural registers” (e.g., RAX) to a much larger pool of “physical registers” to eliminate False Dependencies (WAW/WAR).
Reservation Stations (RS): Buffers that hold instructions waiting for data. Once all operands are ready, the instruction is dispatched to an execution unit.
Reorder Buffer (ROB): Ensures that even if instructions execute out of order, they commit in order. This maintains the “illusion” of sequential execution required for correct exception handling.

4. Memory Consistency Models

The OS kernel must coordinate data between multiple cores. However, CPUs may reorder memory reads and writes for performance.

TSO (Total Store Order) - x86

Behavior: Stores are never reordered with other stores. Loads are never reordered with other loads. However, a Load can be reordered with an earlier Store to a different address.
Kernel Impact: Relatively easy to program for. Locks often don’t need explicit memory barriers for every access.

Weak Ordering - ARM / RISC-V

Behavior: Any Load/Store can be reordered with any other Load/Store unless there is a data dependency or an explicit barrier.
Kernel Impact: Harder to program. The kernel must use Memory Barriers (DMB in ARM, FENCE in RISC-V) to ensure that, for example, a “flag” is seen after the “data” it protects is written.

Memory Barriers

Load Barrier: Ensures all loads before the barrier are complete before any load after the barrier.
Store Barrier: Ensures all stores before the barrier are visible to other cores before any store after the barrier.
Full Barrier: Combines both.

5. Cache Coherency (MESI Protocol)

Caches are local to each core. If Core A modifies a memory location that Core B has cached, Core B’s cache is now “stale.” The hardware uses the MESI Protocol to keep them in sync.

MESI States

Modified (M): The cache line is present only in the current cache and is “dirty” (different from RAM).
Exclusive (E): The cache line is present only in the current cache but matches RAM.
Shared (S): The cache line is present in multiple caches and matches RAM.
Invalid (I): The cache line is invalid.

If two threads on different cores modify different variables that happen to reside on the same cache line (typically 64 bytes), the cores will fight for ownership of that line, causing the line to “ping-pong” between caches.

Solution: Cache Alignment. Use __attribute__((aligned(64))) in C to ensure critical variables are on their own lines.

6. Privileged Execution & Protection Rings

The CPU hardware enforces the boundary between the OS and Applications.

x86 Rings

Ring 0: Kernel Mode. Access to all instructions (e.g., HLT, MOV CR3) and all memory.
Ring 3: User Mode. Restricted access.
Protection Transition: Triggered by interrupts, exceptions, or the SYSCALL instruction.

ARM Exception Levels (EL)

EL0: User Application.
EL1: OS Kernel.
EL2: Hypervisor.
EL3: Secure Monitor (TrustZone).

7. SIMD & Vector Processing

Modern workloads (ML, Crypto, Video) use SIMD (Single Instruction, Multiple Data) to process arrays of data in parallel.

x86: SSE, AVX, AVX-512 (512-bit registers).
ARM: NEON, SVE (Scalable Vector Extension).
Kernel Responsibility: During a context switch, the OS must save the state of these massive registers. To save time, kernels often use Lazy FP/SIMD Saving: only saving the registers if the next process actually tries to use them.

7. From ISA to OS ABI

So far we have focused on microarchitecture (pipelines, OoO, caches). The operating system also cares about the architectural contract between compiled binaries and the kernel, usually called the Application Binary Interface (ABI).

7.1 Calling Conventions and System V ABI

On Linux/x86-64, the System V ABI defines:

Which registers are used for function arguments (RDI, RSI, RDX, RCX, R8, R9).
Which registers a function must preserve (callee-saved) vs. can freely clobber.
How the stack frame is laid out (return address, saved base pointer, locals).

The kernel relies on this when:

Creating a new thread or process (it must fabricate a valid stack frame so that returning from the scheduler or execve() lands in user code correctly).
Delivering a signal (it builds a signal frame on the user stack that looks like a normal call frame, then adjusts RIP/RSP).

7.2 Syscall ABIs

Each architecture defines how syscalls are invoked from user space:

x86-64 Linux:
- Syscall number in RAX.
- Arguments in RDI, RSI, RDX, R10, R8, R9.
- syscall instruction transitions to kernel, return value in RAX.
ARM64 Linux:
- Syscall number in x8.
- Arguments in x0–x5.
- svc #0 traps into the kernel.

The OS must save and restore the full architectural state (all relevant registers, flags, FP/SIMD state if used) according to these rules on every transition.

7.3 How OS Code Becomes “Architecture-Dependent”

You will see architecture-specific directories in kernels, for example:

arch/x86/entry/ for syscall and interrupt entry/exit.
arch/arm64/mm/ for ARM64 page table layout and TLB handling.

The high-level logic (“schedule this task”, “map this page”) is often shared across architectures, but:

The exact sequence of instructions to switch page tables, flush TLBs, or enter low-power states is ISA-specific.
The OS must obey the memory model and barrier instructions of each ISA (TSO vs weak ordering).

Understanding this bridge from ISA to ABI is what lets you read actual kernel code in arch/* directories and connect it back to the abstractions in the rest of this course.

Summary for Senior Engineers

Pipeline flushing is the hidden cost of unpredictable code (branches) and context switches.
Memory Ordering is the primary source of subtle concurrency bugs in cross-platform kernels.
Cache Coherency means memory is not a single global pool; it’s a hierarchy of ownership.
OoO Execution means the code you see is almost never the code the CPU executes.

Next: Process Management →

Fundamentals xv6 OS Interfaces

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​CPU Architectures & Microarchitecture

​1. The Execution Pipeline

​Pipeline Stages (Classic 5-Stage)

​Pipeline Hazards

​2. Branch Prediction & Speculative Execution

​Branch Predictor Internals

​Speculative Execution

​3. Out-of-Order (OoO) Execution

​Key Components

​4. Memory Consistency Models

​TSO (Total Store Order) - x86

​Weak Ordering - ARM / RISC-V

​Memory Barriers

​5. Cache Coherency (MESI Protocol)

​MESI States

​The “False Sharing” Problem

​6. Privileged Execution & Protection Rings

​x86 Rings

​ARM Exception Levels (EL)

​7. SIMD & Vector Processing

​7. From ISA to OS ABI

​7.1 Calling Conventions and System V ABI

​7.2 Syscall ABIs

​7.3 How OS Code Becomes “Architecture-Dependent”

​Summary for Senior Engineers