Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

The System Boot Process

Booting an operating system is a series of “handoffs” where each stage initializes more hardware and increases the CPU’s capability until the full kernel is in control. Think of it like a relay race: the firmware sprints the first leg, hands the baton to the bootloader, which hands it to the kernel, which finally hands it to userspace. Each runner can only do what the previous runner set up for it — the bootloader cannot load drivers because the kernel has not initialized the driver model yet, and the kernel cannot mount the real root filesystem because the storage driver might live inside initramfs. Understanding this chain of trust and capability is where the “magic” of the hardware-software boundary happens.
Mastery Level: Senior Systems Engineer
Key Internals: Reset Vector 0xFFFFFFF0, CR0/CR4/EFER registers, GDT/IDT layout, Page Table bootstrapping
Prerequisites: CPU Architectures, Memory Management

1. The Reset Vector: The CPU’s First Breath

When you press the power button, the CPU is in a state of “Real Mode” (16-bit) but with a twist. It does not start at address 0x0000.
  • The Address: On x86-64, the CPU begins execution at 0xFFFFFFF0 (16 bytes below the 4GB mark). This is analogous to a newborn’s first reflex — it is hardwired, not learned. Every x86 CPU ever made wakes up reaching for this exact address.
  • The Hidden Base: While in 16-bit mode the address space is normally 1MB, at reset, the Code Segment (CS) register has a hidden base of 0xFFFF0000. Thus, CS:IP points to the top of the 4GB space, which is mapped by the motherboard to the Flash ROM containing the BIOS or UEFI.
Practical tip: If you are debugging a system that does not POST (no display, fans spinning), the CPU is stuck somewhere between the reset vector and firmware initialization. Check the motherboard’s POST code display (a two-digit hex readout on server boards) — it tells you exactly which firmware stage failed.

2. Firmware: BIOS vs. UEFI

The firmware’s job is to perform POST (Power-On Self Test) and find a bootable device. Think of the firmware as the building superintendent who turns on the lights, checks the plumbing, and unlocks the front door before the tenants (the OS) arrive for the day.
┌─────────────────────────────────────────────────────────────────────┐
│                      BIOS VS UEFI COMPARISON                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  BIOS (Legacy)                          UEFI (Modern)               │
│  ┌───────────────────────────────┐     ┌──────────────────────────┐│
│  │                               │     │                          ││
│  │  1. Power On                  │     │  1. Power On             ││
│  │     • CPU starts in Real Mode │     │     • CPU in 32/64-bit   ││
│  │     • 16-bit addressing       │     │     • Full addressing    ││
│  │                               │     │                          ││
│  │  2. POST (Power-On Self Test) │     │  2. POST + Init          ││
│  │     • Memory check            │     │     • DXE (Driver Exec)  ││
│  │     • Hardware detection      │     │     • Load drivers       ││
│  │                               │     │                          ││
│  │  3. Find Boot Device          │     │  3. Boot Manager         ││
│  │     • Check boot order        │     │     • Read EFI variables ││
│  │     • Read first sector (MBR) │     │     • Load from ESP      ││
│  │     • 512 bytes max           │     │     • Read FAT32 FS      ││
│  │                               │     │                          ││
│  │  4. Load & Execute MBR        │     │  4. Load EFI Application ││
│  │     • Jump to 0x7C00          │     │     • .efi PE/COFF exec  ││
│  │     • 446 bytes of code       │     │     • Graphics, mouse    ││
│  │     • Chain load bootloader   │     │     • Full environment   ││
│  │                               │     │                          ││
│  │  Limitations:                 │     │  Advantages:             ││
│  │  • 2TB disk max (32-bit LBA)  │     │  • 9.4ZB disk (GPT)      ││
│  │  • 4 primary partitions       │     │  • 128 partitions        ││
│  │  • No security features       │     │  • Secure Boot           ││
│  │  • Slow int 13h I/O           │     │  • Fast block I/O        ││
│  │  • Text mode only             │     │  • GUI support           ││
│  │                               │     │                          ││
│  └───────────────────────────────┘     └──────────────────────────┘│
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

2.1 Legacy BIOS (Basic Input/Output System)

  • Mode: Runs entirely in 16-bit Real Mode.
  • Disk Format: Uses MBR (Master Boot Record). The BIOS reads the first 512-byte sector of the disk and jumps to it.
  • Limitations: Max 2TB disks, 4 primary partitions, slow interrupt-based I/O.
MBR Structure:
┌─────────────────────────────────────────────────────────────────────┐
│                  MBR (Master Boot Record) Layout                    │
│                          512 Bytes Total                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Offset 0x000 - 0x1BD (446 bytes): Bootstrap Code                  │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  • First-stage bootloader code                                 │ │
│  │  • Loaded at 0x7C00 in memory                                  │ │
│  │  • Jumps to partition boot sector or loads stage 2             │ │
│  │  • Example: GRUB stage 1                                       │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  Offset 0x1BE - 0x1FD (64 bytes): Partition Table                  │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  Partition 1 (16 bytes):                                       │ │
│  │  ┌──────────────────────────────────────────────────────────┐ │ │
│  │  │  Boot flag    │ Type    │ Start LBA  │ Size (sectors)   │ │ │
│  │  │  0x80 (active)│ 0x83 (Linux) │ 2048  │ 204800          │ │ │
│  │  └──────────────────────────────────────────────────────────┘ │ │
│  │  Partition 2 (16 bytes): [similar structure]                  │ │
│  │  Partition 3 (16 bytes): [similar structure]                  │ │
│  │  Partition 4 (16 bytes): [similar structure]                  │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  Offset 0x1FE - 0x1FF (2 bytes): Boot Signature                    │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  0x55 0xAA  ← Must be present for BIOS to recognize bootable  │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
BIOS Boot Process:
; BIOS loads MBR to 0x7C00 and jumps here
org 0x7C00
bits 16

start:
    ; Disable interrupts -- the IDT is not set up yet, so any interrupt
    ; arriving now would triple-fault the CPU and cause a reboot
    cli

    ; Set up segments -- zero out DS/ES/SS so all memory references
    ; use a flat 0-based address space in Real Mode
    xor ax, ax
    mov ds, ax
    mov es, ax
    mov ss, ax
    mov sp, 0x7C00    ; Stack grows downward from MBR load address

    ; Re-enable interrupts now that the stack is valid --
    ; BIOS INT handlers need a working stack
    sti

    ; Load the rest of the bootloader from disk using BIOS INT 13h.
    ; This is the ONLY way to do disk I/O before drivers exist.
    mov ah, 0x02      ; BIOS "Read Sectors" function
    mov al, 1         ; Read 1 sector (512 bytes)
    mov ch, 0         ; Cylinder 0
    mov cl, 2         ; Sector 2 (sector 1 is the MBR itself)
    mov dh, 0         ; Head 0
    mov dl, 0x80      ; Drive 0x80 = first hard disk
    mov bx, 0x7E00    ; Destination: right after MBR in memory
    int 0x13          ; Call BIOS disk interrupt

    jc error          ; Carry flag set = disk read failed

    ; Hand off to the second stage we just loaded
    jmp 0x7E00

error:
    mov si, error_msg
    call print_string
    hlt               ; Halt -- nothing else we can do

print_string:
    lodsb
    or al, al
    jz .done
    mov ah, 0x0E      ; BIOS teletype output
    int 0x10          ; Call BIOS video interrupt
    jmp print_string
.done:
    ret

error_msg: db 'Boot error!', 0

times 510-($-$$) db 0  ; Pad to exactly 510 bytes
dw 0xAA55              ; Magic boot signature -- BIOS checks for this

2.2 Modern UEFI (Unified Extensible Firmware Interface)

  • Mode: Runs in 32-bit or 64-bit mode from the start.
  • Disk Format: Uses GPT (GUID Partition Table) and a dedicated EFI System Partition (ESP).
  • The Protocol: Instead of jumping to a sector, UEFI understands filesystems (FAT32) and loads PE/COFF executables (e.g., grubx64.efi).
GPT Structure:
┌─────────────────────────────────────────────────────────────────────┐
│                  GPT (GUID Partition Table) Layout                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  LBA 0: Protective MBR                                              │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  Legacy MBR for backward compatibility                          │ │
│  │  Single partition entry covering entire disk                    │ │
│  │  Type: 0xEE (GPT Protective)                                    │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  LBA 1: GPT Header (Primary)                                        │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  Signature: "EFI PART"                                          │ │
│  │  Revision: 0x00010000                                           │ │
│  │  Header Size: 92 bytes                                          │ │
│  │  CRC32 checksum                                                 │ │
│  │  Current LBA: 1                                                 │ │
│  │  Backup LBA: (last LBA on disk)                                 │ │
│  │  First usable LBA: 34                                           │ │
│  │  Last usable LBA: (disk size - 34)                              │ │
│  │  Disk GUID: unique identifier                                   │ │
│  │  Partition entries start: LBA 2                                 │ │
│  │  Number of partition entries: 128                               │ │
│  │  Size of partition entry: 128 bytes                             │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  LBA 2-33: Partition Entry Array                                    │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  Entry 1: EFI System Partition                                 │ │
│  │  ┌──────────────────────────────────────────────────────────┐ │ │
│  │  │  Partition type GUID:                                     │ │ │
│  │  │  C12A7328-F81F-11D2-BA4B-00A0C93EC93B (ESP)               │ │ │
│  │  │  Unique partition GUID: (random)                          │ │ │
│  │  │  First LBA: 2048                                          │ │ │
│  │  │  Last LBA: 1048575                                        │ │ │
│  │  │  Attributes: 0x00 (no special flags)                      │ │ │
│  │  │  Partition name: "EFI System"                             │ │ │
│  │  └──────────────────────────────────────────────────────────┘ │ │
│  │                                                                 │ │
│  │  Entry 2: Root Partition                                       │ │
│  │  ┌──────────────────────────────────────────────────────────┐ │ │
│  │  │  Partition type GUID:                                     │ │ │
│  │  │  0FC63DAF-8483-4772-8E79-3D69D8477DE4 (Linux FS)         │ │ │
│  │  │  First LBA: 1048576                                       │ │ │
│  │  │  Last LBA: ...                                            │ │ │
│  │  └──────────────────────────────────────────────────────────┘ │ │
│  │                                                                 │ │
│  │  Entries 3-128: (unused or additional partitions)              │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  LBA 34+: Partition Data                                            │
│  LBA (end-33 to end-1): Backup Partition Entry Array                │
│  LBA (end): Backup GPT Header                                       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
UEFI Boot Process:
┌─────────────────────────────────────────────────────────────────────┐
│                      UEFI BOOT FLOW                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. SEC (Security Phase)                                            │
│     • CPU microcode validation                                      │
│     • Cache-as-RAM setup (CAR)                                      │
│     • Temporary memory before DRAM init                             │
│                                                                     │
│  2. PEI (Pre-EFI Initialization)                                    │
│     • Memory controller initialization                              │
│     • Initialize DRAM                                               │
│     • Copy firmware to RAM                                          │
│     • Prepare for DXE phase                                         │
│                                                                     │
│  3. DXE (Driver Execution Environment)                              │
│     • Load hardware drivers                                         │
│     • Initialize PCI, USB, SATA, etc.                               │
│     • Build ACPI tables                                             │
│     • Set up UEFI Boot Services and Runtime Services                │
│                                                                     │
│  4. BDS (Boot Device Selection)                                     │
│     • Read boot order from NVRAM                                    │
│     • BootOrder variable: {0003, 0001, 0002}                        │
│     • Boot0001 = "ubuntu" -> \EFI\ubuntu\shimx64.efi                │
│     • Boot0002 = "Windows" -> \EFI\Microsoft\Boot\bootmgfw.efi      │
│     • Boot0003 = "USB" -> \EFI\Boot\bootx64.efi                     │
│     • Scan ESP (EFI System Partition) on GPT disks                  │
│     • Look for removable media fallback paths                       │
│                                                                     │
│  5. Load EFI Application                                            │
│     • Mount ESP (FAT32 filesystem)                                  │
│     • Load bootloader (e.g., grubx64.efi, shimx64.efi)              │
│     • Provide Boot Services:                                        │
│       - LocateProtocol() - Find device drivers                      │
│       - LoadImage() - Load executables                              │
│       - StartImage() - Execute loaded image                         │
│       - AllocatePool() - Memory allocation                          │
│       - OpenProtocol() - Access device functions                    │
│                                                                     │
│  6. TSL (Transient System Load)                                     │
│     • Bootloader takes control                                      │
│     • Can return to boot menu if needed                             │
│     • Eventually calls ExitBootServices()                           │
│                                                                     │
│  7. RT (Runtime)                                                    │
│     • OS takes over                                                 │
│     • Boot Services terminated                                      │
│     • Runtime Services still available:                             │
│       - GetVariable() / SetVariable() - NVRAM access                │
│       - GetTime() / SetTime() - Hardware clock                      │
│       - ResetSystem() - Reboot/shutdown                             │
│       - UpdateCapsule() - Firmware updates                          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
UEFI Services:
  • Boot Services: Used by the bootloader (e.g., to read files). Terminated once the OS starts.
  • Runtime Services: Available even after the OS boots (e.g., setting UEFI variables, NVRAM).
Example: Reading UEFI Variables from Linux:
# List all UEFI variables
efivar -l

# Read boot order
efivar -n 8be4df61-93ca-11d2-aa0d-00e098032b8c-BootOrder

# Read specific boot entry
efivar -n 8be4df61-93ca-11d2-aa0d-00e098032b8c-Boot0001

# Set new boot entry (requires root)
efibootmgr -c -d /dev/sda -p 1 -L "My Linux" -l "\EFI\linux\vmlinuz.efi"

# Change boot order
efibootmgr -o 0003,0001,0002
Secure Boot:
┌─────────────────────────────────────────────────────────────────────┐
│                      UEFI SECURE BOOT                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Goal: Prevent unauthorized code from running during boot           │
│                                                                     │
│  Key Databases (stored in NVRAM):                                   │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  PK (Platform Key)                                              │ │
│  │  • Single key, owned by OEM                                     │ │
│  │  • Controls access to KEK database                              │ │
│  │                                                                 │ │
│  │  KEK (Key Exchange Keys)                                        │ │
│  │  • List of keys that can update db/dbx                          │ │
│  │  • Typically includes Microsoft KEK, Linux Foundation KEK       │ │
│  │                                                                 │ │
│  │  db (Signature Database - Whitelist)                            │ │
│  │  • Certificates/hashes of allowed bootloaders                   │ │
│  │  • Microsoft Windows cert, shim cert (for Linux), etc.          │ │
│  │                                                                 │ │
│  │  dbx (Forbidden Signatures Database - Blacklist)                │ │
│  │  • Known-bad signatures (revoked certificates)                  │ │
│  │  • Updated via Windows Update or Linux vendor                   │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  Boot Flow with Secure Boot:                                        │
│  1. UEFI loads bootloader image                                     │
│  2. Check signature against db (whitelist)                          │
│  3. Check signature against dbx (blacklist)                         │
│  4. If valid: execute bootloader                                    │
│  5. If invalid: refuse to boot, show error                          │
│                                                                     │
│  Linux Secure Boot:                                                 │
│  • Most distros use "shim" bootloader                               │
│  • shim.efi is signed with Microsoft key (in db)                    │
│  • shim contains distro's MOK (Machine Owner Key)                   │
│  • shim verifies and loads grub/kernel signed with MOK              │
│                                                                     │
│  Chain of Trust:                                                    │
│  UEFI → shim.efi (MS-signed) → grubx64.efi (MOK-signed)             │
│       → vmlinuz (MOK-signed) → kernel modules (kernel-signed)       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

3. The Road to 64-bit: Mode Transitions

The most complex part of the boot process is transitioning the CPU from its 16-bit legacy state to full 64-bit Long Mode.

Step 1: Real Mode to Protected Mode (32-bit)

Think of this transition as upgrading from a bicycle to a car — you gain speed and capability, but you need to learn new rules of the road first (segment selectors instead of direct segment registers).
  1. Disable Interrupts: cli (Clear Interrupt Flag) to prevent interrupts before the IDT is ready. An interrupt arriving now would jump to an undefined handler and triple-fault.
  2. Enable A20 Gate: A legacy hack to allow addressing above 1MB. Without it, address bit 20 is forced to zero, and any address above 1MB wraps around.
  3. Load GDT: The Global Descriptor Table defines how memory segments work. The CPU cannot enter Protected Mode without one — it would not know the privilege level or limits of any segment.
  4. Set CR0.PE: Set the Protection Enable bit in the CR0 register. This single bit-flip changes how the CPU interprets every subsequent memory access.
  5. Far Jump: A special jump that flushes the CPU pipeline and loads the new 32-bit Code Segment. This is necessary because the pipeline still contains instructions decoded under Real Mode rules.

Step 2: Protected Mode to Long Mode (64-bit)

This is upgrading from the car to a jet — the CPU gains a vastly larger address space (256 TB virtual), but the runway checklist is strict: paging is mandatory, and the CPU will refuse to take off without it.
  1. Set CR4.PAE: Physical Address Extension is required for 64-bit mode. PAE widens page table entries from 32 to 64 bits, which is a prerequisite for the wider address space.
  2. Setup Page Tables: You must enable paging to enter Long Mode. The kernel builds a minimal “Identity Map” (Virtual Address = Physical Address) for the first 1GB of memory. Without this, the very next instruction fetch after enabling paging would fail — the CPU would try to translate the instruction pointer through page tables that do not map it.
  3. Set EFER.LME: Set the Long Mode Enable bit in the Extended Feature Enable Register. This arms the transition but does not activate it yet.
  4. Set CR0.PG: Enable Paging. Combined with EFER.LME, the CPU is now in “Compatibility Mode” — a transitional state that runs 32-bit code under 64-bit paging.
  5. Final Far Jump: A jump to a 64-bit code segment officially enters Long Mode. The pipeline flush ensures the CPU starts decoding 64-bit instructions.
Practical tip: If a custom bootloader panics immediately after enabling paging, the most likely cause is a missing or incorrect identity mapping. The instruction pointer is a physical address, and the page tables must map that exact address to itself.

4. Kernel Data Structures: GDT and IDT

The kernel must define how it will handle memory and interrupts before it can do anything else.

4.1 The GDT (Global Descriptor Table)

The GDT is an array of 8-byte descriptors. Even in “Flat” 64-bit mode where segmentation is mostly unused, the GDT is required to define:
  • Kernel Code Segment: Rings 0, Executable, Readable.
  • Kernel Data Segment: Rings 0, Readable, Writable.
  • User Code/Data Segments: Rings 3.
  • TSS (Task State Segment): Points to the stack to use when an interrupt occurs.

4.2 The IDT (Interrupt Descriptor Table)

The IDT maps interrupt vectors (0-255) to handler functions.
  • Vectors 0-31: Reserved for CPU exceptions (Divide by Zero, Page Fault, etc.).
  • Vectors 32-255: Available for hardware interrupts and system calls.
  • Gate Types: Interrupt Gates (clear IF), Trap Gates (don’t clear IF).

5. The Kernel Entry Sequence (Linux)

Once the bootloader (GRUB) loads the kernel into memory, it jumps to the kernel’s entry point.

5.1 Decompression (head_64.S)

The Linux kernel is actually a self-extracting executable (vmlinuz — the “z” literally stands for “zipped”). Think of it like a zip file that contains its own unzip utility at the front.
  1. The early code decompresses the “real” kernel image into a higher memory address. Modern kernels use LZ4 or ZSTD for fast decompression.
  2. It sets up a temporary stack — just enough to run C code.
  3. It jumps to the decompressed kernel’s entry point, leaving the compressed image behind.

5.2 The start_kernel() Function

This is the “Big Bang” of the operating system. It is architecture-independent C code (finally, after pages of assembly) that:
  1. setup_arch(): Handles CPU-specific initialization — detecting features, calibrating timers, building the memory map from firmware-provided tables (E820 or UEFI memory map).
  2. mm_init(): Initializes the full Buddy Allocator and Slab Allocator. Before this point, memory allocation is done through a crude “memblock” early allocator.
  3. sched_init(): Sets up the scheduler and the “Idle” task (the task that runs when there is nothing else to do — it puts the CPU into a low-power halt state).
  4. rest_init(): Spawns Process 1 (init) and Process 2 (kthreadd). From this moment, the system is running real processes with a real scheduler.
Practical tip: To see exactly what start_kernel() does and how long each step takes, boot with initcall_debug on the kernel command line. The kernel will print timestamps for every initialization function, making it straightforward to identify slow subsystem init.

6. The Handover to User Space

6.1 Initramfs (Initial RAM Filesystem)

The kernel cannot mount the real root disk immediately — it is a chicken-and-egg problem: the storage driver needed to read the disk might itself be on the disk. Initramfs solves this by giving the kernel a tiny “starter kit” already in memory. Think of it like a toolbox that a mechanic brings to a broken-down car. The car has all the tools in the trunk, but you cannot open the trunk until the car is running. The toolbox has just enough to get the engine started.
  1. The bootloader loads a small CPIO archive into memory (initrd/initramfs).
  2. The kernel mounts this as /.
  3. It runs /init from the ramfs, which loads necessary drivers (NVMe, RAID, LVM, LUKS encryption) and finally “Switches Root” to the real disk.
Practical tip: To inspect what is inside your initramfs, run lsinitrd (on Fedora/RHEL) or lsinitramfs (on Debian/Ubuntu). If your system fails to boot with “unable to mount root fs,” the most common cause is a missing storage driver in the initramfs. Rebuild it with dracut or update-initramfs.

6.2 PID 1: systemd / SysV init

The final step is to execute the first user-space process — the “root of the process tree.” Every process on the system is a descendant of PID 1.
  • Path: /sbin/init (or whatever init= kernel parameter specifies). On modern distributions this is almost always systemd.
  • The PID 1 Rule: This process is the ancestor of all others. If it ever exits, the kernel triggers a Kernel Panic. This is by design: PID 1 is responsible for reaping orphaned child processes. Without it, zombie processes would accumulate with no parent to collect their exit status.
Practical tip: If you need to debug an unbootable system, pass init=/bin/bash on the kernel command line. The kernel will drop you into a root shell instead of starting systemd, letting you fix configuration issues, repair filesystems, or reset passwords. Remember to exec /sbin/init when done to start the system normally.

7. Interview Deep Dive: Senior Level

On the original 8086, memory addresses wrapped around at 1MB. When the 80286 arrived, it could address more, but some old programs relied on the wrap-around bug. IBM added a gate on the 20th address line (A20) to manually enable/disable the wrap-around. Even today, x86 CPUs start with A20 disabled for compatibility, and the bootloader must enable it to access more than 1MB of RAM.
The kernel is compiled as a Position Independent Executable (PIE) or uses Relative Addressing. Early boot code uses the Instruction Pointer (RIP) relative instructions to find data. Once the kernel sets up the initial page tables and enables paging, it can “jump” into the virtual address space it has defined.
When the kernel enables paging (setting CR0.PG), the CPU immediately begins interpreting all addresses as virtual. If the kernel didn’t “Identity Map” (Map Virtual 0x1234 to Physical 0x1234) the code it is currently executing, the very next instruction fetch would fail because the MMU wouldn’t know where to find the code, resulting in an immediate crash.

8. Advanced Practice

  1. GDT Inspector: Use gdb and QEMU (-s -S) to inspect the GDT of a booting kernel. Use the command monitor info gdt.
  2. Early Printk: Add a printk("Hello from early boot!"); to the start_kernel function in a Linux source tree and compile it. Observe when the message appears during boot.
  3. UEFI Shell: Boot into a UEFI shell and use the ls and map commands to see how the firmware sees your disks and partitions.

9. Checklist: From Power-On to Login Prompt

Use this quick reference to understand (or debug) the complete boot sequence:

Phase-by-Phase Checklist

┌─────────────────────────────────────────────────────────────────────┐
│         BOOT SEQUENCE CHECKLIST (UEFI + Linux)                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  PHASE 1: HARDWARE (< 1 sec)                                        │
│  □ Power supply stabilizes                                          │
│  □ CPU executes reset vector (0xFFFFFFF0)                           │
│  □ UEFI firmware loads from flash ROM                               │
│                                                                     │
│  PHASE 2: FIRMWARE (2-10 sec)                                       │
│  □ SEC: CPU microcode, Cache-as-RAM                                 │
│  □ PEI: Memory controller init, DRAM available                      │
│  □ DXE: Load drivers (USB, SATA, NVMe, Graphics)                    │
│  □ BDS: Read BootOrder from NVRAM                                   │
│  □ Secure Boot: Validate bootloader signature                       │
│                                                                     │
│  PHASE 3: BOOTLOADER (1-3 sec)                                      │
│  □ GRUB/systemd-boot loads from ESP                                 │
│  □ Display boot menu (if configured)                                │
│  □ Load kernel (vmlinuz) + initramfs into RAM                       │
│  □ Pass kernel command line parameters                              │
│  □ ExitBootServices() - firmware hands off to kernel                │
│                                                                     │
│  PHASE 4: KERNEL EARLY (1-5 sec)                                    │
│  □ Decompress kernel (if compressed)                                │
│  □ Setup GDT, IDT, page tables                                      │
│  □ Transition: Real → Protected → Long Mode                         │
│  □ start_kernel(): Initialize subsystems                            │
│  □ Mount initramfs as /                                             │
│  □ Run /init from initramfs                                         │
│                                                                     │
│  PHASE 5: INITRAMFS (1-5 sec)                                       │
│  □ Load essential drivers (storage, filesystem, LVM, RAID)          │
│  □ Find real root partition                                         │
│  □ Decrypt root (if LUKS encrypted)                                 │
│  □ switch_root to real filesystem                                   │
│                                                                     │
│  PHASE 6: INIT SYSTEM (2-10 sec)                                    │
│  □ systemd (PID 1) starts                                           │
│  □ Mount remaining filesystems (/etc/fstab)                         │
│  □ Start services (parallel by dependencies)                        │
│  □ Reach default.target (multi-user or graphical)                   │
│  □ Spawn getty on TTYs / Display Manager                            │
│                                                                     │
│  ═══════════════════════════════════════════════════════════════    │
│  LOGIN PROMPT APPEARS!                                              │
└─────────────────────────────────────────────────────────────────────┘

Debugging Boot Issues

SymptomPhaseDebug Command
No display, fans spinFirmwareCheck POST codes, clear CMOS
”No bootable device”Firmware/ESPefibootmgr -v, check ESP mount
GRUB rescue promptBootloaderBoot from live USB, reinstall GRUB
Kernel panic earlyKernel earlyAdd debug to kernel cmdline
initramfs drops to shellInitramfsCheck /dev, load missing drivers
systemd failsInit systemsystemctl --failed, journalctl -b

Quick Boot Time Analysis

# Total boot time breakdown
systemd-analyze

# Blame: which services took longest
systemd-analyze blame | head -10

# Critical path visualization
systemd-analyze critical-chain

# Full boot chart (generates SVG)
systemd-analyze plot > boot.svg

Next: Memory Management & Allocators

Production Caveats: Where the Boot Sequence Bites Real Systems

Boot is the part of the stack with the worst observability. Logs do not exist yet. Console output may not redirect anywhere. The kernel is half-initialized. Everything assumes the previous stage worked, and when it does not, you get a blinking cursor with no diagnostic.
Booting bites senior engineers in these specific ways:
  1. Early-boot debugging is hostile. No syslog, no journald, no SSH. If the kernel panics before start_kernel() completes, you have only the boot console — and on cloud instances, that console may be a serial line that nobody is watching. earlyprintk= and earlycon= on the kernel command line route output to the serial port early; without these flags, an early panic is invisible. On EC2, the “System Log” in the AWS console captures serial output but lags by minutes.
  2. Secure Boot key compromise scenarios. If Microsoft’s UEFI signing key were compromised (or one of the OEM platform keys), an attacker could sign a malicious bootloader that the firmware would happily run. The dbx (revoked-signature database) update mechanism exists for this — but updating dbx everywhere takes years. The 2020 BootHole vulnerability (CVE-2020-10713) in GRUB allowed a buffer overflow in grub.cfg parsing to bypass Secure Boot; vendors had to ship dbx updates that revoked old grubs. Rolling out dbx updates broke many systems with custom-signed kernels because the revocation also blocked legitimate-but-old shims.
  3. BIOS vs UEFI assumptions break on hybrid systems. Some firmware supports both BIOS (Legacy CSM) and UEFI boot. If your installer detects UEFI and writes a GPT disk, but a tech then disables CSM and re-enables it, the firmware may try to BIOS-boot from a GPT disk — and fail mysteriously. If you see “no bootable media” after a firmware update, check whether the firmware’s boot mode matches your disk’s partition table.
  4. initramfs without the right driver bricks the boot. Add LVM to your root device but forget to regenerate initramfs and the system will not boot the next kernel. Switch from BIOS to UEFI install but the initramfs still has BIOS-only modules and you get an early panic. The number-one cause of “machine that booted yesterday will not boot today” is a stale or wrong initramfs.
  5. The first few seconds are unobservable to your monitoring. Datadog and Prometheus agents do not start until well into systemd’s user-space phase. If something breaks between kernel init and systemd reaching multi-user.target, your metrics will show “host disappeared” with no indication of why. Use IPMI/BMC console logging, EC2 system logs, or a dedicated boot-time monitoring pipeline (systemd-bootchart, journalctl --boot).
Solutions and patterns the senior engineer reaches for:
  • Always set earlycon=, console=ttyS0, and panic=10 on production kernels. These three flags transform “silent panic” into “panic logged and rebooted within 10 seconds.” For cloud VMs add console=tty0 console=ttyS0,115200n8 to log to both the framebuffer and serial.
  • Test recovery before you need it. Boot with init=/bin/bash once a year on a spare host. Practice repairing a broken initramfs from a rescue ISO. The five-minute drill prevents the four-hour 3am incident.
  • Use systemd-boot or rEFInd over GRUB on UEFI when you can. GRUB has had dozens of CVEs (BootHole, BootHole2). systemd-boot is simpler — a few hundred lines of EFI code with no scripting language to exploit. For dual-boot or fancy menus, GRUB is unavoidable; for simple single-OS production servers, systemd-boot is the more conservative choice.
  • Encrypt with TPM-bound disk encryption. LUKS keys sealed against TPM PCRs only unseal if the boot chain (firmware, bootloader, kernel) hashes match. An attacker who replaces the kernel with a malicious one cannot unseal the disk. Combined with Secure Boot, this is the defense against evil-maid attacks.
  • Validate boot artifacts in CI. After every kernel package build, boot the resulting image in QEMU with the production initramfs. Catches “broken kernel” before it reaches a production host.
  • Save GRUB rescue commands to a runbook. set root=(hd0,gpt2); linux /vmlinuz root=/dev/sda2; initrd /initrd.img; boot from the rescue prompt boots most systems. Document it; you will need it at 3am.

Senior Interview Questions: Boot Process Mastery

Strong Answer Framework:
  1. The architectural difference. BIOS is a 16-bit Real Mode firmware that boots a 512-byte sector via the MBR. UEFI is a 32 or 64-bit firmware that runs PE/COFF executables (.efi files) from the FAT-formatted EFI System Partition. The shift is from “execute a sector” to “load and execute a program.”
  2. Disk capacity. BIOS with MBR caps disks at 2.2TB (32-bit LBA). UEFI with GPT supports disks up to 9.4 zettabytes and 128 partitions natively. This alone forced the migration — enterprise storage demanded GPT a decade ago.
  3. Boot speed. UEFI can boot in parallel: device init, driver loading, and bootloader load can overlap. BIOS boots strictly serially through INT calls. A modern UEFI server boots firmware-to-bootloader in 5 to 15 seconds; equivalent BIOS would be 30 to 60. Some of this is actually firmware quality, but UEFI’s architecture enables it.
  4. Security. UEFI has Secure Boot: signed-image verification before execution. BIOS has nothing comparable — any sector at LBA 0 with the magic bytes runs. Secure Boot enables a measurable trust chain (firmware verifies bootloader, bootloader verifies kernel, kernel verifies modules). For server workloads handling regulated data, this is increasingly mandatory.
  5. Programming model. UEFI provides Boot Services (filesystem access, network, graphics) and Runtime Services (NVRAM variables, time, reset). Bootloaders can be full C programs that call these services. BIOS only had INT calls and 16-bit conventions; everything was assembly or near-assembly.
  6. The migration cost. UEFI brought new failure modes: corrupted NVRAM (efivars) bricking machines, signed-blob mistakes locking out admins, BootHole-style attacks on bootloader code that does file parsing. The complexity is real.
Real-World Example: When Lenovo shipped the BootHole (CVE-2020-10713) GRUB vulnerability fix in late 2020, several thousand customers reported bricked laptops because the dbx update revoked the bootloader signature on installed Linux distributions. The fix — a clean UEFI variable update — required Lenovo to ship a separate firmware utility. This is the failure mode of high-stakes UEFI updates: “patch the security hole” can equal “brick the fleet” if not coordinated with distro vendors.
Senior follow-up: What does the “Compatibility Support Module” (CSM) do, and why is it being phased out? CSM is a UEFI shim that emulates BIOS — letting a UEFI firmware boot a legacy BIOS-style MBR. Useful during the migration era, but it is a huge attack surface (basically running 16-bit Real Mode code from the firmware). Intel mandated CSM removal from new platforms starting in 2020. Modern firmware is UEFI-only; if you have Windows 7 disks, they will not boot.
Senior follow-up: Is UEFI’s complexity worth it for an embedded system or appliance? Often not. For a single-purpose appliance (router, embedded controller), U-Boot or a direct flash bootloader is simpler, smaller, and easier to audit. UEFI shines when you need a flexible boot environment (multi-OS, network boot, recovery) on commodity hardware. For a static workload, simpler is better.
Senior follow-up: What is the smallest credible attack surface on a UEFI system? The firmware update mechanism. Most UEFI firmware can be reflashed via the OS (CapsuleUpdate). A kernel-level attacker can persist by reflashing firmware with a backdoor. Mitigations: Boot Guard (Intel), Platform Secure Boot (PSB on AMD), and signed firmware updates. None of these eliminate the risk entirely; all of them require correct OEM implementation.
Common Wrong Answers:
  1. “UEFI is faster because BIOS is single-threaded.” Misleading. Boot speed is dominated by hardware init (DRAM training, PCIe link training), not by the firmware language. UEFI’s parallelism helps but is not the dominant factor.
  2. “Secure Boot makes systems unhackable.” Secure Boot only protects the boot path. After the OS loads, normal kernel exploits still apply. Secure Boot is one layer, not the answer.
Further Reading:
  • UEFI Specification, version 2.10 (uefi.org) — the formal reference.
  • Adam Williamson’s “UEFI Boot for Dummies” series on the Fedora wiki.
  • Eclypsium’s BootHole technical write-up (2020) — shows what an actual UEFI attack looks like.
Strong Answer Framework:
  1. The trust hierarchy. Four NVRAM databases: PK (Platform Key, single, OEM-controlled), KEK (Key Exchange Keys, allow updating db/dbx), db (allowlist of trusted signing certificates), dbx (denylist of revoked signatures). The firmware will only execute boot-stage binaries whose signature chains to a cert in db, unless the binary’s hash is also in dbx (which overrides).
  2. The chain of trust at boot. UEFI firmware verifies shimx64.efi against db (signed by Microsoft’s UEFI CA). shim contains the distro’s MOK (Machine Owner Key) and verifies grubx64.efi against MOK. GRUB verifies the kernel against MOK or kernel keyring. Modern kernels then verify their own modules against the kernel keyring. Each link breaks if its signature is invalid.
  3. What Secure Boot prevents. Bootkit malware (replacing the bootloader to persist below the OS), unauthorized OS substitution (booting a malicious kernel that masquerades as the legitimate one), evil-maid attacks (briefly accessing a machine to install a tampered bootloader). Combined with TPM-sealed disk encryption, also prevents extracting data from a stolen disk.
  4. What Secure Boot does not prevent. Vulnerabilities in signed bootloaders (BootHole was a buffer overflow in grub.cfg parsing — the GRUB binary was correctly signed, but it processed attacker-controlled data unsafely). User-space malware (Secure Boot stops here once the kernel is loaded). Kernel exploits (privilege escalation, container escapes). Hardware attacks (DMA, JTAG, reflashing the firmware itself with a flasher). Physical-layer attacks (cold boot RAM extraction).
  5. The dbx update problem. When a vulnerability is found in a signed bootloader, the response is to add the bootloader’s hash to dbx. But if the same bootloader is installed across millions of machines, mass-revoking it disrupts everything. Vendors must coordinate revocations carefully, ship updated bootloaders first, then revoke old ones. The lag from disclosure to safe revocation is often months to years.
Real-World Example: The 2023 BlackLotus UEFI bootkit bypassed Secure Boot on fully-patched Windows 11 systems by exploiting CVE-2022-21894 (Baton Drop) in a still-signed Windows bootloader. Microsoft revoked the affected bootloader’s hash via dbx update — but the dbx slot was finite (about 130 slots). Microsoft had to coordinate with OEMs to expand the dbx storage capacity in firmware updates. The lesson: Secure Boot is a defense, not an absolute, and revocation logistics constrain its usefulness.
Senior follow-up: What is the difference between Secure Boot and Measured Boot? Secure Boot rejects untrusted binaries at load time. Measured Boot computes hashes of every loaded component into TPM PCRs (Platform Configuration Registers) but does not gate execution. Measured Boot enables remote attestation: a server can request the PCR values, verify them against expected hashes, and trust the system only if they match. Secure Boot prevents bad code from running; Measured Boot proves to a third party what code actually ran.
Senior follow-up: If a kernel is signed but vulnerable, does Secure Boot help? No. Secure Boot validates signatures, not code correctness. A signed kernel with a privilege escalation bug runs just fine until the bug is exploited. The defense in depth here is “kernel hardening” (CFI, KASLR, page table isolation) on top of Secure Boot, not Secure Boot alone.
Senior follow-up: Can I use Secure Boot with a custom kernel? Yes, via the MOK enrollment path: generate your own key with openssl, sign your kernel with sbsign, enroll the public key into the MOK database with mokutil --import. From the next boot, your custom kernel passes verification. The catch is that mokutil requires a physically-present user to confirm enrollment via the MOK Manager UI — this is anti-evil-maid by design, but it makes automated provisioning harder.
Common Wrong Answers:
  1. “Secure Boot prevents all malware.” No. It prevents pre-OS malware. Anything running after the kernel loads is unaffected.
  2. “You should disable Secure Boot if you run Linux.” Linux supports Secure Boot via shim. There is no reason to disable it. Disabling it removes a security layer for no benefit.
Further Reading:
  • UEFI Specification 2.10, chapter 32 (Secure Boot and Driver Signing).
  • Matthew Garrett’s blog series on shim and Secure Boot for Linux (mjg59.dreamwidth.org).
  • Eclypsium’s BlackLotus analysis (2023) — a practical bootkit bypass and how dbx responded.
Strong Answer Framework:
  1. Confirm hardware is alive. Power LED on, fans spinning, drive activity? If not, this is below the boot stack — power supply, motherboard, or PSU. If yes, hardware is at least running through POST.
  2. Get serial console access if you do not have it. On servers, IPMI/iDRAC/iLO almost always has a virtual serial console. On cloud instances, AWS has “EC2 Serial Console” (must be enabled per region), GCP has Serial Console API, Azure has Boot Diagnostics. If none of these are available, the next step is reseating drives in another machine to read logs from disk after the fact.
  3. Read POST codes if available. Server motherboards have a 2-digit hex display showing the POST code — the firmware’s exact stage. Decode against the OEM’s POST code table. “Memory training failed” looks very different from “no bootable device” at this layer.
  4. Check for storage failure. If POST passes but the firmware reports “no bootable device,” either the disk is dead, the partition table is corrupted, or the bootloader was overwritten. Boot from rescue media (USB, network), check lsblk, fdisk -l, run smartctl -a on the disk.
  5. If the kernel starts but does not finish. Boot with nomodeset earlyprintk=serial,ttyS0,115200 console=ttyS0,115200n8 debug on the kernel command line. This gives early console output to serial, which IPMI captures. The output reveals which subsystem is failing: hung on udev, hung on mounting root, panic on init.
  6. If everything looks fine but no login prompt. Boot with init=/bin/bash to bypass systemd entirely. If you get a shell, the bug is in user space (probably a service hanging on something like a missing network mount). Boot with systemd.unit=rescue.target for a more controlled diagnostic.
  7. If the bug is reproducible in QEMU. This is the gold scenario. Reproduce locally, attach gdb to QEMU’s stub (-s -S), and step through the boot. Most production boot bugs are reproducible this way.
Real-World Example: In 2018, Cloudflare published a post-mortem of a global outage triggered by a kernel update that hung at boot on a small fraction of servers. The issue was that the new kernel’s IOMMU initialization conflicted with a specific SR-IOV NIC firmware version. Diagnosis required serial console access (IPMI), kernel command-line intel_iommu=off to confirm the IOMMU was at fault, then a coordinated rollback. Total outage was 27 minutes; without IPMI access it would have been hours.
Senior follow-up: What if you cannot get serial console access at all? Boot a rescue image, mount the failing disk, and read /var/log/journal/*/system.journal using journalctl -D /mnt/var/log/journal --boot=-1 to see logs from the previous boot attempt. Many systemd failures leave breadcrumbs even when the machine never came up.
Senior follow-up: How do you avoid this entire class of problem in a fleet? Staged kernel rollouts. Update one host, validate it boots cleanly, then 1 percent, then 10 percent, then 100. Capture serial logs from every host as it boots. Anomaly-detect on boot duration — if a kernel suddenly takes 60 seconds to boot when it usually takes 15, that is a regression worth catching before fleet-wide deployment.
Senior follow-up: What boot debug technique do most engineers not know? kdb (Kernel Debugger) or kgdb (Kernel GDB). Compiled into many distro kernels but disabled by default. Enable with kgdboc=ttyS0,115200 and kgdbwait on the kernel command line; the kernel will halt at the very start of start_kernel() waiting for a gdb connection over serial. From there you can step through any subsystem init.
Common Wrong Answers:
  1. “Just reinstall the OS.” Diagnostic abandonment. You will reproduce the same bug next week. Find the root cause.
  2. “Roll back to the previous kernel.” Right tactical move for an incident, wrong long-term answer. After the rollback, you must reproduce in a lab and find the actual fix, or you will be stuck on the old kernel forever.
Further Reading:
  • Linux kernel Documentation/admin-guide/serial-console.rst.
  • Brendan Gregg, “Linux boot bottlenecks” (brendangregg.com).
  • The Linux Programming Interface, Michael Kerrisk, chapter 25 (Process Termination) and chapter 37 (Daemons) for systemd-adjacent debug techniques.

Interview Deep-Dive

Strong Answer:
  • The GRUB rescue prompt means the bootloader cannot find its configuration or the kernel image. The most common cause after a kernel upgrade is that the /boot partition ran out of space, the grub.cfg was not regenerated, or the symlinks to vmlinuz/initramfs point to a version that was never installed properly.
  • My first step would be to boot from a live USB or rescue image. Then I mount the root and boot partitions and check whether /boot actually has the new kernel and initramfs files. I would look at ls /boot/ and compare against what grub.cfg references.
  • If the files are there, I chroot into the system, re-run grub-mkconfig -o /boot/grub/grub.cfg and grub-install /dev/sda (or the appropriate disk). Then reboot.
  • If /boot is full, I remove old kernels first. On Debian-based systems, apt autoremove handles this; on RHEL, dnf remove with the old kernel package names.
  • A subtlety people miss: on UEFI systems, the EFI System Partition (ESP) is separate from /boot. You need to check whether the .efi file on the ESP is intact and whether the UEFI boot entry still references the correct path. efibootmgr -v from the rescue environment is the tool for that.
Follow-up: What is the difference between GRUB Stage 1 and Stage 2, and why does this matter for recovery?Stage 1 is the tiny bootstrap code that lives in the first 440 bytes of the MBR (on legacy BIOS) or within the .efi binary on UEFI. Its only job is to locate and load Stage 2, which is the full-featured GRUB code that can parse grub.cfg, display a menu, and load kernels. If Stage 1 is corrupted (e.g., another OS overwrote the MBR), you get no boot at all — no GRUB prompt, just a blinking cursor. If Stage 2 is broken or its config is missing, you get the GRUB rescue prompt. The recovery strategy differs: Stage 1 corruption requires re-running grub-install from a rescue environment, while Stage 2 or config issues may only need grub-mkconfig. On UEFI, the chain is simpler since shim.efi loads grubx64.efi directly from the ESP filesystem, so you can often just copy a known-good .efi binary back onto the ESP.
Strong Answer:
  • Secure Boot enforces a chain of trust: the UEFI firmware validates each binary it loads against a signature database (db) stored in NVRAM. The typical chain is UEFI firmware verifies shim.efi (signed by Microsoft), shim verifies grubx64.efi (signed by the distro’s Machine Owner Key), GRUB verifies the kernel (also MOK-signed), and optionally the kernel verifies its own modules.
  • If I compile a custom kernel, it will not be signed by any key in the db or MOK databases. Secure Boot will refuse to load it and the system will not boot.
  • The fix is to enroll my own MOK using mokutil --import my_key.der, reboot into the MOK Manager (a shim-provided UI), accept the key, then sign my kernel with sbsign --key my_key.key --cert my_key.crt --output vmlinuz-signed vmlinuz. From that point, shim will accept my signed kernel.
  • In a production fleet, the better approach is to disable Secure Boot only on dev machines, and for production, use a CI pipeline that signs kernels with a hardware security module (HSM) holding the organization’s MOK private key.
  • The gotcha most people miss: even if the kernel loads, unsigned kernel modules (like out-of-tree drivers from DKMS) will be rejected at module load time if the kernel enforces module signature verification. You need to sign those too, or add module.sig_enforce=0 to the kernel command line (which weakens the trust chain).
Follow-up: What is the security implication of disabling Secure Boot entirely in a cloud or data center environment?Without Secure Boot, a rootkit or bootkit can replace the bootloader or kernel with a tampered version. Since the firmware does not verify signatures, the malicious code runs with full kernel privileges before any OS-level security (like SELinux or AppArmor) is initialized. In a cloud environment where you control physical access, the risk is lower but not zero — a compromised management plane or BMC could inject malicious firmware. Many compliance frameworks (PCI-DSS, FedRAMP) now require Secure Boot or equivalent measured boot (using TPM attestation) as a baseline control.
Strong Answer:
  • The kernel, even with all drivers compiled in, cannot always mount the root filesystem directly because the root device might require drivers that are not compiled into the kernel — they are loaded as modules. Think of NVMe drivers, RAID controller drivers, LVM userspace tools, or LUKS encryption decryption. All of these need to run before the real root is accessible.
  • initramfs (initial RAM filesystem) is a small CPIO archive loaded into memory by the bootloader alongside the kernel. The kernel mounts it as a temporary root filesystem and executes /init from it. That init script loads the necessary drivers, assembles RAID arrays, decrypts LUKS volumes, activates LVM, and then calls switch_root to pivot to the real root filesystem.
  • You would customize initramfs when: you add a new storage driver the default initramfs does not include, you use full-disk encryption with a non-standard key derivation (like a network-fetched key or a hardware token), you need to add custom network configuration for NFS root or iSCSI boot, or you want to add a custom pre-boot recovery shell.
  • On Debian/Ubuntu, update-initramfs -u regenerates it; on Fedora/RHEL, dracut -f does the same. A common production incident is forgetting to regenerate initramfs after installing a new kernel module that the root device depends on — the system boots fine on the old kernel but panics on the new one because the module is missing from the new initramfs.
Follow-up: What happens if PID 1 (the init process) crashes or exits after the kernel hands off to user space?The kernel treats PID 1 as sacred. If PID 1 exits for any reason, the kernel triggers a kernel panic because it has no fallback — there is no process to reap orphaned children, no process to manage system shutdown. This is why systemd, the most common PID 1, is extremely defensive about its own stability. If systemd itself crashes, the kernel panics and the machine reboots (or hangs, depending on kernel panic settings). This is also why container runtimes must ensure the entrypoint process stays alive — if PID 1 inside a PID namespace exits, the entire container is torn down.
Strong Answer:
  • The x86 architecture requires a specific sequence of mode transitions because each mode depends on data structures that the previous mode sets up. You cannot jump from Real Mode to Long Mode because Long Mode requires paging to be enabled, and paging requires a valid page table hierarchy to be set up in memory. But in Real Mode, you cannot address enough memory to set up those tables conveniently, and the GDT (Global Descriptor Table) that defines the code and data segments for protected execution has not been loaded yet.
  • The sequence is: Real Mode (16-bit, 1MB addressable) then enable A20 gate, load GDT, set CR0.PE bit, far jump to flush pipeline, now in 32-bit Protected Mode, set up initial page tables (identity mapping), enable PAE (CR4.PAE), set EFER.LME (Long Mode Enable), enable paging (CR0.PG), far jump to 64-bit code segment, now in Long Mode.
  • If you tried to set EFER.LME and CR0.PG from Real Mode, the CPU would triple-fault and reset because the required structures (GDT with valid 64-bit code segment, valid page tables loaded in CR3) would not exist. The CPU would attempt to fetch instructions through paging with no valid page tables and immediately fault, then fault again trying to handle the fault (double fault), then fault again (triple fault), triggering a reset.
  • The practical consequence: if you are writing a bootloader or early kernel code, you must follow this exact sequence. UEFI firmware abstracts this away for you — by the time your .efi application runs, the CPU is already in 64-bit mode with paging enabled. This is one of the biggest advantages of UEFI over legacy BIOS boot.
Follow-up: What is identity mapping during boot, and what happens if you get it wrong?Identity mapping means setting up page tables so that virtual address X maps to physical address X. This is critical at the moment you enable paging because the CPU is currently executing code at some physical address. If paging is enabled and the page tables do not map the physical address the instruction pointer currently points to, the very next instruction fetch will trigger a page fault. But the page fault handler itself is not set up yet (or is at an unmapped address), so you get a double fault, then a triple fault, and the CPU resets. The kernel must identity-map at least the region containing the code that enables paging and the page tables themselves. After the jump to the kernel’s high virtual address (typically 0xFFFF…), the identity mapping can be torn down.