Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Linux Kernel Architecture - Ring layers, modules, and subsystems

Linux Kernel Architecture

Understanding the architecture of the Linux kernel is the foundation for everything else in this course. This module covers how the kernel is organized, why certain design decisions were made, and how to navigate the massive codebase.
Interview Frequency: Very High
Key Topics: Monolithic design, kernel source navigation, boot process, modules
Time to Master: 10-12 hours

| Performance | Direct function calls | IPC overhead for every operation | | Latency | Lower (no message passing) | Higher (context switches) | | Complexity | Harder to maintain safety | Cleaner separation | | Reliability | One bug can crash kernel | Components isolated | | Development | Faster prototyping | More engineering overhead |
Interview Insight: Linux uses a “modular monolithic” approach — monolithic core with loadable modules. This provides the performance of monolithic with some flexibility of microkernel.

The Famous Torvalds-Tanenbaum Debate

In 1992, Linus Torvalds and Andrew Tanenbaum debated kernel design:
  • Tanenbaum: Microkernels are the future; monolithic is obsolete
  • Torvalds: Performance matters; Linux’s approach is pragmatic
Outcome: Linux became the dominant OS kernel, though microkernels power some embedded systems (QNX in cars, L4 in phones).

Kernel Source Tree Organization

The Linux kernel source is massive (30+ million lines), but well-organized:
linux/
├── arch/           # Architecture-specific code (x86, arm64, riscv)
│   ├── x86/
│   │   ├── boot/       # Boot code
│   │   ├── kernel/     # x86-specific kernel code
│   │   ├── mm/         # x86 memory management
│   │   └── entry/      # Syscall entry points
│   └── arm64/

├── block/          # Block layer, I/O scheduling
├── certs/          # Signing certificates
├── crypto/         # Cryptographic API and algorithms
├── Documentation/  # Kernel documentation

├── drivers/        # Device drivers (largest directory)
│   ├── net/            # Network drivers
│   ├── block/          # Block device drivers
│   ├── gpu/            # GPU drivers (including drm)
│   ├── nvme/           # NVMe drivers
│   └── ...

├── fs/             # Filesystems
│   ├── ext4/           # ext4 filesystem
│   ├── xfs/            # XFS filesystem
│   ├── btrfs/          # Btrfs filesystem
│   ├── proc/           # procfs
│   └── ...

├── include/        # Header files
│   ├── linux/          # Public kernel headers
│   ├── uapi/           # User-space API headers
│   └── asm-generic/    # Generic assembly headers

├── init/           # Kernel initialization
│   └── main.c          # start_kernel() lives here

├── ipc/            # Inter-process communication
├── kernel/         # Core kernel code
│   ├── sched/          # Scheduler
│   ├── locking/        # Locks, mutexes
│   ├── trace/          # Tracing infrastructure
│   └── bpf/            # BPF subsystem

├── lib/            # Kernel libraries
├── mm/             # Memory management
│   ├── slab.c          # Slab allocator
│   ├── page_alloc.c    # Page allocator
│   ├── mmap.c          # Memory mapping
│   └── ...

├── net/            # Networking
│   ├── core/           # Core networking
│   ├── ipv4/           # IPv4 stack
│   ├── ipv6/           # IPv6 stack
│   ├── netfilter/      # Packet filtering
│   └── ...

├── scripts/        # Build and helper scripts
├── security/       # Security modules (SELinux, AppArmor)
├── sound/          # Sound subsystem
├── tools/          # Userspace tools (perf, bpf)
│   ├── perf/           # perf tool
│   ├── bpf/            # BPF tools
│   └── ...

├── Kconfig         # Build configuration
├── Makefile        # Main makefile
└── MAINTAINERS     # Who maintains what

Key Files to Know

FilePurpose
init/main.cKernel entry point (start_kernel())
arch/x86/entry/entry_64.SSystem call entry point
kernel/sched/core.cScheduler core
mm/page_alloc.cPage allocator (buddy system)
mm/slab.cSlab allocator
fs/read_write.cread/write system calls
net/core/dev.cCore networking
Navigation Tip: Use tools like cscope, ctags, or online browsers like elixir.bootlin.com to navigate the source.

Boot Process Deep Dive

Understanding how Linux boots is essential for systems engineers: Linux Boot Sequence

start_kernel() - The Heart of Boot

// init/main.c - simplified
asmlinkage __visible void __init start_kernel(void)
{
    // Very early setup
    set_task_stack_end_magic(&init_task);
    
    // Memory management initialization
    setup_arch(&command_line);      // Arch-specific setup
    mm_init();                      // Memory subsystem
    
    // Core subsystems
    sched_init();                   // Scheduler
    rcu_init();                     // RCU
    
    // Interrupts and timers
    init_IRQ();
    tick_init();
    
    // Various subsystems
    vfs_caches_init();              // VFS
    signals_init();                 // Signals
    
    // Start init process
    rest_init();                    // Creates init process
}

Key Boot Parameters

ParameterPurposeExample
root=Root filesystem deviceroot=/dev/sda1
init=First user processinit=/bin/bash
quietSuppress boot messagesquiet
debugEnable debug messagesdebug
nokaslrDisable KASLRnokaslr
isolcpus=Isolate CPUs from schedulerisolcpus=2,3
nosmpDisable SMPnosmp
Debug Tip: Boot with init=/bin/bash to get a shell before init runs. Useful for recovery.

Kernel Address Space Layout

On x86-64, the virtual address space is split between user and kernel:
┌─────────────────────────────────────────────────────────────────────────────┐
│                    x86-64 VIRTUAL ADDRESS SPACE (48-bit)                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  0xFFFFFFFFFFFFFFFF ┌────────────────────────────────────────────────────┐  │
│                     │                                                     │  │
│                     │              KERNEL SPACE (128 TB)                  │  │
│                     │                                                     │  │
│                     │  0xFFFFFFFF80000000 - Kernel text (code)           │  │
│                     │  0xFFFF880000000000 - Direct physical mapping      │  │
│                     │  0xFFFFC90000000000 - vmalloc area                 │  │
│                     │  0xFFFFEA0000000000 - Virtual memory map           │  │
│                     │                                                     │  │
│  0xFFFF800000000000 ├────────────────────────────────────────────────────┤  │
│                     │              NON-CANONICAL HOLE                     │  │
│                     │         (addresses that cause fault)                │  │
│  0x0000800000000000 ├────────────────────────────────────────────────────┤  │
│                     │                                                     │  │
│                     │              USER SPACE (128 TB)                    │  │
│                     │                                                     │  │
│                     │  Stack (grows down from near top)                  │  │
│                     │  mmap region (shared libs, anonymous maps)         │  │
│                     │  Heap (grows up from end of data)                  │  │
│                     │  BSS, Data, Text (program sections)                │  │
│                     │                                                     │  │
│  0x0000000000000000 └────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

KASLR (Kernel Address Space Layout Randomization)

KASLR randomizes kernel addresses at boot for security:
# Check if KASLR is enabled
cat /proc/cmdline | grep -q nokaslr && echo "KASLR disabled" || echo "KASLR enabled"

# See kernel text base (will differ each boot with KASLR)
sudo cat /proc/kallsyms | grep " _text" | head -1

Loadable Kernel Modules

Modules allow extending the kernel without recompiling:

Module Structure

// hello_module.c - Simple kernel module
#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Your Name");
MODULE_DESCRIPTION("A simple Hello World module");
MODULE_VERSION("1.0");

static int __init hello_init(void)
{
    printk(KERN_INFO "Hello, Kernel!\n");
    return 0;  // 0 = success
}

static void __exit hello_exit(void)
{
    printk(KERN_INFO "Goodbye, Kernel!\n");
}

module_init(hello_init);
module_exit(hello_exit);

Module Loading Process

Module Loading Process

Module Management Commands

# List loaded modules
lsmod

# Module info
modinfo ext4

# Load module
sudo modprobe ext4

# Load with parameters
sudo modprobe loop max_loop=64

# Remove module
sudo rmmod loop

# Show module dependencies
modprobe --show-depends ext4

# Show module parameters
systool -vm ext4

Module Parameters

// Module with parameters
static int buffer_size = 1024;
module_param(buffer_size, int, 0644);  // Read-write in sysfs
MODULE_PARM_DESC(buffer_size, "Size of internal buffer");

static char *device_name = "mydev";
module_param(device_name, charp, 0444);  // Read-only
MODULE_PARM_DESC(device_name, "Device name");

Kernel Threads

Kernel threads (kthreads) are processes that run entirely in kernel mode:
// Creating a kernel thread
#include <linux/kthread.h>

static struct task_struct *my_thread;

static int thread_function(void *data)
{
    while (!kthread_should_stop()) {
        // Do work
        schedule_timeout_interruptible(HZ);  // Sleep 1 second
    }
    return 0;
}

// In module init:
my_thread = kthread_run(thread_function, NULL, "my_kthread");

// In module exit:
kthread_stop(my_thread);

Important Kernel Threads

# View kernel threads (names in brackets)
ps aux | grep '\[.*\]'

# Common kernel threads:
# [kthreadd]     - Parent of all kernel threads
# [ksoftirqd/N]  - Soft IRQ handling for CPU N
# [kworker/N:M]  - Workqueue workers
# [kswapd0]      - Memory reclaim
# [jbd2/sda1-8]  - Journal block device (ext4 journaling)
# [kcompactd0]   - Memory compaction

Lab Exercises

Objective: Get comfortable with kernel source tree
# Clone kernel source
git clone --depth=1 https://github.com/torvalds/linux.git
cd linux

# Find start_kernel
grep -rn "asmlinkage.*start_kernel" init/

# Find syscall table (x86-64)
find arch/x86 -name "*syscall*"

# Count lines in different subsystems
wc -l mm/*.c       # Memory management
wc -l kernel/*.c   # Core kernel
wc -l fs/*.c       # Filesystems
Objective: Write, compile, and load a kernel module
// Save as hello.c
#include <linux/init.h>
#include <linux/module.h>

MODULE_LICENSE("GPL");

static int __init hello_init(void)
{
    pr_info("Hello from kernel module!\n");
    return 0;
}

static void __exit hello_exit(void)
{
    pr_info("Goodbye from kernel module!\n");
}

module_init(hello_init);
module_exit(hello_exit);
# Makefile
obj-m := hello.o

KDIR := /lib/modules/$(shell uname -r)/build

all:
	make -C $(KDIR) M=$(PWD) modules

clean:
	make -C $(KDIR) M=$(PWD) clean
# Build and test
make
sudo insmod hello.ko
dmesg | tail
sudo rmmod hello
dmesg | tail
Objective: Understand boot timing and initialization
# View boot messages
dmesg | head -100

# Boot timing analysis
systemd-analyze
systemd-analyze blame
systemd-analyze critical-chain

# Kernel command line used
cat /proc/cmdline

# Initramfs contents
lsinitramfs /boot/initramfs-$(uname -r).img | head -50

Interview Questions

Answer:Linux chose monolithic design for performance:
  • Direct function calls between subsystems (no IPC overhead)
  • Single address space eliminates context switch on internal calls
  • Simpler data sharing between components
Trade-offs:
  • A bug in any component can crash the entire kernel
  • Larger attack surface (all code runs privileged)
  • More complex codebase to maintain
Mitigations in Linux:
  • Loadable modules for flexibility
  • Namespaces and cgroups for isolation
  • Seccomp for syscall filtering
  • Strong code review process
Answer (kernel perspective):
  1. Shell process:
    • fork() → creates child process (clone syscall)
    • execve("/bin/ls") → replaces process image
  2. execve processing:
    • Kernel opens ELF binary
    • Maps code/data sections into memory
    • Sets up stack with arguments/environment
    • Loads dynamic linker (ld.so)
  3. ls execution:
    • Dynamic linker loads libc
    • ls calls opendir()getdents64 syscall
    • Kernel reads directory entries from filesystem
    • ls calls write() → output to terminal
  4. Termination:
    • ls calls exit()exit_group syscall
    • Kernel cleans up resources
    • Parent shell’s wait() returns
Answer:
AspectKernel ModuleShared Library
PrivilegeRuns in kernel modeRuns in user mode
Address spaceKernel address spaceProcess address space
Fault impactCan crash systemCrashes only that process
Symbol resolutionKernel symbol tableUser-space linker
MemoryUses kmalloc, vmallocUses malloc, mmap
Loadinginsmod, modprobeld.so, dlopen
Key insight: Modules are essentially kernel code with a defined entry/exit point, while shared libraries are user-space code loaded by the dynamic linker.
Answer:KASLR (Kernel Address Space Layout Randomization):
  • Randomizes kernel base address at each boot
  • Makes it harder to exploit memory corruption vulnerabilities
  • Attacker can’t hardcode kernel addresses
Implementation:
  • Random offset chosen during early boot
  • All kernel symbols shifted by this offset
  • /proc/kallsyms shows randomized addresses
Limitations:
  • Information leaks can reveal base address
  • Side-channel attacks (Meltdown/Spectre) can bypass
  • Doesn’t protect against local attackers with kernel memory access
Related protections:
  • SMEP (Supervisor Mode Execution Prevention)
  • SMAP (Supervisor Mode Access Prevention)
  • Stack canaries

Key Takeaways

Architecture Choice

Linux’s monolithic design prioritizes performance while modules add flexibility

Source Organization

Understanding source tree layout is essential for kernel development and debugging

Boot Process

From firmware to init, each stage has specific responsibilities and debugging points

Module System

Modules extend kernel functionality at runtime without recompilation

Further Reading


Interview Deep-Dive

Strong Answer:
  • First, I would parse the oops message to extract the faulting instruction pointer (RIP), the module name, and the call trace. The RIP tells me exactly which function and offset within the module triggered the fault, and I can use addr2line or objdump -d against the module’s .ko file with debug symbols to map that to a source line.
  • Next, I would check whether the module was loaded with modinfo to confirm its version, parameters, and whether it matches the running kernel. A common root cause is loading a module compiled against a different kernel version, which causes struct layout mismatches since the kernel does not guarantee a stable internal ABI.
  • I would also examine the register dump and stack trace to understand what data the function was operating on. For example, if a NULL pointer dereference is indicated, I would look at which struct field was being accessed and trace backward through the call chain to find who passed a NULL pointer.
  • Finally, if this is reproducible, I would boot with module_blacklist=<mod> to confirm the module is the cause, then load it with dynamic debug enabled (dyndbg='+p') or add pr_debug statements to the module source to trace the exact code path leading to the crash.
Follow-up: How does KASLR complicate this investigation, and what would you do about it?Follow-up Answer:
  • KASLR randomizes the kernel base address at each boot, so the faulting address in the oops does not correspond to the compile-time addresses in the symbol table. To decode the address, I need either the /proc/kallsyms output from that exact boot session (before the panic), or I need to subtract the KASLR offset, which the oops message itself sometimes prints. If the system was configured with kdump, the crash dump captures the full memory image including the randomized layout, and tools like crash can resolve symbols automatically. For modules, the oops typically prints the module load address, and I can compute the offset from there.
Strong Answer:
  • The monolithic design means all kernel subsystems — scheduler, memory manager, filesystem, device drivers — share a single address space and communicate via direct function calls. This eliminates IPC overhead that microkernels pay on every cross-subsystem call, which can be hundreds of nanoseconds per message pass. For a general-purpose OS handling millions of syscalls per second, this performance advantage is decisive.
  • The trade-off is reliability and security. A bug in any driver or subsystem can corrupt kernel memory and crash the entire system. Microkernels like QNX isolate each component in its own address space, so a faulty network driver crashes only that process and can be restarted without rebooting.
  • Linux mitigates the monolithic downsides through loadable modules (isolate at build-time), static analysis tools like Sparse and Coccinelle, extensive code review, and runtime protections like KASAN, UBSAN, and lockdep. But these are mitigations, not guarantees.
  • For a safety-critical embedded system — say, an avionics flight controller — I would choose a microkernel like seL4 or a certified RTOS. The formal verification guarantees and fault isolation outweigh the performance overhead, because correctness is non-negotiable. However, for a high-throughput infrastructure server, Linux’s monolithic approach with modules remains the pragmatic choice.
Follow-up: In practice, how do loadable modules blur the line between monolithic and microkernel designs?Follow-up Answer:
  • Loadable modules give Linux a degree of runtime flexibility that pure monolithic kernels lack. You can load a filesystem driver only when a specific filesystem is mounted, unload a network driver when the interface goes down, and update drivers without rebooting. However, modules still run in kernel address space with full privileges — there is no isolation boundary. A buggy module can still panic the kernel. So modules provide deployment flexibility (similar to microkernel services that can be started/stopped independently) without the fault isolation that defines a true microkernel.
Strong Answer:
  • The initialization order in start_kernel() reflects hard dependencies between subsystems. Memory management must be initialized before the scheduler because the scheduler needs to allocate data structures — runqueues, per-CPU variables, and the initial task_struct copies — and those allocations require a functioning page allocator and slab allocator.
  • If you swapped them, sched_init() would attempt to call kmalloc or alloc_percpu before the memory allocator is ready. This would either trigger a NULL pointer dereference (if the allocator pointers are not yet set up) or corrupt memory by writing to uninitialized data structures. The kernel would panic very early in boot, likely before any console output is visible.
  • This ordering principle extends throughout the boot: interrupts must be initialized before timers (timers are delivered via interrupts), VFS must be initialized before mounting the root filesystem, and RCU must be set up before any RCU-protected data structures are used. Each subsystem’s init function documents its dependencies implicitly through the call order in start_kernel().
Follow-up: How would you debug a hang during early boot where no console output is visible?Follow-up Answer:
  • The primary tool is earlyprintk, a kernel boot parameter that configures a simple output driver (serial port, VGA) before the normal console subsystem is ready. For example, earlyprintk=serial,ttyS0,115200 sends kernel messages to the serial port immediately. If even that fails, I would use JTAG or a hardware debugger to set breakpoints at start_kernel() and single-step through the initialization sequence. On virtual machines, QEMU with -serial stdio and -s -S flags lets me attach GDB to the kernel from the very first instruction.
Strong Answer:
  • My primary concern is stability risk. A kernel module runs with full kernel privileges, and any bug — buffer overflow, use-after-free, deadlock — can panic the entire system, not just crash the agent. User-space processes are isolated by virtual memory: a segfault kills the process, not the machine.
  • Second, development velocity suffers. Kernel modules cannot use standard C libraries, memory debugging tools like Valgrind or AddressSanitizer work differently, and testing requires either VMs or rebooting. User-space development is dramatically faster.
  • Third, there is a maintenance burden. The kernel does not guarantee a stable internal ABI, so a module compiled for kernel 5.15 may not load on 5.19 if struct layouts changed. The module must be recompiled for each kernel version the fleet runs.
  • A kernel module is the right choice when you need access to information or hooks that are not exposed to user space — for example, intercepting every context switch for precise scheduling analysis, or implementing a custom block I/O scheduler. However, with eBPF now providing safe, verified access to kernel hooks from user space, most monitoring use cases should prefer eBPF over custom modules.
Follow-up: How does eBPF address the safety concerns of kernel modules while still running code in kernel context?Follow-up Answer:
  • eBPF programs pass through a verifier before loading that proves termination (bounded loops, no unbounded recursion), memory safety (all pointer accesses bounds-checked), and type safety (correct helper function arguments). The verifier rejects any program that could crash the kernel. Additionally, eBPF programs run in a restricted execution environment: they cannot call arbitrary kernel functions, only approved helper functions, and they have a limited stack (512 bytes). This makes eBPF a safe middle ground between full kernel module access and user-space isolation.

Next: System Call Interface →