Skip to main content

Linux Kernel Internals

The Linux kernel is the heart of the Linux operating system. It manages all hardware resources, provides essential abstractions (like processes, files, and memory), and enforces security and isolation. Understanding kernel internals is crucial for systems programming, performance optimization, and senior engineering interviews.

What is a Kernel?

The kernel is the core part of any operating system. It runs in a privileged mode (kernel space) and has direct access to hardware. User applications run in user space and must interact with the kernel to perform any privileged operation (like reading a file or allocating memory). Think of the kernel as the “traffic controller” of your computer, making sure all requests from programs are handled safely and efficiently.
Interview Frequency: High for systems roles
Key Topics: Kernel architecture, system calls, modules, boot process
Time to Master: 15-20 hours

Kernel Architecture

The Linux kernel uses a monolithic architecture: all core services (process management, memory, device drivers, networking) are part of a single binary, but it supports loadable modules for flexibility. The kernel sits between user applications and hardware, providing a safe and efficient interface.

User Space vs Kernel Space

User space is where regular applications run. Kernel space is reserved for the OS kernel and its extensions. This separation protects the system from bugs and security issues in user programs.
┌─────────────────────────────────────────────────────────────────┐
│                    LINUX KERNEL ARCHITECTURE                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   User Space                                                     │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │  Applications (bash, nginx, Chrome, ...)                  │ │
│   └───────────────────────────────────────────────────────────┘ │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │  C Library (glibc)                                        │ │
│   └───────────────────────────┬───────────────────────────────┘ │
│                               │ System Calls                     │
│   ════════════════════════════╧═════════════════════════════════│
│                                                                  │
│   Kernel Space                                                   │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │              System Call Interface                        │ │
│   └───────────────────────────────────────────────────────────┘ │
│                                                                  │
│   ┌─────────────┬─────────────┬─────────────┬─────────────────┐ │
│   │   VFS       │   Scheduler │   Memory    │    Network      │ │
│   │             │             │  Management │     Stack       │ │
│   └─────────────┴─────────────┴─────────────┴─────────────────┘ │
│                                                                  │
│   ┌─────────────┬─────────────┬─────────────┬─────────────────┐ │
│   │  Filesystem │   IPC       │    Virtual  │    Netfilter    │ │
│   │  Drivers    │             │    Memory   │                 │ │
│   └─────────────┴─────────────┴─────────────┴─────────────────┘ │
│                                                                  │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │              Device Drivers (Block, Char, Net)            │ │
│   └───────────────────────────────────────────────────────────┘ │
│                                                                  │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │              Architecture-Specific Code (x86, ARM)        │ │
│   └───────────────────────────────────────────────────────────┘ │
│                               │                                  │
│   ════════════════════════════╧═════════════════════════════════│
│                                                                  │
│   Hardware                                                       │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │  CPU  │  Memory  │  Disk  │  Network  │  Peripherals      │ │
│   └───────────────────────────────────────────────────────────┘ │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

System Calls

User programs can’t access hardware directly. Instead, they use system calls (syscalls) to request services from the kernel. For example, reading a file, creating a process, or allocating memory all require syscalls. The kernel validates and executes these requests, ensuring security and stability.

System Call Flow

Here’s how a typical system call works:
  1. The application calls a library function (like read() in C).
  2. The library sets up the syscall number and arguments, then triggers a special CPU instruction (like syscall on x86_64).
  3. The CPU switches to kernel mode and jumps to the syscall handler.
  4. The kernel validates arguments, performs the requested action, and returns a result.
  5. The CPU switches back to user mode, and the result is returned to the application.
┌─────────────────────────────────────────────────────────────────┐
│                    SYSTEM CALL FLOW                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   User Space                                                     │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │  1. Application calls read(fd, buf, count)                │ │
│   │                              │                             │ │
│   │                              ▼                             │ │
│   │  2. glibc wrapper sets up registers                       │ │
│   │     - rax = __NR_read (syscall number)                    │ │
│   │     - rdi = fd, rsi = buf, rdx = count                    │ │
│   │                              │                             │ │
│   │                              ▼                             │ │
│   │  3. Execute SYSCALL instruction                           │ │
│   └──────────────────────────────┼────────────────────────────┘ │
│                                  │ Trap to kernel                │
│   ════════════════════════════════════════════════════════════  │
│                                  │                               │
│   Kernel Space                   ▼                               │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │  4. entry_SYSCALL_64                                      │ │
│   │     - Save user registers                                  │ │
│   │     - Switch to kernel stack                               │ │
│   │                              │                             │ │
│   │                              ▼                             │ │
│   │  5. sys_call_table[rax] → sys_read()                      │ │
│   │                              │                             │ │
│   │                              ▼                             │ │
│   │  6. Execute system call                                   │ │
│   │     - VFS → filesystem driver → disk I/O                  │ │
│   │                              │                             │ │
│   │                              ▼                             │ │
│   │  7. Return: restore registers, SYSRET                     │ │
│   └──────────────────────────────┼────────────────────────────┘ │
│                                  │                               │
│   ════════════════════════════════════════════════════════════  │
│                                  ▼                               │
│   User Space                                                     │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │  8. Return value in rax (bytes read or -errno)            │ │
│   └───────────────────────────────────────────────────────────┘ │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

System Call Implementation

// Kernel-side system call definition
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
    struct fd f = fdget(fd);
    ssize_t ret = -EBADF;
    
    if (!f.file)
        return ret;
    
    if (!(f.file->f_mode & FMODE_READ))
        goto out;
    
    ret = vfs_read(f.file, buf, count, &f.file->f_pos);
    
out:
    fdput(f);
    return ret;
}

// User-space invocation options:

// 1. Direct syscall (rare)
long result = syscall(SYS_read, fd, buf, count);

// 2. Through glibc wrapper (common)
ssize_t result = read(fd, buf, count);

Key System Calls

Some of the most important Linux system calls include:
CategorySystem CallsPurpose
Processfork, execve, exit, waitProcess lifecycle
Memorymmap, munmap, brk, mprotectMemory management
Fileopen, read, write, close, statFile I/O
Socketsocket, bind, listen, accept, connectNetworking
Signalkill, sigaction, sigprocmaskSignal handling
IPCpipe, shmget, msgget, semgetInter-process communication

Process Management

In Linux, every running program is represented by a task_struct in the kernel. This structure holds all information about the process or thread: its state, scheduling info, memory, open files, credentials, and more.

Task Struct

The task_struct is the kernel’s internal data structure for tracking processes and threads. It’s like a detailed profile for every running program.
// Kernel representation of a process/thread
struct task_struct {
    // Thread state
    volatile long state;    // TASK_RUNNING, TASK_INTERRUPTIBLE, etc.
    unsigned int flags;     // PF_EXITING, PF_VCPU, etc.
    
    // Scheduling
    int prio, static_prio, normal_prio;
    struct sched_entity se;
    struct sched_rt_entity rt;
    const struct sched_class *sched_class;
    
    // Process relationships
    struct task_struct *parent;
    struct list_head children;
    struct list_head sibling;
    struct task_struct *group_leader;
    
    // Memory management
    struct mm_struct *mm;           // Memory descriptor
    struct mm_struct *active_mm;    // For kernel threads
    
    // Filesystem info
    struct fs_struct *fs;           // Current directory, root
    struct files_struct *files;     // Open files
    
    // Credentials
    const struct cred *cred;        // UID, GID, capabilities
    
    // Signals
    struct signal_struct *signal;
    struct sighand_struct *sighand;
    sigset_t blocked, real_blocked;
    
    // Namespaces
    struct nsproxy *nsproxy;
    
    // ... many more fields
};

// Get current task
struct task_struct *current = get_current();

Process States

Processes in Linux can be in various states: running, waiting, stopped, zombie, etc. The kernel manages transitions between these states as processes execute, wait for I/O, or terminate.
┌─────────────────────────────────────────────────────────────────┐
│                    PROCESS STATE MACHINE                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│                        ┌─────────────┐                          │
│                        │   CREATED   │                          │
│                        │  (fork())   │                          │
│                        └──────┬──────┘                          │
│                               │                                  │
│                               ▼                                  │
│     ┌──────────────────────────────────────────────────┐        │
│     │                                                  │        │
│     │  ┌────────────────┐        ┌────────────────┐   │        │
│     │  │ TASK_RUNNING   │◄──────►│ TASK_RUNNING   │   │        │
│     │  │   (ready)      │schedule│   (on CPU)     │   │        │
│     │  └───────┬────────┘        └───────┬────────┘   │        │
│     │          │                         │            │        │
│     └──────────┼─────────────────────────┼────────────┘        │
│                │                         │                      │
│         wait   │                         │ I/O, lock            │
│                ▼                         ▼                      │
│     ┌────────────────────┐    ┌────────────────────┐           │
│     │TASK_INTERRUPTIBLE  │    │TASK_UNINTERRUPTIBLE│           │
│     │(can receive signal)│    │(ignores signals)   │           │
│     └────────────────────┘    └────────────────────┘           │
│                │                         │                      │
│                │         I/O complete    │                      │
│                └────────────┬────────────┘                      │
│                             │                                    │
│                             ▼                                    │
│                   ┌────────────────┐                            │
│                   │  EXIT_ZOMBIE   │                            │
│                   │  (wait by parent)│                          │
│                   └────────┬───────┘                            │
│                            │                                     │
│                            ▼                                     │
│                   ┌────────────────┐                            │
│                   │   EXIT_DEAD    │                            │
│                   │   (reaped)     │                            │
│                   └────────────────┘                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Memory Management

Linux uses virtual memory to give each process the illusion of a private, contiguous address space. The kernel manages page tables, handles page faults, and allocates physical memory efficiently.

Address Space Layout

The address space of a process is divided into regions: code (text), data, heap, stack, and memory-mapped areas. The kernel enforces boundaries and permissions for each region, protecting processes from each other.
┌─────────────────────────────────────────────────────────────────┐
│                    VIRTUAL ADDRESS SPACE (x86-64)                │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  0xFFFFFFFFFFFFFFFF ┌──────────────────────────────────────┐    │
│                     │                                      │    │
│                     │        Kernel Space                  │    │
│                     │        (upper 128 TB)                │    │
│                     │                                      │    │
│                     │  - Direct mapping of physical RAM    │    │
│                     │  - vmalloc area                      │    │
│                     │  - Kernel text and data              │    │
│                     │  - Module space                      │    │
│                     │                                      │    │
│  0xFFFF800000000000 ├──────────────────────────────────────┤    │
│                     │        Hole (unmapped)               │    │
│  0x00007FFFFFFFFFFF ├──────────────────────────────────────┤    │
│                     │                                      │    │
│                     │        User Space                    │    │
│                     │        (lower 128 TB)                │    │
│                     │                                      │    │
│                     │  Stack (grows down)     ──┐          │    │
│                     │          │                │          │    │
│                     │          ▼                │          │    │
│                     │        (gap)             mmap        │    │
│                     │          ▲                │          │    │
│                     │          │                │          │    │
│                     │  mmap region (libs, etc.) ◄┘          │    │
│                     │                                      │    │
│                     │        (gap)                         │    │
│                     │                                      │    │
│                     │  Heap (grows up via brk)             │    │
│                     │          │                           │    │
│                     │          ▼                           │    │
│                     │  BSS (uninitialized data)            │    │
│                     │  Data (initialized data)             │    │
│                     │  Text (code)                         │    │
│  0x0000000000000000 └──────────────────────────────────────┘    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Page Tables (4-Level)

Modern CPUs (like x86-64) use multi-level page tables to efficiently map virtual addresses to physical memory. The kernel walks these tables to resolve memory accesses and handles page faults when needed.
// x86-64 page table structure
// 48-bit virtual address:
// [47:39] PML4 index (9 bits, 512 entries)
// [38:30] PDPT index (9 bits, 512 entries)  
// [29:21] PD index (9 bits, 512 entries)
// [20:12] PT index (9 bits, 512 entries)
// [11:0]  Page offset (12 bits, 4KB page)

// Walk page tables
pgd_t *pgd = pgd_offset(mm, address);
if (pgd_none(*pgd)) return NULL;

p4d_t *p4d = p4d_offset(pgd, address);
if (p4d_none(*p4d)) return NULL;

pud_t *pud = pud_offset(p4d, address);
if (pud_none(*pud)) return NULL;

pmd_t *pmd = pmd_offset(pud, address);
if (pmd_none(*pmd)) return NULL;

pte_t *pte = pte_offset_kernel(pmd, address);
if (pte_none(*pte)) return NULL;

unsigned long phys = (pte_val(*pte) & PTE_PFN_MASK) | 
                     (address & ~PAGE_MASK);

Memory Allocation Layers

Memory allocation in Linux happens in layers:
  • User programs use malloc() (implemented by libraries like glibc).
  • The library requests memory from the kernel via brk() or mmap().
  • The kernel manages virtual memory areas (VMAs), page tables, and physical memory.
  • For kernel allocations, the slab and buddy allocators provide efficient memory management for different object sizes.
┌─────────────────────────────────────────────────────────────────┐
│                    KERNEL MEMORY ALLOCATION                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   User request: malloc(100)                                      │
│          │                                                       │
│          ▼                                                       │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  glibc malloc (user space)                              │   │
│   │  - ptmalloc, jemalloc, tcmalloc                         │   │
│   │  - Caches memory, reduces syscalls                       │   │
│   └───────────────────────────┬─────────────────────────────┘   │
│                               │ brk() or mmap()                  │
│   ════════════════════════════╧═════════════════════════════════│
│                               │                                  │
│   ┌───────────────────────────▼─────────────────────────────┐   │
│   │  Virtual Memory Subsystem                               │   │
│   │  - Creates VMAs (vm_area_struct)                        │   │
│   │  - Handles page faults                                   │   │
│   └───────────────────────────┬─────────────────────────────┘   │
│                               │                                  │
│   ┌───────────────────────────▼─────────────────────────────┐   │
│   │  Slab Allocator (SLUB)                                  │   │
│   │  - kmalloc(), kmem_cache_alloc()                        │   │
│   │  - Object caching, minimal fragmentation                 │   │
│   └───────────────────────────┬─────────────────────────────┘   │
│                               │                                  │
│   ┌───────────────────────────▼─────────────────────────────┐   │
│   │  Buddy Allocator (Page Allocator)                       │   │
│   │  - alloc_pages(), __get_free_pages()                    │   │
│   │  - Power-of-2 page blocks                                │   │
│   └───────────────────────────┬─────────────────────────────┘   │
│                               │                                  │
│                               ▼                                  │
│                        Physical Memory                           │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Kernel Modules

Kernel modules are pieces of code that can be loaded into or removed from the running kernel. They extend kernel functionality (like device drivers) without requiring a reboot. Writing modules is a key skill for Linux systems programming.

Module Structure

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Your Name");
MODULE_DESCRIPTION("Example kernel module");
MODULE_VERSION("1.0");

// Module parameters
static int param_value = 42;
module_param(param_value, int, 0644);
MODULE_PARM_DESC(param_value, "An integer parameter");

// Initialization function
static int __init mymodule_init(void)
{
    pr_info("Module loaded with param_value = %d\n", param_value);
    return 0;  // Success
}

// Cleanup function
static void __exit mymodule_exit(void)
{
    pr_info("Module unloaded\n");
}

module_init(mymodule_init);
module_exit(mymodule_exit);

Building a Module

# Makefile
obj-m := mymodule.o

# For multi-file modules:
# mymodule-objs := file1.o file2.o

KERNEL_DIR ?= /lib/modules/$(shell uname -r)/build

all:
	$(MAKE) -C $(KERNEL_DIR) M=$(PWD) modules

clean:
	$(MAKE) -C $(KERNEL_DIR) M=$(PWD) clean
# Build and load
$ make
$ sudo insmod mymodule.ko param_value=100
$ lsmod | grep mymodule
$ cat /sys/module/mymodule/parameters/param_value
$ sudo rmmod mymodule
$ dmesg | tail

Character Device Driver

#include <linux/cdev.h>
#include <linux/device.h>
#include <linux/fs.h>

#define DEVICE_NAME "mydev"

static dev_t dev_num;
static struct cdev my_cdev;
static struct class *my_class;

static int mydev_open(struct inode *inode, struct file *file)
{
    pr_info("Device opened\n");
    return 0;
}

static ssize_t mydev_read(struct file *file, char __user *buf,
                          size_t count, loff_t *offset)
{
    char msg[] = "Hello from kernel!\n";
    size_t len = sizeof(msg);
    
    if (*offset >= len)
        return 0;
    
    if (copy_to_user(buf, msg, len))
        return -EFAULT;
    
    *offset += len;
    return len;
}

static struct file_operations fops = {
    .owner = THIS_MODULE,
    .open = mydev_open,
    .read = mydev_read,
};

static int __init mydev_init(void)
{
    // Allocate device number
    alloc_chrdev_region(&dev_num, 0, 1, DEVICE_NAME);
    
    // Create cdev
    cdev_init(&my_cdev, &fops);
    cdev_add(&my_cdev, dev_num, 1);
    
    // Create device class and device
    my_class = class_create(THIS_MODULE, DEVICE_NAME);
    device_create(my_class, NULL, dev_num, NULL, DEVICE_NAME);
    
    return 0;
}

static void __exit mydev_exit(void)
{
    device_destroy(my_class, dev_num);
    class_destroy(my_class);
    cdev_del(&my_cdev);
    unregister_chrdev_region(dev_num, 1);
}

module_init(mydev_init);
module_exit(mydev_exit);

Boot Process

┌─────────────────────────────────────────────────────────────────┐
│                    LINUX BOOT SEQUENCE                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. BIOS/UEFI                                                    │
│     ┌───────────────────────────────────────────────────────┐   │
│     │ • Power-on self-test (POST)                           │   │
│     │ • Initialize hardware                                  │   │
│     │ • Load bootloader from boot device                     │   │
│     └───────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ▼                                   │
│  2. Bootloader (GRUB)                                            │
│     ┌───────────────────────────────────────────────────────┐   │
│     │ • Display boot menu                                    │   │
│     │ • Load kernel image (vmlinuz)                          │   │
│     │ • Load initial ramdisk (initramfs)                     │   │
│     │ • Pass kernel command line parameters                   │   │
│     │ • Transfer control to kernel                           │   │
│     └───────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ▼                                   │
│  3. Kernel Initialization                                        │
│     ┌───────────────────────────────────────────────────────┐   │
│     │ • Decompress kernel                                    │   │
│     │ • Set up page tables, GDT, IDT                         │   │
│     │ • Initialize memory management                          │   │
│     │ • Initialize scheduler                                  │   │
│     │ • Start kernel threads (kthreadd)                       │   │
│     │ • Mount initramfs as temporary root                     │   │
│     │ • Execute /init from initramfs                         │   │
│     └───────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ▼                                   │
│  4. initramfs                                                    │
│     ┌───────────────────────────────────────────────────────┐   │
│     │ • Load essential drivers (disk, filesystem)            │   │
│     │ • Detect hardware                                       │   │
│     │ • Mount real root filesystem                            │   │
│     │ • pivot_root to real root                               │   │
│     │ • exec /sbin/init                                       │   │
│     └───────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ▼                                   │
│  5. Init System (systemd)                                        │
│     ┌───────────────────────────────────────────────────────┐   │
│     │ • PID 1: mother of all processes                       │   │
│     │ • Mount filesystems (/proc, /sys, etc.)                │   │
│     │ • Start services in dependency order                    │   │
│     │ • Reach default.target (multi-user or graphical)       │   │
│     │ • Spawn login prompts                                   │   │
│     └───────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Kernel Command Line

# View current command line
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.15.0 root=UUID=xxx ro quiet splash

# Common parameters:
# root=           Root filesystem device
# ro/rw           Mount root read-only/read-write
# init=           Alternative init program
# single/1        Single user mode
# console=        Console device
# quiet           Suppress boot messages
# debug           Enable debug output
# panic=          Seconds before reboot on panic

Kernel Debugging

printk and Dynamic Debug

// Log levels
pr_emerg("System unusable\n");     // 0
pr_alert("Action required\n");     // 1
pr_crit("Critical condition\n");   // 2
pr_err("Error condition\n");       // 3
pr_warn("Warning condition\n");    // 4
pr_notice("Normal but significant\n"); // 5
pr_info("Informational\n");        // 6
pr_debug("Debug-level\n");         // 7 (requires DEBUG)

// Dynamic debug
pr_debug("Debug message with args: %d\n", value);
// Enable at runtime:
// echo 'module mymodule +p' > /sys/kernel/debug/dynamic_debug/control

/proc and /sys

# Process information
$ cat /proc/1/status        # PID 1 status
$ cat /proc/1/maps          # Memory mappings
$ cat /proc/1/fd            # Open files

# System information
$ cat /proc/meminfo         # Memory statistics
$ cat /proc/cpuinfo         # CPU information
$ cat /proc/interrupts      # Interrupt counts

# Kernel tuning
$ cat /proc/sys/kernel/hostname
$ echo 1 > /proc/sys/net/ipv4/ip_forward

# Device information
$ ls /sys/class/net/        # Network interfaces
$ cat /sys/class/net/eth0/address  # MAC address
$ ls /sys/block/            # Block devices

Tracing

# Function tracer
$ echo function > /sys/kernel/debug/tracing/current_tracer
$ echo 1 > /sys/kernel/debug/tracing/tracing_on
$ cat /sys/kernel/debug/tracing/trace

# Function graph tracer
$ echo function_graph > /sys/kernel/debug/tracing/current_tracer

# Event tracing
$ echo 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
$ cat /sys/kernel/debug/tracing/trace

# Using ftrace
$ trace-cmd record -e sched_switch sleep 1
$ trace-cmd report

# BPF tracing (modern)
$ bpftrace -e 'tracepoint:syscalls:sys_enter_read { @[comm] = count(); }'

Interview Deep Dive Questions

Answer:1. Shell reads input:
Shell → read(STDIN) → "ls\n"
2. Shell parses and finds executable:
Search PATH: /usr/bin/ls found
3. fork() creates child process:
Shell (PID 100)

     │ fork()

Shell (PID 100) ──┬── Child (PID 101)
                  │   - Copy of parent
                  │   - Same code, data, file descriptors
4. execve() replaces child with ls:
// In child process:
execve("/usr/bin/ls", ["ls"], environ);

// Kernel actions:
// - Load ELF binary
// - Set up new address space
// - Initialize stack with args/env
// - Jump to entry point (_start → __libc_start_main → main)
5. ls executes:
getdents64(dirfd, buffer) → Read directory entries
write(STDOUT, "file1  file2\n") → Output
6. ls exits:
exit_group(0) → Kernel cleanup
- Free memory
- Close file descriptors  
- Set state to EXIT_ZOMBIE
- Signal parent (SIGCHLD)
7. Shell reaps child:
Shell calls wait4() → Gets exit status
Zombie → EXIT_DEAD → Fully removed
Shell displays next prompt
Key system calls: read, fork, execve, openat, getdents64, write, exit_group, wait4
Answer:Problem: fork() creates complete copy of parent’s address space. With large processes, this would be slow and wasteful.Solution: Copy-on-Write
Before fork():
┌─────────────────────────────────────────────────────────────┐
│  Parent (PID 100)                                           │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Page Table → Physical Page A (R/W)                  │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

After fork() (no copy yet!):
┌─────────────────────────────────────────────────────────────┐
│  Parent (PID 100)                Physical Page A            │
│  ┌───────────────┐               ┌───────────┐              │
│  │ Page Table    ├──────────────►│   Data    │◄─────────────┤
│  │ (now R/O)     │               │           │              │
│  └───────────────┘               └───────────┘              │
│                                        ▲                    │
│  Child (PID 101)                       │                    │
│  ┌───────────────┐                     │                    │
│  │ Page Table    ├─────────────────────┘                    │
│  │ (R/O copy)    │   Both point to same physical page!     │
│  └───────────────┘                                          │
└─────────────────────────────────────────────────────────────┘

When child writes (page fault triggers copy):
┌─────────────────────────────────────────────────────────────┐
│  Parent (PID 100)                Physical Page A            │
│  ┌───────────────┐               ┌───────────┐              │
│  │ Page Table    ├──────────────►│   Data    │              │
│  │ (R/W again)   │               │           │              │
│  └───────────────┘               └───────────┘              │
│                                                              │
│  Child (PID 101)                 Physical Page B (NEW)      │
│  ┌───────────────┐               ┌───────────┐              │
│  │ Page Table    ├──────────────►│ Data+Mod  │              │
│  │ (R/W)         │               │           │              │
│  └───────────────┘               └───────────┘              │
└─────────────────────────────────────────────────────────────┘
Implementation:
// During fork():
// 1. Create new page tables (shallow copy)
// 2. Mark all writable pages as read-only
// 3. Increment reference count on physical pages

// On write (page fault handler):
if (page->ref_count > 1) {
    // Allocate new page
    new_page = alloc_page();
    
    // Copy contents
    copy_page(new_page, old_page);
    
    // Update page table to point to new page (R/W)
    pte_set(pte, new_page, PTE_W);
    
    // Decrement old page ref count
    old_page->ref_count--;
} else {
    // We're the only user, just make writable
    pte_set_writable(pte);
}
Benefits:
  • fork() is O(1) in page table size, not memory size
  • Pages never written are never copied
  • exec() after fork() doesn’t copy at all
Answer:Page fault types:
  1. Minor fault: Page in memory but not mapped
  2. Major fault: Page not in memory (disk I/O needed)
  3. Invalid fault: Access violation (segfault)
Handler flow:
// arch/x86/mm/fault.c: do_page_fault()

void do_page_fault(struct pt_regs *regs, unsigned long error_code,
                   unsigned long address)
{
    struct mm_struct *mm = current->mm;
    struct vm_area_struct *vma;
    
    // 1. Check if fault in kernel mode (bad)
    if (fault_in_kernel_mode(regs)) {
        // Likely a bug, try to recover or panic
        bad_area_nosemaphore(regs, error_code, address);
        return;
    }
    
    // 2. Find VMA containing address
    vma = find_vma(mm, address);
    
    // 3. Check if address is valid
    if (!vma || address < vma->vm_start) {
        // Check if stack needs to expand
        if (expand_stack(vma, address)) {
            // Invalid address → SIGSEGV
            bad_area(regs, error_code, address);
            return;
        }
    }
    
    // 4. Check permissions
    if (write_fault && !(vma->vm_flags & VM_WRITE)) {
        // Write to read-only → SIGSEGV
        bad_area(regs, error_code, address);
        return;
    }
    
    // 5. Handle the fault
    fault = handle_mm_fault(vma, address, flags);
    
    // Returns VM_FAULT_MAJOR if I/O needed
    // Returns VM_FAULT_MINOR if just page table update
}
Common scenarios:
ScenarioHandling
Demand pagingAllocate page, zero-fill or read from file
Copy-on-writeCopy page, update permissions
Stack growthExtend VMA, allocate pages
SwapRead page from swap, update page table
Memory-mapped fileRead page from file
Invalid accessSend SIGSEGV to process
Answer:Problem: Interrupt handlers must be fast, but some work takes time.Solution: Deferred work mechanisms
┌─────────────────────────────────────────────────────────────┐
│                    DEFERRED WORK                             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │                  Interrupt Context                    │   │
│  │  • Runs with interrupts disabled                      │   │
│  │  • Cannot sleep, allocate memory, or take mutex       │   │
│  │  • Must be very fast (< 1ms)                          │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │ Schedule deferred work           │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │               Softirq / Tasklet                       │   │
│  │  • Runs with interrupts enabled                       │   │
│  │  • Cannot sleep                                       │   │
│  │  • Runs in atomic context                            │   │
│  │  • For time-critical deferred work                    │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │ If work can sleep               │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │                  Workqueue                            │   │
│  │  • Runs in process context (kernel thread)            │   │
│  │  • CAN sleep, allocate memory, take mutex             │   │
│  │  • Lower priority than softirqs                       │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Comparison:
FeatureSoftirqTaskletWorkqueue
ContextAtomicAtomicProcess
Can sleepNoNoYes
ConcurrencyPer-CPUSerializedPer-worker
PriorityHighHighLower
Use caseNetwork, block I/OSimple deferredComplex work
Examples:
// Softirq (static, limited number)
// Used by: NET_RX_SOFTIRQ, NET_TX_SOFTIRQ, BLOCK_SOFTIRQ

// Tasklet (dynamic, built on softirq)
DECLARE_TASKLET(my_tasklet, my_handler);
tasklet_schedule(&my_tasklet);

// Workqueue (most flexible)
DECLARE_WORK(my_work, my_work_handler);
schedule_work(&my_work);

void my_work_handler(struct work_struct *work) {
    // This can sleep!
    mutex_lock(&my_mutex);
    // ... do work ...
    mutex_unlock(&my_mutex);
}
Answer:Futex = Fast Userspace muTEXGoal: Avoid syscall in the common (uncontended) case.
┌─────────────────────────────────────────────────────────────┐
│                    FUTEX OPERATION                           │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Uncontended case (no syscall):                             │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Thread A:                                            │   │
│  │  atomic_cmpxchg(&futex, 0, 1) → Success              │   │
│  │  // Lock acquired, no kernel involvement!             │   │
│  │                                                       │   │
│  │  atomic_cmpxchg(&futex, 1, 0) → Success              │   │
│  │  // Lock released, still no kernel!                   │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  Contended case (syscall needed):                           │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Thread A: holds lock (futex = 1)                     │   │
│  │                                                       │   │
│  │  Thread B:                                            │   │
│  │  atomic_cmpxchg(&futex, 0, 1) → Fails                │   │
│  │  // Lock held, must wait                              │   │
│  │                                                       │   │
│  │  futex(FUTEX_WAIT, &futex, 1)                        │   │
│  │  // Kernel call: "sleep until futex != 1"             │   │
│  │  // Thread B now blocked                              │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  Wake up:                                                    │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Thread A:                                            │   │
│  │  atomic_exchange(&futex, 0)                          │   │
│  │  // Sees there may be waiters                         │   │
│  │                                                       │   │
│  │  futex(FUTEX_WAKE, &futex, 1)                        │   │
│  │  // Kernel: wake one waiter                           │   │
│  │                                                       │   │
│  │  Thread B: wakes up, retries atomic op                │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Kernel implementation:
// Simplified futex wait
SYSCALL_DEFINE4(futex, u32 __user *, uaddr, int, op, ...)
{
    if (op == FUTEX_WAIT) {
        // Get hash bucket for this address
        struct futex_hash_bucket *hb = hash_futex(uaddr);
        
        spin_lock(&hb->lock);
        
        // Check value hasn't changed
        if (get_user(uval, uaddr) != expected) {
            spin_unlock(&hb->lock);
            return -EAGAIN;  // Try again in userspace
        }
        
        // Add to wait queue
        queue_me(&q, hb);
        
        spin_unlock(&hb->lock);
        
        // Sleep
        schedule();
        
        return 0;
    }
    
    if (op == FUTEX_WAKE) {
        struct futex_hash_bucket *hb = hash_futex(uaddr);
        
        spin_lock(&hb->lock);
        
        // Wake up waiters
        wake_up_n(hb, nr_wake);
        
        spin_unlock(&hb->lock);
    }
}
Why it’s fast:
  • Uncontended: Just an atomic operation, no syscall
  • Hash table lookup in kernel is O(1)
  • Foundation for pthread_mutex, semaphores, condition variables

Kernel Exploration Commands

# Kernel version and build info
$ uname -a
$ cat /proc/version

# Boot parameters
$ cat /proc/cmdline

# Loaded modules
$ lsmod
$ modinfo <module_name>

# Hardware and IRQ info
$ cat /proc/interrupts
$ cat /proc/iomem
$ lspci -v

# Process information
$ cat /proc/<pid>/status
$ cat /proc/<pid>/maps
$ cat /proc/<pid>/stack

# System performance
$ cat /proc/stat
$ cat /proc/loadavg
$ cat /proc/vmstat

# Tracing
$ cat /sys/kernel/debug/tracing/available_tracers
$ perf list

Key Takeaways

Monolithic but Modular

Linux kernel is monolithic (single address space) but supports loadable modules.

System Call Interface

SYSCALL instruction transitions to kernel. ~400 system calls in modern Linux.

Memory Management

4-level page tables, buddy + slab allocators, copy-on-write optimization.

Boot Sequence

BIOS → Bootloader → Kernel → initramfs → systemd (PID 1)

Next: Interview Preparation