Linux Kernel Internals

The Linux kernel is the heart of the Linux operating system. It manages all hardware resources, provides essential abstractions (like processes, files, and memory), and enforces security and isolation. Understanding kernel internals is crucial for systems programming, performance optimization, and senior engineering interviews.

What is a Kernel?

The kernel is the core part of any operating system. It runs in a privileged mode (kernel space) and has direct access to hardware. User applications run in user space and must interact with the kernel to perform any privileged operation (like reading a file or allocating memory). Think of the kernel as the “traffic controller” of your computer, making sure all requests from programs are handled safely and efficiently.

Interview Frequency: High for systems roles
Key Topics: Kernel architecture, system calls, modules, boot process
Time to Master: 15-20 hours

Kernel Architecture

The Linux kernel uses a monolithic architecture: all core services (process management, memory, device drivers, networking) are part of a single binary, but it supports loadable modules for flexibility. The kernel sits between user applications and hardware, providing a safe and efficient interface.

User Space vs Kernel Space

User space is where regular applications run. Kernel space is reserved for the OS kernel and its extensions. This separation protects the system from bugs and security issues in user programs.

┌─────────────────────────────────────────────────────────────────┐
│                    LINUX KERNEL ARCHITECTURE                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   User Space                                                     │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │  Applications (bash, nginx, Chrome, ...)                  │ │
│   └───────────────────────────────────────────────────────────┘ │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │  C Library (glibc)                                        │ │
│   └───────────────────────────┬───────────────────────────────┘ │
│                               │ System Calls                     │
│   ════════════════════════════╧═════════════════════════════════│
│                                                                  │
│   Kernel Space                                                   │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │              System Call Interface                        │ │
│   └───────────────────────────────────────────────────────────┘ │
│                                                                  │
│   ┌─────────────┬─────────────┬─────────────┬─────────────────┐ │
│   │   VFS       │   Scheduler │   Memory    │    Network      │ │
│   │             │             │  Management │     Stack       │ │
│   └─────────────┴─────────────┴─────────────┴─────────────────┘ │
│                                                                  │
│   ┌─────────────┬─────────────┬─────────────┬─────────────────┐ │
│   │  Filesystem │   IPC       │    Virtual  │    Netfilter    │ │
│   │  Drivers    │             │    Memory   │                 │ │
│   └─────────────┴─────────────┴─────────────┴─────────────────┘ │
│                                                                  │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │              Device Drivers (Block, Char, Net)            │ │
│   └───────────────────────────────────────────────────────────┘ │
│                                                                  │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │              Architecture-Specific Code (x86, ARM)        │ │
│   └───────────────────────────────────────────────────────────┘ │
│                               │                                  │
│   ════════════════════════════╧═════════════════════════════════│
│                                                                  │
│   Hardware                                                       │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │  CPU  │  Memory  │  Disk  │  Network  │  Peripherals      │ │
│   └───────────────────────────────────────────────────────────┘ │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

System Calls

User programs can’t access hardware directly. Instead, they use system calls (syscalls) to request services from the kernel. For example, reading a file, creating a process, or allocating memory all require syscalls. The kernel validates and executes these requests, ensuring security and stability.

System Call Flow

Here’s how a typical system call works:

The application calls a library function (like read() in C).
The library sets up the syscall number and arguments, then triggers a special CPU instruction (like syscall on x86_64).
The CPU switches to kernel mode and jumps to the syscall handler.
The kernel validates arguments, performs the requested action, and returns a result.
The CPU switches back to user mode, and the result is returned to the application.

┌─────────────────────────────────────────────────────────────────┐
│                    SYSTEM CALL FLOW                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   User Space                                                     │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │  1. Application calls read(fd, buf, count)                │ │
│   │                              │                             │ │
│   │                              ▼                             │ │
│   │  2. glibc wrapper sets up registers                       │ │
│   │     - rax = __NR_read (syscall number)                    │ │
│   │     - rdi = fd, rsi = buf, rdx = count                    │ │
│   │                              │                             │ │
│   │                              ▼                             │ │
│   │  3. Execute SYSCALL instruction                           │ │
│   └──────────────────────────────┼────────────────────────────┘ │
│                                  │ Trap to kernel                │
│   ════════════════════════════════════════════════════════════  │
│                                  │                               │
│   Kernel Space                   ▼                               │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │  4. entry_SYSCALL_64                                      │ │
│   │     - Save user registers                                  │ │
│   │     - Switch to kernel stack                               │ │
│   │                              │                             │ │
│   │                              ▼                             │ │
│   │  5. sys_call_table[rax] → sys_read()                      │ │
│   │                              │                             │ │
│   │                              ▼                             │ │
│   │  6. Execute system call                                   │ │
│   │     - VFS → filesystem driver → disk I/O                  │ │
│   │                              │                             │ │
│   │                              ▼                             │ │
│   │  7. Return: restore registers, SYSRET                     │ │
│   └──────────────────────────────┼────────────────────────────┘ │
│                                  │                               │
│   ════════════════════════════════════════════════════════════  │
│                                  ▼                               │
│   User Space                                                     │
│   ┌───────────────────────────────────────────────────────────┐ │
│   │  8. Return value in rax (bytes read or -errno)            │ │
│   └───────────────────────────────────────────────────────────┘ │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

System Call Implementation

// Kernel-side system call definition
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
    struct fd f = fdget(fd);
    ssize_t ret = -EBADF;
    
    if (!f.file)
        return ret;
    
    if (!(f.file->f_mode & FMODE_READ))
        goto out;
    
    ret = vfs_read(f.file, buf, count, &f.file->f_pos);
    
out:
    fdput(f);
    return ret;
}

// User-space invocation options:

// 1. Direct syscall (rare)
long result = syscall(SYS_read, fd, buf, count);

// 2. Through glibc wrapper (common)
ssize_t result = read(fd, buf, count);

Key System Calls

Some of the most important Linux system calls include:

Category	System Calls	Purpose
Process	fork, execve, exit, wait	Process lifecycle
Memory	mmap, munmap, brk, mprotect	Memory management
File	open, read, write, close, stat	File I/O
Socket	socket, bind, listen, accept, connect	Networking
Signal	kill, sigaction, sigprocmask	Signal handling
IPC	pipe, shmget, msgget, semget	Inter-process communication

Process Management

In Linux, every running program is represented by a task_struct in the kernel. This structure holds all information about the process or thread: its state, scheduling info, memory, open files, credentials, and more.

Task Struct

The task_struct is the kernel’s internal data structure for tracking processes and threads. It’s like a detailed profile for every running program.

// Kernel representation of a process/thread
struct task_struct {
    // Thread state
    volatile long state;    // TASK_RUNNING, TASK_INTERRUPTIBLE, etc.
    unsigned int flags;     // PF_EXITING, PF_VCPU, etc.
    
    // Scheduling
    int prio, static_prio, normal_prio;
    struct sched_entity se;
    struct sched_rt_entity rt;
    const struct sched_class *sched_class;
    
    // Process relationships
    struct task_struct *parent;
    struct list_head children;
    struct list_head sibling;
    struct task_struct *group_leader;
    
    // Memory management
    struct mm_struct *mm;           // Memory descriptor
    struct mm_struct *active_mm;    // For kernel threads
    
    // Filesystem info
    struct fs_struct *fs;           // Current directory, root
    struct files_struct *files;     // Open files
    
    // Credentials
    const struct cred *cred;        // UID, GID, capabilities
    
    // Signals
    struct signal_struct *signal;
    struct sighand_struct *sighand;
    sigset_t blocked, real_blocked;
    
    // Namespaces
    struct nsproxy *nsproxy;
    
    // ... many more fields
};

// Get current task
struct task_struct *current = get_current();

Process States

Processes in Linux can be in various states: running, waiting, stopped, zombie, etc. The kernel manages transitions between these states as processes execute, wait for I/O, or terminate.

┌─────────────────────────────────────────────────────────────────┐
│                    PROCESS STATE MACHINE                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│                        ┌─────────────┐                          │
│                        │   CREATED   │                          │
│                        │  (fork())   │                          │
│                        └──────┬──────┘                          │
│                               │                                  │
│                               ▼                                  │
│     ┌──────────────────────────────────────────────────┐        │
│     │                                                  │        │
│     │  ┌────────────────┐        ┌────────────────┐   │        │
│     │  │ TASK_RUNNING   │◄──────►│ TASK_RUNNING   │   │        │
│     │  │   (ready)      │schedule│   (on CPU)     │   │        │
│     │  └───────┬────────┘        └───────┬────────┘   │        │
│     │          │                         │            │        │
│     └──────────┼─────────────────────────┼────────────┘        │
│                │                         │                      │
│         wait   │                         │ I/O, lock            │
│                ▼                         ▼                      │
│     ┌────────────────────┐    ┌────────────────────┐           │
│     │TASK_INTERRUPTIBLE  │    │TASK_UNINTERRUPTIBLE│           │
│     │(can receive signal)│    │(ignores signals)   │           │
│     └────────────────────┘    └────────────────────┘           │
│                │                         │                      │
│                │         I/O complete    │                      │
│                └────────────┬────────────┘                      │
│                             │                                    │
│                             ▼                                    │
│                   ┌────────────────┐                            │
│                   │  EXIT_ZOMBIE   │                            │
│                   │  (wait by parent)│                          │
│                   └────────┬───────┘                            │
│                            │                                     │
│                            ▼                                     │
│                   ┌────────────────┐                            │
│                   │   EXIT_DEAD    │                            │
│                   │   (reaped)     │                            │
│                   └────────────────┘                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Memory Management

Linux uses virtual memory to give each process the illusion of a private, contiguous address space. The kernel manages page tables, handles page faults, and allocates physical memory efficiently.

Address Space Layout

The address space of a process is divided into regions: code (text), data, heap, stack, and memory-mapped areas. The kernel enforces boundaries and permissions for each region, protecting processes from each other.

┌─────────────────────────────────────────────────────────────────┐
│                    VIRTUAL ADDRESS SPACE (x86-64)                │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  0xFFFFFFFFFFFFFFFF ┌──────────────────────────────────────┐    │
│                     │                                      │    │
│                     │        Kernel Space                  │    │
│                     │        (upper 128 TB)                │    │
│                     │                                      │    │
│                     │  - Direct mapping of physical RAM    │    │
│                     │  - vmalloc area                      │    │
│                     │  - Kernel text and data              │    │
│                     │  - Module space                      │    │
│                     │                                      │    │
│  0xFFFF800000000000 ├──────────────────────────────────────┤    │
│                     │        Hole (unmapped)               │    │
│  0x00007FFFFFFFFFFF ├──────────────────────────────────────┤    │
│                     │                                      │    │
│                     │        User Space                    │    │
│                     │        (lower 128 TB)                │    │
│                     │                                      │    │
│                     │  Stack (grows down)     ──┐          │    │
│                     │          │                │          │    │
│                     │          ▼                │          │    │
│                     │        (gap)             mmap        │    │
│                     │          ▲                │          │    │
│                     │          │                │          │    │
│                     │  mmap region (libs, etc.) ◄┘          │    │
│                     │                                      │    │
│                     │        (gap)                         │    │
│                     │                                      │    │
│                     │  Heap (grows up via brk)             │    │
│                     │          │                           │    │
│                     │          ▼                           │    │
│                     │  BSS (uninitialized data)            │    │
│                     │  Data (initialized data)             │    │
│                     │  Text (code)                         │    │
│  0x0000000000000000 └──────────────────────────────────────┘    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Page Tables (4-Level)

Modern CPUs (like x86-64) use multi-level page tables to efficiently map virtual addresses to physical memory. The kernel walks these tables to resolve memory accesses and handles page faults when needed.

// x86-64 page table structure
// 48-bit virtual address:
// [47:39] PML4 index (9 bits, 512 entries)
// [38:30] PDPT index (9 bits, 512 entries)  
// [29:21] PD index (9 bits, 512 entries)
// [20:12] PT index (9 bits, 512 entries)
// [11:0]  Page offset (12 bits, 4KB page)

// Walk page tables
pgd_t *pgd = pgd_offset(mm, address);
if (pgd_none(*pgd)) return NULL;

p4d_t *p4d = p4d_offset(pgd, address);
if (p4d_none(*p4d)) return NULL;

pud_t *pud = pud_offset(p4d, address);
if (pud_none(*pud)) return NULL;

pmd_t *pmd = pmd_offset(pud, address);
if (pmd_none(*pmd)) return NULL;

pte_t *pte = pte_offset_kernel(pmd, address);
if (pte_none(*pte)) return NULL;

unsigned long phys = (pte_val(*pte) & PTE_PFN_MASK) | 
                     (address & ~PAGE_MASK);

Memory Allocation Layers

Memory allocation in Linux happens in layers:

User programs use malloc() (implemented by libraries like glibc).
The library requests memory from the kernel via brk() or mmap().
The kernel manages virtual memory areas (VMAs), page tables, and physical memory.
For kernel allocations, the slab and buddy allocators provide efficient memory management for different object sizes.

┌─────────────────────────────────────────────────────────────────┐
│                    KERNEL MEMORY ALLOCATION                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   User request: malloc(100)                                      │
│          │                                                       │
│          ▼                                                       │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  glibc malloc (user space)                              │   │
│   │  - ptmalloc, jemalloc, tcmalloc                         │   │
│   │  - Caches memory, reduces syscalls                       │   │
│   └───────────────────────────┬─────────────────────────────┘   │
│                               │ brk() or mmap()                  │
│   ════════════════════════════╧═════════════════════════════════│
│                               │                                  │
│   ┌───────────────────────────▼─────────────────────────────┐   │
│   │  Virtual Memory Subsystem                               │   │
│   │  - Creates VMAs (vm_area_struct)                        │   │
│   │  - Handles page faults                                   │   │
│   └───────────────────────────┬─────────────────────────────┘   │
│                               │                                  │
│   ┌───────────────────────────▼─────────────────────────────┐   │
│   │  Slab Allocator (SLUB)                                  │   │
│   │  - kmalloc(), kmem_cache_alloc()                        │   │
│   │  - Object caching, minimal fragmentation                 │   │
│   └───────────────────────────┬─────────────────────────────┘   │
│                               │                                  │
│   ┌───────────────────────────▼─────────────────────────────┐   │
│   │  Buddy Allocator (Page Allocator)                       │   │
│   │  - alloc_pages(), __get_free_pages()                    │   │
│   │  - Power-of-2 page blocks                                │   │
│   └───────────────────────────┬─────────────────────────────┘   │
│                               │                                  │
│                               ▼                                  │
│                        Physical Memory                           │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Kernel Modules

Kernel modules are pieces of code that can be loaded into or removed from the running kernel. They extend kernel functionality (like device drivers) without requiring a reboot. Writing modules is a key skill for Linux systems programming.

Module Structure

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Your Name");
MODULE_DESCRIPTION("Example kernel module");
MODULE_VERSION("1.0");

// Module parameters
static int param_value = 42;
module_param(param_value, int, 0644);
MODULE_PARM_DESC(param_value, "An integer parameter");

// Initialization function
static int __init mymodule_init(void)
{
    pr_info("Module loaded with param_value = %d\n", param_value);
    return 0;  // Success
}

// Cleanup function
static void __exit mymodule_exit(void)
{
    pr_info("Module unloaded\n");
}

module_init(mymodule_init);
module_exit(mymodule_exit);

Building a Module

# Makefile
obj-m := mymodule.o

# For multi-file modules:
# mymodule-objs := file1.o file2.o

KERNEL_DIR ?= /lib/modules/$(shell uname -r)/build

all:
	$(MAKE) -C $(KERNEL_DIR) M=$(PWD) modules

clean:
	$(MAKE) -C $(KERNEL_DIR) M=$(PWD) clean

# Build and load
$ make
$ sudo insmod mymodule.ko param_value=100
$ lsmod | grep mymodule
$ cat /sys/module/mymodule/parameters/param_value
$ sudo rmmod mymodule
$ dmesg | tail

Character Device Driver

#include <linux/cdev.h>
#include <linux/device.h>
#include <linux/fs.h>

#define DEVICE_NAME "mydev"

static dev_t dev_num;
static struct cdev my_cdev;
static struct class *my_class;

static int mydev_open(struct inode *inode, struct file *file)
{
    pr_info("Device opened\n");
    return 0;
}

static ssize_t mydev_read(struct file *file, char __user *buf,
                          size_t count, loff_t *offset)
{
    char msg[] = "Hello from kernel!\n";
    size_t len = sizeof(msg);
    
    if (*offset >= len)
        return 0;
    
    if (copy_to_user(buf, msg, len))
        return -EFAULT;
    
    *offset += len;
    return len;
}

static struct file_operations fops = {
    .owner = THIS_MODULE,
    .open = mydev_open,
    .read = mydev_read,
};

static int __init mydev_init(void)
{
    // Allocate device number
    alloc_chrdev_region(&dev_num, 0, 1, DEVICE_NAME);
    
    // Create cdev
    cdev_init(&my_cdev, &fops);
    cdev_add(&my_cdev, dev_num, 1);
    
    // Create device class and device
    my_class = class_create(THIS_MODULE, DEVICE_NAME);
    device_create(my_class, NULL, dev_num, NULL, DEVICE_NAME);
    
    return 0;
}

static void __exit mydev_exit(void)
{
    device_destroy(my_class, dev_num);
    class_destroy(my_class);
    cdev_del(&my_cdev);
    unregister_chrdev_region(dev_num, 1);
}

module_init(mydev_init);
module_exit(mydev_exit);

Boot Process

┌─────────────────────────────────────────────────────────────────┐
│                    LINUX BOOT SEQUENCE                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. BIOS/UEFI                                                    │
│     ┌───────────────────────────────────────────────────────┐   │
│     │ • Power-on self-test (POST)                           │   │
│     │ • Initialize hardware                                  │   │
│     │ • Load bootloader from boot device                     │   │
│     └───────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ▼                                   │
│  2. Bootloader (GRUB)                                            │
│     ┌───────────────────────────────────────────────────────┐   │
│     │ • Display boot menu                                    │   │
│     │ • Load kernel image (vmlinuz)                          │   │
│     │ • Load initial ramdisk (initramfs)                     │   │
│     │ • Pass kernel command line parameters                   │   │
│     │ • Transfer control to kernel                           │   │
│     └───────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ▼                                   │
│  3. Kernel Initialization                                        │
│     ┌───────────────────────────────────────────────────────┐   │
│     │ • Decompress kernel                                    │   │
│     │ • Set up page tables, GDT, IDT                         │   │
│     │ • Initialize memory management                          │   │
│     │ • Initialize scheduler                                  │   │
│     │ • Start kernel threads (kthreadd)                       │   │
│     │ • Mount initramfs as temporary root                     │   │
│     │ • Execute /init from initramfs                         │   │
│     └───────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ▼                                   │
│  4. initramfs                                                    │
│     ┌───────────────────────────────────────────────────────┐   │
│     │ • Load essential drivers (disk, filesystem)            │   │
│     │ • Detect hardware                                       │   │
│     │ • Mount real root filesystem                            │   │
│     │ • pivot_root to real root                               │   │
│     │ • exec /sbin/init                                       │   │
│     └───────────────────────────────────────────────────────┘   │
│                              │                                   │
│                              ▼                                   │
│  5. Init System (systemd)                                        │
│     ┌───────────────────────────────────────────────────────┐   │
│     │ • PID 1: mother of all processes                       │   │
│     │ • Mount filesystems (/proc, /sys, etc.)                │   │
│     │ • Start services in dependency order                    │   │
│     │ • Reach default.target (multi-user or graphical)       │   │
│     │ • Spawn login prompts                                   │   │
│     └───────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Kernel Command Line

# View current command line
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.15.0 root=UUID=xxx ro quiet splash

# Common parameters:
# root=           Root filesystem device
# ro/rw           Mount root read-only/read-write
# init=           Alternative init program
# single/1        Single user mode
# console=        Console device
# quiet           Suppress boot messages
# debug           Enable debug output
# panic=          Seconds before reboot on panic

Kernel Debugging

printk and Dynamic Debug

// Log levels
pr_emerg("System unusable\n");     // 0
pr_alert("Action required\n");     // 1
pr_crit("Critical condition\n");   // 2
pr_err("Error condition\n");       // 3
pr_warn("Warning condition\n");    // 4
pr_notice("Normal but significant\n"); // 5
pr_info("Informational\n");        // 6
pr_debug("Debug-level\n");         // 7 (requires DEBUG)

// Dynamic debug
pr_debug("Debug message with args: %d\n", value);
// Enable at runtime:
// echo 'module mymodule +p' > /sys/kernel/debug/dynamic_debug/control

/proc and /sys

# Process information
$ cat /proc/1/status        # PID 1 status
$ cat /proc/1/maps          # Memory mappings
$ cat /proc/1/fd            # Open files

# System information
$ cat /proc/meminfo         # Memory statistics
$ cat /proc/cpuinfo         # CPU information
$ cat /proc/interrupts      # Interrupt counts

# Kernel tuning
$ cat /proc/sys/kernel/hostname
$ echo 1 > /proc/sys/net/ipv4/ip_forward

# Device information
$ ls /sys/class/net/        # Network interfaces
$ cat /sys/class/net/eth0/address  # MAC address
$ ls /sys/block/            # Block devices

Tracing

# Function tracer
$ echo function > /sys/kernel/debug/tracing/current_tracer
$ echo 1 > /sys/kernel/debug/tracing/tracing_on
$ cat /sys/kernel/debug/tracing/trace

# Function graph tracer
$ echo function_graph > /sys/kernel/debug/tracing/current_tracer

# Event tracing
$ echo 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
$ cat /sys/kernel/debug/tracing/trace

# Using ftrace
$ trace-cmd record -e sched_switch sleep 1
$ trace-cmd report

# BPF tracing (modern)
$ bpftrace -e 'tracepoint:syscalls:sys_enter_read { @[comm] = count(); }'

Interview Deep Dive Questions

Q1: Walk through what happens when you type 'ls' in a terminal

Answer:1. Shell reads input:

Shell → read(STDIN) → "ls\n"

2. Shell parses and finds executable:

Search PATH: /usr/bin/ls found

3. fork() creates child process:

Shell (PID 100)
     │
     │ fork()
     ▼
Shell (PID 100) ──┬── Child (PID 101)
                  │   - Copy of parent
                  │   - Same code, data, file descriptors

4. execve() replaces child with ls:

// In child process:
execve("/usr/bin/ls", ["ls"], environ);

// Kernel actions:
// - Load ELF binary
// - Set up new address space
// - Initialize stack with args/env
// - Jump to entry point (_start → __libc_start_main → main)

5. ls executes:

getdents64(dirfd, buffer) → Read directory entries
write(STDOUT, "file1  file2\n") → Output

6. ls exits:

exit_group(0) → Kernel cleanup
- Free memory
- Close file descriptors  
- Set state to EXIT_ZOMBIE
- Signal parent (SIGCHLD)

7. Shell reaps child:

Shell calls wait4() → Gets exit status
Zombie → EXIT_DEAD → Fully removed
Shell displays next prompt

Key system calls: read, fork, execve, openat, getdents64, write, exit_group, wait4

Q2: Explain copy-on-write (COW) in fork()

Answer:Problem: fork() creates complete copy of parent’s address space. With large processes, this would be slow and wasteful.Solution: Copy-on-Write

Before fork():
┌─────────────────────────────────────────────────────────────┐
│  Parent (PID 100)                                           │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Page Table → Physical Page A (R/W)                  │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

After fork() (no copy yet!):
┌─────────────────────────────────────────────────────────────┐
│  Parent (PID 100)                Physical Page A            │
│  ┌───────────────┐               ┌───────────┐              │
│  │ Page Table    ├──────────────►│   Data    │◄─────────────┤
│  │ (now R/O)     │               │           │              │
│  └───────────────┘               └───────────┘              │
│                                        ▲                    │
│  Child (PID 101)                       │                    │
│  ┌───────────────┐                     │                    │
│  │ Page Table    ├─────────────────────┘                    │
│  │ (R/O copy)    │   Both point to same physical page!     │
│  └───────────────┘                                          │
└─────────────────────────────────────────────────────────────┘

When child writes (page fault triggers copy):
┌─────────────────────────────────────────────────────────────┐
│  Parent (PID 100)                Physical Page A            │
│  ┌───────────────┐               ┌───────────┐              │
│  │ Page Table    ├──────────────►│   Data    │              │
│  │ (R/W again)   │               │           │              │
│  └───────────────┘               └───────────┘              │
│                                                              │
│  Child (PID 101)                 Physical Page B (NEW)      │
│  ┌───────────────┐               ┌───────────┐              │
│  │ Page Table    ├──────────────►│ Data+Mod  │              │
│  │ (R/W)         │               │           │              │
│  └───────────────┘               └───────────┘              │
└─────────────────────────────────────────────────────────────┘

Implementation:

// During fork():
// 1. Create new page tables (shallow copy)
// 2. Mark all writable pages as read-only
// 3. Increment reference count on physical pages

// On write (page fault handler):
if (page->ref_count > 1) {
    // Allocate new page
    new_page = alloc_page();
    
    // Copy contents
    copy_page(new_page, old_page);
    
    // Update page table to point to new page (R/W)
    pte_set(pte, new_page, PTE_W);
    
    // Decrement old page ref count
    old_page->ref_count--;
} else {
    // We're the only user, just make writable
    pte_set_writable(pte);
}

Benefits:

fork() is O(1) in page table size, not memory size
Pages never written are never copied
exec() after fork() doesn’t copy at all

Q3: How does the kernel handle a page fault?

Answer:Page fault types:

Minor fault: Page in memory but not mapped
Major fault: Page not in memory (disk I/O needed)
Invalid fault: Access violation (segfault)

Handler flow:

// arch/x86/mm/fault.c: do_page_fault()

void do_page_fault(struct pt_regs *regs, unsigned long error_code,
                   unsigned long address)
{
    struct mm_struct *mm = current->mm;
    struct vm_area_struct *vma;
    
    // 1. Check if fault in kernel mode (bad)
    if (fault_in_kernel_mode(regs)) {
        // Likely a bug, try to recover or panic
        bad_area_nosemaphore(regs, error_code, address);
        return;
    }
    
    // 2. Find VMA containing address
    vma = find_vma(mm, address);
    
    // 3. Check if address is valid
    if (!vma || address < vma->vm_start) {
        // Check if stack needs to expand
        if (expand_stack(vma, address)) {
            // Invalid address → SIGSEGV
            bad_area(regs, error_code, address);
            return;
        }
    }
    
    // 4. Check permissions
    if (write_fault && !(vma->vm_flags & VM_WRITE)) {
        // Write to read-only → SIGSEGV
        bad_area(regs, error_code, address);
        return;
    }
    
    // 5. Handle the fault
    fault = handle_mm_fault(vma, address, flags);
    
    // Returns VM_FAULT_MAJOR if I/O needed
    // Returns VM_FAULT_MINOR if just page table update
}

Common scenarios:

Scenario	Handling
Demand paging	Allocate page, zero-fill or read from file
Copy-on-write	Copy page, update permissions
Stack growth	Extend VMA, allocate pages
Swap	Read page from swap, update page table
Memory-mapped file	Read page from file
Invalid access	Send SIGSEGV to process

Q4: Explain the difference between softirq, tasklet, and workqueue

Answer:Problem: Interrupt handlers must be fast, but some work takes time.Solution: Deferred work mechanisms

┌─────────────────────────────────────────────────────────────┐
│                    DEFERRED WORK                             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │                  Interrupt Context                    │   │
│  │  • Runs with interrupts disabled                      │   │
│  │  • Cannot sleep, allocate memory, or take mutex       │   │
│  │  • Must be very fast (< 1ms)                          │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │ Schedule deferred work           │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │               Softirq / Tasklet                       │   │
│  │  • Runs with interrupts enabled                       │   │
│  │  • Cannot sleep                                       │   │
│  │  • Runs in atomic context                            │   │
│  │  • For time-critical deferred work                    │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │ If work can sleep               │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │                  Workqueue                            │   │
│  │  • Runs in process context (kernel thread)            │   │
│  │  • CAN sleep, allocate memory, take mutex             │   │
│  │  • Lower priority than softirqs                       │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Comparison:

Feature	Softirq	Tasklet	Workqueue
Context	Atomic	Atomic	Process
Can sleep	No	No	Yes
Concurrency	Per-CPU	Serialized	Per-worker
Priority	High	High	Lower
Use case	Network, block I/O	Simple deferred	Complex work

Examples:

// Softirq (static, limited number)
// Used by: NET_RX_SOFTIRQ, NET_TX_SOFTIRQ, BLOCK_SOFTIRQ

// Tasklet (dynamic, built on softirq)
DECLARE_TASKLET(my_tasklet, my_handler);
tasklet_schedule(&my_tasklet);

// Workqueue (most flexible)
DECLARE_WORK(my_work, my_work_handler);
schedule_work(&my_work);

void my_work_handler(struct work_struct *work) {
    // This can sleep!
    mutex_lock(&my_mutex);
    // ... do work ...
    mutex_unlock(&my_mutex);
}

Q5: How does the kernel implement futexes?

Answer:Futex = Fast Userspace muTEXGoal: Avoid syscall in the common (uncontended) case.

┌─────────────────────────────────────────────────────────────┐
│                    FUTEX OPERATION                           │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Uncontended case (no syscall):                             │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Thread A:                                            │   │
│  │  atomic_cmpxchg(&futex, 0, 1) → Success              │   │
│  │  // Lock acquired, no kernel involvement!             │   │
│  │                                                       │   │
│  │  atomic_cmpxchg(&futex, 1, 0) → Success              │   │
│  │  // Lock released, still no kernel!                   │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  Contended case (syscall needed):                           │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Thread A: holds lock (futex = 1)                     │   │
│  │                                                       │   │
│  │  Thread B:                                            │   │
│  │  atomic_cmpxchg(&futex, 0, 1) → Fails                │   │
│  │  // Lock held, must wait                              │   │
│  │                                                       │   │
│  │  futex(FUTEX_WAIT, &futex, 1)                        │   │
│  │  // Kernel call: "sleep until futex != 1"             │   │
│  │  // Thread B now blocked                              │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  Wake up:                                                    │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Thread A:                                            │   │
│  │  atomic_exchange(&futex, 0)                          │   │
│  │  // Sees there may be waiters                         │   │
│  │                                                       │   │
│  │  futex(FUTEX_WAKE, &futex, 1)                        │   │
│  │  // Kernel: wake one waiter                           │   │
│  │                                                       │   │
│  │  Thread B: wakes up, retries atomic op                │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Kernel implementation:

// Simplified futex wait
SYSCALL_DEFINE4(futex, u32 __user *, uaddr, int, op, ...)
{
    if (op == FUTEX_WAIT) {
        // Get hash bucket for this address
        struct futex_hash_bucket *hb = hash_futex(uaddr);
        
        spin_lock(&hb->lock);
        
        // Check value hasn't changed
        if (get_user(uval, uaddr) != expected) {
            spin_unlock(&hb->lock);
            return -EAGAIN;  // Try again in userspace
        }
        
        // Add to wait queue
        queue_me(&q, hb);
        
        spin_unlock(&hb->lock);
        
        // Sleep
        schedule();
        
        return 0;
    }
    
    if (op == FUTEX_WAKE) {
        struct futex_hash_bucket *hb = hash_futex(uaddr);
        
        spin_lock(&hb->lock);
        
        // Wake up waiters
        wake_up_n(hb, nr_wake);
        
        spin_unlock(&hb->lock);
    }
}

Why it’s fast:

Uncontended: Just an atomic operation, no syscall
Hash table lookup in kernel is O(1)
Foundation for pthread_mutex, semaphores, condition variables

Kernel Exploration Commands

# Kernel version and build info
$ uname -a
$ cat /proc/version

# Boot parameters
$ cat /proc/cmdline

# Loaded modules
$ lsmod
$ modinfo <module_name>

# Hardware and IRQ info
$ cat /proc/interrupts
$ cat /proc/iomem
$ lspci -v

# Process information
$ cat /proc/<pid>/status
$ cat /proc/<pid>/maps
$ cat /proc/<pid>/stack

# System performance
$ cat /proc/stat
$ cat /proc/loadavg
$ cat /proc/vmstat

# Tracing
$ cat /sys/kernel/debug/tracing/available_tracers
$ perf list

End-to-End Walkthrough: `read()` on a TCP Socket

To connect the dots between subsystems, trace a single read() call in a typical server:

User-space call:
- Application thread calls read(fd, buf, n) on a TCP socket.
- The C library issues the read system call (e.g., syscall(SYS_read, ...)).
Syscall entry:
- CPU executes syscall / svc instruction.
- Control transfers to the kernel’s syscall entry (entry_SYSCALL_64 on x86-64).
- The kernel locates the struct file for fd and dispatches to the socket’s file operations.
VFS and socket layer:
- The VFS read implementation calls into the socket layer (sock_read_iter).
- This eventually calls the protocol-specific recvmsg implementation (e.g., tcp_recvmsg).
TCP stack and receive queue:
- If there is already data in the socket’s receive queue (filled by earlier packets), tcp_recvmsg copies it into buf and returns.
- If not, it may sleep the process, putting it on a wait queue until more data arrives.
Network device and driver:
- Incoming packets trigger an interrupt on the NIC.
- The driver’s interrupt handler schedules NAPI polling or other bottom-half work.
- Packets are pulled from the NIC’s DMA ring into memory as sk_buff structures.
Protocol processing:
- The kernel’s networking stack parses headers, validates checksums, and places payload bytes into the appropriate socket’s receive buffers.
- When enough data is available, it wakes up the sleeping read() caller.
Copy to user and return:
- tcp_recvmsg copies data from kernel buffers into the user-space buf using safe copy helpers.
- The syscall returns to user space with the number of bytes read.

Throughout this path you touch:

Syscall machinery (entry_SYSCALL_*).
VFS (struct file, file operations).
Networking stack (TCP/IP implementation, sk_buff).
Scheduler and wait queues (sleep and wake-up of the thread).
Interrupt handling and drivers (NIC, NAPI, DMA).

When reading the rest of this course, try to anchor concepts back to this kind of end-to-end path.

Key Takeaways

Monolithic but Modular

Linux kernel is monolithic (single address space) but supports loadable modules.

System Call Interface

SYSCALL instruction transitions to kernel. ~400 system calls in modern Linux.

Memory Management

4-level page tables, buddy + slab allocators, copy-on-write optimization.

Boot Sequence

BIOS → Bootloader → Kernel → initramfs → systemd (PID 1)

Next: Interview Preparation →

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Linux Kernel Internals

​What is a Kernel?

​Kernel Architecture

​User Space vs Kernel Space

​System Calls

​System Call Flow

​System Call Implementation

​Key System Calls

​Process Management

​Task Struct

​Process States

​Memory Management

​Address Space Layout

​Page Tables (4-Level)

​Memory Allocation Layers

​Kernel Modules

​Module Structure

​Building a Module

​Character Device Driver

​Boot Process

​Kernel Command Line

​Kernel Debugging

​printk and Dynamic Debug

​/proc and /sys

​Tracing

​Interview Deep Dive Questions

​Kernel Exploration Commands

​End-to-End Walkthrough: read() on a TCP Socket

​Key Takeaways

Monolithic but Modular

System Call Interface

Memory Management

Boot Sequence

Linux Kernel Internals

What is a Kernel?

Kernel Architecture

User Space vs Kernel Space

System Calls

System Call Flow

System Call Implementation

Key System Calls

Process Management

Task Struct

Process States

Memory Management

Address Space Layout

Page Tables (4-Level)

Memory Allocation Layers

Kernel Modules

Module Structure

Building a Module

Character Device Driver

Boot Process

Kernel Command Line

Kernel Debugging

printk and Dynamic Debug

/proc and /sys

Tracing

Interview Deep Dive Questions

Kernel Exploration Commands

End-to-End Walkthrough: `read()` on a TCP Socket

Key Takeaways