Linux Kernel Internals
The Linux kernel is the heart of the Linux operating system. It manages all hardware resources, provides essential abstractions (like processes, files, and memory), and enforces security and isolation. Understanding kernel internals is crucial for systems programming, performance optimization, and senior engineering interviews.
What is a Kernel?
The kernel is the core part of any operating system. It runs in a privileged mode (kernel space) and has direct access to hardware. User applications run in user space and must interact with the kernel to perform any privileged operation (like reading a file or allocating memory).
Think of the kernel as the “traffic controller” of your computer, making sure all requests from programs are handled safely and efficiently.
Interview Frequency : High for systems roles
Key Topics : Kernel architecture, system calls, modules, boot process
Time to Master : 15-20 hours
Kernel Architecture
The Linux kernel uses a monolithic architecture : all core services (process management, memory, device drivers, networking) are part of a single binary, but it supports loadable modules for flexibility. The kernel sits between user applications and hardware, providing a safe and efficient interface.
User Space vs Kernel Space
User space is where regular applications run. Kernel space is reserved for the OS kernel and its extensions. This separation protects the system from bugs and security issues in user programs.
┌─────────────────────────────────────────────────────────────────┐
│ LINUX KERNEL ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ User Space │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Applications (bash, nginx, Chrome, ...) │ │
│ └───────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ C Library (glibc) │ │
│ └───────────────────────────┬───────────────────────────────┘ │
│ │ System Calls │
│ ════════════════════════════╧═════════════════════════════════│
│ │
│ Kernel Space │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ System Call Interface │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────┬─────────────┬─────────────┬─────────────────┐ │
│ │ VFS │ Scheduler │ Memory │ Network │ │
│ │ │ │ Management │ Stack │ │
│ └─────────────┴─────────────┴─────────────┴─────────────────┘ │
│ │
│ ┌─────────────┬─────────────┬─────────────┬─────────────────┐ │
│ │ Filesystem │ IPC │ Virtual │ Netfilter │ │
│ │ Drivers │ │ Memory │ │ │
│ └─────────────┴─────────────┴─────────────┴─────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Device Drivers (Block, Char, Net) │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Architecture-Specific Code (x86, ARM) │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │
│ ════════════════════════════╧═════════════════════════════════│
│ │
│ Hardware │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ CPU │ Memory │ Disk │ Network │ Peripherals │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
System Calls
User programs can’t access hardware directly. Instead, they use system calls (syscalls) to request services from the kernel. For example, reading a file, creating a process, or allocating memory all require syscalls. The kernel validates and executes these requests, ensuring security and stability.
System Call Flow
Here’s how a typical system call works:
The application calls a library function (like read() in C).
The library sets up the syscall number and arguments, then triggers a special CPU instruction (like syscall on x86_64).
The CPU switches to kernel mode and jumps to the syscall handler.
The kernel validates arguments, performs the requested action, and returns a result.
The CPU switches back to user mode, and the result is returned to the application.
┌─────────────────────────────────────────────────────────────────┐
│ SYSTEM CALL FLOW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ User Space │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ 1. Application calls read(fd, buf, count) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ 2. glibc wrapper sets up registers │ │
│ │ - rax = __NR_read (syscall number) │ │
│ │ - rdi = fd, rsi = buf, rdx = count │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ 3. Execute SYSCALL instruction │ │
│ └──────────────────────────────┼────────────────────────────┘ │
│ │ Trap to kernel │
│ ════════════════════════════════════════════════════════════ │
│ │ │
│ Kernel Space ▼ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ 4. entry_SYSCALL_64 │ │
│ │ - Save user registers │ │
│ │ - Switch to kernel stack │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ 5. sys_call_table[rax] → sys_read() │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ 6. Execute system call │ │
│ │ - VFS → filesystem driver → disk I/O │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ 7. Return: restore registers, SYSRET │ │
│ └──────────────────────────────┼────────────────────────────┘ │
│ │ │
│ ════════════════════════════════════════════════════════════ │
│ ▼ │
│ User Space │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ 8. Return value in rax (bytes read or -errno) │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
System Call Implementation
// Kernel-side system call definition
SYSCALL_DEFINE3 (read, unsigned int , fd, char __user * , buf, size_t , count)
{
struct fd f = fdget (fd);
ssize_t ret = - EBADF;
if ( ! f . file )
return ret;
if ( ! ( f . file -> f_mode & FMODE_READ))
goto out;
ret = vfs_read ( f . file , buf, count, & f . file -> f_pos );
out:
fdput (f);
return ret;
}
// User-space invocation options:
// 1. Direct syscall (rare)
long result = syscall (SYS_read, fd, buf, count);
// 2. Through glibc wrapper (common)
ssize_t result = read (fd, buf, count);
Key System Calls
Some of the most important Linux system calls include:
Category System Calls Purpose Process fork, execve, exit, wait Process lifecycle Memory mmap, munmap, brk, mprotect Memory management File open, read, write, close, stat File I/O Socket socket, bind, listen, accept, connect Networking Signal kill, sigaction, sigprocmask Signal handling IPC pipe, shmget, msgget, semget Inter-process communication
Process Management
In Linux, every running program is represented by a task_struct in the kernel. This structure holds all information about the process or thread: its state, scheduling info, memory, open files, credentials, and more.
Task Struct
The task_struct is the kernel’s internal data structure for tracking processes and threads. It’s like a detailed profile for every running program.
// Kernel representation of a process/thread
struct task_struct {
// Thread state
volatile long state; // TASK_RUNNING, TASK_INTERRUPTIBLE, etc.
unsigned int flags; // PF_EXITING, PF_VCPU, etc.
// Scheduling
int prio, static_prio, normal_prio;
struct sched_entity se;
struct sched_rt_entity rt;
const struct sched_class * sched_class;
// Process relationships
struct task_struct * parent;
struct list_head children;
struct list_head sibling;
struct task_struct * group_leader;
// Memory management
struct mm_struct * mm; // Memory descriptor
struct mm_struct * active_mm; // For kernel threads
// Filesystem info
struct fs_struct * fs; // Current directory, root
struct files_struct * files; // Open files
// Credentials
const struct cred * cred; // UID, GID, capabilities
// Signals
struct signal_struct * signal;
struct sighand_struct * sighand;
sigset_t blocked, real_blocked;
// Namespaces
struct nsproxy * nsproxy;
// ... many more fields
};
// Get current task
struct task_struct * current = get_current ();
Process States
Processes in Linux can be in various states: running, waiting, stopped, zombie, etc. The kernel manages transitions between these states as processes execute, wait for I/O, or terminate.
┌─────────────────────────────────────────────────────────────────┐
│ PROCESS STATE MACHINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ │
│ │ CREATED │ │
│ │ (fork()) │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ │ │
│ │ ┌────────────────┐ ┌────────────────┐ │ │
│ │ │ TASK_RUNNING │◄──────►│ TASK_RUNNING │ │ │
│ │ │ (ready) │schedule│ (on CPU) │ │ │
│ │ └───────┬────────┘ └───────┬────────┘ │ │
│ │ │ │ │ │
│ └──────────┼─────────────────────────┼────────────┘ │
│ │ │ │
│ wait │ │ I/O, lock │
│ ▼ ▼ │
│ ┌────────────────────┐ ┌────────────────────┐ │
│ │TASK_INTERRUPTIBLE │ │TASK_UNINTERRUPTIBLE│ │
│ │(can receive signal)│ │(ignores signals) │ │
│ └────────────────────┘ └────────────────────┘ │
│ │ │ │
│ │ I/O complete │ │
│ └────────────┬────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────┐ │
│ │ EXIT_ZOMBIE │ │
│ │ (wait by parent)│ │
│ └────────┬───────┘ │
│ │ │
│ ▼ │
│ ┌────────────────┐ │
│ │ EXIT_DEAD │ │
│ │ (reaped) │ │
│ └────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Memory Management
Linux uses virtual memory to give each process the illusion of a private, contiguous address space. The kernel manages page tables, handles page faults, and allocates physical memory efficiently.
Address Space Layout
The address space of a process is divided into regions: code (text), data, heap, stack, and memory-mapped areas. The kernel enforces boundaries and permissions for each region, protecting processes from each other.
┌─────────────────────────────────────────────────────────────────┐
│ VIRTUAL ADDRESS SPACE (x86-64) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 0xFFFFFFFFFFFFFFFF ┌──────────────────────────────────────┐ │
│ │ │ │
│ │ Kernel Space │ │
│ │ (upper 128 TB) │ │
│ │ │ │
│ │ - Direct mapping of physical RAM │ │
│ │ - vmalloc area │ │
│ │ - Kernel text and data │ │
│ │ - Module space │ │
│ │ │ │
│ 0xFFFF800000000000 ├──────────────────────────────────────┤ │
│ │ Hole (unmapped) │ │
│ 0x00007FFFFFFFFFFF ├──────────────────────────────────────┤ │
│ │ │ │
│ │ User Space │ │
│ │ (lower 128 TB) │ │
│ │ │ │
│ │ Stack (grows down) ──┐ │ │
│ │ │ │ │ │
│ │ ▼ │ │ │
│ │ (gap) mmap │ │
│ │ ▲ │ │ │
│ │ │ │ │ │
│ │ mmap region (libs, etc.) ◄┘ │ │
│ │ │ │
│ │ (gap) │ │
│ │ │ │
│ │ Heap (grows up via brk) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ BSS (uninitialized data) │ │
│ │ Data (initialized data) │ │
│ │ Text (code) │ │
│ 0x0000000000000000 └──────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Page Tables (4-Level)
Modern CPUs (like x86-64) use multi-level page tables to efficiently map virtual addresses to physical memory. The kernel walks these tables to resolve memory accesses and handles page faults when needed.
// x86-64 page table structure
// 48-bit virtual address:
// [47:39] PML4 index (9 bits, 512 entries)
// [38:30] PDPT index (9 bits, 512 entries)
// [29:21] PD index (9 bits, 512 entries)
// [20:12] PT index (9 bits, 512 entries)
// [11:0] Page offset (12 bits, 4KB page)
// Walk page tables
pgd_t * pgd = pgd_offset (mm, address);
if ( pgd_none ( * pgd )) return NULL ;
p4d_t * p4d = p4d_offset (pgd, address);
if ( p4d_none ( * p4d )) return NULL ;
pud_t * pud = pud_offset (p4d, address);
if ( pud_none ( * pud )) return NULL ;
pmd_t * pmd = pmd_offset (pud, address);
if ( pmd_none ( * pmd )) return NULL ;
pte_t * pte = pte_offset_kernel (pmd, address);
if ( pte_none ( * pte )) return NULL ;
unsigned long phys = ( pte_val ( * pte ) & PTE_PFN_MASK) |
(address & ~ PAGE_MASK);
Memory Allocation Layers
Memory allocation in Linux happens in layers:
User programs use malloc() (implemented by libraries like glibc).
The library requests memory from the kernel via brk() or mmap().
The kernel manages virtual memory areas (VMAs), page tables, and physical memory.
For kernel allocations, the slab and buddy allocators provide efficient memory management for different object sizes.
┌─────────────────────────────────────────────────────────────────┐
│ KERNEL MEMORY ALLOCATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ User request: malloc(100) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ glibc malloc (user space) │ │
│ │ - ptmalloc, jemalloc, tcmalloc │ │
│ │ - Caches memory, reduces syscalls │ │
│ └───────────────────────────┬─────────────────────────────┘ │
│ │ brk() or mmap() │
│ ════════════════════════════╧═════════════════════════════════│
│ │ │
│ ┌───────────────────────────▼─────────────────────────────┐ │
│ │ Virtual Memory Subsystem │ │
│ │ - Creates VMAs (vm_area_struct) │ │
│ │ - Handles page faults │ │
│ └───────────────────────────┬─────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────▼─────────────────────────────┐ │
│ │ Slab Allocator (SLUB) │ │
│ │ - kmalloc(), kmem_cache_alloc() │ │
│ │ - Object caching, minimal fragmentation │ │
│ └───────────────────────────┬─────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────▼─────────────────────────────┐ │
│ │ Buddy Allocator (Page Allocator) │ │
│ │ - alloc_pages(), __get_free_pages() │ │
│ │ - Power-of-2 page blocks │ │
│ └───────────────────────────┬─────────────────────────────┘ │
│ │ │
│ ▼ │
│ Physical Memory │
│ │
└─────────────────────────────────────────────────────────────────┘
Kernel Modules
Kernel modules are pieces of code that can be loaded into or removed from the running kernel. They extend kernel functionality (like device drivers) without requiring a reboot. Writing modules is a key skill for Linux systems programming.
Module Structure
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
MODULE_LICENSE ( "GPL" );
MODULE_AUTHOR ( "Your Name" );
MODULE_DESCRIPTION ( "Example kernel module" );
MODULE_VERSION ( "1.0" );
// Module parameters
static int param_value = 42 ;
module_param (param_value, int , 0 644 );
MODULE_PARM_DESC (param_value, "An integer parameter" );
// Initialization function
static int __init mymodule_init ( void )
{
pr_info ( "Module loaded with param_value = %d \n " , param_value);
return 0 ; // Success
}
// Cleanup function
static void __exit mymodule_exit ( void )
{
pr_info ( "Module unloaded \n " );
}
module_init (mymodule_init);
module_exit (mymodule_exit);
Building a Module
# Makefile
obj-m := mymodule.o
# For multi-file modules:
# mymodule-objs := file1.o file2.o
KERNEL_DIR ?= /lib/modules/ $( shell uname -r) /build
all :
$( MAKE ) -C $( KERNEL_DIR ) M= $( PWD ) modules
clean :
$( MAKE ) -C $( KERNEL_DIR ) M= $( PWD ) clean
# Build and load
$ make
$ sudo insmod mymodule.ko param_value= 100
$ lsmod | grep mymodule
$ cat /sys/module/mymodule/parameters/param_value
$ sudo rmmod mymodule
$ dmesg | tail
Character Device Driver
#include <linux/cdev.h>
#include <linux/device.h>
#include <linux/fs.h>
#define DEVICE_NAME "mydev"
static dev_t dev_num;
static struct cdev my_cdev;
static struct class * my_class;
static int mydev_open ( struct inode * inode , struct file * file )
{
pr_info ( "Device opened \n " );
return 0 ;
}
static ssize_t mydev_read ( struct file * file , char __user * buf ,
size_t count , loff_t * offset )
{
char msg [] = "Hello from kernel! \n " ;
size_t len = sizeof (msg);
if ( * offset >= len)
return 0 ;
if ( copy_to_user (buf, msg, len))
return - EFAULT;
* offset += len;
return len;
}
static struct file_operations fops = {
.owner = THIS_MODULE,
.open = mydev_open,
.read = mydev_read,
};
static int __init mydev_init ( void )
{
// Allocate device number
alloc_chrdev_region ( & dev_num, 0 , 1 , DEVICE_NAME);
// Create cdev
cdev_init ( & my_cdev, & fops);
cdev_add ( & my_cdev, dev_num, 1 );
// Create device class and device
my_class = class_create (THIS_MODULE, DEVICE_NAME);
device_create (my_class, NULL , dev_num, NULL , DEVICE_NAME);
return 0 ;
}
static void __exit mydev_exit ( void )
{
device_destroy (my_class, dev_num);
class_destroy (my_class);
cdev_del ( & my_cdev);
unregister_chrdev_region (dev_num, 1 );
}
module_init (mydev_init);
module_exit (mydev_exit);
Boot Process
┌─────────────────────────────────────────────────────────────────┐
│ LINUX BOOT SEQUENCE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. BIOS/UEFI │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ • Power-on self-test (POST) │ │
│ │ • Initialize hardware │ │
│ │ • Load bootloader from boot device │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 2. Bootloader (GRUB) │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ • Display boot menu │ │
│ │ • Load kernel image (vmlinuz) │ │
│ │ • Load initial ramdisk (initramfs) │ │
│ │ • Pass kernel command line parameters │ │
│ │ • Transfer control to kernel │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 3. Kernel Initialization │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ • Decompress kernel │ │
│ │ • Set up page tables, GDT, IDT │ │
│ │ • Initialize memory management │ │
│ │ • Initialize scheduler │ │
│ │ • Start kernel threads (kthreadd) │ │
│ │ • Mount initramfs as temporary root │ │
│ │ • Execute /init from initramfs │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 4. initramfs │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ • Load essential drivers (disk, filesystem) │ │
│ │ • Detect hardware │ │
│ │ • Mount real root filesystem │ │
│ │ • pivot_root to real root │ │
│ │ • exec /sbin/init │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 5. Init System (systemd) │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ • PID 1: mother of all processes │ │
│ │ • Mount filesystems (/proc, /sys, etc.) │ │
│ │ • Start services in dependency order │ │
│ │ • Reach default.target (multi-user or graphical) │ │
│ │ • Spawn login prompts │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Kernel Command Line
# View current command line
$ cat /proc/cmdline
BOOT_IMAGE = /vmlinuz-5.15.0 root = UUID = xxx ro quiet splash
# Common parameters:
# root= Root filesystem device
# ro/rw Mount root read-only/read-write
# init= Alternative init program
# single/1 Single user mode
# console= Console device
# quiet Suppress boot messages
# debug Enable debug output
# panic= Seconds before reboot on panic
Kernel Debugging
printk and Dynamic Debug
// Log levels
pr_emerg ( "System unusable \n " ); // 0
pr_alert ( "Action required \n " ); // 1
pr_crit ( "Critical condition \n " ); // 2
pr_err ( "Error condition \n " ); // 3
pr_warn ( "Warning condition \n " ); // 4
pr_notice ( "Normal but significant \n " ); // 5
pr_info ( "Informational \n " ); // 6
pr_debug ( "Debug-level \n " ); // 7 (requires DEBUG)
// Dynamic debug
pr_debug ( "Debug message with args: %d \n " , value);
// Enable at runtime:
// echo 'module mymodule +p' > /sys/kernel/debug/dynamic_debug/control
/proc and /sys
# Process information
$ cat /proc/1/status # PID 1 status
$ cat /proc/1/maps # Memory mappings
$ cat /proc/1/fd # Open files
# System information
$ cat /proc/meminfo # Memory statistics
$ cat /proc/cpuinfo # CPU information
$ cat /proc/interrupts # Interrupt counts
# Kernel tuning
$ cat /proc/sys/kernel/hostname
$ echo 1 > /proc/sys/net/ipv4/ip_forward
# Device information
$ ls /sys/class/net/ # Network interfaces
$ cat /sys/class/net/eth0/address # MAC address
$ ls /sys/block/ # Block devices
Tracing
# Function tracer
$ echo function > /sys/kernel/debug/tracing/current_tracer
$ echo 1 > /sys/kernel/debug/tracing/tracing_on
$ cat /sys/kernel/debug/tracing/trace
# Function graph tracer
$ echo function_graph > /sys/kernel/debug/tracing/current_tracer
# Event tracing
$ echo 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
$ cat /sys/kernel/debug/tracing/trace
# Using ftrace
$ trace-cmd record -e sched_switch sleep 1
$ trace-cmd report
# BPF tracing (modern)
$ bpftrace -e 'tracepoint:syscalls:sys_enter_read { @[comm] = count(); }'
Interview Deep Dive Questions
Q1: Walk through what happens when you type 'ls' in a terminal
Answer: 1. Shell reads input :Shell → read(STDIN) → "ls\n"
2. Shell parses and finds executable :Search PATH: /usr/bin/ls found
3. fork() creates child process :Shell (PID 100)
│
│ fork()
▼
Shell (PID 100) ──┬── Child (PID 101)
│ - Copy of parent
│ - Same code, data, file descriptors
4. execve() replaces child with ls :// In child process:
execve ( "/usr/bin/ls" , [ "ls" ], environ);
// Kernel actions:
// - Load ELF binary
// - Set up new address space
// - Initialize stack with args/env
// - Jump to entry point (_start → __libc_start_main → main)
5. ls executes :getdents64(dirfd, buffer) → Read directory entries
write(STDOUT, "file1 file2\n") → Output
6. ls exits :exit_group(0) → Kernel cleanup
- Free memory
- Close file descriptors
- Set state to EXIT_ZOMBIE
- Signal parent (SIGCHLD)
7. Shell reaps child :Shell calls wait4() → Gets exit status
Zombie → EXIT_DEAD → Fully removed
Shell displays next prompt
Key system calls : read, fork, execve, openat, getdents64, write, exit_group, wait4
Q2: Explain copy-on-write (COW) in fork()
Answer: Problem : fork() creates complete copy of parent’s address space. With large processes, this would be slow and wasteful.Solution : Copy-on-WriteBefore fork():
┌─────────────────────────────────────────────────────────────┐
│ Parent (PID 100) │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Page Table → Physical Page A (R/W) │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
After fork() (no copy yet!):
┌─────────────────────────────────────────────────────────────┐
│ Parent (PID 100) Physical Page A │
│ ┌───────────────┐ ┌───────────┐ │
│ │ Page Table ├──────────────►│ Data │◄─────────────┤
│ │ (now R/O) │ │ │ │
│ └───────────────┘ └───────────┘ │
│ ▲ │
│ Child (PID 101) │ │
│ ┌───────────────┐ │ │
│ │ Page Table ├─────────────────────┘ │
│ │ (R/O copy) │ Both point to same physical page! │
│ └───────────────┘ │
└─────────────────────────────────────────────────────────────┘
When child writes (page fault triggers copy):
┌─────────────────────────────────────────────────────────────┐
│ Parent (PID 100) Physical Page A │
│ ┌───────────────┐ ┌───────────┐ │
│ │ Page Table ├──────────────►│ Data │ │
│ │ (R/W again) │ │ │ │
│ └───────────────┘ └───────────┘ │
│ │
│ Child (PID 101) Physical Page B (NEW) │
│ ┌───────────────┐ ┌───────────┐ │
│ │ Page Table ├──────────────►│ Data+Mod │ │
│ │ (R/W) │ │ │ │
│ └───────────────┘ └───────────┘ │
└─────────────────────────────────────────────────────────────┘
Implementation :// During fork():
// 1. Create new page tables (shallow copy)
// 2. Mark all writable pages as read-only
// 3. Increment reference count on physical pages
// On write (page fault handler):
if (page -> ref_count > 1 ) {
// Allocate new page
new_page = alloc_page ();
// Copy contents
copy_page (new_page, old_page);
// Update page table to point to new page (R/W)
pte_set (pte, new_page, PTE_W);
// Decrement old page ref count
old_page -> ref_count -- ;
} else {
// We're the only user, just make writable
pte_set_writable (pte);
}
Benefits :
fork() is O(1) in page table size, not memory size
Pages never written are never copied
exec() after fork() doesn’t copy at all
Q3: How does the kernel handle a page fault?
Answer: Page fault types :
Minor fault : Page in memory but not mapped
Major fault : Page not in memory (disk I/O needed)
Invalid fault : Access violation (segfault)
Handler flow :// arch/x86/mm/fault.c: do_page_fault()
void do_page_fault ( struct pt_regs * regs , unsigned long error_code ,
unsigned long address )
{
struct mm_struct * mm = current -> mm ;
struct vm_area_struct * vma;
// 1. Check if fault in kernel mode (bad)
if ( fault_in_kernel_mode (regs)) {
// Likely a bug, try to recover or panic
bad_area_nosemaphore (regs, error_code, address);
return ;
}
// 2. Find VMA containing address
vma = find_vma (mm, address);
// 3. Check if address is valid
if ( ! vma || address < vma -> vm_start ) {
// Check if stack needs to expand
if ( expand_stack (vma, address)) {
// Invalid address → SIGSEGV
bad_area (regs, error_code, address);
return ;
}
}
// 4. Check permissions
if (write_fault && ! ( vma -> vm_flags & VM_WRITE)) {
// Write to read-only → SIGSEGV
bad_area (regs, error_code, address);
return ;
}
// 5. Handle the fault
fault = handle_mm_fault (vma, address, flags);
// Returns VM_FAULT_MAJOR if I/O needed
// Returns VM_FAULT_MINOR if just page table update
}
Common scenarios :Scenario Handling Demand paging Allocate page, zero-fill or read from file Copy-on-write Copy page, update permissions Stack growth Extend VMA, allocate pages Swap Read page from swap, update page table Memory-mapped file Read page from file Invalid access Send SIGSEGV to process
Q4: Explain the difference between softirq, tasklet, and workqueue
Answer: Problem : Interrupt handlers must be fast, but some work takes time.Solution : Deferred work mechanisms┌─────────────────────────────────────────────────────────────┐
│ DEFERRED WORK │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Interrupt Context │ │
│ │ • Runs with interrupts disabled │ │
│ │ • Cannot sleep, allocate memory, or take mutex │ │
│ │ • Must be very fast (< 1ms) │ │
│ └────────────────────────┬─────────────────────────────┘ │
│ │ Schedule deferred work │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Softirq / Tasklet │ │
│ │ • Runs with interrupts enabled │ │
│ │ • Cannot sleep │ │
│ │ • Runs in atomic context │ │
│ │ • For time-critical deferred work │ │
│ └────────────────────────┬─────────────────────────────┘ │
│ │ If work can sleep │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Workqueue │ │
│ │ • Runs in process context (kernel thread) │ │
│ │ • CAN sleep, allocate memory, take mutex │ │
│ │ • Lower priority than softirqs │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Comparison :Feature Softirq Tasklet Workqueue Context Atomic Atomic Process Can sleep No No Yes Concurrency Per-CPU Serialized Per-worker Priority High High Lower Use case Network, block I/O Simple deferred Complex work
Examples :// Softirq (static, limited number)
// Used by: NET_RX_SOFTIRQ, NET_TX_SOFTIRQ, BLOCK_SOFTIRQ
// Tasklet (dynamic, built on softirq)
DECLARE_TASKLET (my_tasklet, my_handler);
tasklet_schedule ( & my_tasklet );
// Workqueue (most flexible)
DECLARE_WORK (my_work, my_work_handler);
schedule_work ( & my_work );
void my_work_handler ( struct work_struct * work ) {
// This can sleep!
mutex_lock ( & my_mutex);
// ... do work ...
mutex_unlock ( & my_mutex);
}
Q5: How does the kernel implement futexes?
Answer: Futex = Fast Userspace muTEXGoal : Avoid syscall in the common (uncontended) case.┌─────────────────────────────────────────────────────────────┐
│ FUTEX OPERATION │
├─────────────────────────────────────────────────────────────┤
│ │
│ Uncontended case (no syscall): │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Thread A: │ │
│ │ atomic_cmpxchg(&futex, 0, 1) → Success │ │
│ │ // Lock acquired, no kernel involvement! │ │
│ │ │ │
│ │ atomic_cmpxchg(&futex, 1, 0) → Success │ │
│ │ // Lock released, still no kernel! │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Contended case (syscall needed): │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Thread A: holds lock (futex = 1) │ │
│ │ │ │
│ │ Thread B: │ │
│ │ atomic_cmpxchg(&futex, 0, 1) → Fails │ │
│ │ // Lock held, must wait │ │
│ │ │ │
│ │ futex(FUTEX_WAIT, &futex, 1) │ │
│ │ // Kernel call: "sleep until futex != 1" │ │
│ │ // Thread B now blocked │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Wake up: │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Thread A: │ │
│ │ atomic_exchange(&futex, 0) │ │
│ │ // Sees there may be waiters │ │
│ │ │ │
│ │ futex(FUTEX_WAKE, &futex, 1) │ │
│ │ // Kernel: wake one waiter │ │
│ │ │ │
│ │ Thread B: wakes up, retries atomic op │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Kernel implementation :// Simplified futex wait
SYSCALL_DEFINE4 (futex, u32 __user * , uaddr, int , op, ...)
{
if (op == FUTEX_WAIT) {
// Get hash bucket for this address
struct futex_hash_bucket * hb = hash_futex (uaddr);
spin_lock ( & hb -> lock );
// Check value hasn't changed
if ( get_user (uval, uaddr) != expected) {
spin_unlock ( & hb -> lock );
return - EAGAIN; // Try again in userspace
}
// Add to wait queue
queue_me ( & q, hb);
spin_unlock ( & hb -> lock );
// Sleep
schedule ();
return 0 ;
}
if (op == FUTEX_WAKE) {
struct futex_hash_bucket * hb = hash_futex (uaddr);
spin_lock ( & hb -> lock );
// Wake up waiters
wake_up_n (hb, nr_wake);
spin_unlock ( & hb -> lock );
}
}
Why it’s fast :
Uncontended: Just an atomic operation, no syscall
Hash table lookup in kernel is O(1)
Foundation for pthread_mutex, semaphores, condition variables
Kernel Exploration Commands
# Kernel version and build info
$ uname -a
$ cat /proc/version
# Boot parameters
$ cat /proc/cmdline
# Loaded modules
$ lsmod
$ modinfo < module_nam e >
# Hardware and IRQ info
$ cat /proc/interrupts
$ cat /proc/iomem
$ lspci -v
# Process information
$ cat /proc/ < pi d > /status
$ cat /proc/ < pi d > /maps
$ cat /proc/ < pi d > /stack
# System performance
$ cat /proc/stat
$ cat /proc/loadavg
$ cat /proc/vmstat
# Tracing
$ cat /sys/kernel/debug/tracing/available_tracers
$ perf list
Key Takeaways
Monolithic but Modular Linux kernel is monolithic (single address space) but supports loadable modules.
System Call Interface SYSCALL instruction transitions to kernel. ~400 system calls in modern Linux.
Memory Management 4-level page tables, buddy + slab allocators, copy-on-write optimization.
Boot Sequence BIOS → Bootloader → Kernel → initramfs → systemd (PID 1)
Next: Interview Preparation →