Skip to main content

Linux Internals Deep Dive

If you love understanding how things actually work, this chapter is for you. If you just want to run commands and get things done, feel free to skip ahead. No judgment.
This chapter takes you beneath the surface of Linux. We will explore how the kernel manages processes, understand how system calls bridge user space and kernel space, and demystify the virtual filesystem. This knowledge is what transforms a Linux user into a Linux engineer.

Why Internals Matter

Understanding Linux internals helps you:
  • Debug performance issues when top and htop are not enough
  • Write better software that works with the kernel, not against it
  • Ace interviews where internals questions are common
  • Understand containers since Docker relies on kernel features
  • Troubleshoot production systems at a deeper level

User Space vs Kernel Space

The most fundamental concept: Linux divides memory into two distinct spaces.
┌─────────────────────────────────────────────────────────────────┐
│                         User Space                               │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐            │
│  │  bash   │  │  nginx  │  │ python  │  │  java   │            │
│  └─────────┘  └─────────┘  └─────────┘  └─────────┘            │
│                                                                  │
│  Applications, libraries, user processes                         │
│  - Protected from each other                                     │
│  - Cannot access hardware directly                               │
│  - Uses system calls to request kernel services                  │
├──────────────────────────────────────────────────────────────────┤
│                      System Call Interface                        │
├──────────────────────────────────────────────────────────────────┤
│                        Kernel Space                               │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │  Process Management  │  Memory Management  │  File Systems  │ │
│  ├─────────────────────────────────────────────────────────────┤ │
│  │  Device Drivers  │  Network Stack  │  Security Modules      │ │
│  └─────────────────────────────────────────────────────────────┘ │
│                                                                   │
│  - Full hardware access                                          │
│  - Manages all system resources                                  │
│  - Runs in privileged mode (Ring 0)                              │
├──────────────────────────────────────────────────────────────────┤
│                         Hardware                                  │
│  CPU  │  Memory  │  Disk  │  Network  │  Devices                │
└──────────────────────────────────────────────────────────────────┘
Why this separation?
  • Security: Buggy user programs cannot crash the kernel
  • Stability: One process cannot corrupt another
  • Abstraction: Applications do not need to know hardware details

System Calls: The Bridge

When a user program needs kernel services (read file, open network connection, create process), it makes a system call.

Anatomy of a System Call

User space:    write(fd, buf, count)


               libc wrapper (glibc)

                    │  Prepares arguments
                    │  Triggers software interrupt

──────────────── syscall instruction ────────────────

Kernel space:       │

               System call handler

                    │  Validates arguments
                    │  Performs operation
                    │  Returns result

──────────────── return to user space ────────────────

Common System Calls

CategorySystem CallsPurpose
Processfork, exec, exit, waitCreate and manage processes
Fileopen, read, write, closeFile operations
Networksocket, bind, listen, acceptNetwork operations
Memorymmap, brk, mprotectMemory management
IPCpipe, shmget, semgetInter-process communication

Tracing System Calls

# Trace system calls of a running process
strace -p 1234

# Trace a command
strace ls -la

# Count system calls
strace -c ls -la

# Example output:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 30.12    0.000142          71         2           getdents64
 21.28    0.000100          14         7           mmap
 15.96    0.000075          10         7           close

Process Management

What is a Process?

A process is a running program. It includes:
  • Code: The program instructions
  • Data: Variables and heap
  • Stack: Function calls and local variables
  • Registers: CPU state
  • File descriptors: Open files, sockets
  • Memory mappings: Virtual memory layout

Process Control Block (PCB)

The kernel maintains a task_struct for each process:
// Simplified task_struct (actual is ~600 lines)
struct task_struct {
    volatile long state;        // -1 unrunnable, 0 runnable, >0 stopped
    void *stack;                // Kernel stack
    unsigned int cpu;           // Current CPU
    pid_t pid;                  // Process ID
    pid_t tgid;                 // Thread group ID
    struct task_struct *parent; // Parent process
    struct list_head children;  // Child processes
    struct mm_struct *mm;       // Memory mappings
    struct files_struct *files; // Open files
    // ... hundreds more fields
};

Process States

                    fork()


              ┌────────────────┐
              │   TASK_NEW     │
              └────────┬───────┘


              ┌────────────────┐
     ┌───────▶│ TASK_RUNNING   │◀──────────┐
     │        │  (Runnable)    │           │
     │        └────────┬───────┘           │
     │                 │                    │
     │   Scheduled     │  Waiting for      │ Event
     │     by CPU      │  I/O, lock, etc   │ occurred
     │                 ▼                    │
     │        ┌────────────────┐           │
     │        │ TASK_RUNNING   │           │
     │        │ (On CPU)       │           │
     │        └────────┬───────┘           │
     │                 │                    │
     │  Preempted      │   Need to wait    │
     │                 ▼                    │
     │        ┌─────────────────┐          │
     └────────│ TASK_INTERR-    │──────────┘
              │ UPTIBLE/        │
              │ TASK_UNINTERR-  │
              │ UPTIBLE         │
              └────────┬────────┘

                       │ exit()

              ┌────────────────┐
              │  TASK_ZOMBIE   │───▶ Parent calls wait()
              └────────────────┘           │

                                    Process removed

The Scheduler

Linux uses the Completely Fair Scheduler (CFS) for normal processes:
CFS Core Idea: Track "virtual runtime" for each process

Virtual Runtime = Actual Runtime / Weight

Processes with lower virtual runtime get scheduled first.
This ensures fairness - each process gets its fair share.

Example with nice values:
┌──────────┬──────────┬─────────────┬────────────────────┐
│ Process  │  Nice    │   Weight    │ Virtual Runtime    │
├──────────┼──────────┼─────────────┼────────────────────┤
│    A     │    0     │    1024     │  10ms / 1024       │
│    B     │   10     │     110     │  10ms / 110        │
│    C     │  -10     │    9548     │  10ms / 9548       │
└──────────┴──────────┴─────────────┴────────────────────┘

C (nice -10) has lowest vruntime after 10ms, gets scheduled more.

Real-Time Scheduling

For time-critical tasks, Linux provides real-time schedulers:
PolicyDescription
SCHED_FIFOFirst-in, first-out, runs until blocks or yields
SCHED_RRRound-robin, time-sliced FIFO
SCHED_DEADLINEEarliest deadline first (newest)
SCHED_OTHERDefault CFS scheduler
# Check scheduling policy
chrt -p 1234

# Set real-time priority
chrt -f 50 ./critical-app

# View all runnable tasks with scheduling info
ps -eo pid,ni,pri,pcpu,comm --sort=-pcpu

Memory Management

Virtual Memory

Every process gets its own virtual address space:
Process A Virtual Memory:           Process B Virtual Memory:
┌───────────────────────────┐      ┌───────────────────────────┐
│  0xFFFFFFFF (High)        │      │  0xFFFFFFFF (High)        │
│  Kernel Space (shared)    │      │  Kernel Space (shared)    │
├───────────────────────────┤      ├───────────────────────────┤
│  Stack ▼                  │      │  Stack ▼                  │
│                           │      │                           │
│  (grows down)             │      │  (grows down)             │
│                           │      │                           │
│  Memory Mapped Region     │      │  Memory Mapped Region     │
│  (shared libs, mmap)      │      │  (shared libs, mmap)      │
│                           │      │                           │
│  Heap ▲                   │      │  Heap ▲                   │
│  (grows up)               │      │  (grows up)               │
├───────────────────────────┤      ├───────────────────────────┤
│  BSS (uninitialized data) │      │  BSS (uninitialized data) │
│  Data (initialized data)  │      │  Data (initialized data)  │
│  Text (code)              │      │  Text (code)              │
│  0x00000000 (Low)         │      │  0x00000000 (Low)         │
└───────────────────────────┘      └───────────────────────────┘

Same virtual addresses map to different physical memory!

Page Tables

Virtual addresses translate to physical addresses via page tables:
Virtual Address: 0x00007f3a8b2c1000


        ┌─────────────────┐
        │   Page Table    │
        │ (per process)   │
        ├─────────────────┤
        │ VPN → PFN       │
        │ VPN → PFN       │
        │ VPN → PFN       │
        └────────┬────────┘


Physical Address: 0x0000001a3c560000
Page size is typically 4KB (x86) or can be huge pages (2MB, 1GB).

The Page Cache

Linux aggressively caches file data in RAM:
# View memory usage
free -h
              total        used        free      shared  buff/cache   available
Mem:           15Gi       3.2Gi       8.1Gi       312Mi       4.1Gi        11Gi


                                               This is your page cache!
Page cache behavior:
  • Read a file? It stays in cache for future reads
  • Write a file? Goes to cache first, flushed to disk later
  • Running low on memory? Cache pages are evicted first
# Drop caches (for testing, not production)
sync; echo 3 > /proc/sys/vm/drop_caches

Memory Allocation

When a process requests memory:
malloc(1024)


┌─────────────────────────────────────────────────────┐
│  Is there space in existing heap?                    │
│     Yes → Return pointer to free chunk               │
│     No  → Request more memory from kernel            │
│              │                                       │
│              ▼                                       │
│           brk() or mmap()                           │
│              │                                       │
│              ▼                                       │
│           Kernel allocates virtual pages             │
│           (not physical yet!)                        │
│              │                                       │
│              ▼                                       │
│           First access triggers page fault           │
│              │                                       │
│              ▼                                       │
│           Kernel allocates physical page             │
└─────────────────────────────────────────────────────┘
This demand paging means you can allocate more virtual memory than physical RAM exists.

The Virtual Filesystem (VFS)

Linux abstracts all filesystems through a common interface.

VFS Architecture

User space:    open("/etc/passwd", O_RDONLY)


               ┌─────────────────────────────────────────────┐
               │              VFS Layer                       │
               │  - Common file operations                    │
               │  - Inode, dentry, file abstractions         │
               └─────────────────────────────────────────────┘

        ┌───────────┼───────────┬───────────┬────────────┐
        ▼           ▼           ▼           ▼            ▼
    ┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐    ┌───────┐
    │  ext4 │  │  XFS  │  │  NFS  │  │ procfs│    │ tmpfs │
    └───────┘  └───────┘  └───────┘  └───────┘    └───────┘
        │           │           │           │            │
    Physical    Physical     Network    Kernel       Memory
      Disk        Disk       Server     Data

Key VFS Concepts

Inode: Metadata about a file (permissions, size, timestamps, block pointers). Does NOT contain the filename.
# View inode information
stat /etc/passwd
  File: /etc/passwd
  Size: 2446        Blocks: 8          IO Block: 4096   regular file
Device: 259,3       Inode: 131162      Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Dentry: Directory entry - maps filename to inode.
Directory: /etc
┌─────────────────┬─────────────────┐
│  Name           │  Inode Number   │
├─────────────────┼─────────────────┤
│  passwd         │  131162         │
│  shadow         │  131163         │
│  hosts          │  131164         │
└─────────────────┴─────────────────┘
File descriptors: Per-process table of open files.
# View open file descriptors for a process
ls -la /proc/1234/fd/
lrwx------ 1 root root 64 Dec  2 10:00 0 -> /dev/null
lrwx------ 1 root root 64 Dec  2 10:00 1 -> /dev/null
l-wx------ 1 root root 64 Dec  2 10:00 2 -> /var/log/nginx/error.log

Everything is a File

This Unix philosophy extends to:
PathWhat It Is
/dev/sdaBlock device (hard drive)
/dev/nullBit bucket (discards writes)
/dev/randomRandom number generator
/proc/cpuinfoCPU information (kernel data)
/sys/class/netNetwork interface info
/dev/stdinStandard input
# Read from "files" that aren't really files
cat /proc/meminfo
cat /sys/class/net/eth0/address

Networking Internals

The Network Stack

Application:    curl https://example.com


┌─────────────────────────────────────────────────────────────────┐
│  Socket Layer (socket, bind, connect, send, recv)               │
├─────────────────────────────────────────────────────────────────┤
│  Transport Layer (TCP, UDP)                                      │
│  - Connection state (SYN, ACK, FIN)                             │
│  - Reliability, flow control                                     │
├─────────────────────────────────────────────────────────────────┤
│  Network Layer (IP)                                              │
│  - Routing decisions                                             │
│  - IP fragmentation                                              │
├─────────────────────────────────────────────────────────────────┤
│  Data Link Layer (Ethernet, WiFi)                               │
│  - Frame construction                                            │
│  - MAC addressing                                                │
├─────────────────────────────────────────────────────────────────┤
│  Physical Layer (NIC Driver)                                     │
│  - DMA to/from network card                                      │
└─────────────────────────────────────────────────────────────────┘

Socket Buffers

Data flows through socket buffers:
Send path:
Application ──▶ Socket send buffer ──▶ TCP ──▶ IP ──▶ NIC

Receive path:
NIC ──▶ Driver ──▶ IP ──▶ TCP ──▶ Socket recv buffer ──▶ Application
# View socket buffer sizes
sysctl net.core.rmem_max
sysctl net.core.wmem_max

# View socket statistics
ss -s

Netfilter and iptables

Netfilter provides hooks for packet processing:
                              Network


                         ┌──────────────┐
                         │ PREROUTING   │──▶ DNAT (port forwarding)
                         └──────┬───────┘

                         ┌──────▼───────┐
                    ┌────│   Routing    │────┐
                    │    │   Decision   │    │
                    │    └──────────────┘    │
                    │                        │
             ┌──────▼───────┐         ┌──────▼───────┐
             │    INPUT     │         │   FORWARD    │
             │  (local)     │         │  (routing)   │
             └──────┬───────┘         └──────┬───────┘
                    │                        │
                    ▼                        │
              Local Process                  │
                    │                        │
                    ▼                        │
             ┌──────────────┐               │
             │   OUTPUT     │               │
             └──────┬───────┘               │
                    │                        │
                    └────────┬───────────────┘

                      ┌──────▼───────┐
                      │ POSTROUTING  │──▶ SNAT (masquerading)
                      └──────┬───────┘


                          Network

Interview Deep Dive Questions

Answer: 1) Shell calls fork() to create child process, 2) Child calls exec() to load program, 3) Kernel loads ELF binary, sets up memory mappings, 4) Kernel sets up stack with argc, argv, environment, 5) Control transfers to program entry point (usually _start in libc), 6) _start calls main(), 7) Program runs, 8) exit() called, kernel cleans up resources.
Answer: Processes have separate address spaces, file descriptors, and resources. Threads share address space, heap, and file descriptors within a process, but have separate stacks and registers. In Linux, both are task_struct - threads share mm_struct (memory) while processes have separate ones. Threads are cheaper to create (no memory copy) but share bugs.
Answer: When the kernel switches from running one process to another: 1) Save current process registers to task_struct, 2) Save current MMU context (page tables), 3) Select next process to run (scheduler), 4) Load new process registers from task_struct, 5) Restore MMU context, 6) Jump to new process instruction pointer. Context switches are expensive (cache invalidation, TLB flush).
Answer: By default, Linux allows processes to allocate more virtual memory than physical RAM (overcommit). Actual physical pages are allocated on first access (demand paging). If system runs out of memory, OOM killer selects and kills processes. Controlled by vm.overcommit_memory: 0=heuristic, 1=always allow, 2=never overcommit.
Answer: Both are virtual filesystems - no actual disk storage. /proc exposes kernel data structures and process info (originated in Unix). /sys is newer, provides structured hardware/driver info (Linux 2.6+). /proc has accumulated cruft, /sys is more organized. Examples: /proc/meminfo, /proc/1234/status, /sys/class/net/eth0/address.
Answer: Out-of-Memory killer is invoked when system is critically low on memory. It calculates oom_score for each process based on: memory usage, runtime, nice value, whether it is privileged. Highest score gets killed first. oom_score_adj (-1000 to 1000) can be set in /proc/pid/oom_score_adj. -1000 makes process unkillable (risky).

Exploring Internals Yourself

# View kernel parameters
sysctl -a | head -50

# View CPU information
cat /proc/cpuinfo

# View memory information
cat /proc/meminfo

# View process memory map
cat /proc/self/maps

# Trace system calls
strace -c ls

# View scheduling info
cat /proc/1/sched

# View network connections
ss -tulpn

# View file descriptors
ls -la /proc/self/fd

# View mount points
cat /proc/mounts

# View kernel modules
lsmod

Key Takeaways

  1. User space and kernel space are separated - for security and stability
  2. System calls are the bridge - only way to request kernel services
  3. Everything is a file - devices, processes, kernel data exposed as files
  4. Virtual memory provides isolation - each process has its own address space
  5. CFS scheduler ensures fairness - virtual runtime tracks CPU usage
  6. Page cache makes I/O fast - files cached in RAM automatically
  7. VFS abstracts filesystems - same interface for ext4, NFS, procfs
  8. Namespaces and cgroups enable containers - isolation and resource limits

Ready to master the command line? Next up: Linux Permissions where we will dive deep into users, groups, and access control.