Linux Internals Deep Dive

If you love understanding how things actually work, this chapter is for you. If you just want to run commands and get things done, feel free to skip ahead. No judgment.

This chapter takes you beneath the surface of Linux. We will explore how the kernel manages processes, understand how system calls bridge user space and kernel space, and demystify the virtual filesystem. This knowledge is what transforms a Linux user into a Linux engineer.

Why Internals Matter

Understanding Linux internals helps you:

Debug performance issues when top and htop are not enough
Write better software that works with the kernel, not against it
Ace interviews where internals questions are common
Understand containers since Docker relies on kernel features
Troubleshoot production systems at a deeper level

User Space vs Kernel Space

The most fundamental concept: Linux divides memory into two distinct spaces.

┌─────────────────────────────────────────────────────────────────┐
│                         User Space                               │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐            │
│  │  bash   │  │  nginx  │  │ python  │  │  java   │            │
│  └─────────┘  └─────────┘  └─────────┘  └─────────┘            │
│                                                                  │
│  Applications, libraries, user processes                         │
│  - Protected from each other                                     │
│  - Cannot access hardware directly                               │
│  - Uses system calls to request kernel services                  │
├──────────────────────────────────────────────────────────────────┤
│                      System Call Interface                        │
├──────────────────────────────────────────────────────────────────┤
│                        Kernel Space                               │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │  Process Management  │  Memory Management  │  File Systems  │ │
│  ├─────────────────────────────────────────────────────────────┤ │
│  │  Device Drivers  │  Network Stack  │  Security Modules      │ │
│  └─────────────────────────────────────────────────────────────┘ │
│                                                                   │
│  - Full hardware access                                          │
│  - Manages all system resources                                  │
│  - Runs in privileged mode (Ring 0)                              │
├──────────────────────────────────────────────────────────────────┤
│                         Hardware                                  │
│  CPU  │  Memory  │  Disk  │  Network  │  Devices                │
└──────────────────────────────────────────────────────────────────┘

Why this separation?

Security: Buggy user programs cannot crash the kernel
Stability: One process cannot corrupt another
Abstraction: Applications do not need to know hardware details

System Calls: The Bridge

When a user program needs kernel services (read file, open network connection, create process), it makes a system call.

Anatomy of a System Call

User space:    write(fd, buf, count)
                    │
                    ▼
               libc wrapper (glibc)
                    │
                    │  Prepares arguments
                    │  Triggers software interrupt
                    ▼
──────────────── syscall instruction ────────────────
                    │
Kernel space:       │
                    ▼
               System call handler
                    │
                    │  Validates arguments
                    │  Performs operation
                    │  Returns result
                    ▼
──────────────── return to user space ────────────────

Common System Calls

Category	System Calls	Purpose
Process	fork, exec, exit, wait	Create and manage processes
File	open, read, write, close	File operations
Network	socket, bind, listen, accept	Network operations
Memory	mmap, brk, mprotect	Memory management
IPC	pipe, shmget, semget	Inter-process communication

Tracing System Calls

# Trace system calls of a running process
strace -p 1234

# Trace a command
strace ls -la

# Count system calls
strace -c ls -la

# Example output:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 30.12    0.000142          71         2           getdents64
 21.28    0.000100          14         7           mmap
 15.96    0.000075          10         7           close

Process Management

What is a Process?

A process is a running program. It includes:

Code: The program instructions
Data: Variables and heap
Stack: Function calls and local variables
Registers: CPU state
File descriptors: Open files, sockets
Memory mappings: Virtual memory layout

Process Control Block (PCB)

The kernel maintains a task_struct for each process:

// Simplified task_struct (actual is ~600 lines)
struct task_struct {
    volatile long state;        // -1 unrunnable, 0 runnable, >0 stopped
    void *stack;                // Kernel stack
    unsigned int cpu;           // Current CPU
    pid_t pid;                  // Process ID
    pid_t tgid;                 // Thread group ID
    struct task_struct *parent; // Parent process
    struct list_head children;  // Child processes
    struct mm_struct *mm;       // Memory mappings
    struct files_struct *files; // Open files
    // ... hundreds more fields
};

Process States

                    fork()
                       │
                       ▼
              ┌────────────────┐
              │   TASK_NEW     │
              └────────┬───────┘
                       │
                       ▼
              ┌────────────────┐
     ┌───────▶│ TASK_RUNNING   │◀──────────┐
     │        │  (Runnable)    │           │
     │        └────────┬───────┘           │
     │                 │                    │
     │   Scheduled     │  Waiting for      │ Event
     │     by CPU      │  I/O, lock, etc   │ occurred
     │                 ▼                    │
     │        ┌────────────────┐           │
     │        │ TASK_RUNNING   │           │
     │        │ (On CPU)       │           │
     │        └────────┬───────┘           │
     │                 │                    │
     │  Preempted      │   Need to wait    │
     │                 ▼                    │
     │        ┌─────────────────┐          │
     └────────│ TASK_INTERR-    │──────────┘
              │ UPTIBLE/        │
              │ TASK_UNINTERR-  │
              │ UPTIBLE         │
              └────────┬────────┘
                       │
                       │ exit()
                       ▼
              ┌────────────────┐
              │  TASK_ZOMBIE   │───▶ Parent calls wait()
              └────────────────┘           │
                                           ▼
                                    Process removed

The Scheduler

Linux uses the Completely Fair Scheduler (CFS) for normal processes:

CFS Core Idea: Track "virtual runtime" for each process

Virtual Runtime = Actual Runtime / Weight

Processes with lower virtual runtime get scheduled first.
This ensures fairness - each process gets its fair share.

Example with nice values:
┌──────────┬──────────┬─────────────┬────────────────────┐
│ Process  │  Nice    │   Weight    │ Virtual Runtime    │
├──────────┼──────────┼─────────────┼────────────────────┤
│    A     │    0     │    1024     │  10ms / 1024       │
│    B     │   10     │     110     │  10ms / 110        │
│    C     │  -10     │    9548     │  10ms / 9548       │
└──────────┴──────────┴─────────────┴────────────────────┘

C (nice -10) has lowest vruntime after 10ms, gets scheduled more.

Real-Time Scheduling

For time-critical tasks, Linux provides real-time schedulers:

Policy	Description
SCHED_FIFO	First-in, first-out, runs until blocks or yields
SCHED_RR	Round-robin, time-sliced FIFO
SCHED_DEADLINE	Earliest deadline first (newest)
SCHED_OTHER	Default CFS scheduler

# Check scheduling policy
chrt -p 1234

# Set real-time priority
chrt -f 50 ./critical-app

# View all runnable tasks with scheduling info
ps -eo pid,ni,pri,pcpu,comm --sort=-pcpu

Memory Management

Virtual Memory

Every process gets its own virtual address space:

Process A Virtual Memory:           Process B Virtual Memory:
┌───────────────────────────┐      ┌───────────────────────────┐
│  0xFFFFFFFF (High)        │      │  0xFFFFFFFF (High)        │
│  Kernel Space (shared)    │      │  Kernel Space (shared)    │
├───────────────────────────┤      ├───────────────────────────┤
│  Stack ▼                  │      │  Stack ▼                  │
│                           │      │                           │
│  (grows down)             │      │  (grows down)             │
│                           │      │                           │
│  Memory Mapped Region     │      │  Memory Mapped Region     │
│  (shared libs, mmap)      │      │  (shared libs, mmap)      │
│                           │      │                           │
│  Heap ▲                   │      │  Heap ▲                   │
│  (grows up)               │      │  (grows up)               │
├───────────────────────────┤      ├───────────────────────────┤
│  BSS (uninitialized data) │      │  BSS (uninitialized data) │
│  Data (initialized data)  │      │  Data (initialized data)  │
│  Text (code)              │      │  Text (code)              │
│  0x00000000 (Low)         │      │  0x00000000 (Low)         │
└───────────────────────────┘      └───────────────────────────┘

Same virtual addresses map to different physical memory!

Page Tables

Virtual addresses translate to physical addresses via page tables:

Virtual Address: 0x00007f3a8b2c1000
                 │
                 ▼
        ┌─────────────────┐
        │   Page Table    │
        │ (per process)   │
        ├─────────────────┤
        │ VPN → PFN       │
        │ VPN → PFN       │
        │ VPN → PFN       │
        └────────┬────────┘
                 │
                 ▼
Physical Address: 0x0000001a3c560000

Page size is typically 4KB (x86) or can be huge pages (2MB, 1GB).

The Page Cache

Linux aggressively caches file data in RAM:

# View memory usage
free -h
              total        used        free      shared  buff/cache   available
Mem:           15Gi       3.2Gi       8.1Gi       312Mi       4.1Gi        11Gi
                                                              ▲
                                                              │
                                               This is your page cache!

Page cache behavior:

Read a file? It stays in cache for future reads
Write a file? Goes to cache first, flushed to disk later
Running low on memory? Cache pages are evicted first

# Drop caches (for testing, not production)
sync; echo 3 > /proc/sys/vm/drop_caches

Memory Allocation

When a process requests memory:

malloc(1024)
     │
     ▼
┌─────────────────────────────────────────────────────┐
│  Is there space in existing heap?                    │
│     Yes → Return pointer to free chunk               │
│     No  → Request more memory from kernel            │
│              │                                       │
│              ▼                                       │
│           brk() or mmap()                           │
│              │                                       │
│              ▼                                       │
│           Kernel allocates virtual pages             │
│           (not physical yet!)                        │
│              │                                       │
│              ▼                                       │
│           First access triggers page fault           │
│              │                                       │
│              ▼                                       │
│           Kernel allocates physical page             │
└─────────────────────────────────────────────────────┘

This demand paging means you can allocate more virtual memory than physical RAM exists.

The Virtual Filesystem (VFS)

Linux abstracts all filesystems through a common interface.

VFS Architecture

User space:    open("/etc/passwd", O_RDONLY)
                    │
                    ▼
               ┌─────────────────────────────────────────────┐
               │              VFS Layer                       │
               │  - Common file operations                    │
               │  - Inode, dentry, file abstractions         │
               └─────────────────────────────────────────────┘
                    │
        ┌───────────┼───────────┬───────────┬────────────┐
        ▼           ▼           ▼           ▼            ▼
    ┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐    ┌───────┐
    │  ext4 │  │  XFS  │  │  NFS  │  │ procfs│    │ tmpfs │
    └───────┘  └───────┘  └───────┘  └───────┘    └───────┘
        │           │           │           │            │
    Physical    Physical     Network    Kernel       Memory
      Disk        Disk       Server     Data

Key VFS Concepts

Inode: Metadata about a file (permissions, size, timestamps, block pointers). Does NOT contain the filename.

# View inode information
stat /etc/passwd
  File: /etc/passwd
  Size: 2446        Blocks: 8          IO Block: 4096   regular file
Device: 259,3       Inode: 131162      Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)

Dentry: Directory entry - maps filename to inode.

Directory: /etc
┌─────────────────┬─────────────────┐
│  Name           │  Inode Number   │
├─────────────────┼─────────────────┤
│  passwd         │  131162         │
│  shadow         │  131163         │
│  hosts          │  131164         │
└─────────────────┴─────────────────┘

File descriptors: Per-process table of open files.

# View open file descriptors for a process
ls -la /proc/1234/fd/
lrwx------ 1 root root 64 Dec  2 10:00 0 -> /dev/null
lrwx------ 1 root root 64 Dec  2 10:00 1 -> /dev/null
l-wx------ 1 root root 64 Dec  2 10:00 2 -> /var/log/nginx/error.log

Everything is a File

This Unix philosophy extends to:

Path	What It Is
`/dev/sda`	Block device (hard drive)
`/dev/null`	Bit bucket (discards writes)
`/dev/random`	Random number generator
`/proc/cpuinfo`	CPU information (kernel data)
`/sys/class/net`	Network interface info
`/dev/stdin`	Standard input

# Read from "files" that aren't really files
cat /proc/meminfo
cat /sys/class/net/eth0/address

Networking Internals

The Network Stack

Application:    curl https://example.com
                    │
                    ▼
┌─────────────────────────────────────────────────────────────────┐
│  Socket Layer (socket, bind, connect, send, recv)               │
├─────────────────────────────────────────────────────────────────┤
│  Transport Layer (TCP, UDP)                                      │
│  - Connection state (SYN, ACK, FIN)                             │
│  - Reliability, flow control                                     │
├─────────────────────────────────────────────────────────────────┤
│  Network Layer (IP)                                              │
│  - Routing decisions                                             │
│  - IP fragmentation                                              │
├─────────────────────────────────────────────────────────────────┤
│  Data Link Layer (Ethernet, WiFi)                               │
│  - Frame construction                                            │
│  - MAC addressing                                                │
├─────────────────────────────────────────────────────────────────┤
│  Physical Layer (NIC Driver)                                     │
│  - DMA to/from network card                                      │
└─────────────────────────────────────────────────────────────────┘

Socket Buffers

Data flows through socket buffers:

Send path:
Application ──▶ Socket send buffer ──▶ TCP ──▶ IP ──▶ NIC

Receive path:
NIC ──▶ Driver ──▶ IP ──▶ TCP ──▶ Socket recv buffer ──▶ Application

# View socket buffer sizes
sysctl net.core.rmem_max
sysctl net.core.wmem_max

# View socket statistics
ss -s

Netfilter and iptables

Netfilter provides hooks for packet processing:

                              Network
                                 │
                                 ▼
                         ┌──────────────┐
                         │ PREROUTING   │──▶ DNAT (port forwarding)
                         └──────┬───────┘
                                │
                         ┌──────▼───────┐
                    ┌────│   Routing    │────┐
                    │    │   Decision   │    │
                    │    └──────────────┘    │
                    │                        │
             ┌──────▼───────┐         ┌──────▼───────┐
             │    INPUT     │         │   FORWARD    │
             │  (local)     │         │  (routing)   │
             └──────┬───────┘         └──────┬───────┘
                    │                        │
                    ▼                        │
              Local Process                  │
                    │                        │
                    ▼                        │
             ┌──────────────┐               │
             │   OUTPUT     │               │
             └──────┬───────┘               │
                    │                        │
                    └────────┬───────────────┘
                             │
                      ┌──────▼───────┐
                      │ POSTROUTING  │──▶ SNAT (masquerading)
                      └──────┬───────┘
                             │
                             ▼
                          Network

Interview Deep Dive Questions

What happens when you run a program?

Answer: 1) Shell calls fork() to create child process, 2) Child calls exec() to load program, 3) Kernel loads ELF binary, sets up memory mappings, 4) Kernel sets up stack with argc, argv, environment, 5) Control transfers to program entry point (usually _start in libc), 6) _start calls main(), 7) Program runs, 8) exit() called, kernel cleans up resources.

Explain the difference between processes and threads

Answer: Processes have separate address spaces, file descriptors, and resources. Threads share address space, heap, and file descriptors within a process, but have separate stacks and registers. In Linux, both are task_struct - threads share mm_struct (memory) while processes have separate ones. Threads are cheaper to create (no memory copy) but share bugs.

What is a context switch?

Answer: When the kernel switches from running one process to another: 1) Save current process registers to task_struct, 2) Save current MMU context (page tables), 3) Select next process to run (scheduler), 4) Load new process registers from task_struct, 5) Restore MMU context, 6) Jump to new process instruction pointer. Context switches are expensive (cache invalidation, TLB flush).

How does Linux handle memory overcommit?

Answer: By default, Linux allows processes to allocate more virtual memory than physical RAM (overcommit). Actual physical pages are allocated on first access (demand paging). If system runs out of memory, OOM killer selects and kills processes. Controlled by vm.overcommit_memory: 0=heuristic, 1=always allow, 2=never overcommit.

Explain the purpose of /proc and /sys

Answer: Both are virtual filesystems - no actual disk storage. /proc exposes kernel data structures and process info (originated in Unix). /sys is newer, provides structured hardware/driver info (Linux 2.6+). /proc has accumulated cruft, /sys is more organized. Examples: /proc/meminfo, /proc/1234/status, /sys/class/net/eth0/address.

What is the OOM killer and how does it work?

Answer: Out-of-Memory killer is invoked when system is critically low on memory. It calculates oom_score for each process based on: memory usage, runtime, nice value, whether it is privileged. Highest score gets killed first. oom_score_adj (-1000 to 1000) can be set in /proc/pid/oom_score_adj. -1000 makes process unkillable (risky).

Exploring Internals Yourself

# View kernel parameters
sysctl -a | head -50

# View CPU information
cat /proc/cpuinfo

# View memory information
cat /proc/meminfo

# View process memory map
cat /proc/self/maps

# Trace system calls
strace -c ls

# View scheduling info
cat /proc/1/sched

# View network connections
ss -tulpn

# View file descriptors
ls -la /proc/self/fd

# View mount points
cat /proc/mounts

# View kernel modules
lsmod

Key Takeaways

User space and kernel space are separated - for security and stability
System calls are the bridge - only way to request kernel services
Everything is a file - devices, processes, kernel data exposed as files
Virtual memory provides isolation - each process has its own address space
CFS scheduler ensures fairness - virtual runtime tracks CPU usage
Page cache makes I/O fast - files cached in RAM automatically
VFS abstracts filesystems - same interface for ext4, NFS, procfs
Namespaces and cgroups enable containers - isolation and resource limits

Ready to master the command line? Next up: Linux Permissions where we will dive deep into users, groups, and access control.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Linux Internals Deep Dive

​Why Internals Matter

​User Space vs Kernel Space

​System Calls: The Bridge

​Anatomy of a System Call

​Common System Calls

​Tracing System Calls

​Process Management

​What is a Process?

​Process Control Block (PCB)

​Process States

​The Scheduler

​Real-Time Scheduling

​Memory Management

​Virtual Memory

​Page Tables

​The Page Cache

​Memory Allocation

​The Virtual Filesystem (VFS)

​VFS Architecture

​Key VFS Concepts

​Everything is a File

​Networking Internals

​The Network Stack

​Socket Buffers

​Netfilter and iptables

​Interview Deep Dive Questions

​Exploring Internals Yourself

​Key Takeaways