Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Linux Internals Deep Dive

If you love understanding how things actually work, this chapter is for you. If you just want to run commands and get things done, feel free to skip ahead. No judgment.
This chapter takes you beneath the surface of Linux. We will explore how the kernel manages processes, understand how system calls bridge user space and kernel space, and demystify the virtual filesystem. This knowledge is what transforms a Linux user into a Linux engineer.

Why Internals Matter

Understanding Linux internals helps you:
  • Debug performance issues when top and htop are not enough
  • Write better software that works with the kernel, not against it
  • Ace interviews where internals questions are common
  • Understand containers since Docker relies on kernel features
  • Troubleshoot production systems at a deeper level

User Space vs Kernel Space

The most fundamental concept: Linux divides memory into two distinct spaces.
┌─────────────────────────────────────────────────────────────────┐
│                         User Space                               │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐            │
│  │  bash   │  │  nginx  │  │ python  │  │  java   │            │
│  └─────────┘  └─────────┘  └─────────┘  └─────────┘            │
│                                                                  │
│  Applications, libraries, user processes                         │
│  - Protected from each other                                     │
│  - Cannot access hardware directly                               │
│  - Uses system calls to request kernel services                  │
├──────────────────────────────────────────────────────────────────┤
│                      System Call Interface                        │
├──────────────────────────────────────────────────────────────────┤
│                        Kernel Space                               │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │  Process Management  │  Memory Management  │  File Systems  │ │
│  ├─────────────────────────────────────────────────────────────┤ │
│  │  Device Drivers  │  Network Stack  │  Security Modules      │ │
│  └─────────────────────────────────────────────────────────────┘ │
│                                                                   │
│  - Full hardware access                                          │
│  - Manages all system resources                                  │
│  - Runs in privileged mode (Ring 0)                              │
├──────────────────────────────────────────────────────────────────┤
│                         Hardware                                  │
│  CPU  │  Memory  │  Disk  │  Network  │  Devices                │
└──────────────────────────────────────────────────────────────────┘
Why this separation?
  • Security: Buggy user programs cannot crash the kernel
  • Stability: One process cannot corrupt another
  • Abstraction: Applications do not need to know hardware details

System Calls: The Bridge

When a user program needs kernel services (read file, open network connection, create process), it makes a system call.

Anatomy of a System Call

User space:    write(fd, buf, count)


               libc wrapper (glibc)

                    │  Prepares arguments
                    │  Triggers software interrupt

──────────────── syscall instruction ────────────────

Kernel space:       │

               System call handler

                    │  Validates arguments
                    │  Performs operation
                    │  Returns result

──────────────── return to user space ────────────────

Common System Calls

CategorySystem CallsPurpose
Processfork, exec, exit, waitCreate and manage processes
Fileopen, read, write, closeFile operations
Networksocket, bind, listen, acceptNetwork operations
Memorymmap, brk, mprotectMemory management
IPCpipe, shmget, semgetInter-process communication

Tracing System Calls

strace is one of the most powerful debugging tools in Linux. It shows you exactly what a process is asking the kernel to do, in real time. When a program hangs, crashes, or behaves strangely, strace reveals what is happening at the system call level — which files it is trying to open, which network connections it is making, and where it is getting stuck.
# Trace system calls of an already-running process (attach to PID 1234)
strace -p 1234

# Trace a command from start to finish
strace ls -la

# Count and summarize system calls (great for finding where time is spent)
strace -c ls -la

# Example output:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 30.12    0.000142          71         2           getdents64
 21.28    0.000100          14         7           mmap
 15.96    0.000075          10         7           close

# Trace only file-related system calls (filter out the noise)
strace -e trace=file ls -la

# Trace only network-related calls (invaluable for debugging connection issues)
strace -e trace=network curl https://example.com
When to reach for strace: A process hangs and you do not know why — strace shows what system call it is blocked on. A config file is not being read — strace shows every open() call and which paths it tried. A network connection fails — strace shows the connect() call and the error code. It is the “X-ray vision” of Linux debugging.

Process Management

What is a Process?

A process is a running program. It includes:
  • Code: The program instructions
  • Data: Variables and heap
  • Stack: Function calls and local variables
  • Registers: CPU state
  • File descriptors: Open files, sockets
  • Memory mappings: Virtual memory layout

Process Control Block (PCB)

The kernel maintains a task_struct for each process:
// Simplified task_struct (actual is ~600 lines)
struct task_struct {
    volatile long state;        // -1 unrunnable, 0 runnable, >0 stopped
    void *stack;                // Kernel stack
    unsigned int cpu;           // Current CPU
    pid_t pid;                  // Process ID
    pid_t tgid;                 // Thread group ID
    struct task_struct *parent; // Parent process
    struct list_head children;  // Child processes
    struct mm_struct *mm;       // Memory mappings
    struct files_struct *files; // Open files
    // ... hundreds more fields
};

Process States

                    fork()


              ┌────────────────┐
              │   TASK_NEW     │
              └────────┬───────┘


              ┌────────────────┐
     ┌───────▶│ TASK_RUNNING   │◀──────────┐
     │        │  (Runnable)    │           │
     │        └────────┬───────┘           │
     │                 │                    │
     │   Scheduled     │  Waiting for      │ Event
     │     by CPU      │  I/O, lock, etc   │ occurred
     │                 ▼                    │
     │        ┌────────────────┐           │
     │        │ TASK_RUNNING   │           │
     │        │ (On CPU)       │           │
     │        └────────┬───────┘           │
     │                 │                    │
     │  Preempted      │   Need to wait    │
     │                 ▼                    │
     │        ┌─────────────────┐          │
     └────────│ TASK_INTERR-    │──────────┘
              │ UPTIBLE/        │
              │ TASK_UNINTERR-  │
              │ UPTIBLE         │
              └────────┬────────┘

                       │ exit()

              ┌────────────────┐
              │  TASK_ZOMBIE   │───▶ Parent calls wait()
              └────────────────┘           │

                                    Process removed

The Scheduler

Linux uses the Completely Fair Scheduler (CFS) for normal processes:
CFS Core Idea: Track "virtual runtime" for each process

Virtual Runtime = Actual Runtime / Weight

Processes with lower virtual runtime get scheduled first.
This ensures fairness - each process gets its fair share.

Example with nice values:
┌──────────┬──────────┬─────────────┬────────────────────┐
│ Process  │  Nice    │   Weight    │ Virtual Runtime    │
├──────────┼──────────┼─────────────┼────────────────────┤
│    A     │    0     │    1024     │  10ms / 1024       │
│    B     │   10     │     110     │  10ms / 110        │
│    C     │  -10     │    9548     │  10ms / 9548       │
└──────────┴──────────┴─────────────┴────────────────────┘

C (nice -10) has lowest vruntime after 10ms, gets scheduled more.

Real-Time Scheduling

For time-critical tasks, Linux provides real-time schedulers:
PolicyDescription
SCHED_FIFOFirst-in, first-out, runs until blocks or yields
SCHED_RRRound-robin, time-sliced FIFO
SCHED_DEADLINEEarliest deadline first (newest)
SCHED_OTHERDefault CFS scheduler
# Check scheduling policy
chrt -p 1234

# Set real-time priority
chrt -f 50 ./critical-app

# View all runnable tasks with scheduling info
ps -eo pid,ni,pri,pcpu,comm --sort=-pcpu

Memory Management

Virtual Memory

Every process gets its own virtual address space:
Process A Virtual Memory:           Process B Virtual Memory:
┌───────────────────────────┐      ┌───────────────────────────┐
│  0xFFFFFFFF (High)        │      │  0xFFFFFFFF (High)        │
│  Kernel Space (shared)    │      │  Kernel Space (shared)    │
├───────────────────────────┤      ├───────────────────────────┤
│  Stack ▼                  │      │  Stack ▼                  │
│                           │      │                           │
│  (grows down)             │      │  (grows down)             │
│                           │      │                           │
│  Memory Mapped Region     │      │  Memory Mapped Region     │
│  (shared libs, mmap)      │      │  (shared libs, mmap)      │
│                           │      │                           │
│  Heap ▲                   │      │  Heap ▲                   │
│  (grows up)               │      │  (grows up)               │
├───────────────────────────┤      ├───────────────────────────┤
│  BSS (uninitialized data) │      │  BSS (uninitialized data) │
│  Data (initialized data)  │      │  Data (initialized data)  │
│  Text (code)              │      │  Text (code)              │
│  0x00000000 (Low)         │      │  0x00000000 (Low)         │
└───────────────────────────┘      └───────────────────────────┘

Same virtual addresses map to different physical memory!

Page Tables

Virtual addresses translate to physical addresses via page tables:
Virtual Address: 0x00007f3a8b2c1000


        ┌─────────────────┐
        │   Page Table    │
        │ (per process)   │
        ├─────────────────┤
        │ VPN → PFN       │
        │ VPN → PFN       │
        │ VPN → PFN       │
        └────────┬────────┘


Physical Address: 0x0000001a3c560000
Page size is typically 4KB (x86) or can be huge pages (2MB, 1GB).

The Page Cache

Linux aggressively caches file data in RAM:
# View memory usage
free -h
              total        used        free      shared  buff/cache   available
Mem:           15Gi       3.2Gi       8.1Gi       312Mi       4.1Gi        11Gi


                                               This is your page cache!
Page cache behavior:
  • Read a file? It stays in cache for future reads
  • Write a file? Goes to cache first, flushed to disk later
  • Running low on memory? Cache pages are evicted first
# Drop caches (for testing, not production)
sync; echo 3 > /proc/sys/vm/drop_caches

Memory Allocation

When a process requests memory:
malloc(1024)


┌─────────────────────────────────────────────────────┐
│  Is there space in existing heap?                    │
│     Yes → Return pointer to free chunk               │
│     No  → Request more memory from kernel            │
│              │                                       │
│              ▼                                       │
│           brk() or mmap()                           │
│              │                                       │
│              ▼                                       │
│           Kernel allocates virtual pages             │
│           (not physical yet!)                        │
│              │                                       │
│              ▼                                       │
│           First access triggers page fault           │
│              │                                       │
│              ▼                                       │
│           Kernel allocates physical page             │
└─────────────────────────────────────────────────────┘
This demand paging means you can allocate more virtual memory than physical RAM exists. It is like a restaurant that takes unlimited reservations but only sets a table when you actually show up.
Production gotcha: The OOM Killer. Because Linux allows overcommit by default, the total virtual memory across all processes can exceed physical RAM. As long as not everyone actually uses their allocation, this works fine. But when too many processes touch too many pages simultaneously and the system runs out of physical memory, the kernel invokes the Out-of-Memory (OOM) Killer. It picks a process (based on memory usage, runtime, and priority) and kills it with SIGKILL — no graceful shutdown, no cleanup. You discover this in dmesg or /var/log/kern.log. To protect critical processes, set their oom_score_adj to a negative value (e.g., echo -500 > /proc/PID/oom_score_adj). To protect them completely, set it to -1000, but be cautious — if nothing can be killed, the system may hang entirely.

The Virtual Filesystem (VFS)

Linux abstracts all filesystems through a common interface.

VFS Architecture

User space:    open("/etc/passwd", O_RDONLY)


               ┌─────────────────────────────────────────────┐
               │              VFS Layer                       │
               │  - Common file operations                    │
               │  - Inode, dentry, file abstractions         │
               └─────────────────────────────────────────────┘

        ┌───────────┼───────────┬───────────┬────────────┐
        ▼           ▼           ▼           ▼            ▼
    ┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐    ┌───────┐
    │  ext4 │  │  XFS  │  │  NFS  │  │ procfs│    │ tmpfs │
    └───────┘  └───────┘  └───────┘  └───────┘    └───────┘
        │           │           │           │            │
    Physical    Physical     Network    Kernel       Memory
      Disk        Disk       Server     Data

Key VFS Concepts

Inode: Metadata about a file (permissions, size, timestamps, block pointers). Does NOT contain the filename.
# View inode information
stat /etc/passwd
  File: /etc/passwd
  Size: 2446        Blocks: 8          IO Block: 4096   regular file
Device: 259,3       Inode: 131162      Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Dentry: Directory entry - maps filename to inode.
Directory: /etc
┌─────────────────┬─────────────────┐
│  Name           │  Inode Number   │
├─────────────────┼─────────────────┤
│  passwd         │  131162         │
│  shadow         │  131163         │
│  hosts          │  131164         │
└─────────────────┴─────────────────┘
File descriptors: Per-process table of open files.
# View open file descriptors for a process
ls -la /proc/1234/fd/
lrwx------ 1 root root 64 Dec  2 10:00 0 -> /dev/null
lrwx------ 1 root root 64 Dec  2 10:00 1 -> /dev/null
l-wx------ 1 root root 64 Dec  2 10:00 2 -> /var/log/nginx/error.log

Everything is a File

This Unix philosophy extends to:
PathWhat It Is
/dev/sdaBlock device (hard drive)
/dev/nullBit bucket (discards writes)
/dev/randomRandom number generator
/proc/cpuinfoCPU information (kernel data)
/sys/class/netNetwork interface info
/dev/stdinStandard input
# Read from "files" that aren't really files
cat /proc/meminfo
cat /sys/class/net/eth0/address

Networking Internals

Understanding the network stack is essential for debugging connectivity problems, container networking, and performance tuning. When your application cannot connect to a database or your API latency spikes, this knowledge tells you where to look.

The Network Stack

Think of the network stack like a mail-room in a large building. Your application writes a letter (data), the mail-room staff put it in an envelope (TCP/UDP), add a street address (IP), put it in a delivery truck (Ethernet frame), and the truck drives to the destination (physical network). Each layer handles one responsibility and passes the packet down.
Application:    curl https://example.com


┌─────────────────────────────────────────────────────────────────┐
│  Socket Layer (socket, bind, connect, send, recv)               │
│  - Your application interacts here via system calls             │
├─────────────────────────────────────────────────────────────────┤
│  Transport Layer (TCP, UDP)                                      │
│  - TCP: reliable, ordered, connection-oriented (HTTP, SSH, DB)  │
│  - UDP: fast, no guarantees (DNS, video streaming, gaming)      │
│  - Connection state (SYN, ACK, FIN)                             │
│  - Reliability, flow control, congestion control                │
├─────────────────────────────────────────────────────────────────┤
│  Network Layer (IP)                                              │
│  - Routing decisions (which interface, which gateway)           │
│  - IP fragmentation for packets larger than MTU                 │
├─────────────────────────────────────────────────────────────────┤
│  Data Link Layer (Ethernet, WiFi)                               │
│  - Frame construction with source/destination MAC addresses     │
│  - ARP resolution (IP address to MAC address)                   │
├─────────────────────────────────────────────────────────────────┤
│  Physical Layer (NIC Driver)                                     │
│  - DMA to/from network card (bypasses CPU for efficiency)       │
└─────────────────────────────────────────────────────────────────┘

Socket Buffers

Data does not go straight from your application to the wire. It passes through kernel buffers at each layer. These buffers are where performance tuning happens — and where problems hide.
Send path:
Application ──▶ Socket send buffer ──▶ TCP ──▶ IP ──▶ NIC

                     │ If buffer is full, write() blocks (or returns EAGAIN in non-blocking mode)

Receive path:
NIC ──▶ Driver ──▶ IP ──▶ TCP ──▶ Socket recv buffer ──▶ Application

                                        │ If buffer is full, incoming packets are DROPPED
# View socket buffer sizes (in bytes)
# These are the maximum buffer sizes for receive and send
sysctl net.core.rmem_max    # Max receive buffer (default often 212992)
sysctl net.core.wmem_max    # Max send buffer

# View socket statistics -- connection counts, drops, errors
# This is the first command to run when you suspect network issues
ss -s
Production gotcha: If you see packet drops on a high-traffic server, the default socket buffer sizes may be too small. Increasing net.core.rmem_max and net.core.wmem_max to 16MB or more is common for servers handling thousands of concurrent connections. But larger buffers consume more memory per socket — it is a trade-off.

Netfilter and iptables

Netfilter is the kernel’s packet filtering framework. It provides hooks at five points in the packet processing path where you can inspect, modify, or drop packets. iptables (and its successor nftables) is the user-space tool for configuring these hooks. This is also the foundation for Docker and Kubernetes networking — every port mapping, every Service VIP, every NetworkPolicy is implemented as Netfilter rules under the hood.
                              Network


                         ┌──────────────┐
                         │ PREROUTING   │──▶ DNAT (port forwarding)
                         └──────┬───────┘

                         ┌──────▼───────┐
                    ┌────│   Routing    │────┐
                    │    │   Decision   │    │
                    │    └──────────────┘    │
                    │                        │
             ┌──────▼───────┐         ┌──────▼───────┐
             │    INPUT     │         │   FORWARD    │
             │  (local)     │         │  (routing)   │
             └──────┬───────┘         └──────┬───────┘
                    │                        │
                    ▼                        │
              Local Process                  │
                    │                        │
                    ▼                        │
             ┌──────────────┐               │
             │   OUTPUT     │               │
             └──────┬───────┘               │
                    │                        │
                    └────────┬───────────────┘

                      ┌──────▼───────┐
                      │ POSTROUTING  │──▶ SNAT (masquerading)
                      └──────┬───────┘


                          Network

Interview Deep Dive Questions

Answer: 1) Shell calls fork() to create child process, 2) Child calls exec() to load program, 3) Kernel loads ELF binary, sets up memory mappings, 4) Kernel sets up stack with argc, argv, environment, 5) Control transfers to program entry point (usually _start in libc), 6) _start calls main(), 7) Program runs, 8) exit() called, kernel cleans up resources.
Answer: Processes have separate address spaces, file descriptors, and resources. Threads share address space, heap, and file descriptors within a process, but have separate stacks and registers. In Linux, both are task_struct - threads share mm_struct (memory) while processes have separate ones. Threads are cheaper to create (no memory copy) but share bugs.
Answer: When the kernel switches from running one process to another: 1) Save current process registers to task_struct, 2) Save current MMU context (page tables), 3) Select next process to run (scheduler), 4) Load new process registers from task_struct, 5) Restore MMU context, 6) Jump to new process instruction pointer. Context switches are expensive (cache invalidation, TLB flush).
Answer: By default, Linux allows processes to allocate more virtual memory than physical RAM (overcommit). Actual physical pages are allocated on first access (demand paging). If system runs out of memory, OOM killer selects and kills processes. Controlled by vm.overcommit_memory: 0=heuristic, 1=always allow, 2=never overcommit.
Answer: Both are virtual filesystems - no actual disk storage. /proc exposes kernel data structures and process info (originated in Unix). /sys is newer, provides structured hardware/driver info (Linux 2.6+). /proc has accumulated cruft, /sys is more organized. Examples: /proc/meminfo, /proc/1234/status, /sys/class/net/eth0/address.
Answer: Out-of-Memory killer is invoked when system is critically low on memory. It calculates oom_score for each process based on: memory usage, runtime, nice value, whether it is privileged. Highest score gets killed first. oom_score_adj (-1000 to 1000) can be set in /proc/pid/oom_score_adj. -1000 makes process unkillable (risky).

Exploring Internals Yourself

The best way to internalize these concepts is to poke around on a live system. Every command below reads from /proc or /sys — virtual filesystems that expose kernel internals as ordinary files. Nothing here is dangerous to read.
# View tunable kernel parameters (there are hundreds -- networking, memory, scheduling)
sysctl -a | head -50

# View CPU details: model, cores, cache size, flags (like virtualization support)
cat /proc/cpuinfo

# View memory breakdown: total, free, cached, swap, buffers
# This is where free -h gets its data
cat /proc/meminfo

# View the current process's virtual memory map (see all those shared libraries loaded?)
cat /proc/self/maps

# Summarize system calls for a command -- reveals what the kernel is doing for you
strace -c ls

# View scheduling statistics for PID 1 (systemd)
cat /proc/1/sched

# View all listening network sockets and which process owns each one
ss -tulpn

# View file descriptors for the current shell (0=stdin, 1=stdout, 2=stderr, plus any open files)
ls -la /proc/self/fd

# View all mounted filesystems (including virtual ones like proc, sys, tmpfs)
cat /proc/mounts

# View loaded kernel modules (drivers for hardware, filesystems, networking features)
lsmod

Key Takeaways

  1. User space and kernel space are separated - for security and stability
  2. System calls are the bridge - only way to request kernel services
  3. Everything is a file - devices, processes, kernel data exposed as files
  4. Virtual memory provides isolation - each process has its own address space
  5. CFS scheduler ensures fairness - virtual runtime tracks CPU usage
  6. Page cache makes I/O fast - files cached in RAM automatically
  7. VFS abstracts filesystems - same interface for ext4, NFS, procfs
  8. Namespaces and cgroups enable containers - isolation and resource limits

Interview Deep-Dive

Strong Answer:
  • The JVM uses more than heap. Total memory includes heap, metaspace, thread stacks (~1MB per thread), JIT code cache, direct byte buffers, and native memory from libraries. A 2GB heap easily results in 3-4GB total.
  • The cgroup memory limit counts all process memory, not just heap. When total RSS exceeds the limit, the OOM killer terminates the process.
  • Fix: use -XX:MaxRAMPercentage=75.0 instead of -Xmx2g. This reserves 25% for non-heap usage. On Java 11+, -XX:+UseContainerSupport is enabled by default.
  • Diagnostic: jcmd <pid> VM.native_memory summary shows JVM memory breakdown. cat /sys/fs/cgroup/memory/memory.usage_in_bytes shows cgroup-level usage.
Follow-up: How does the OOM killer decide which process to kill?It calculates oom_score per process based on RSS, root ownership, and oom_score_adj (-1000 to 1000). Higher score = killed first. In Kubernetes, the kubelet sets oom_score_adj by QoS class: Guaranteed gets -997 (protected), BestEffort gets 1000 (killed first). Check with cat /proc/<pid>/oom_score.
Strong Answer:
  • fork() creates a copy of the calling process. The child gets a new PID but inherits memory mappings, file descriptors, and environment. Modern Linux uses copy-on-write: pages are shared until one process writes, making fork() fast even for large processes.
  • exec() replaces the current process image with a new program. Loads the ELF binary, sets up a fresh stack, jumps to the entry point. PID stays the same.
  • Separation gives you a window between fork and exec to configure the child: redirect file descriptors (how shell pipes work), change working directory, set environment variables, drop privileges. A single “create process” call would need a massive options struct.
  • This also enables fork-without-exec for pre-forking servers (Apache, PostgreSQL) where workers run the same code with shared file descriptors.
Follow-up: What is the difference between a zombie process and an orphan process?A zombie has exited but its parent has not called wait() to collect the exit status. It consumes a PID and kernel memory but cannot be killed (it is already dead). An orphan’s parent has exited — it is adopted by PID 1, which reaps it automatically. To clean up zombies: fix the parent to call wait(), or kill the parent (making zombies into orphans that init reaps).
Strong Answer:
  • Start with vmstat 1 5. High r + low wa = CPU-bound. High wa or b = I/O-bound. Non-zero si/so = memory pressure (swapping).
  • CPU-bound: htop to find the hot process, then perf top -p <pid> for function-level profiling.
  • I/O-bound: iostat -xz 1 to identify the saturated disk (high await, %util above 80%). iotop for per-process I/O.
  • Neither: strace -c -p <pid> to see where time is spent. High time in futex() = lock contention. High time in epoll_wait() or recv() = waiting on network.
  • Network: ss -tn state established for connection counts. curl -w "DNS: %{time_namelookup} Connect: %{time_connect} Total: %{time_total}\n" -o /dev/null -s <URL> for timing breakdown.
Follow-up: What does high time in futex() indicate?futex() underlies userspace synchronization primitives (mutexes, condition variables). High futex time means threads are contending for locks. Diagnose with perf lock report for native code, jstack for Java thread dumps, or pprof blocking profiles for Go. The fix is usually reducing lock scope, using lock-free structures, or reducing parallelism.

Ready to master the command line? Next up: Linux Permissions where we will dive deep into users, groups, and access control.