If you love understanding how things actually work, this chapter is for you. If you just want to run commands and get things done, feel free to skip ahead. No judgment.
This chapter takes you beneath the surface of Linux. We will explore how the kernel manages processes, understand how system calls bridge user space and kernel space, and demystify the virtual filesystem. This knowledge is what transforms a Linux user into a Linux engineer.
strace is one of the most powerful debugging tools in Linux. It shows you exactly what a process is asking the kernel to do, in real time. When a program hangs, crashes, or behaves strangely, strace reveals what is happening at the system call level — which files it is trying to open, which network connections it is making, and where it is getting stuck.
# Trace system calls of an already-running process (attach to PID 1234)strace -p 1234# Trace a command from start to finishstrace ls -la# Count and summarize system calls (great for finding where time is spent)strace -c ls -la# Example output:% time seconds usecs/call calls errors syscall------ ----------- ----------- --------- --------- ---------------- 30.12 0.000142 71 2 getdents64 21.28 0.000100 14 7 mmap 15.96 0.000075 10 7 close# Trace only file-related system calls (filter out the noise)strace -e trace=file ls -la# Trace only network-related calls (invaluable for debugging connection issues)strace -e trace=network curl https://example.com
When to reach for strace: A process hangs and you do not know why — strace shows what system call it is blocked on. A config file is not being read — strace shows every open() call and which paths it tried. A network connection fails — strace shows the connect() call and the error code. It is the “X-ray vision” of Linux debugging.
malloc(1024) │ ▼┌─────────────────────────────────────────────────────┐│ Is there space in existing heap? ││ Yes → Return pointer to free chunk ││ No → Request more memory from kernel ││ │ ││ ▼ ││ brk() or mmap() ││ │ ││ ▼ ││ Kernel allocates virtual pages ││ (not physical yet!) ││ │ ││ ▼ ││ First access triggers page fault ││ │ ││ ▼ ││ Kernel allocates physical page │└─────────────────────────────────────────────────────┘
This demand paging means you can allocate more virtual memory than physical RAM exists. It is like a restaurant that takes unlimited reservations but only sets a table when you actually show up.
Production gotcha: The OOM Killer. Because Linux allows overcommit by default, the total virtual memory across all processes can exceed physical RAM. As long as not everyone actually uses their allocation, this works fine. But when too many processes touch too many pages simultaneously and the system runs out of physical memory, the kernel invokes the Out-of-Memory (OOM) Killer. It picks a process (based on memory usage, runtime, and priority) and kills it with SIGKILL — no graceful shutdown, no cleanup. You discover this in dmesg or /var/log/kern.log. To protect critical processes, set their oom_score_adj to a negative value (e.g., echo -500 > /proc/PID/oom_score_adj). To protect them completely, set it to -1000, but be cautious — if nothing can be killed, the system may hang entirely.
Understanding the network stack is essential for debugging connectivity problems, container networking, and performance tuning. When your application cannot connect to a database or your API latency spikes, this knowledge tells you where to look.
Think of the network stack like a mail-room in a large building. Your application writes a letter (data), the mail-room staff put it in an envelope (TCP/UDP), add a street address (IP), put it in a delivery truck (Ethernet frame), and the truck drives to the destination (physical network). Each layer handles one responsibility and passes the packet down.
Application: curl https://example.com │ ▼┌─────────────────────────────────────────────────────────────────┐│ Socket Layer (socket, bind, connect, send, recv) ││ - Your application interacts here via system calls │├─────────────────────────────────────────────────────────────────┤│ Transport Layer (TCP, UDP) ││ - TCP: reliable, ordered, connection-oriented (HTTP, SSH, DB) ││ - UDP: fast, no guarantees (DNS, video streaming, gaming) ││ - Connection state (SYN, ACK, FIN) ││ - Reliability, flow control, congestion control │├─────────────────────────────────────────────────────────────────┤│ Network Layer (IP) ││ - Routing decisions (which interface, which gateway) ││ - IP fragmentation for packets larger than MTU │├─────────────────────────────────────────────────────────────────┤│ Data Link Layer (Ethernet, WiFi) ││ - Frame construction with source/destination MAC addresses ││ - ARP resolution (IP address to MAC address) │├─────────────────────────────────────────────────────────────────┤│ Physical Layer (NIC Driver) ││ - DMA to/from network card (bypasses CPU for efficiency) │└─────────────────────────────────────────────────────────────────┘
Data does not go straight from your application to the wire. It passes through kernel buffers at each layer. These buffers are where performance tuning happens — and where problems hide.
Send path:Application ──▶ Socket send buffer ──▶ TCP ──▶ IP ──▶ NIC ▲ │ If buffer is full, write() blocks (or returns EAGAIN in non-blocking mode)Receive path:NIC ──▶ Driver ──▶ IP ──▶ TCP ──▶ Socket recv buffer ──▶ Application ▲ │ If buffer is full, incoming packets are DROPPED
# View socket buffer sizes (in bytes)# These are the maximum buffer sizes for receive and sendsysctl net.core.rmem_max # Max receive buffer (default often 212992)sysctl net.core.wmem_max # Max send buffer# View socket statistics -- connection counts, drops, errors# This is the first command to run when you suspect network issuesss -s
Production gotcha: If you see packet drops on a high-traffic server, the default socket buffer sizes may be too small. Increasing net.core.rmem_max and net.core.wmem_max to 16MB or more is common for servers handling thousands of concurrent connections. But larger buffers consume more memory per socket — it is a trade-off.
Netfilter is the kernel’s packet filtering framework. It provides hooks at five points in the packet processing path where you can inspect, modify, or drop packets. iptables (and its successor nftables) is the user-space tool for configuring these hooks. This is also the foundation for Docker and Kubernetes networking — every port mapping, every Service VIP, every NetworkPolicy is implemented as Netfilter rules under the hood.
Answer: 1) Shell calls fork() to create child process, 2) Child calls exec() to load program, 3) Kernel loads ELF binary, sets up memory mappings, 4) Kernel sets up stack with argc, argv, environment, 5) Control transfers to program entry point (usually _start in libc), 6) _start calls main(), 7) Program runs, 8) exit() called, kernel cleans up resources.
Explain the difference between processes and threads
Answer: Processes have separate address spaces, file descriptors, and resources. Threads share address space, heap, and file descriptors within a process, but have separate stacks and registers. In Linux, both are task_struct - threads share mm_struct (memory) while processes have separate ones. Threads are cheaper to create (no memory copy) but share bugs.
What is a context switch?
Answer: When the kernel switches from running one process to another: 1) Save current process registers to task_struct, 2) Save current MMU context (page tables), 3) Select next process to run (scheduler), 4) Load new process registers from task_struct, 5) Restore MMU context, 6) Jump to new process instruction pointer. Context switches are expensive (cache invalidation, TLB flush).
How does Linux handle memory overcommit?
Answer: By default, Linux allows processes to allocate more virtual memory than physical RAM (overcommit). Actual physical pages are allocated on first access (demand paging). If system runs out of memory, OOM killer selects and kills processes. Controlled by vm.overcommit_memory: 0=heuristic, 1=always allow, 2=never overcommit.
Explain the purpose of /proc and /sys
Answer: Both are virtual filesystems - no actual disk storage. /proc exposes kernel data structures and process info (originated in Unix). /sys is newer, provides structured hardware/driver info (Linux 2.6+). /proc has accumulated cruft, /sys is more organized. Examples: /proc/meminfo, /proc/1234/status, /sys/class/net/eth0/address.
What is the OOM killer and how does it work?
Answer: Out-of-Memory killer is invoked when system is critically low on memory. It calculates oom_score for each process based on: memory usage, runtime, nice value, whether it is privileged. Highest score gets killed first. oom_score_adj (-1000 to 1000) can be set in /proc/pid/oom_score_adj. -1000 makes process unkillable (risky).
The best way to internalize these concepts is to poke around on a live system. Every command below reads from /proc or /sys — virtual filesystems that expose kernel internals as ordinary files. Nothing here is dangerous to read.
# View tunable kernel parameters (there are hundreds -- networking, memory, scheduling)sysctl -a | head -50# View CPU details: model, cores, cache size, flags (like virtualization support)cat /proc/cpuinfo# View memory breakdown: total, free, cached, swap, buffers# This is where free -h gets its datacat /proc/meminfo# View the current process's virtual memory map (see all those shared libraries loaded?)cat /proc/self/maps# Summarize system calls for a command -- reveals what the kernel is doing for youstrace -c ls# View scheduling statistics for PID 1 (systemd)cat /proc/1/sched# View all listening network sockets and which process owns each oness -tulpn# View file descriptors for the current shell (0=stdin, 1=stdout, 2=stderr, plus any open files)ls -la /proc/self/fd# View all mounted filesystems (including virtual ones like proc, sys, tmpfs)cat /proc/mounts# View loaded kernel modules (drivers for hardware, filesystems, networking features)lsmod
A Java application in a Docker container is getting OOMKilled even though the JVM heap is set to 2GB and the container limit is 4GB. What is happening?
Strong Answer:
The JVM uses more than heap. Total memory includes heap, metaspace, thread stacks (~1MB per thread), JIT code cache, direct byte buffers, and native memory from libraries. A 2GB heap easily results in 3-4GB total.
The cgroup memory limit counts all process memory, not just heap. When total RSS exceeds the limit, the OOM killer terminates the process.
Fix: use -XX:MaxRAMPercentage=75.0 instead of -Xmx2g. This reserves 25% for non-heap usage. On Java 11+, -XX:+UseContainerSupport is enabled by default.
Follow-up: How does the OOM killer decide which process to kill?It calculates oom_score per process based on RSS, root ownership, and oom_score_adj (-1000 to 1000). Higher score = killed first. In Kubernetes, the kubelet sets oom_score_adj by QoS class: Guaranteed gets -997 (protected), BestEffort gets 1000 (killed first). Check with cat /proc/<pid>/oom_score.
Explain fork() and exec() at the kernel level. Why are they two separate system calls?
Strong Answer:
fork() creates a copy of the calling process. The child gets a new PID but inherits memory mappings, file descriptors, and environment. Modern Linux uses copy-on-write: pages are shared until one process writes, making fork() fast even for large processes.
exec() replaces the current process image with a new program. Loads the ELF binary, sets up a fresh stack, jumps to the entry point. PID stays the same.
Separation gives you a window between fork and exec to configure the child: redirect file descriptors (how shell pipes work), change working directory, set environment variables, drop privileges. A single “create process” call would need a massive options struct.
This also enables fork-without-exec for pre-forking servers (Apache, PostgreSQL) where workers run the same code with shared file descriptors.
Follow-up: What is the difference between a zombie process and an orphan process?A zombie has exited but its parent has not called wait() to collect the exit status. It consumes a PID and kernel memory but cannot be killed (it is already dead). An orphan’s parent has exited — it is adopted by PID 1, which reaps it automatically. To clean up zombies: fix the parent to call wait(), or kill the parent (making zombies into orphans that init reaps).
Your application response time doubled overnight. Using strace and other tools, how do you identify whether the bottleneck is CPU, disk I/O, network, or locks?
Strong Answer:
Start with vmstat 1 5. High r + low wa = CPU-bound. High wa or b = I/O-bound. Non-zero si/so = memory pressure (swapping).
CPU-bound: htop to find the hot process, then perf top -p <pid> for function-level profiling.
I/O-bound: iostat -xz 1 to identify the saturated disk (high await, %util above 80%). iotop for per-process I/O.
Neither: strace -c -p <pid> to see where time is spent. High time in futex() = lock contention. High time in epoll_wait() or recv() = waiting on network.
Network: ss -tn state established for connection counts. curl -w "DNS: %{time_namelookup} Connect: %{time_connect} Total: %{time_total}\n" -o /dev/null -s <URL> for timing breakdown.
Follow-up: What does high time in futex() indicate?futex() underlies userspace synchronization primitives (mutexes, condition variables). High futex time means threads are contending for locks. Diagnose with perf lock report for native code, jstack for Java thread dumps, or pprof blocking profiles for Go. The fix is usually reducing lock scope, using lock-free structures, or reducing parallelism.
Ready to master the command line? Next up: Linux Permissions where we will dive deep into users, groups, and access control.