Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
System Call Interface
System calls are the only legitimate way for user-space programs to request services from the kernel. Understanding syscalls deeply is essential for observability engineers who trace application behavior and infrastructure engineers who debug performance issues.Key Topics: syscall mechanism, vDSO, seccomp, overhead analysis
Time to Master: 12-14 hours
What Are System Calls?
System calls are the interface between user-space applications and the kernel:System Calls in Linux
Understanding the Transition
When an application makes a system call, the CPU must transition from user mode (Ring 3) to kernel mode (Ring 0). This is a privileged operation that involves:- Saving user context: All registers are saved so we can return to exactly where we left off
- Switching stacks: User stack → Kernel stack (each process has both)
- Changing privilege level: Ring 3 → Ring 0 (CPU enforces this)
- Executing kernel code: The actual syscall handler runs
- Returning to user mode: Restore context and switch back to Ring 3
x86-64 Syscall Mechanism
The SYSCALL Instruction
On x86-64, thesyscall instruction is the fast path for entering the kernel:
Register Convention
| Register | Purpose |
|---|---|
rax | Syscall number (input), return value (output) |
rdi | Argument 1 |
rsi | Argument 2 |
rdx | Argument 3 |
r10 | Argument 4 (not rcx, which is used by syscall) |
r8 | Argument 5 |
r9 | Argument 6 |
MSR Configuration
The CPU needs to know where to jump on syscall:Syscall Entry Point Deep Dive
The syscall entry point is one of the most critical pieces of kernel code:do_syscall_64 - The C Entry Point
System Call Table
The syscall table maps syscall numbers to handler functions:Finding Syscall Numbers
vDSO - Virtual Dynamic Shared Object
The Time Query Problem
Before vDSO, getting the current time was surprisingly expensive: The problem: Applications callgettimeofday() or clock_gettime() millions of times per second:
- Web servers log every request with timestamps
- Databases track transaction times
- Profilers measure code execution
- Games render frames with timing
vDSO Functions
| Function | What it does |
|---|---|
__vdso_clock_gettime | Get current time (most common) |
__vdso_gettimeofday | Legacy time function |
__vdso_time | Simple seconds since epoch |
__vdso_getcpu | Get current CPU/NUMA node |
How vDSO Works
- Kernel maps a special page into every process
- Page contains code that can read kernel data (time, CPU)
- Kernel updates shared data (timekeeping) periodically
- User-space reads data without entering kernel
vDSO Performance Impact
Syscall Overhead Analysis
Why Are Syscalls Expensive?
Syscalls are one of the most expensive operations you can do in user space. Here’s why: The fundamental problem: CPU privilege levels. User code runs in ring 3 (unprivileged), kernel code runs in ring 0 (privileged). Switching between them is expensive. What makes it expensive:- Mode switch: CPU must save all registers, switch stacks, change privilege level
- Security checks: Validate arguments, check permissions, run seccomp filters
- Cache pollution: Kernel code evicts user code from CPU caches
- Spectre mitigations: KPTI adds extra overhead (page table switching)
Cost Breakdown
Measuring Syscall Overhead
Spectre Mitigations Impact
After Spectre/Meltdown, syscall overhead increased:seccomp - Syscall Filtering
The Container Security Problem
Containers provide isolation, but they share the same kernel. This creates a security risk: The threat: A compromised container could:- Use
ptrace()to inspect other processes - Use
mount()to escape the container - Use
reboot()to crash the host - Use
kexec_load()to replace the kernel - Use
clock_settime()to break time-based security
seccomp Modes
| Mode | Description | Use Case |
|---|---|---|
SECCOMP_MODE_STRICT | Allow only read, write, exit, sigreturn | Maximum security |
SECCOMP_MODE_FILTER | BPF program filters syscalls | Container sandboxing |
How seccomp-BPF Works
seccomp Example
seccomp in Containers
Docker uses seccomp to restrict container syscalls:System Call Tracing
Essential skill for observability engineering:strace - User-Space Tracer
strace Output Analysis
ltrace - Library Call Tracer
Kernel-Level Tracing
For production, use eBPF-based tracing (covered in Track 5):Adding a Custom Syscall (Lab)
Understanding by implementation:Step 1: Define the Syscall
Step 2: Implement the Handler
Step 3: Add to Syscall Table
Step 4: Test from User Space
Compatibility and ABI
32-bit Compatibility on 64-bit
Syscall Number Differences
Lab Exercises
Lab 1: Syscall Overhead Measurement
Lab 1: Syscall Overhead Measurement
Lab 2: strace Deep Analysis
Lab 2: strace Deep Analysis
Lab 3: seccomp Filter Implementation
Lab 3: seccomp Filter Implementation
Interview Questions
Q1: Walk through what happens during a read() syscall
Q1: Walk through what happens during a read() syscall
-
User space:
- Application calls
read(fd, buf, count) - libc sets up registers: rax=0 (SYS_read), rdi=fd, rsi=buf, rdx=count
- Executes
syscallinstruction
- Application calls
-
Kernel entry:
- CPU switches to ring 0, loads kernel stack
entry_SYSCALL_64saves registersdo_syscall_64looks upsys_readin syscall table
-
Syscall handler (
ksys_read):- Validates fd, gets
struct file * - Calls file’s
readoperation (viafile->f_op->read) - For regular files: checks page cache, reads from disk if needed
- Copies data to user buffer via
copy_to_user
- Validates fd, gets
-
Return:
- Returns bytes read (or error)
syscall_exit_to_user_mode: check signals, schedulingsysretq: return to user mode
Q2: How does vDSO improve performance? What syscalls use it?
Q2: How does vDSO improve performance? What syscalls use it?
- Kernel maps a special page into every process’s address space
- Page contains code that reads kernel-maintained data
- No mode switch needed — runs entirely in user space
- Regular syscall: ~200-500 cycles (mode switch overhead)
- vDSO call: ~10-20 cycles (just a function call)
gettimeofday(),clock_gettime()— most importanttime(),getcpu()
- Only works for read-only data
- Kernel maintains shared data (timekeeping)
- Can’t be used for anything requiring kernel intervention
Q3: How does seccomp protect containers?
Q3: How does seccomp protect containers?
- BPF program runs on every syscall entry
- Blocks dangerous syscalls before they execute
- Defense in depth — even if container escapes, syscalls limited
mount,umount— prevent filesystem manipulationreboot,kexec_load— prevent system disruptionptrace— prevent debugging/injectioninit_module,delete_module— prevent kernel modificationclock_settime— prevent time manipulation
- Container exploit tries
ptraceto escape → blocked - Malware tries
kexec_load→ blocked - Process tries to load kernel module → blocked
Q4: What's the overhead of enabling seccomp?
Q4: What's the overhead of enabling seccomp?
- BPF filter runs on every syscall entry
- Constant-time operations for simple filters
- More complex filters = higher overhead
- Simple whitelist: ~20-50 nanoseconds per syscall
- Complex filters with argument checking: 100-200 ns
- Syscalls already cost 200-500ns minimum
- 20-50ns is <25% additional overhead
- Security benefit outweighs cost
- Put common allowed syscalls first in filter
- Use SECCOMP_RET_ALLOW as default if mostly allowing
- Profile with
perfto measure actual impact
Key Takeaways
Syscall Mechanism
vDSO Optimization
seccomp Security
Tracing Skills
Interview Deep-Dive
You are profiling a high-throughput service and discover it makes 500,000 gettimeofday() calls per second. Explain why this is not actually a performance problem on modern Linux, and describe the kernel mechanism that makes it fast.
You are profiling a high-throughput service and discover it makes 500,000 gettimeofday() calls per second. Explain why this is not actually a performance problem on modern Linux, and describe the kernel mechanism that makes it fast.
- On modern Linux,
gettimeofday()andclock_gettime()do not actually enter the kernel. They are served by the vDSO (virtual Dynamic Shared Object), which is a small shared library that the kernel maps into every process’s address space duringexecve(). The vDSO contains code that reads time data from a shared memory page that the kernel updates on each timer tick (typically every 1-4ms). - The mechanism works as follows: the kernel maintains a
vsyscall_gtod_datastructure in a page mapped read-only into user space. This structure contains the current time, the clocksource coefficients (TSC multiplier and shift), and the last update timestamp. The vDSO code reads the TSC register directly (viardtscorrdtscp), applies the coefficients to compute the current time, and returns — all without any privilege transition. - A regular syscall costs 200-500 CPU cycles (mode switch, register save/restore, Spectre mitigations). A vDSO call costs 10-20 cycles (just a function call and a few multiplications). At 500,000 calls per second, the difference is roughly 0.1% CPU for vDSO versus 5-10% CPU for real syscalls. This is why you rarely see
gettimeofdayas a performance bottleneck in strace output, and also why strace itself cannot see vDSO calls (they never enter the kernel).
- The vDSO only works when the kernel can provide sufficient information for user-space time computation. It falls back to a real syscall when: the clocksource is not TSC-based (for example, HPET or ACPI PM timer, which require MMIO reads that need kernel privileges), when
CLOCK_PROCESS_CPUTIME_IDorCLOCK_THREAD_CPUTIME_IDare requested (these require reading per-task scheduling data), or when the clock isCLOCK_TAIon some kernel versions. You can verify which calls use vDSO by checking whether they appear in strace output — if they do not appear, they are being handled by vDSO.
A container escape vulnerability has been reported that exploits a race condition in a specific syscall. Explain how seccomp-BPF would protect against this, and discuss the limitations of seccomp as a security boundary.
A container escape vulnerability has been reported that exploits a race condition in a specific syscall. Explain how seccomp-BPF would protect against this, and discuss the limitations of seccomp as a security boundary.
- Seccomp-BPF filters run before the syscall handler executes. When a container runtime (like runc) starts a container, it installs a BPF filter program via
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, ...). This filter receives the syscall number and arguments as input and returns an action (ALLOW, KILL, ERRNO, TRACE, LOG). If the vulnerable syscall is blocked by the filter, the exploit never reaches the buggy kernel code — the filter returns EPERM or kills the process before the syscall handler is even invoked. - Docker’s default seccomp profile blocks approximately 44 of the 300+ syscalls, including dangerous ones like
kexec_load,mount,ptrace,init_module,delete_module, andclock_settime. This reduces the kernel’s attack surface significantly. - However, seccomp has important limitations. First, it can only filter on syscall number and the first six arguments. It cannot dereference pointers, so it cannot inspect the contents of buffers or filenames passed to syscalls. Second, the filter is set once and cannot be relaxed (only tightened), following the principle of least privilege. Third, seccomp cannot protect against kernel vulnerabilities in allowed syscalls — if the exploit is in
read()orwrite(), which must be allowed for the container to function, seccomp cannot help. Finally, seccomp does not protect against hardware-level attacks like Spectre that bypass the syscall interface entirely.
- Each seccomp filter is a BPF program that runs linearly: the kernel evaluates instructions sequentially for every syscall. A simple allowlist of 20 syscalls might take 20-50 nanoseconds per syscall. A complex filter with argument checking on dozens of syscalls could take 100-200 nanoseconds. Since syscalls already cost 200-500 nanoseconds minimum, a well-designed filter adds less than 25% overhead. For production, I would start with Docker’s default profile, then use
strace -cto identify which syscalls the service actually uses, and build a tight allowlist. I would put the most frequently called syscalls (read, write, futex, epoll_wait) first in the filter to minimize average evaluation time, and set the default action to SCMP_ACT_ERRNO rather than SCMP_ACT_KILL to avoid silent process deaths during development.
Trace the complete journey of a write(fd, buf, 4096) call from the moment user-space code executes it to the point where data reaches the storage device. Include every privilege transition and major kernel function.
Trace the complete journey of a write(fd, buf, 4096) call from the moment user-space code executes it to the point where data reaches the storage device. Include every privilege transition and major kernel function.
- In user space,
write()is a libc wrapper that sets up registers per the x86-64 ABI:rax=1(SYS_write),rdi=fd,rsi=buf,rdx=4096, then executes thesyscallinstruction. - The CPU saves RIP and RFLAGS into RCX and R11, loads the kernel entry point from MSR_LSTAR, switches to ring 0, and jumps to
entry_SYSCALL_64in assembly. This code swaps to the kernel stack viaswapgs, saves all user registers onto the kernel stack as apt_regsstructure, then callsdo_syscall_64(). do_syscall_64()looks upsys_call_table[1](write), which dispatches toksys_write(). This function callsfdget_pos()to convert the integer fd to astruct file *pointer and acquire the file position lock. It then callsvfs_write(), which checks permissions and callsfile->f_op->write_iter()— the filesystem-specific write function.- For ext4 buffered writes,
ext4_file_write_iter()callsgeneric_perform_write(), which finds or creates pages in the page cache (address_space), copies the 4096 bytes from user space into the page cache page viacopy_from_user(), and marks the page dirty. The write call returns to user space at this point — the data is in page cache but not on disk. - Later, the
writebackkernel thread (or thepdflushequivalent) wakes up and callsext4_writepages(), which allocates disk blocks, createsbiostructures describing the I/O, and submits them to the block layer viasubmit_bio(). The block layer’s scheduler (mq-deadline, kyber, or none) may reorder or merge the request, then dispatches it to the device driver, which programs DMA to transfer the page cache data directly to the storage device. The device raises an interrupt on completion, and thebioend_io callback marks the page clean.
- After the buffered write returns, data is only in volatile page cache — a power failure loses it. Calling
fsync(fd)after the write forces the kernel to flush all dirty pages for that file to disk and wait for the device to confirm they are on persistent storage. Specifically,fsync()callsvfs_fsync(), which invokesfile->f_op->fsync()(ext4_sync_file), which flushes dirty data pages, writes the inode metadata, and issues a cache flush command to the drive (SYNCHRONIZE CACHE for SCSI/SAS, FUA bit for NVMe). Only after the drive confirms the flush isfsync()allowed to return. Note that even fsync does not guarantee safety against drive firmware bugs that falsely acknowledge writes, which is why enterprise drives with power-loss-protected write caches exist.
Next: Kernel Data Structures →