System Call Interface
System calls are the only legitimate way for user-space programs to request services from the kernel. Understanding syscalls deeply is essential for observability engineers who trace application behavior and infrastructure engineers who debug performance issues.Interview Frequency: Very High (especially at observability companies)
Key Topics: syscall mechanism, vDSO, seccomp, overhead analysis
Time to Master: 12-14 hours
Key Topics: syscall mechanism, vDSO, seccomp, overhead analysis
Time to Master: 12-14 hours
What Are System Calls?
System calls are the interface between user-space applications and the kernel:System Calls in Linux
Understanding the Transition
When an application makes a system call, the CPU must transition from user mode (Ring 3) to kernel mode (Ring 0). This is a privileged operation that involves:- Saving user context: All registers are saved so we can return to exactly where we left off
- Switching stacks: User stack → Kernel stack (each process has both)
- Changing privilege level: Ring 3 → Ring 0 (CPU enforces this)
- Executing kernel code: The actual syscall handler runs
- Returning to user mode: Restore context and switch back to Ring 3
x86-64 Syscall Mechanism
The SYSCALL Instruction
On x86-64, thesyscall instruction is the fast path for entering the kernel:
Register Convention
| Register | Purpose |
|---|---|
rax | Syscall number (input), return value (output) |
rdi | Argument 1 |
rsi | Argument 2 |
rdx | Argument 3 |
r10 | Argument 4 (not rcx, which is used by syscall) |
r8 | Argument 5 |
r9 | Argument 6 |
MSR Configuration
The CPU needs to know where to jump on syscall:Syscall Entry Point Deep Dive
The syscall entry point is one of the most critical pieces of kernel code:do_syscall_64 - The C Entry Point
System Call Table
The syscall table maps syscall numbers to handler functions:Finding Syscall Numbers
vDSO - Virtual Dynamic Shared Object
The Time Query Problem
Before vDSO, getting the current time was surprisingly expensive: The problem: Applications callgettimeofday() or clock_gettime() millions of times per second:
- Web servers log every request with timestamps
- Databases track transaction times
- Profilers measure code execution
- Games render frames with timing
vDSO Functions
| Function | What it does |
|---|---|
__vdso_clock_gettime | Get current time (most common) |
__vdso_gettimeofday | Legacy time function |
__vdso_time | Simple seconds since epoch |
__vdso_getcpu | Get current CPU/NUMA node |
How vDSO Works
- Kernel maps a special page into every process
- Page contains code that can read kernel data (time, CPU)
- Kernel updates shared data (timekeeping) periodically
- User-space reads data without entering kernel
vDSO Performance Impact
Interview Insight: “gettimeofday/clock_gettime are the most frequently called syscalls in many applications. vDSO makes them essentially free, which is why you rarely see them in syscall traces as performance problems.”
Syscall Overhead Analysis
Why Are Syscalls Expensive?
Syscalls are one of the most expensive operations you can do in user space. Here’s why: The fundamental problem: CPU privilege levels. User code runs in ring 3 (unprivileged), kernel code runs in ring 0 (privileged). Switching between them is expensive. What makes it expensive:- Mode switch: CPU must save all registers, switch stacks, change privilege level
- Security checks: Validate arguments, check permissions, run seccomp filters
- Cache pollution: Kernel code evicts user code from CPU caches
- Spectre mitigations: KPTI adds extra overhead (page table switching)
Cost Breakdown
Measuring Syscall Overhead
Spectre Mitigations Impact
After Spectre/Meltdown, syscall overhead increased:seccomp - Syscall Filtering
The Container Security Problem
Containers provide isolation, but they share the same kernel. This creates a security risk: The threat: A compromised container could:- Use
ptrace()to inspect other processes - Use
mount()to escape the container - Use
reboot()to crash the host - Use
kexec_load()to replace the kernel - Use
clock_settime()to break time-based security
seccomp Modes
| Mode | Description | Use Case |
|---|---|---|
SECCOMP_MODE_STRICT | Allow only read, write, exit, sigreturn | Maximum security |
SECCOMP_MODE_FILTER | BPF program filters syscalls | Container sandboxing |
How seccomp-BPF Works
seccomp Example
seccomp in Containers
Docker uses seccomp to restrict container syscalls:System Call Tracing
Essential skill for observability engineering:strace - User-Space Tracer
strace Output Analysis
ltrace - Library Call Tracer
Kernel-Level Tracing
For production, use eBPF-based tracing (covered in Track 5):Adding a Custom Syscall (Lab)
Understanding by implementation:Step 1: Define the Syscall
Step 2: Implement the Handler
Step 3: Add to Syscall Table
Step 4: Test from User Space
Compatibility and ABI
32-bit Compatibility on 64-bit
Syscall Number Differences
Lab Exercises
Lab 1: Syscall Overhead Measurement
Lab 1: Syscall Overhead Measurement
Objective: Measure and compare syscall overhead
Lab 2: strace Deep Analysis
Lab 2: strace Deep Analysis
Objective: Analyze real application syscall patterns
Lab 3: seccomp Filter Implementation
Lab 3: seccomp Filter Implementation
Objective: Create a sandboxed execution environment
Interview Questions
Q1: Walk through what happens during a read() syscall
Q1: Walk through what happens during a read() syscall
Answer:
- User space:
- Application calls
read(fd, buf, count) - libc sets up registers: rax=0 (SYS_read), rdi=fd, rsi=buf, rdx=count
- Executes
syscallinstruction
- Application calls
- Kernel entry:
- CPU switches to ring 0, loads kernel stack
entry_SYSCALL_64saves registersdo_syscall_64looks upsys_readin syscall table
- Syscall handler (
ksys_read):- Validates fd, gets
struct file * - Calls file’s
readoperation (viafile->f_op->read) - For regular files: checks page cache, reads from disk if needed
- Copies data to user buffer via
copy_to_user
- Validates fd, gets
- Return:
- Returns bytes read (or error)
syscall_exit_to_user_mode: check signals, schedulingsysretq: return to user mode
Q2: How does vDSO improve performance? What syscalls use it?
Q2: How does vDSO improve performance? What syscalls use it?
Answer:How it works:
- Kernel maps a special page into every process’s address space
- Page contains code that reads kernel-maintained data
- No mode switch needed — runs entirely in user space
- Regular syscall: ~200-500 cycles (mode switch overhead)
- vDSO call: ~10-20 cycles (just a function call)
gettimeofday(),clock_gettime()— most importanttime(),getcpu()
- Only works for read-only data
- Kernel maintains shared data (timekeeping)
- Can’t be used for anything requiring kernel intervention
Q3: How does seccomp protect containers?
Q3: How does seccomp protect containers?
Answer:Protection mechanism:
- BPF program runs on every syscall entry
- Blocks dangerous syscalls before they execute
- Defense in depth — even if container escapes, syscalls limited
mount,umount— prevent filesystem manipulationreboot,kexec_load— prevent system disruptionptrace— prevent debugging/injectioninit_module,delete_module— prevent kernel modificationclock_settime— prevent time manipulation
- Container exploit tries
ptraceto escape → blocked - Malware tries
kexec_load→ blocked - Process tries to load kernel module → blocked
Q4: What's the overhead of enabling seccomp?
Q4: What's the overhead of enabling seccomp?
Answer:Overhead sources:
- BPF filter runs on every syscall entry
- Constant-time operations for simple filters
- More complex filters = higher overhead
- Simple whitelist: ~20-50 nanoseconds per syscall
- Complex filters with argument checking: 100-200 ns
- Syscalls already cost 200-500ns minimum
- 20-50ns is <25% additional overhead
- Security benefit outweighs cost
- Put common allowed syscalls first in filter
- Use SECCOMP_RET_ALLOW as default if mostly allowing
- Profile with
perfto measure actual impact
Key Takeaways
Syscall Mechanism
SYSCALL instruction, register convention, and kernel entry path are fundamental knowledge
vDSO Optimization
Critical for understanding why some “syscalls” have nearly zero overhead
seccomp Security
BPF-based syscall filtering is the foundation of container security
Tracing Skills
strace and understanding syscall patterns are essential for debugging
Next: Kernel Data Structures →