Skip to main content
System Call Flow - User space to kernel transition

System Call Interface

System calls are the only legitimate way for user-space programs to request services from the kernel. Understanding syscalls deeply is essential for observability engineers who trace application behavior and infrastructure engineers who debug performance issues.
Interview Frequency: Very High (especially at observability companies)
Key Topics: syscall mechanism, vDSO, seccomp, overhead analysis
Time to Master: 12-14 hours

What Are System Calls?

System calls are the interface between user-space applications and the kernel:

System Calls in Linux

System Call Transition
┌─────────────────────────────────────────────────────────────────────────────┐
│                     USER SPACE TO KERNEL TRANSITION                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   USER SPACE                                                                 │
│   ──────────                                                                │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  Application                                                         │   │
│   │       ↓                                                              │   │
│   │  libc wrapper (e.g., read())                                        │   │
│   │       ↓                                                              │   │
│   │  Set up registers: syscall number, arguments                        │   │
│   │       ↓                                                              │   │
│   │  SYSCALL instruction                                                │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                              │                                               │
│   ═══════════════════════════╪═══════════════════════════════════════════   │
│                              │ CPU switches to ring 0                        │
│   ═══════════════════════════╪═══════════════════════════════════════════   │
│                              ↓                                               │
│   KERNEL SPACE                                                               │
│   ────────────                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  entry_SYSCALL_64 (arch/x86/entry/entry_64.S)                       │   │
│   │       ↓                                                              │   │
│   │  Save user registers, switch to kernel stack                        │   │
│   │       ↓                                                              │   │
│   │  Look up syscall in sys_call_table                                  │   │
│   │       ↓                                                              │   │
│   │  Call syscall handler (e.g., ksys_read)                            │   │
│   │       ↓                                                              │   │
│   │  Return to user space via SYSRET/IRET                               │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Understanding the Transition

When an application makes a system call, the CPU must transition from user mode (Ring 3) to kernel mode (Ring 0). This is a privileged operation that involves:
  1. Saving user context: All registers are saved so we can return to exactly where we left off
  2. Switching stacks: User stack → Kernel stack (each process has both)
  3. Changing privilege level: Ring 3 → Ring 0 (CPU enforces this)
  4. Executing kernel code: The actual syscall handler runs
  5. Returning to user mode: Restore context and switch back to Ring 3
This transition is expensive (200-500 CPU cycles) because of security checks, context switching, and cache effects. Understanding this overhead is crucial for writing performant systems.

x86-64 Syscall Mechanism

The SYSCALL Instruction

On x86-64, the syscall instruction is the fast path for entering the kernel:
; User space syscall invocation
mov    rax, 0       ; syscall number (0 = read)
mov    rdi, 0       ; arg1: fd (stdin)
mov    rsi, buffer  ; arg2: buffer pointer
mov    rdx, 100     ; arg3: count
syscall             ; Enter kernel
; Return value in rax

Register Convention

RegisterPurpose
raxSyscall number (input), return value (output)
rdiArgument 1
rsiArgument 2
rdxArgument 3
r10Argument 4 (not rcx, which is used by syscall)
r8Argument 5
r9Argument 6

MSR Configuration

The CPU needs to know where to jump on syscall:
// These MSRs are set during boot:
// MSR_LSTAR (0xC0000082) - Syscall entry point address
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);

// MSR_STAR - Segment selectors for syscall/sysret
// MSR_SYSCALL_MASK - Flags to clear on syscall

Syscall Entry Point Deep Dive

The syscall entry point is one of the most critical pieces of kernel code:
// Simplified from arch/x86/entry/entry_64.S
SYM_CODE_START(entry_SYSCALL_64)
    // User RSP is in per-CPU storage, kernel RSP loaded
    swapgs                      // Switch GS base to kernel
    movq    %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
    movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp
    
    // Push user registers onto kernel stack
    pushq   $__USER_DS          // SS
    pushq   PER_CPU_VAR(...)    // RSP
    pushq   %r11                // RFLAGS (saved by syscall)
    pushq   $__USER_CS          // CS
    pushq   %rcx                // RIP (saved by syscall)
    
    // Save remaining registers
    PUSH_AND_CLEAR_REGS
    
    // Call C handler
    movq    %rsp, %rdi          // pt_regs pointer
    call    do_syscall_64
    
    // Return to user space
    ...
    sysretq
SYM_CODE_END(entry_SYSCALL_64)

do_syscall_64 - The C Entry Point

// arch/x86/entry/common.c
__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
{
    add_random_kstack_offset();  // Security: randomize stack
    nr = syscall_enter_from_user_mode(regs, nr);
    
    if (likely(nr < NR_syscalls)) {
        nr = array_index_nospec(nr, NR_syscalls);  // Spectre mitigation
        regs->ax = sys_call_table[nr](regs);      // Call handler!
    }
    
    syscall_exit_to_user_mode(regs);
}

System Call Table

The syscall table maps syscall numbers to handler functions:
// arch/x86/entry/syscall_64.c
const sys_call_ptr_t sys_call_table[] = {
    [0]   = __x64_sys_read,
    [1]   = __x64_sys_write,
    [2]   = __x64_sys_open,
    [3]   = __x64_sys_close,
    [9]   = __x64_sys_mmap,
    [56]  = __x64_sys_clone,
    [57]  = __x64_sys_fork,
    [59]  = __x64_sys_execve,
    [60]  = __x64_sys_exit,
    ...
};

Finding Syscall Numbers

# Method 1: Header files
grep -E "^#define __NR_" /usr/include/asm/unistd_64.h | head -20

# Method 2: ausyscall tool
ausyscall --dump | head -20

# Method 3: Python
python3 -c "import os; print(os.SYS_read, os.SYS_write)"

vDSO - Virtual Dynamic Shared Object

The Time Query Problem

Before vDSO, getting the current time was surprisingly expensive: The problem: Applications call gettimeofday() or clock_gettime() millions of times per second:
  • Web servers log every request with timestamps
  • Databases track transaction times
  • Profilers measure code execution
  • Games render frames with timing
The cost: Each call was a full syscall (~200-500 cycles overhead) just to read a number that the kernel updates periodically anyway. The insight: Time is read-only data that changes slowly (milliseconds). Why context switch to read it? The solution: vDSO maps kernel data into user space. Applications read time directly from memory - no syscall needed! vDSO is a kernel optimization that provides certain syscalls without entering kernel mode:
┌─────────────────────────────────────────────────────────────────────────────┐
│                            vDSO MECHANISM                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Without vDSO (slow):                                                       │
│   ┌────────────┐     syscall     ┌────────────┐                             │
│   │ User Space │ ──────────────→ │   Kernel   │                             │
│   │            │ ←────────────── │            │                             │
│   └────────────┘     return      └────────────┘                             │
│        ~100-200 cycles overhead                                              │
│                                                                              │
│   With vDSO (fast):                                                          │
│   ┌────────────┐  function call  ┌────────────┐                             │
│   │ User Space │ ──────────────→ │   vDSO     │  (still user space!)        │
│   │            │ ←────────────── │  (shared)  │                             │
│   └────────────┘     return      └────────────┘                             │
│        ~10-20 cycles (no mode switch)                                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

vDSO Functions

FunctionWhat it does
__vdso_clock_gettimeGet current time (most common)
__vdso_gettimeofdayLegacy time function
__vdso_timeSimple seconds since epoch
__vdso_getcpuGet current CPU/NUMA node

How vDSO Works

  1. Kernel maps a special page into every process
  2. Page contains code that can read kernel data (time, CPU)
  3. Kernel updates shared data (timekeeping) periodically
  4. User-space reads data without entering kernel
# Find vDSO mapping
cat /proc/self/maps | grep vdso
# 7fff12345000-7fff12346000 r-xp 00000000 00:00 0 [vdso]

# Dump vDSO contents
dd if=/proc/self/mem bs=4096 skip=$((0x7fff12345)) count=1 2>/dev/null | xxd | head

vDSO Performance Impact

// Benchmark: clock_gettime performance
#include <time.h>

void benchmark() {
    struct timespec ts;
    
    // This uses vDSO (~20ns)
    for (int i = 0; i < 1000000; i++) {
        clock_gettime(CLOCK_MONOTONIC, &ts);
    }
}
Interview Insight: “gettimeofday/clock_gettime are the most frequently called syscalls in many applications. vDSO makes them essentially free, which is why you rarely see them in syscall traces as performance problems.”

Syscall Overhead Analysis

Why Are Syscalls Expensive?

Syscalls are one of the most expensive operations you can do in user space. Here’s why: The fundamental problem: CPU privilege levels. User code runs in ring 3 (unprivileged), kernel code runs in ring 0 (privileged). Switching between them is expensive. What makes it expensive:
  1. Mode switch: CPU must save all registers, switch stacks, change privilege level
  2. Security checks: Validate arguments, check permissions, run seccomp filters
  3. Cache pollution: Kernel code evicts user code from CPU caches
  4. Spectre mitigations: KPTI adds extra overhead (page table switching)
Real-world impact: A program doing 100,000 syscalls/second spends 2-5% of CPU time just on syscall overhead, before doing any actual work. Understanding syscall overhead is crucial for performance:

Cost Breakdown

┌─────────────────────────────────────────────────────────────────────────────┐
│                      SYSCALL OVERHEAD BREAKDOWN                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. Mode Switch (~50-100 cycles)                                            │
│     - SYSCALL instruction                                                    │
│     - Save user registers                                                    │
│     - Load kernel stack                                                      │
│                                                                              │
│  2. Kernel Entry Work (~50-100 cycles)                                      │
│     - Entry tracing (if enabled)                                            │
│     - Audit logging (if enabled)                                            │
│     - seccomp check (if enabled)                                            │
│                                                                              │
│  3. Actual Syscall Work (varies)                                            │
│     - Read: depends on data availability                                    │
│     - Write: depends on buffer state                                        │
│     - Memory operations: page faults, allocation                            │
│                                                                              │
│  4. Return Path (~50-100 cycles)                                            │
│     - Signal checking                                                        │
│     - Rescheduling if needed                                                │
│     - Exit tracing                                                          │
│     - SYSRET/IRET                                                           │
│                                                                              │
│  Minimum overhead: ~200-500 cycles (trivial syscall)                        │
│  Typical overhead: 1000+ cycles (with I/O)                                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Measuring Syscall Overhead

// Measure getpid() overhead (does almost nothing)
#include <sys/syscall.h>
#include <time.h>

long measure_syscall_overhead(int iterations) {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    for (int i = 0; i < iterations; i++) {
        syscall(SYS_getpid);
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    return (end.tv_nsec - start.tv_nsec) / iterations;
}
// Typical result: ~200-400ns per syscall

Spectre Mitigations Impact

After Spectre/Meltdown, syscall overhead increased:
# Check for mitigations
cat /sys/devices/system/cpu/vulnerabilities/*

# Mitigations that affect syscall performance:
# - KPTI (Kernel Page Table Isolation): ~100-400 cycles
# - Retpoline: ~10-50 cycles on indirect calls
# - IBRS: ~30-100 cycles

seccomp - Syscall Filtering

The Container Security Problem

Containers provide isolation, but they share the same kernel. This creates a security risk: The threat: A compromised container could:
  • Use ptrace() to inspect other processes
  • Use mount() to escape the container
  • Use reboot() to crash the host
  • Use kexec_load() to replace the kernel
  • Use clock_settime() to break time-based security
The challenge: Containers need some syscalls to function, but not all ~300+ syscalls. The solution: seccomp-BPF filters syscalls before they execute. Even if an attacker gains code execution in a container, dangerous syscalls are blocked at the kernel level. seccomp-BPF allows filtering syscalls for security:

seccomp Modes

ModeDescriptionUse Case
SECCOMP_MODE_STRICTAllow only read, write, exit, sigreturnMaximum security
SECCOMP_MODE_FILTERBPF program filters syscallsContainer sandboxing

How seccomp-BPF Works

┌─────────────────────────────────────────────────────────────────────────────┐
│                        SECCOMP-BPF FILTERING                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Syscall Entry                                                              │
│        ↓                                                                     │
│   ┌─────────────────────────────────────────────────────────────────┐       │
│   │  seccomp BPF filter (runs before syscall)                       │       │
│   │                                                                  │       │
│   │  Input: struct seccomp_data {                                   │       │
│   │           int   nr;          // syscall number                  │       │
│   │           __u32 arch;        // architecture                    │       │
│   │           __u64 instruction_pointer;                            │       │
│   │           __u64 args[6];     // syscall arguments               │       │
│   │         };                                                       │       │
│   │                                                                  │       │
│   │  Output: action                                                  │       │
│   │    SECCOMP_RET_ALLOW  - Allow syscall                           │       │
│   │    SECCOMP_RET_KILL   - Kill process                            │       │
│   │    SECCOMP_RET_ERRNO  - Return error                            │       │
│   │    SECCOMP_RET_TRACE  - Notify tracer (ptrace)                  │       │
│   │    SECCOMP_RET_LOG    - Allow but log                           │       │
│   └─────────────────────────────────────────────────────────────────┘       │
│        ↓                                                                     │
│   If allowed: proceed to syscall handler                                    │
│   If denied: return error or kill process                                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

seccomp Example

#include <seccomp.h>

void apply_seccomp_filter() {
    // Create filter context (default: allow all)
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW);
    
    // Block specific syscalls
    seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(ptrace), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(kexec_load), 0);
    
    // Apply filter
    seccomp_load(ctx);
    seccomp_release(ctx);
}

seccomp in Containers

Docker uses seccomp to restrict container syscalls:
# View Docker's default seccomp profile
docker info --format '{{ .SecurityOptions }}'

# Run container with custom seccomp profile
docker run --security-opt seccomp=profile.json myimage

# Run without seccomp (less secure)
docker run --security-opt seccomp=unconfined myimage

System Call Tracing

Essential skill for observability engineering:

strace - User-Space Tracer

# Basic syscall tracing
strace ls

# Trace specific syscalls
strace -e read,write cat /etc/passwd

# With timing information
strace -T ls

# Count syscalls
strace -c ls

# Trace child processes too
strace -f ./multi_threaded_app

# Output to file
strace -o trace.log ls

# Attach to running process
strace -p 1234

strace Output Analysis

$ strace -T cat /etc/passwd 2>&1 | head -20
execve("/usr/bin/cat", ["cat", "/etc/passwd"], ...) = 0 <0.000892>
brk(NULL)                               = 0x5555557c3000 <0.000012>
mmap(NULL, 8192, PROT_READ|PROT_WRITE, ...) = 0x7ffff7fc5000 <0.000023>
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT <0.000015>
openat(AT_FDCWD, "/etc/passwd", O_RDONLY) = 3 <0.000021>
fstat(3, {st_mode=S_IFREG|0644, st_size=2773, ...}) = 0 <0.000011>
read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 131072) = 2773 <0.000018>
write(1, "root:x:0:0:root:/root:/bin/bash\n"..., 2773) = 2773 <0.000042>
close(3)                                = 0 <0.000010>

ltrace - Library Call Tracer

# Trace library calls
ltrace ls

# Trace specific library
ltrace -l libc.so.6 ls

Kernel-Level Tracing

For production, use eBPF-based tracing (covered in Track 5):
# Using bpftrace
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { 
    printf("%s opened %s\n", comm, str(args->filename)); 
}'

# Using perf
sudo perf trace ls

Adding a Custom Syscall (Lab)

Understanding by implementation:

Step 1: Define the Syscall

// Add to include/linux/syscalls.h
asmlinkage long sys_hello(const char __user *name);

Step 2: Implement the Handler

// Create kernel/hello.c
#include <linux/syscalls.h>
#include <linux/uaccess.h>

SYSCALL_DEFINE1(hello, const char __user *, name)
{
    char buf[64];
    
    if (copy_from_user(buf, name, sizeof(buf) - 1))
        return -EFAULT;
    
    buf[sizeof(buf) - 1] = '\0';
    pr_info("Hello, %s! (from syscall)\n", buf);
    
    return 0;
}

Step 3: Add to Syscall Table

// arch/x86/entry/syscalls/syscall_64.tbl
# Add line:
451    common    hello    sys_hello

Step 4: Test from User Space

#include <sys/syscall.h>
#include <unistd.h>

#define SYS_hello 451

int main() {
    syscall(SYS_hello, "World");
    return 0;
}

Compatibility and ABI

32-bit Compatibility on 64-bit

// Different syscall tables for different ABIs
// arch/x86/entry/syscall_64.c - 64-bit
// arch/x86/entry/syscall_32.c - 32-bit compat

// Check ABI in seccomp
struct seccomp_data {
    __u32 arch;  // AUDIT_ARCH_X86_64 or AUDIT_ARCH_I386
    ...
};

Syscall Number Differences

# Same syscall, different numbers
# 64-bit: read = 0, write = 1
# 32-bit: read = 3, write = 4

# View 32-bit syscall numbers
cat /usr/include/asm/unistd_32.h

Lab Exercises

Objective: Measure and compare syscall overhead
// syscall_bench.c
#define _GNU_SOURCE
#include <stdio.h>
#include <time.h>
#include <sys/syscall.h>
#include <unistd.h>

#define ITERATIONS 10000000

double measure(void (*func)(void)) {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < ITERATIONS; i++) func();
    clock_gettime(CLOCK_MONOTONIC, &end);
    return (end.tv_sec - start.tv_sec) * 1e9 + 
           (end.tv_nsec - start.tv_nsec);
}

void test_getpid() { syscall(SYS_getpid); }
void test_gettid() { syscall(SYS_gettid); }
void test_getuid() { syscall(SYS_getuid); }

int main() {
    printf("getpid: %.1f ns\n", measure(test_getpid) / ITERATIONS);
    printf("gettid: %.1f ns\n", measure(test_gettid) / ITERATIONS);
    printf("getuid: %.1f ns\n", measure(test_getuid) / ITERATIONS);
    return 0;
}
gcc -O2 syscall_bench.c -o syscall_bench
./syscall_bench
Objective: Analyze real application syscall patterns
# 1. Compare static vs dynamic linking
strace -c /bin/ls -la
strace -c /bin/busybox ls -la

# 2. Analyze a web server
strace -c -f python3 -m http.server 8000 &
curl http://localhost:8000/
kill %1

# 3. Find slowest syscalls
strace -T -o trace.log dd if=/dev/zero of=/tmp/test bs=1M count=100
sort -t'<' -k2 -n trace.log | tail -20
Objective: Create a sandboxed execution environment
// sandbox.c
#include <stdio.h>
#include <stdlib.h>
#include <seccomp.h>
#include <unistd.h>
#include <sys/wait.h>

void apply_sandbox() {
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
    
    // Allow only essential syscalls
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(brk), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 0);
    
    seccomp_load(ctx);
    seccomp_release(ctx);
}

int main() {
    printf("Before sandbox\n");
    apply_sandbox();
    printf("After sandbox\n");
    
    // This will fail (open not allowed)
    FILE *f = fopen("/etc/passwd", "r");
    if (!f) printf("fopen blocked by seccomp!\n");
    
    return 0;
}
gcc sandbox.c -o sandbox -lseccomp
./sandbox

Interview Questions

Answer:
  1. User space:
    • Application calls read(fd, buf, count)
    • libc sets up registers: rax=0 (SYS_read), rdi=fd, rsi=buf, rdx=count
    • Executes syscall instruction
  2. Kernel entry:
    • CPU switches to ring 0, loads kernel stack
    • entry_SYSCALL_64 saves registers
    • do_syscall_64 looks up sys_read in syscall table
  3. Syscall handler (ksys_read):
    • Validates fd, gets struct file *
    • Calls file’s read operation (via file->f_op->read)
    • For regular files: checks page cache, reads from disk if needed
    • Copies data to user buffer via copy_to_user
  4. Return:
    • Returns bytes read (or error)
    • syscall_exit_to_user_mode: check signals, scheduling
    • sysretq: return to user mode
Answer:How it works:
  • Kernel maps a special page into every process’s address space
  • Page contains code that reads kernel-maintained data
  • No mode switch needed — runs entirely in user space
Performance improvement:
  • Regular syscall: ~200-500 cycles (mode switch overhead)
  • vDSO call: ~10-20 cycles (just a function call)
Functions provided:
  • gettimeofday(), clock_gettime() — most important
  • time(), getcpu()
Why limited:
  • Only works for read-only data
  • Kernel maintains shared data (timekeeping)
  • Can’t be used for anything requiring kernel intervention
Impact: Applications doing millions of time queries (monitoring, logging) would be ~10-50x slower without vDSO.
Answer:Protection mechanism:
  • BPF program runs on every syscall entry
  • Blocks dangerous syscalls before they execute
  • Defense in depth — even if container escapes, syscalls limited
Commonly blocked syscalls:
  • mount, umount — prevent filesystem manipulation
  • reboot, kexec_load — prevent system disruption
  • ptrace — prevent debugging/injection
  • init_module, delete_module — prevent kernel modification
  • clock_settime — prevent time manipulation
Docker default profile: Blocks ~44 syscalls out of ~300+Example attack prevention:
  • Container exploit tries ptrace to escape → blocked
  • Malware tries kexec_load → blocked
  • Process tries to load kernel module → blocked
Answer:Overhead sources:
  • BPF filter runs on every syscall entry
  • Constant-time operations for simple filters
  • More complex filters = higher overhead
Typical overhead:
  • Simple whitelist: ~20-50 nanoseconds per syscall
  • Complex filters with argument checking: 100-200 ns
Why it’s acceptable:
  • Syscalls already cost 200-500ns minimum
  • 20-50ns is <25% additional overhead
  • Security benefit outweighs cost
Optimization tips:
  • Put common allowed syscalls first in filter
  • Use SECCOMP_RET_ALLOW as default if mostly allowing
  • Profile with perf to measure actual impact

Key Takeaways

Syscall Mechanism

SYSCALL instruction, register convention, and kernel entry path are fundamental knowledge

vDSO Optimization

Critical for understanding why some “syscalls” have nearly zero overhead

seccomp Security

BPF-based syscall filtering is the foundation of container security

Tracing Skills

strace and understanding syscall patterns are essential for debugging

Next: Kernel Data Structures →