Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

System Call Flow - User space to kernel transition

System Call Interface

System calls are the only legitimate way for user-space programs to request services from the kernel. Understanding syscalls deeply is essential for observability engineers who trace application behavior and infrastructure engineers who debug performance issues.
Interview Frequency: Very High (especially at observability companies)
Key Topics: syscall mechanism, vDSO, seccomp, overhead analysis
Time to Master: 12-14 hours

What Are System Calls?

System calls are the interface between user-space applications and the kernel:

System Calls in Linux

System Call Transition
┌─────────────────────────────────────────────────────────────────────────────┐
│                     USER SPACE TO KERNEL TRANSITION                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   USER SPACE                                                                 │
│   ──────────                                                                │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  Application                                                         │   │
│   │       ↓                                                              │   │
│   │  libc wrapper (e.g., read())                                        │   │
│   │       ↓                                                              │   │
│   │  Set up registers: syscall number, arguments                        │   │
│   │       ↓                                                              │   │
│   │  SYSCALL instruction                                                │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                              │                                               │
│   ═══════════════════════════╪═══════════════════════════════════════════   │
│                              │ CPU switches to ring 0                        │
│   ═══════════════════════════╪═══════════════════════════════════════════   │
│                              ↓                                               │
│   KERNEL SPACE                                                               │
│   ────────────                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  entry_SYSCALL_64 (arch/x86/entry/entry_64.S)                       │   │
│   │       ↓                                                              │   │
│   │  Save user registers, switch to kernel stack                        │   │
│   │       ↓                                                              │   │
│   │  Look up syscall in sys_call_table                                  │   │
│   │       ↓                                                              │   │
│   │  Call syscall handler (e.g., ksys_read)                            │   │
│   │       ↓                                                              │   │
│   │  Return to user space via SYSRET/IRET                               │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Understanding the Transition

When an application makes a system call, the CPU must transition from user mode (Ring 3) to kernel mode (Ring 0). This is a privileged operation that involves:
  1. Saving user context: All registers are saved so we can return to exactly where we left off
  2. Switching stacks: User stack → Kernel stack (each process has both)
  3. Changing privilege level: Ring 3 → Ring 0 (CPU enforces this)
  4. Executing kernel code: The actual syscall handler runs
  5. Returning to user mode: Restore context and switch back to Ring 3
This transition is expensive (200-500 CPU cycles) because of security checks, context switching, and cache effects. Understanding this overhead is crucial for writing performant systems.

x86-64 Syscall Mechanism

The SYSCALL Instruction

On x86-64, the syscall instruction is the fast path for entering the kernel:
; User space syscall invocation
mov    rax, 0       ; syscall number (0 = read)
mov    rdi, 0       ; arg1: fd (stdin)
mov    rsi, buffer  ; arg2: buffer pointer
mov    rdx, 100     ; arg3: count
syscall             ; Enter kernel
; Return value in rax

Register Convention

RegisterPurpose
raxSyscall number (input), return value (output)
rdiArgument 1
rsiArgument 2
rdxArgument 3
r10Argument 4 (not rcx, which is used by syscall)
r8Argument 5
r9Argument 6

MSR Configuration

The CPU needs to know where to jump on syscall:
// These MSRs are set during boot:
// MSR_LSTAR (0xC0000082) - Syscall entry point address
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);

// MSR_STAR - Segment selectors for syscall/sysret
// MSR_SYSCALL_MASK - Flags to clear on syscall

Syscall Entry Point Deep Dive

The syscall entry point is one of the most critical pieces of kernel code:
// Simplified from arch/x86/entry/entry_64.S
SYM_CODE_START(entry_SYSCALL_64)
    // User RSP is in per-CPU storage, kernel RSP loaded
    swapgs                      // Switch GS base to kernel
    movq    %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
    movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp
    
    // Push user registers onto kernel stack
    pushq   $__USER_DS          // SS
    pushq   PER_CPU_VAR(...)    // RSP
    pushq   %r11                // RFLAGS (saved by syscall)
    pushq   $__USER_CS          // CS
    pushq   %rcx                // RIP (saved by syscall)
    
    // Save remaining registers
    PUSH_AND_CLEAR_REGS
    
    // Call C handler
    movq    %rsp, %rdi          // pt_regs pointer
    call    do_syscall_64
    
    // Return to user space
    ...
    sysretq
SYM_CODE_END(entry_SYSCALL_64)

do_syscall_64 - The C Entry Point

// arch/x86/entry/common.c
__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
{
    add_random_kstack_offset();  // Security: randomize stack
    nr = syscall_enter_from_user_mode(regs, nr);
    
    if (likely(nr < NR_syscalls)) {
        nr = array_index_nospec(nr, NR_syscalls);  // Spectre mitigation
        regs->ax = sys_call_table[nr](regs);      // Call handler!
    }
    
    syscall_exit_to_user_mode(regs);
}

System Call Table

The syscall table maps syscall numbers to handler functions:
// arch/x86/entry/syscall_64.c
const sys_call_ptr_t sys_call_table[] = {
    [0]   = __x64_sys_read,
    [1]   = __x64_sys_write,
    [2]   = __x64_sys_open,
    [3]   = __x64_sys_close,
    [9]   = __x64_sys_mmap,
    [56]  = __x64_sys_clone,
    [57]  = __x64_sys_fork,
    [59]  = __x64_sys_execve,
    [60]  = __x64_sys_exit,
    ...
};

Finding Syscall Numbers

# Method 1: Header files
grep -E "^#define __NR_" /usr/include/asm/unistd_64.h | head -20

# Method 2: ausyscall tool
ausyscall --dump | head -20

# Method 3: Python
python3 -c "import os; print(os.SYS_read, os.SYS_write)"

vDSO - Virtual Dynamic Shared Object

The Time Query Problem

Before vDSO, getting the current time was surprisingly expensive: The problem: Applications call gettimeofday() or clock_gettime() millions of times per second:
  • Web servers log every request with timestamps
  • Databases track transaction times
  • Profilers measure code execution
  • Games render frames with timing
The cost: Each call was a full syscall (~200-500 cycles overhead) just to read a number that the kernel updates periodically anyway. The insight: Time is read-only data that changes slowly (milliseconds). Why context switch to read it? The solution: vDSO maps kernel data into user space. Applications read time directly from memory - no syscall needed! vDSO is a kernel optimization that provides certain syscalls without entering kernel mode:
┌─────────────────────────────────────────────────────────────────────────────┐
│                            vDSO MECHANISM                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Without vDSO (slow):                                                       │
│   ┌────────────┐     syscall     ┌────────────┐                             │
│   │ User Space │ ──────────────→ │   Kernel   │                             │
│   │            │ ←────────────── │            │                             │
│   └────────────┘     return      └────────────┘                             │
│        ~100-200 cycles overhead                                              │
│                                                                              │
│   With vDSO (fast):                                                          │
│   ┌────────────┐  function call  ┌────────────┐                             │
│   │ User Space │ ──────────────→ │   vDSO     │  (still user space!)        │
│   │            │ ←────────────── │  (shared)  │                             │
│   └────────────┘     return      └────────────┘                             │
│        ~10-20 cycles (no mode switch)                                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

vDSO Functions

FunctionWhat it does
__vdso_clock_gettimeGet current time (most common)
__vdso_gettimeofdayLegacy time function
__vdso_timeSimple seconds since epoch
__vdso_getcpuGet current CPU/NUMA node

How vDSO Works

  1. Kernel maps a special page into every process
  2. Page contains code that can read kernel data (time, CPU)
  3. Kernel updates shared data (timekeeping) periodically
  4. User-space reads data without entering kernel
# Find vDSO mapping
cat /proc/self/maps | grep vdso
# 7fff12345000-7fff12346000 r-xp 00000000 00:00 0 [vdso]

# Dump vDSO contents
dd if=/proc/self/mem bs=4096 skip=$((0x7fff12345)) count=1 2>/dev/null | xxd | head

vDSO Performance Impact

// Benchmark: clock_gettime performance
#include <time.h>

void benchmark() {
    struct timespec ts;
    
    // This uses vDSO (~20ns)
    for (int i = 0; i < 1000000; i++) {
        clock_gettime(CLOCK_MONOTONIC, &ts);
    }
}
Interview Insight: “gettimeofday/clock_gettime are the most frequently called syscalls in many applications. vDSO makes them essentially free, which is why you rarely see them in syscall traces as performance problems.”

Syscall Overhead Analysis

Why Are Syscalls Expensive?

Syscalls are one of the most expensive operations you can do in user space. Here’s why: The fundamental problem: CPU privilege levels. User code runs in ring 3 (unprivileged), kernel code runs in ring 0 (privileged). Switching between them is expensive. What makes it expensive:
  1. Mode switch: CPU must save all registers, switch stacks, change privilege level
  2. Security checks: Validate arguments, check permissions, run seccomp filters
  3. Cache pollution: Kernel code evicts user code from CPU caches
  4. Spectre mitigations: KPTI adds extra overhead (page table switching)
Real-world impact: A program doing 100,000 syscalls/second spends 2-5% of CPU time just on syscall overhead, before doing any actual work. Understanding syscall overhead is crucial for performance:

Cost Breakdown

┌─────────────────────────────────────────────────────────────────────────────┐
│                      SYSCALL OVERHEAD BREAKDOWN                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. Mode Switch (~50-100 cycles)                                            │
│     - SYSCALL instruction                                                    │
│     - Save user registers                                                    │
│     - Load kernel stack                                                      │
│                                                                              │
│  2. Kernel Entry Work (~50-100 cycles)                                      │
│     - Entry tracing (if enabled)                                            │
│     - Audit logging (if enabled)                                            │
│     - seccomp check (if enabled)                                            │
│                                                                              │
│  3. Actual Syscall Work (varies)                                            │
│     - Read: depends on data availability                                    │
│     - Write: depends on buffer state                                        │
│     - Memory operations: page faults, allocation                            │
│                                                                              │
│  4. Return Path (~50-100 cycles)                                            │
│     - Signal checking                                                        │
│     - Rescheduling if needed                                                │
│     - Exit tracing                                                          │
│     - SYSRET/IRET                                                           │
│                                                                              │
│  Minimum overhead: ~200-500 cycles (trivial syscall)                        │
│  Typical overhead: 1000+ cycles (with I/O)                                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Measuring Syscall Overhead

// Measure getpid() overhead (does almost nothing)
#include <sys/syscall.h>
#include <time.h>

long measure_syscall_overhead(int iterations) {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    for (int i = 0; i < iterations; i++) {
        syscall(SYS_getpid);
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    return (end.tv_nsec - start.tv_nsec) / iterations;
}
// Typical result: ~200-400ns per syscall

Spectre Mitigations Impact

After Spectre/Meltdown, syscall overhead increased:
# Check for mitigations
cat /sys/devices/system/cpu/vulnerabilities/*

# Mitigations that affect syscall performance:
# - KPTI (Kernel Page Table Isolation): ~100-400 cycles
# - Retpoline: ~10-50 cycles on indirect calls
# - IBRS: ~30-100 cycles

seccomp - Syscall Filtering

The Container Security Problem

Containers provide isolation, but they share the same kernel. This creates a security risk: The threat: A compromised container could:
  • Use ptrace() to inspect other processes
  • Use mount() to escape the container
  • Use reboot() to crash the host
  • Use kexec_load() to replace the kernel
  • Use clock_settime() to break time-based security
The challenge: Containers need some syscalls to function, but not all ~300+ syscalls. The solution: seccomp-BPF filters syscalls before they execute. Even if an attacker gains code execution in a container, dangerous syscalls are blocked at the kernel level. seccomp-BPF allows filtering syscalls for security:

seccomp Modes

ModeDescriptionUse Case
SECCOMP_MODE_STRICTAllow only read, write, exit, sigreturnMaximum security
SECCOMP_MODE_FILTERBPF program filters syscallsContainer sandboxing

How seccomp-BPF Works

┌─────────────────────────────────────────────────────────────────────────────┐
│                        SECCOMP-BPF FILTERING                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Syscall Entry                                                              │
│        ↓                                                                     │
│   ┌─────────────────────────────────────────────────────────────────┐       │
│   │  seccomp BPF filter (runs before syscall)                       │       │
│   │                                                                  │       │
│   │  Input: struct seccomp_data {                                   │       │
│   │           int   nr;          // syscall number                  │       │
│   │           __u32 arch;        // architecture                    │       │
│   │           __u64 instruction_pointer;                            │       │
│   │           __u64 args[6];     // syscall arguments               │       │
│   │         };                                                       │       │
│   │                                                                  │       │
│   │  Output: action                                                  │       │
│   │    SECCOMP_RET_ALLOW  - Allow syscall                           │       │
│   │    SECCOMP_RET_KILL   - Kill process                            │       │
│   │    SECCOMP_RET_ERRNO  - Return error                            │       │
│   │    SECCOMP_RET_TRACE  - Notify tracer (ptrace)                  │       │
│   │    SECCOMP_RET_LOG    - Allow but log                           │       │
│   └─────────────────────────────────────────────────────────────────┘       │
│        ↓                                                                     │
│   If allowed: proceed to syscall handler                                    │
│   If denied: return error or kill process                                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

seccomp Example

#include <seccomp.h>

void apply_seccomp_filter() {
    // Create filter context (default: allow all)
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW);
    
    // Block specific syscalls
    seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(ptrace), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(kexec_load), 0);
    
    // Apply filter
    seccomp_load(ctx);
    seccomp_release(ctx);
}

seccomp in Containers

Docker uses seccomp to restrict container syscalls:
# View Docker's default seccomp profile
docker info --format '{{ .SecurityOptions }}'

# Run container with custom seccomp profile
docker run --security-opt seccomp=profile.json myimage

# Run without seccomp (less secure)
docker run --security-opt seccomp=unconfined myimage

System Call Tracing

Essential skill for observability engineering:

strace - User-Space Tracer

# Basic syscall tracing
strace ls

# Trace specific syscalls
strace -e read,write cat /etc/passwd

# With timing information
strace -T ls

# Count syscalls
strace -c ls

# Trace child processes too
strace -f ./multi_threaded_app

# Output to file
strace -o trace.log ls

# Attach to running process
strace -p 1234

strace Output Analysis

$ strace -T cat /etc/passwd 2>&1 | head -20
execve("/usr/bin/cat", ["cat", "/etc/passwd"], ...) = 0 <0.000892>
brk(NULL)                               = 0x5555557c3000 <0.000012>
mmap(NULL, 8192, PROT_READ|PROT_WRITE, ...) = 0x7ffff7fc5000 <0.000023>
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT <0.000015>
openat(AT_FDCWD, "/etc/passwd", O_RDONLY) = 3 <0.000021>
fstat(3, {st_mode=S_IFREG|0644, st_size=2773, ...}) = 0 <0.000011>
read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 131072) = 2773 <0.000018>
write(1, "root:x:0:0:root:/root:/bin/bash\n"..., 2773) = 2773 <0.000042>
close(3)                                = 0 <0.000010>

ltrace - Library Call Tracer

# Trace library calls
ltrace ls

# Trace specific library
ltrace -l libc.so.6 ls

Kernel-Level Tracing

For production, use eBPF-based tracing (covered in Track 5):
# Using bpftrace
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { 
    printf("%s opened %s\n", comm, str(args->filename)); 
}'

# Using perf
sudo perf trace ls

Adding a Custom Syscall (Lab)

Understanding by implementation:

Step 1: Define the Syscall

// Add to include/linux/syscalls.h
asmlinkage long sys_hello(const char __user *name);

Step 2: Implement the Handler

// Create kernel/hello.c
#include <linux/syscalls.h>
#include <linux/uaccess.h>

SYSCALL_DEFINE1(hello, const char __user *, name)
{
    char buf[64];
    
    if (copy_from_user(buf, name, sizeof(buf) - 1))
        return -EFAULT;
    
    buf[sizeof(buf) - 1] = '\0';
    pr_info("Hello, %s! (from syscall)\n", buf);
    
    return 0;
}

Step 3: Add to Syscall Table

// arch/x86/entry/syscalls/syscall_64.tbl
# Add line:
451    common    hello    sys_hello

Step 4: Test from User Space

#include <sys/syscall.h>
#include <unistd.h>

#define SYS_hello 451

int main() {
    syscall(SYS_hello, "World");
    return 0;
}

Compatibility and ABI

32-bit Compatibility on 64-bit

// Different syscall tables for different ABIs
// arch/x86/entry/syscall_64.c - 64-bit
// arch/x86/entry/syscall_32.c - 32-bit compat

// Check ABI in seccomp
struct seccomp_data {
    __u32 arch;  // AUDIT_ARCH_X86_64 or AUDIT_ARCH_I386
    ...
};

Syscall Number Differences

# Same syscall, different numbers
# 64-bit: read = 0, write = 1
# 32-bit: read = 3, write = 4

# View 32-bit syscall numbers
cat /usr/include/asm/unistd_32.h

Lab Exercises

Objective: Measure and compare syscall overhead
// syscall_bench.c
#define _GNU_SOURCE
#include <stdio.h>
#include <time.h>
#include <sys/syscall.h>
#include <unistd.h>

#define ITERATIONS 10000000

double measure(void (*func)(void)) {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < ITERATIONS; i++) func();
    clock_gettime(CLOCK_MONOTONIC, &end);
    return (end.tv_sec - start.tv_sec) * 1e9 + 
           (end.tv_nsec - start.tv_nsec);
}

void test_getpid() { syscall(SYS_getpid); }
void test_gettid() { syscall(SYS_gettid); }
void test_getuid() { syscall(SYS_getuid); }

int main() {
    printf("getpid: %.1f ns\n", measure(test_getpid) / ITERATIONS);
    printf("gettid: %.1f ns\n", measure(test_gettid) / ITERATIONS);
    printf("getuid: %.1f ns\n", measure(test_getuid) / ITERATIONS);
    return 0;
}
gcc -O2 syscall_bench.c -o syscall_bench
./syscall_bench
Objective: Analyze real application syscall patterns
# 1. Compare static vs dynamic linking
strace -c /bin/ls -la
strace -c /bin/busybox ls -la

# 2. Analyze a web server
strace -c -f python3 -m http.server 8000 &
curl http://localhost:8000/
kill %1

# 3. Find slowest syscalls
strace -T -o trace.log dd if=/dev/zero of=/tmp/test bs=1M count=100
sort -t'<' -k2 -n trace.log | tail -20
Objective: Create a sandboxed execution environment
// sandbox.c
#include <stdio.h>
#include <stdlib.h>
#include <seccomp.h>
#include <unistd.h>
#include <sys/wait.h>

void apply_sandbox() {
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
    
    // Allow only essential syscalls
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(brk), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 0);
    
    seccomp_load(ctx);
    seccomp_release(ctx);
}

int main() {
    printf("Before sandbox\n");
    apply_sandbox();
    printf("After sandbox\n");
    
    // This will fail (open not allowed)
    FILE *f = fopen("/etc/passwd", "r");
    if (!f) printf("fopen blocked by seccomp!\n");
    
    return 0;
}
gcc sandbox.c -o sandbox -lseccomp
./sandbox

Interview Questions

Answer:
  1. User space:
    • Application calls read(fd, buf, count)
    • libc sets up registers: rax=0 (SYS_read), rdi=fd, rsi=buf, rdx=count
    • Executes syscall instruction
  2. Kernel entry:
    • CPU switches to ring 0, loads kernel stack
    • entry_SYSCALL_64 saves registers
    • do_syscall_64 looks up sys_read in syscall table
  3. Syscall handler (ksys_read):
    • Validates fd, gets struct file *
    • Calls file’s read operation (via file->f_op->read)
    • For regular files: checks page cache, reads from disk if needed
    • Copies data to user buffer via copy_to_user
  4. Return:
    • Returns bytes read (or error)
    • syscall_exit_to_user_mode: check signals, scheduling
    • sysretq: return to user mode
Answer:How it works:
  • Kernel maps a special page into every process’s address space
  • Page contains code that reads kernel-maintained data
  • No mode switch needed — runs entirely in user space
Performance improvement:
  • Regular syscall: ~200-500 cycles (mode switch overhead)
  • vDSO call: ~10-20 cycles (just a function call)
Functions provided:
  • gettimeofday(), clock_gettime() — most important
  • time(), getcpu()
Why limited:
  • Only works for read-only data
  • Kernel maintains shared data (timekeeping)
  • Can’t be used for anything requiring kernel intervention
Impact: Applications doing millions of time queries (monitoring, logging) would be ~10-50x slower without vDSO.
Answer:Protection mechanism:
  • BPF program runs on every syscall entry
  • Blocks dangerous syscalls before they execute
  • Defense in depth — even if container escapes, syscalls limited
Commonly blocked syscalls:
  • mount, umount — prevent filesystem manipulation
  • reboot, kexec_load — prevent system disruption
  • ptrace — prevent debugging/injection
  • init_module, delete_module — prevent kernel modification
  • clock_settime — prevent time manipulation
Docker default profile: Blocks ~44 syscalls out of ~300+Example attack prevention:
  • Container exploit tries ptrace to escape → blocked
  • Malware tries kexec_load → blocked
  • Process tries to load kernel module → blocked
Answer:Overhead sources:
  • BPF filter runs on every syscall entry
  • Constant-time operations for simple filters
  • More complex filters = higher overhead
Typical overhead:
  • Simple whitelist: ~20-50 nanoseconds per syscall
  • Complex filters with argument checking: 100-200 ns
Why it’s acceptable:
  • Syscalls already cost 200-500ns minimum
  • 20-50ns is <25% additional overhead
  • Security benefit outweighs cost
Optimization tips:
  • Put common allowed syscalls first in filter
  • Use SECCOMP_RET_ALLOW as default if mostly allowing
  • Profile with perf to measure actual impact

Key Takeaways

Syscall Mechanism

SYSCALL instruction, register convention, and kernel entry path are fundamental knowledge

vDSO Optimization

Critical for understanding why some “syscalls” have nearly zero overhead

seccomp Security

BPF-based syscall filtering is the foundation of container security

Tracing Skills

strace and understanding syscall patterns are essential for debugging

Interview Deep-Dive

Strong Answer:
  • On modern Linux, gettimeofday() and clock_gettime() do not actually enter the kernel. They are served by the vDSO (virtual Dynamic Shared Object), which is a small shared library that the kernel maps into every process’s address space during execve(). The vDSO contains code that reads time data from a shared memory page that the kernel updates on each timer tick (typically every 1-4ms).
  • The mechanism works as follows: the kernel maintains a vsyscall_gtod_data structure in a page mapped read-only into user space. This structure contains the current time, the clocksource coefficients (TSC multiplier and shift), and the last update timestamp. The vDSO code reads the TSC register directly (via rdtsc or rdtscp), applies the coefficients to compute the current time, and returns — all without any privilege transition.
  • A regular syscall costs 200-500 CPU cycles (mode switch, register save/restore, Spectre mitigations). A vDSO call costs 10-20 cycles (just a function call and a few multiplications). At 500,000 calls per second, the difference is roughly 0.1% CPU for vDSO versus 5-10% CPU for real syscalls. This is why you rarely see gettimeofday as a performance bottleneck in strace output, and also why strace itself cannot see vDSO calls (they never enter the kernel).
Follow-up: Under what circumstances would clock_gettime() actually fall back to a real syscall instead of using vDSO?Follow-up Answer:
  • The vDSO only works when the kernel can provide sufficient information for user-space time computation. It falls back to a real syscall when: the clocksource is not TSC-based (for example, HPET or ACPI PM timer, which require MMIO reads that need kernel privileges), when CLOCK_PROCESS_CPUTIME_ID or CLOCK_THREAD_CPUTIME_ID are requested (these require reading per-task scheduling data), or when the clock is CLOCK_TAI on some kernel versions. You can verify which calls use vDSO by checking whether they appear in strace output — if they do not appear, they are being handled by vDSO.
Strong Answer:
  • Seccomp-BPF filters run before the syscall handler executes. When a container runtime (like runc) starts a container, it installs a BPF filter program via prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, ...). This filter receives the syscall number and arguments as input and returns an action (ALLOW, KILL, ERRNO, TRACE, LOG). If the vulnerable syscall is blocked by the filter, the exploit never reaches the buggy kernel code — the filter returns EPERM or kills the process before the syscall handler is even invoked.
  • Docker’s default seccomp profile blocks approximately 44 of the 300+ syscalls, including dangerous ones like kexec_load, mount, ptrace, init_module, delete_module, and clock_settime. This reduces the kernel’s attack surface significantly.
  • However, seccomp has important limitations. First, it can only filter on syscall number and the first six arguments. It cannot dereference pointers, so it cannot inspect the contents of buffers or filenames passed to syscalls. Second, the filter is set once and cannot be relaxed (only tightened), following the principle of least privilege. Third, seccomp cannot protect against kernel vulnerabilities in allowed syscalls — if the exploit is in read() or write(), which must be allowed for the container to function, seccomp cannot help. Finally, seccomp does not protect against hardware-level attacks like Spectre that bypass the syscall interface entirely.
Follow-up: How does the seccomp overhead scale with filter complexity, and how would you design a filter for a production service?Follow-up Answer:
  • Each seccomp filter is a BPF program that runs linearly: the kernel evaluates instructions sequentially for every syscall. A simple allowlist of 20 syscalls might take 20-50 nanoseconds per syscall. A complex filter with argument checking on dozens of syscalls could take 100-200 nanoseconds. Since syscalls already cost 200-500 nanoseconds minimum, a well-designed filter adds less than 25% overhead. For production, I would start with Docker’s default profile, then use strace -c to identify which syscalls the service actually uses, and build a tight allowlist. I would put the most frequently called syscalls (read, write, futex, epoll_wait) first in the filter to minimize average evaluation time, and set the default action to SCMP_ACT_ERRNO rather than SCMP_ACT_KILL to avoid silent process deaths during development.
Strong Answer:
  • In user space, write() is a libc wrapper that sets up registers per the x86-64 ABI: rax=1 (SYS_write), rdi=fd, rsi=buf, rdx=4096, then executes the syscall instruction.
  • The CPU saves RIP and RFLAGS into RCX and R11, loads the kernel entry point from MSR_LSTAR, switches to ring 0, and jumps to entry_SYSCALL_64 in assembly. This code swaps to the kernel stack via swapgs, saves all user registers onto the kernel stack as a pt_regs structure, then calls do_syscall_64().
  • do_syscall_64() looks up sys_call_table[1] (write), which dispatches to ksys_write(). This function calls fdget_pos() to convert the integer fd to a struct file * pointer and acquire the file position lock. It then calls vfs_write(), which checks permissions and calls file->f_op->write_iter() — the filesystem-specific write function.
  • For ext4 buffered writes, ext4_file_write_iter() calls generic_perform_write(), which finds or creates pages in the page cache (address_space), copies the 4096 bytes from user space into the page cache page via copy_from_user(), and marks the page dirty. The write call returns to user space at this point — the data is in page cache but not on disk.
  • Later, the writeback kernel thread (or the pdflush equivalent) wakes up and calls ext4_writepages(), which allocates disk blocks, creates bio structures describing the I/O, and submits them to the block layer via submit_bio(). The block layer’s scheduler (mq-deadline, kyber, or none) may reorder or merge the request, then dispatches it to the device driver, which programs DMA to transfer the page cache data directly to the storage device. The device raises an interrupt on completion, and the bio end_io callback marks the page clean.
Follow-up: At which point is the data guaranteed to survive a power failure, and how does fsync() change the flow?Follow-up Answer:
  • After the buffered write returns, data is only in volatile page cache — a power failure loses it. Calling fsync(fd) after the write forces the kernel to flush all dirty pages for that file to disk and wait for the device to confirm they are on persistent storage. Specifically, fsync() calls vfs_fsync(), which invokes file->f_op->fsync() (ext4_sync_file), which flushes dirty data pages, writes the inode metadata, and issues a cache flush command to the drive (SYNCHRONIZE CACHE for SCSI/SAS, FUA bit for NVMe). Only after the drive confirms the flush is fsync() allowed to return. Note that even fsync does not guarantee safety against drive firmware bugs that falsely acknowledge writes, which is why enterprise drives with power-loss-protected write caches exist.

Next: Kernel Data Structures →