Skip to main content

OS Fundamentals & System Call Internals

Operating Systems exist to manage hardware and provide a safe abstraction for applications. A “Senior” engineer must understand the physical transition between these two worlds.
Interview Frequency: Extremely High (90%+ of system programming interviews) Key Topics: System calls, kernel/user space, vDSO, context switching, privilege levels Time to Master: 12-15 hours Prerequisites: C programming, basic computer architecture

0. What is an Operating System?

At the highest level, an Operating System is a resource manager and isolation layer:
  • Resource Manager:
    • Multiplexes CPU time between many processes.
    • Allocates and reclaims memory, files, sockets, and devices.
    • Schedules and prioritizes work according to policy (throughput, latency, fairness, deadlines).
  • Isolation & Protection Layer:
    • Prevents one program from corrupting another program’s memory.
    • Prevents untrusted code from directly touching hardware.
    • Enforces security boundaries (user vs kernel, containers, VMs).
A helpful analogy is a city operating authority:
  • Streets/highways ⇢ CPU cores and buses.
  • Buildings ⇢ processes.
  • Rooms ⇢ threads.
  • Zoning rules and permits ⇢ permissions and security policies.
  • Traffic lights ⇢ synchronization and scheduling.
The OS makes the city feel orderly and predictable to its “citizens” (programs) even though underneath, the physical world (hardware) is chaotic and failure-prone.

0.1 Core Responsibilities

Every mainstream OS (Linux, Windows, macOS, BSD, RTOS variants) implements the same core ideas:
  • Abstraction: Present simple interfaces (files, sockets, processes) instead of raw devices and registers.
  • Virtualization: Make a single physical CPU and memory look like many virtual CPUs and address spaces.
  • Isolation: Ensure faults in one address space do not corrupt others.
  • Coordination: Provide primitives (locks, signals, pipes, futexes) so concurrent entities can cooperate.
  • Accounting: Track which process used how much CPU, memory, I/O; enforce quotas and limits.

0.2 Types of Operating Systems

Monolithic Kernels

Examples: Linux, traditional UnixMost services (drivers, file systems, networking) run in kernel mode. Fast but large kernel.

Microkernels

Examples: seL4, Minix3, QNXMove many services to user space for stronger isolation. More message passing overhead.

Hybrid Kernels

Examples: Windows NT, macOS (XNU)Combine monolithic and microkernel approaches for balance.

Real-Time OS

Examples: FreeRTOS, VxWorksTrade general-purpose flexibility for strict deadline guarantees.
In this course, we primarily focus on Linux as the concrete example, but the mental models transfer to all of these.

0.5 From Source Code to Running Process

To make OS fundamentals concrete, walk through what happens when you compile and run a simple C program:
// main.c
#include <stdio.h>

int main(int argc, char **argv) {
    printf("hello\n");
    return 0;
}

Step 1: Compilation and Linking

gcc -E main.c -o main.i
Expands #include and macros into a single translation unit:
// Thousands of lines from stdio.h are now inserted
// ...
int main(int argc, char **argv) {
    printf("hello\n");
    return 0;
}
ELF Structure (Executable and Linkable Format):
┌─────────────────────────┐
│  ELF Header             │  Magic number, entry point, section offsets
├─────────────────────────┤
│  Program Headers        │  How to load segments into memory
├─────────────────────────┤
│  .text (Code)           │  Machine instructions (read-only)
├─────────────────────────┤
│  .rodata (Constants)    │  "hello" string lives here
├─────────────────────────┤
│  .data (Initialized)    │  Initialized global variables
├─────────────────────────┤
│  .bss (Uninitialized)   │  Uninitialized globals (zero-filled by loader)
├─────────────────────────┤
│  Symbol Table           │  Function names, debugging info
├─────────────────────────┤
│  Section Headers        │  Metadata about each section
└─────────────────────────┘
At this point you have a program on disk, not yet a process.

Step 2: Shell Creates a New Process

When you run:
$ ./main
What the shell does:
  1. Your shell (itself a process) parses the command.
  2. The shell calls fork():
    • The kernel creates a child process by copying the parent’s PCB and page tables (copy-on-write).
    • Parent and child now both exist; they differ only in the return value of fork().
// Inside bash or sh:
pid_t pid = fork();
if (pid == 0) {
    // Child process
    execve("./main", argv, envp);
} else {
    // Parent process
    wait(NULL);  // Wait for child to finish
}

Step 3: Child Calls execve()

In the child:
  1. The shell calls execve("./main", ...).
  2. The kernel:
    • Reads the ELF headers from disk.
    • Allocates a new address space.
    • Maps code, data, stack, and shared libraries into that space.
    • Sets up the initial user stack with argc/argv and environment.
    • Sets the program counter to the C runtime entry point (_start).
Memory Layout After execve():
High Address (0x7FFF...)
┌─────────────────────────┐
│  Kernel Space           │ ← Not accessible from user mode
├─────────────────────────┤
│  Stack (grows down ↓)   │ ← argc, argv, environment vars
│         ...             │
├─────────────────────────┤
│  Memory Mapped Region   │ ← Shared libraries (libc.so)
│  (libc, ld-linux.so)    │
├─────────────────────────┤
│  Heap (grows up ↑)      │ ← malloc() allocates here
│         ...             │
├─────────────────────────┤
│  .bss (uninitialized)   │ ← Zero-filled global variables
├─────────────────────────┤
│  .data (initialized)    │ ← Initialized global variables
├─────────────────────────┤
│  .text (code)           │ ← Your program's machine code
└─────────────────────────┘
Low Address (0x0000...)
At this moment the program becomes a process with its own PID and address space.

Step 4: C Runtime → main → Exit

  1. The C runtime (crt1.o) runs first, initializing the runtime and calling your main().
  2. Your code executes (printf("hello\n")), which itself issues syscalls under the hood (write() on stdout).
  3. When main returns, the runtime calls exit(), which:
    • Flushes stdio buffers.
    • Invokes the exit_group syscall.
    • Lets the kernel tear down the process (free memory, close FDs, reap the PCB).
Complete Flow Diagram:
User types "./main"

Shell receives command (shell is a running process, PID 1000)

Shell calls fork() ──────────┐
        ↓                     ↓
Parent (PID 1000)        Child (PID 1001)
calls wait()             calls execve("./main")
blocks...                     ↓
                         Kernel loads ELF binary
                         Sets up address space
                         Maps .text, .data, stack
                         Jumps to _start

                         C runtime initializes

                         main() executes
                         printf() → write syscall

                         main() returns 0

                         exit(0) → exit_group syscall

                         Kernel cleans up process

Parent's wait() returns
Shell prints next prompt
This entire path is the lifecycle of a simple process; later chapters (Processes, Virtual Memory, Scheduling, File Systems) each zoom into one part of this story.

1. The Kernel vs. User Space

The CPU hardware enforces the boundary.

1.1 Privilege Levels (Protection Rings)

Four Rings (though most OSes only use 2):
Ring 0 (Kernel Mode)    ← Full hardware access
Ring 1 (Device Drivers) ← Unused in modern OSes
Ring 2 (Device Drivers) ← Unused in modern OSes
Ring 3 (User Mode)      ← Restricted instructions
Current Privilege Level (CPL) stored in CS register (Code Segment).Privileged Instructions (only Ring 0):
  • HLT - halt the CPU
  • CLI/STI - disable/enable interrupts
  • MOV CR3, reg - change page tables
  • LGDT/LIDT - load GDT/IDT
  • IN/OUT - direct hardware I/O (on some systems)

1.2 What Each Mode Can Do

CapabilityUser Space (Ring 3 / EL0)Kernel Space (Ring 0 / EL1)
Execute normal instructions
Access own virtual memory
Access all physical memory
Modify page tables
Disable interrupts
Execute I/O instructions
Halt the CPU
Load kernel modules
Why the restriction?
// If user space could do this:
asm("cli");  // Disable interrupts
while(1);    // Infinite loop
// The entire system would freeze!

// Or this:
void *kernel_memory = (void *)0xFFFF888000000000;
*kernel_memory = 0x90909090;  // Overwrite kernel code
// System compromised!
The hardware enforces that attempts to execute privileged instructions in user mode trigger a General Protection Fault (x86) or Illegal Instruction exception (ARM/RISC-V), which the kernel handles by terminating the offending process.

2. System Call Evolution (x86-64)

How does a program ask the kernel for help?

2.1 Legacy: INT 0x80 (i386)

In the 32-bit era, applications used a software interrupt.
; Legacy 32-bit Linux syscall
mov eax, 4          ; syscall number (write)
mov ebx, 1          ; file descriptor (stdout)
mov ecx, msg        ; buffer
mov edx, 13         ; count
int 0x80            ; Trap to kernel
What happens:
  1. CPU saves registers (CS, EIP, EFLAGS).
  2. CPU looks up interrupt vector 0x80 in the IDT (Interrupt Descriptor Table).
  3. CPU jumps to kernel’s interrupt handler.
  4. Handler switches to kernel stack.
  5. Handler calls the appropriate syscall function.
  6. Handler returns using IRET.
Problem: Very slow (~300-500 cycles) due to:
  • Interrupt controller overhead
  • Full register save/restore
  • Stack switching
  • Permission checks

2.2 Modern: SYSCALL (AMD) / SYSENTER (Intel)

x86-64 introduced a dedicated instruction for syscalls. SYSCALL Instruction (AMD64):
; Modern 64-bit Linux syscall
mov rax, 1          ; syscall number (write)
mov rdi, 1          ; arg1: file descriptor
mov rsi, msg        ; arg2: buffer
mov rdx, 13         ; arg3: count
syscall             ; Fast system call
Hardware Magic:
  1. No IDT lookup: CPU jumps to address stored in IA32_LSTAR MSR (Model Specific Register).
  2. No stack lookup: Uses IA32_KERNEL_GS_BASE for per-CPU data.
  3. Minimal save: Only saves RIP and RFLAGS to RCX and R11.
Kernel Entry Point (simplified):
; Entry point stored in LSTAR MSR
entry_SYSCALL_64:
    SWAPGS                  ; Switch to kernel GS (per-CPU area)
    mov    QWORD PTR gs:0x14, rsp   ; Save user stack
    mov    rsp, QWORD PTR gs:0x1c   ; Load kernel stack

    push   rax              ; Save registers
    push   rcx              ; (RCX = user RIP)
    push   r11              ; (R11 = user RFLAGS)
    ; ... save more registers

    call   do_syscall_64    ; C function dispatch

    ; ... restore registers
    pop    r11
    pop    rcx
    pop    rax

    mov    rsp, QWORD PTR gs:0x14   ; Restore user stack
    SWAPGS                  ; Switch back to user GS
    sysretq                 ; Return to user space
Performance: Much faster (~100-200 cycles). The savings come from:
  • Direct jump (no table lookup)
  • Minimal register save/restore
  • No interrupt controller involved

2.3 ARM64: SVC Instruction

; ARM64 syscall
mov x8, #64         ; syscall number (write)
mov x0, #1          ; arg1: fd
ldr x1, =msg        ; arg2: buffer
mov x2, #13         ; arg3: count
svc #0              ; SuperVisor Call
Hardware behavior:
  • Saves PC to ELR_EL1
  • Saves PSTATE to SPSR_EL1
  • Jumps to exception vector
  • Kernel dispatches based on x8

2.4 RISC-V: ECALL Instruction

; RISC-V syscall
li a7, 64           ; syscall number (write)
li a0, 1            ; arg1: fd
la a1, msg          ; arg2: buffer
li a2, 13           ; arg3: count
ecall               ; Environment call
Cost Comparison Table:
MechanismTypical CyclesUse Case
INT 0x80300-500Legacy 32-bit x86
SYSENTER100-150Intel x86-32 (deprecated)
SYSCALL100-200Modern x86-64
SVC100-200ARM64
ECALL100-200RISC-V
vDSO (no switch)5-20Kernel-provided user code

3. The vDSO (Virtual Dynamic Shared Object)

Some system calls are called thousands of times per second (e.g., gettimeofday(), clock_gettime()). Switching to kernel mode every time is a massive waste of CPU.

3.1 How it Works

The vDSO is a special page of memory that the kernel maps into every user process’s address space. This page contains:
  1. Code: Executable functions that run in user mode.
  2. Data: Read-only kernel data (like current time).
Process Address Space:
┌─────────────────────────┐
│  Kernel Space           │
├─────────────────────────┤
│  Stack                  │
├─────────────────────────┤
│  [vdso] ← Magic page!   │ ← Kernel-provided, user-executable
│    - gettimeofday()     │
│    - clock_gettime()    │
│    - getcpu()           │
├─────────────────────────┤
│  Shared Libraries       │
│  (libc.so)              │
├─────────────────────────┤
│  Heap                   │
├─────────────────────────┤
│  .text                  │
└─────────────────────────┘
Example: gettimeofday() Without vDSO:
// Slow path: requires syscall
struct timeval tv;
gettimeofday(&tv, NULL);
// → syscall → kernel mode → read kernel clock → return
// Cost: ~200 cycles
Example: gettimeofday() With vDSO:
// Fast path: executes in user mode
struct timeval tv;
gettimeofday(&tv, NULL);
// → calls vDSO function → reads shared memory → returns
// Cost: ~20 cycles (no mode switch!)

3.2 Implementation Details

Kernel Side (sets up vDSO):
// In arch/x86/entry/vdso/vdso.c
static int __init init_vdso(void) {
    // Map vDSO page into every process
    vdso_pages[0] = alloc_page(GFP_KERNEL);
    copy_vdso_to_page(vdso_pages[0]);
    return 0;
}

// Kernel periodically updates shared time data
void update_vsyscall(struct timekeeper *tk) {
    vdso_data->wall_time_sec = tk->wall_time.tv_sec;
    vdso_data->wall_time_nsec = tk->wall_time.tv_nsec;
}
User Side (vDSO function):
// Inside vDSO (simplified)
notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) {
    // Read from shared memory (no syscall!)
    ts->tv_sec = vdso_data->wall_time_sec;
    ts->tv_nsec = vdso_data->wall_time_nsec;
    return 0;
}
Variable Shadowing: The kernel periodically writes the current time into a “data” part of the vDSO page using atomic operations or seqlocks to ensure consistency.

3.3 Why Some Syscalls Can’t Use vDSO

Safe for vDSO:
  • Read-only operations
  • No side effects
  • Data changes slowly or predictably
Cannot use vDSO:
  • write() - modifies kernel state
  • fork() - creates process
  • mmap() - changes address space
  • open() - allocates file descriptor
Performance Impact:
// Benchmark: 10 million calls
clock_gettime(CLOCK_REALTIME, &ts);

// With vDSO:     ~200 ms (20 ns per call)
// Without vDSO:  ~2000 ms (200 ns per call)
// Speedup: 10x

3.4 Finding the vDSO

# View process mappings
$ cat /proc/self/maps | grep vdso
7ffd1a3fe000-7ffd1a400000 r-xp 00000000 00:00 0    [vdso]

# Dump vDSO symbols
$ objdump -T /lib/x86_64-linux-gnu/libc.so.6 | grep vdso
# (libc wraps vDSO calls)

# Using LD_SHOW_AUXV
$ LD_SHOW_AUXV=1 ./program 2>&1 | grep SYSINFO
AT_SYSINFO_EHDR: 0x7ffd1a3fe000 vDSO base address
Result: Zero context switches. The system call is “executed” entirely in user space.

4. vsyscall: The Legacy Fixed Address

Before the vDSO, there was vsyscall.

4.1 The Problem with vsyscall

Fixed virtual address: 0xffffffffff600000

Every process had:
┌───────────────────────────┐
│ 0xffffffffff600000:       │
│   gettimeofday code       │ ← Same address in EVERY process
│   time code               │
│   getcpu code             │
└───────────────────────────┘
Security Issue: Predictable addresses enable ROP attacks (Return-Oriented Programming). An attacker could reliably:
// Exploit code
return_address = 0xffffffffff600000 + offset;
// Execute attacker-controlled syscalls

4.2 Modern Linux Solution

vsyscall is now emulated:
// In arch/x86/entry/vsyscall/vsyscall_64.c
static bool is_vsyscall_vaddr(unsigned long vaddr) {
    return vaddr >= VSYSCALL_ADDR && vaddr < VSYSCALL_ADDR + PAGE_SIZE;
}

// Trap and emulate
do_page_fault() {
    if (is_vsyscall_vaddr(address)) {
        // Emulate the syscall (SLOW!)
        emulate_vsyscall();
        return;
    }
}
Result: vsyscall addresses still work (for legacy binaries), but they trap to the kernel instead of executing directly. Modern code uses vDSO instead.

5. Kernel Entry: entry_SYSCALL_64

When the SYSCALL instruction is executed, the CPU jumps to this assembly entry point in the kernel.

5.1 Complete Entry Path (x86-64)

; arch/x86/entry/entry_64.S

ENTRY(entry_SYSCALL_64)
    /*
     * Interrupts are off on entry.
     * We are in kernel mode, but need to switch stacks.
     */

    /* 1. SWAPGS: Switch GS register to kernel per-CPU area */
    SWAPGS

    /* 2. Save user stack pointer */
    movq    %rsp, PER_CPU_VAR(rsp_scratch)

    /* 3. Load kernel stack */
    movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp

    /* 4. Build pt_regs structure on kernel stack */
    pushq   $__USER_DS              /* User data segment */
    pushq   PER_CPU_VAR(rsp_scratch)  /* User RSP */
    pushq   %r11                    /* User RFLAGS (saved by SYSCALL) */
    pushq   $__USER_CS              /* User code segment */
    pushq   %rcx                    /* User RIP (saved by SYSCALL) */

    pushq   %rax    /* Syscall number */
    pushq   %rdi    /* Arg 1 */
    pushq   %rsi    /* Arg 2 */
    pushq   %rdx    /* Arg 3 */
    pushq   %r10    /* Arg 4 (RCX was overwritten) */
    pushq   %r8     /* Arg 5 */
    pushq   %r9     /* Arg 6 */

    /* 5. Call the C dispatcher */
    call    do_syscall_64

    /* 6. Restore user registers from pt_regs */
    popq    %r9
    popq    %r8
    popq    %r10
    popq    %rdx
    popq    %rsi
    popq    %rdi
    popq    %rax    /* Return value */

    popq    %rcx    /* User RIP */
    popq    %r11    /* Skip CS */
    popq    %r11    /* User RFLAGS */
    popq    %rsp    /* Skip DS */
    popq    %rsp    /* User RSP */

    /* 7. SWAPGS back to user GS */
    SWAPGS

    /* 8. SYSRETQ: Return to user space */
    sysretq
END(entry_SYSCALL_64)

5.2 The C Dispatcher

// In arch/x86/entry/common.c

__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr) {
    // Security checks
    nr = syscall_enter_from_user_mode(regs, nr);

    // Validate syscall number
    if (likely(nr < NR_syscalls)) {
        // Look up in syscall table and call
        regs->ax = sys_call_table[nr](regs);
    }

    syscall_exit_to_user_mode(regs);
}

5.3 The Syscall Table

// In arch/x86/entry/syscall_64.c

const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
    [0] = __x64_sys_read,
    [1] = __x64_sys_write,
    [2] = __x64_sys_open,
    [3] = __x64_sys_close,
    [4] = __x64_sys_stat,
    // ... 300+ more syscalls
    [39] = __x64_sys_getpid,
    [57] = __x64_sys_fork,
    [59] = __x64_sys_execve,
    // ...
};
Security Note: The syscall table is marked read-only after boot. Modifying it is a common rootkit technique.

5.4 Complete Flow Diagram

User Program                  CPU Hardware              Kernel
─────────────────────────────────────────────────────────────
mov rax, 1
mov rdi, 1
mov rsi, buffer
mov rdx, 13
syscall ──────────────────>  SYSCALL instruction
                             - Save RIP → RCX
                             - Save RFLAGS → R11
                             - Load RIP from LSTAR MSR ───────> entry_SYSCALL_64:
                             - Switch to Ring 0                   SWAPGS
                                                                  Save user RSP
                                                                  Load kernel stack
                                                                  Push registers (pt_regs)
                                                                  call do_syscall_64

                                                                  sys_call_table[1]

                                                                  __x64_sys_write()

                                                                  ksys_write()

                                                                  vfs_write()

                                                                  [... kernel work ...]

                                                                  return bytes_written

                                                                  Pop registers
                                                                  Restore user RSP
                                                                  SWAPGS
                             SYSRETQ <────────────────────────── sysretq
                             - Restore RIP from RCX
                             - Restore RFLAGS from R11
                             - Switch to Ring 3
RAX = bytes_written <───────
continue program...

6. Processes vs. Threads vs. Kernel Tasks

Before we dive into system call micro-details, it is critical to distinguish the units of execution the kernel manages:

Process

Isolation Unit
  • Own virtual address space
  • Own page tables (mm_struct)
  • Own resources (FDs, signals, cwd)
  • Heavyweight context switch

Thread

Execution Unit
  • Shares address space with process
  • Own stack and registers
  • Own TID
  • Lightweight context switch (same CR3)

Kernel Thread

Kernel Worker
  • No user address space
  • Lives entirely in kernel
  • Examples: kswapd, kworker
  • No context switch overhead for syscalls

6.1 Visualization

Think of a process as a house, and threads as people inside the house:
Process = House
├── Address Space = House structure (walls, roof)
├── Resources
│   ├── File Descriptors = Shared utilities (kitchen, bathroom)
│   ├── Signal Handlers = House rules (fire alarm protocol)
│   └── Current Directory = Current room the house is in
└── Threads = People inside
    ├── Thread 1
    │   ├── Stack = Private bedroom
    │   └── Registers = Personal belongings
    ├── Thread 2
    │   ├── Stack = Private bedroom
    │   └── Registers = Personal belongings
    └── ...

6.2 Why It Matters for System Calls

Example: read() system call
// Thread 1
ssize_t n = read(fd, buf1, size);  // May block
  • Issued by a thread
  • CPU time charged to the process
  • May block only this thread, not the whole process
  • Other threads can continue executing
Example: fork() system call
// Process with 5 threads calls fork()
pid_t pid = fork();
  • Creates a new process with copy-on-write address space
  • The child initially has only one thread (the caller)
  • Even though parent had 5 threads, they don’t get copied
  • Child’s single thread continues from the fork() return point
Kernel’s View:
// Linux doesn't distinguish! Everything is a task_struct

// Thread vs Process determined by clone() flags:
clone(CLONE_VM | CLONE_FS | CLONE_FILES);  // = Thread
clone(SIGCHLD);                             // = Process
You will see these distinctions repeatedly in later chapters (Scheduling, Synchronization, Signals, and Linux Internals).

7. System Call Deep Dive: Real Linux Examples

7.1 Example: write() System Call

User space call:
#include <unistd.h>
ssize_t n = write(1, "Hello\n", 6);
Libc wrapper (glibc/sysdeps/unix/sysv/linux/write.c):
ssize_t __write(int fd, const void *buf, size_t count) {
    return INLINE_SYSCALL_CALL(write, fd, buf, count);
}
INLINE_SYSCALL_CALL expands to:
mov rax, 1      ; __NR_write
mov rdi, 1      ; fd
mov rsi, buf    ; buffer
mov rdx, 6      ; count
syscall
Kernel entry (arch/x86/entry/syscall_64.c):
sys_call_table[1] = __x64_sys_write;
Syscall implementation (fs/read_write.c):
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, size_t, count) {
    return ksys_write(fd, buf, count);
}

ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count) {
    struct fd f = fdget_pos(fd);
    if (!f.file)
        return -EBADF;

    ssize_t ret = vfs_write(f.file, buf, count, &pos);
    fdput_pos(f);
    return ret;
}
VFS layer (fs/read_write.c):
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos) {
    // Check permissions
    if (!(file->f_mode & FMODE_WRITE))
        return -EBADF;

    // Check file operations
    if (!file->f_op->write && !file->f_op->write_iter)
        return -EINVAL;

    // Call filesystem-specific write
    if (file->f_op->write)
        ret = file->f_op->write(file, buf, count, pos);
    else
        ret = new_sync_write(file, buf, count, pos);

    return ret;
}
Filesystem layer (e.g., ext4):
// fs/ext4/file.c
static ssize_t ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) {
    // Allocate disk blocks
    // Update inode metadata
    // Write to page cache
    // Mark pages dirty
    return generic_file_write_iter(iocb, from);
}
Complete Stack:
User Application: write(1, "Hello\n", 6)

Libc Wrapper: __write()

Assembly: syscall instruction

entry_SYSCALL_64

do_syscall_64

sys_call_table[1] → __x64_sys_write

ksys_write(fd, buf, count)

vfs_write(file, buf, count, pos)

ext4_file_write_iter() [if file is on ext4]

generic_file_write_iter()

Page Cache operations

Return bytes written

sysretq back to user space

7.2 Example: getpid() - Fast Path

User space:
pid_t pid = getpid();
Modern implementation uses vDSO:
// glibc calls:
pid_t getpid(void) {
    // Check if vDSO provides getpid
    if (GLRO(dl_vdso_getpid))
        return GLRO(dl_vdso_getpid)();

    // Fallback to syscall
    return INLINE_SYSCALL_CALL(getpid);
}
vDSO version (no syscall!):
// In vDSO
notrace static int __vdso_getpid(void) {
    // Read from per-thread cached value
    return current->tgid;  // Thread Group ID
}
Cost: ~5-10 cycles (no kernel transition)

7.3 Example: open() - Complex Path

int fd = open("/home/user/file.txt", O_RDWR | O_CREAT, 0644);
Syscall path:
// fs/open.c
SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode) {
    return do_sys_open(AT_FDCWD, filename, flags, mode);
}

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode) {
    // 1. Allocate file descriptor
    int fd = get_unused_fd_flags(flags);

    // 2. Path lookup
    struct file *f = do_filp_open(dfd, filename, flags, mode);

    // 3. Install in FD table
    fd_install(fd, f);

    return fd;
}
Path lookup (fs/namei.c):
struct file *do_filp_open(int dfd, struct filename *pathname, const struct open_flags *op) {
    // RCU-walk: lockless path walk
    // Falls back to ref-walk if needed
    struct nameidata nd;
    int error = path_openat(&nd, op, flags);

    // Creates/opens inode
    // Allocates struct file
    // Links to dentry cache

    return file;
}
Work done:
  • String parsing (/home/user/file.txt → components)
  • Dentry cache lookups (hot path)
  • Inode cache lookups
  • Disk reads (cold path, if not cached)
  • Permission checks (each directory component)
  • File allocation
  • FD table modification
Cost: Highly variable
  • Hot (all cached): 1-2 μs
  • Cold (disk reads): 1-10 ms

8. Performance Analysis of System Calls

8.1 Measuring Syscall Overhead

Microbenchmark:
#include <stdio.h>
#include <unistd.h>
#include <time.h>

#define ITERATIONS 10000000

int main() {
    struct timespec start, end;

    // Benchmark getpid() (vDSO)
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < ITERATIONS; i++) {
        getpid();
    }
    clock_gettime(CLOCK_MONOTONIC, &end);

    long ns = (end.tv_sec - start.tv_sec) * 1000000000L +
              (end.tv_nsec - start.tv_nsec);
    printf("getpid: %ld ns per call\n", ns / ITERATIONS);

    // Benchmark write() to /dev/null
    int fd = open("/dev/null", O_WRONLY);
    char buf[1] = {0};

    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < ITERATIONS; i++) {
        write(fd, buf, 1);
    }
    clock_gettime(CLOCK_MONOTONIC, &end);

    ns = (end.tv_sec - start.tv_sec) * 1000000000L +
         (end.tv_nsec - start.tv_nsec);
    printf("write: %ld ns per call\n", ns / ITERATIONS);

    return 0;
}
Typical Results (modern Intel/AMD):
getpid: 15 ns per call      ← vDSO, no syscall
write:  200 ns per call     ← Real syscall

8.2 Using perf to Analyze

# Count syscalls
perf stat -e syscalls:sys_enter_* ./program

# Trace individual syscalls
strace -c ./program
strace -T ./program  # With timing

# Profile syscall overhead
perf record -e raw_syscalls:sys_enter,raw_syscalls:sys_exit ./program
perf report

8.3 Syscall Batching Strategies

Bad: Many small syscalls
// DON'T DO THIS
for (int i = 0; i < 1000; i++) {
    write(fd, &data[i], 1);  // 1000 syscalls!
}
Good: One large syscall
// DO THIS
write(fd, data, 1000);  // 1 syscall
Better: Vectored I/O
struct iovec iov[10];
for (int i = 0; i < 10; i++) {
    iov[i].iov_base = buffers[i];
    iov[i].iov_len = sizes[i];
}
writev(fd, iov, 10);  // 1 syscall for multiple buffers
Best: Asynchronous I/O (io_uring)
// Submit many operations with zero syscalls
io_uring_prep_write(sqe1, fd, buf1, len1, offset1);
io_uring_prep_write(sqe2, fd, buf2, len2, offset2);
io_uring_prep_write(sqe3, fd, buf3, len3, offset3);
io_uring_submit(ring);  // One syscall for all 3!

9. Security Implications

9.1 Spectre/Meltdown and Syscalls

Meltdown (2018) exploited speculative execution:
// Attack code
char *kernel_addr = (char *)0xFFFF888000000000;
char value = *kernel_addr;  // Would normally fault

// But CPU speculatively loads it!
// Side channel timing attack can extract value
KPTI Fix (Kernel Page Table Isolation): Before KPTI:
User process page tables included kernel mappings
→ Fast syscalls (no CR3 switch)
→ Vulnerable to Meltdown
After KPTI:
User process page tables: User space only
Kernel page tables: Kernel + user space
→ CR3 switch on every syscall/interrupt
→ Immune to Meltdown
→ 5-30% performance hit (mitigated by PCID)
PCID (Process Context Identifier):
Instead of flushing TLB on CR3 switch:
Tag TLB entries with PCID
→ User TLB entries coexist with kernel TLB entries
→ Much lower performance impact

9.2 Syscall Filtering with seccomp

seccomp-BPF: Filter which syscalls a process can make
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/bpf.h>

// Allow only read, write, exit
struct sock_filter filter[] = {
    // Load syscall number
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),

    // Allow read (0)
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_read, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Allow write (1)
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Allow exit (60)
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_exit, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Kill process for anything else
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
};

struct sock_fprog prog = {
    .len = sizeof(filter) / sizeof(filter[0]),
    .filter = filter,
};

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);

// Now can only use read, write, exit
Use cases:
  • Docker containers
  • Chrome sandbox
  • systemd service hardening

10. Interview Deep Dive Questions

Complete Answer:User Space:
  1. Application loads syscall number into RAX
  2. Arguments into RDI, RSI, RDX, R10, R8, R9
  3. Executes SYSCALL instruction
CPU Hardware: 4. Saves user RIP to RCX (return address) 5. Saves user RFLAGS to R11 6. Loads kernel RIP from IA32_LSTAR MSR 7. Switches to Ring 0 (kernel mode) 8. Jumps to entry_SYSCALL_64Kernel Entry: 9. SWAPGS (switch to kernel GS register) 10. Save user RSP, load kernel stack 11. Build pt_regs structure (save all registers) 12. Call do_syscall_64(pt_regs, nr)Kernel Dispatch: 13. Validate syscall number 14. Look up sys_call_table[nr] 15. Call syscall handler function 16. Function does actual work (VFS, scheduler, etc.) 17. Return value placed in RAXKernel Exit: 18. Restore registers from pt_regs 19. Restore user RSP 20. SWAPGS back to user GS 21. Execute SYSRETQCPU Hardware: 22. Restore RIP from RCX 23. Restore RFLAGS from R11 24. Switch to Ring 3 (user mode) 25. Jump to user codeCost: ~100-200 CPU cycles on modern hardware
Answer:Problem: Context switching to kernel mode is expensive (~100-200 cycles). For frequently-called syscalls like gettimeofday(), this overhead dominates.Solution: vDSO (Virtual Dynamic Shared Object)How it works:
  1. Kernel maps a page of executable code into every process
  2. This page contains implementations of certain syscalls
  3. Kernel periodically updates read-only data in this page
  4. Libc resolves these functions to vDSO instead of syscalls
  5. Calls execute entirely in user space (no mode switch)
Syscalls that use vDSO:
  • gettimeofday() - reads kernel’s time data
  • clock_gettime() - reads clock data
  • getcpu() - reads current CPU number
  • time() - simplified time call
Performance impact:
Without vDSO: ~200 ns per call
With vDSO:    ~20 ns per call
Speedup: 10x
Why only certain syscalls?:
  • Must be read-only (no side effects)
  • Data must be safely readable from user space
  • Data changes must be atomic/consistent
  • Cannot require kernel state modifications
Security: vDSO code is kernel-provided, mapped at random addresses (ASLR), and cannot be modified by user space.
Answer:SWAPGS is an x86-64 instruction that atomically swaps the GS base register with a kernel-specific per-CPU value.Purpose:
  • GS register points to per-CPU data structures
  • In user mode: GS points to user thread-local storage (TLS)
  • In kernel mode: GS points to kernel per-CPU area
Why it’s needed:
User mode:
  GS:0 → thread ID
  GS:8 → errno location
  GS:16 → TLS data

Kernel mode:
  GS:0 → current task_struct pointer
  GS:8 → CPU number
  GS:16 → kernel stack pointer
Security-critical:
  1. Spectre v1 exploited missing SWAPGS
  2. CPU could speculatively execute kernel code with user GS
  3. Kernel would read wrong data, leak information
Attack scenario:
// User sets GS to malicious value
syscall();  // Enter kernel

// If SWAPGS not executed:
current = *(task_struct **)GS:0;  // Reads attacker-controlled memory!
Mitigation:
  • SWAPGS must be first instruction in syscall entry
  • Must happen before any GS-relative memory access
  • Hardware barriers prevent speculative execution reordering
Modern defense (FENCE after SWAPGS):
entry_SYSCALL_64:
    SWAPGS
    lfence              ; Speculation barrier
    mov rsp, gs:0x1c    ; Now safe to use GS
Answer:Function Call (user → user):
call foo
  push return_address
  jmp foo
foo:
  ; function body
  ret
Cost: 5-10 cycles
  • Push return address
  • Branch (usually predicted correctly)
  • Return (predicted via RAS - Return Address Stack)
System Call (user → kernel → user):
syscall
  ; Save RIP, RFLAGS
  ; Switch privilege level
  ; Switch stack
  ; Jump to kernel
[kernel work]
sysretq
  ; Restore RIP, RFLAGS
  ; Switch privilege level
  ; Switch stack
  ; Jump to user
Cost: 100-200+ cycles
  • Mode switch overhead: ~50 cycles
  • Register save/restore: ~20 cycles
  • TLB effects (if PCID not used): ~50 cycles
  • Cache effects (kernel code not in L1): ~50+ cycles
Real-world comparison:
// Function call
int add(int a, int b) { return a + b; }
add(1, 2);  // ~10 cycles

// Syscall
getpid();   // ~200 cycles (without vDSO)
Why syscalls are slow:
  1. Privilege level transition (Ring 3 → Ring 0 → Ring 3)
  2. Page table switch (if KPTI enabled)
  3. TLB flush (if PCID not supported)
  4. Cache pollution (kernel code evicts user code)
  5. Security checks and barriers
Ratio: Syscalls are 20-40x slower than function callsOptimization strategies:
  • Batch operations (writev vs many writes)
  • Use vDSO when available
  • Use memory mapping (mmap) to avoid read/write syscalls
  • Use io_uring for async I/O with minimal syscalls
Answer:
Aspectx86-64ARM64RISC-V
InstructionSYSCALLSVC #0ECALL
Syscall #RAXx8a7
ArgsRDI,RSI,RDX,R10,R8,R9x0-x5a0-a5
ReturnRAXx0a0
Max Args666
x86-64:
mov rax, 1      ; syscall number
mov rdi, 1      ; arg1
mov rsi, buf    ; arg2
mov rdx, len    ; arg3
syscall         ; trap to kernel
; return value in RAX
ARM64:
mov x8, #64     ; syscall number (write)
mov x0, #1      ; arg1 (fd)
ldr x1, =buf    ; arg2 (buffer)
mov x2, len     ; arg3 (count)
svc #0          ; supervisor call
; return value in x0
RISC-V:
li a7, 64       ; syscall number
li a0, 1        ; arg1
la a1, buf      ; arg2
li a2, len      ; arg3
ecall           ; environment call
; return value in a0
Key differences:
  1. Calling convention:
    • x86-64 uses different registers for syscalls vs function calls
    • ARM/RISC-V use same registers for both
  2. Syscall numbering:
    • Each architecture has different syscall numbers
    • write is #1 on x86-64, #64 on ARM64/RISC-V
    • Forces architecture-specific syscall tables
  3. Mode switching:
    • x86: Ring 0 vs Ring 3
    • ARM: EL0 vs EL1
    • RISC-V: U-mode vs S-mode
  4. Performance:
    • Similar overhead (~100-200 cycles)
    • RISC architectures slightly cleaner (fewer legacy modes)
    • All benefit from vDSO equally

11. Hands-On Practice

Lab 1: Tracing System Calls

# Trace all syscalls
strace ./program

# Count syscalls by type
strace -c ./program

# Trace only open/read/write
strace -e trace=open,read,write ./program

# Show timing per syscall
strace -T ./program

# Attach to running process
strace -p <pid>

Lab 2: Minimal Syscall (No libc)

// syscall_raw.c - Direct syscall without libc
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>

// Don't use printf (it uses syscalls internally)
static inline long my_write(int fd, const void *buf, size_t count) {
    long ret;
    __asm__ volatile (
        "syscall"
        : "=a" (ret)
        : "0"(__NR_write), "D"(fd), "S"(buf), "d"(count)
        : "rcx", "r11", "memory"
    );
    return ret;
}

void _start() {
    const char msg[] = "Hello from raw syscall!\n";
    my_write(1, msg, sizeof(msg) - 1);

    __asm__ volatile (
        "mov $60, %%rax\n"  // __NR_exit
        "xor %%rdi, %%rdi\n"
        "syscall"
        ::: "rax", "rdi"
    );
}
Compile without libc:
gcc -nostdlib -static -o syscall_raw syscall_raw.c
./syscall_raw

Lab 3: Benchmark Syscall Overhead

// benchmark_syscall.c
#define _GNU_SOURCE
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <fcntl.h>

#define ITERATIONS 10000000

static inline long long rdtsc() {
    unsigned int lo, hi;
    __asm__ volatile ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((long long)hi << 32) | lo;
}

int main() {
    long long start, end;

    // Warm up
    for (int i = 0; i < 1000; i++) getpid();

    // Benchmark getpid (may use vDSO)
    start = rdtsc();
    for (int i = 0; i < ITERATIONS; i++) {
        getpid();
    }
    end = rdtsc();
    printf("getpid: %lld cycles/call\n", (end - start) / ITERATIONS);

    // Benchmark syscall(__NR_getpid) (forces real syscall)
    start = rdtsc();
    for (int i = 0; i < ITERATIONS; i++) {
        syscall(SYS_getpid);
    }
    end = rdtsc();
    printf("syscall(getpid): %lld cycles/call\n", (end - start) / ITERATIONS);

    return 0;
}

Lab 4: Examine vDSO

// find_vdso.c
#include <stdio.h>
#include <string.h>

int main(int argc, char **argv, char **envp) {
    // Skip past environment variables
    char **p = envp;
    while (*p) p++;

    // Now at auxiliary vector
    unsigned long *auxv = (unsigned long *)(p + 1);

    while (*auxv) {
        if (auxv[0] == 33) {  // AT_SYSINFO_EHDR
            printf("vDSO base address: 0x%lx\n", auxv[1]);
        }
        auxv += 2;
    }

    // Also check /proc/self/maps
    system("cat /proc/self/maps | grep vdso");

    return 0;
}

Summary for Senior Engineers

System Calls Aren't Free

Even with SYSCALL, there is a 100-200 cycle cost. Batch your syscalls (writev, io_uring).

vDSO is Magic

Some “syscalls” execute entirely in user space. This is why they don’t show in strace.

SWAPGS is Critical

Primary target for Spectre variants. Must happen before any kernel memory access.

Syscall Table is Kernel's API

Modifying it is how rootkits hide. Kernel marks it read-only after boot.
Key Takeaways:
  1. Privilege separation is enforced by hardware (CPU rings/modes)
  2. System calls are the only legitimate way to cross the user/kernel boundary
  3. vDSO eliminates syscall overhead for frequently-used operations
  4. Security mechanisms (KPTI, SWAPGS, seccomp) protect the syscall interface
  5. Performance matters: Modern systems minimize context switches

Next: CPU Architectures & Microarchitecture