Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

OS Fundamentals & System Call Internals

Operating Systems exist to manage hardware and provide a safe abstraction for applications. A “Senior” engineer must understand the physical transition between these two worlds.
Interview Frequency: Extremely High (90%+ of system programming interviews) Key Topics: System calls, kernel/user space, vDSO, context switching, privilege levels Time to Master: 12-15 hours Prerequisites: C programming, basic computer architecture

0. What is an Operating System?

At the highest level, an Operating System is a resource manager and isolation layer:
  • Resource Manager:
    • Multiplexes CPU time between many processes.
    • Allocates and reclaims memory, files, sockets, and devices.
    • Schedules and prioritizes work according to policy (throughput, latency, fairness, deadlines).
  • Isolation & Protection Layer:
    • Prevents one program from corrupting another program’s memory.
    • Prevents untrusted code from directly touching hardware.
    • Enforces security boundaries (user vs kernel, containers, VMs).
A helpful analogy is a city operating authority:
  • Streets/highways ⇢ CPU cores and buses.
  • Buildings ⇢ processes.
  • Rooms ⇢ threads.
  • Zoning rules and permits ⇢ permissions and security policies.
  • Traffic lights ⇢ synchronization and scheduling.
The OS makes the city feel orderly and predictable to its “citizens” (programs) even though underneath, the physical world (hardware) is chaotic and failure-prone.

0.1 Core Responsibilities

Every mainstream OS (Linux, Windows, macOS, BSD, RTOS variants) implements the same core ideas:
  • Abstraction: Present simple interfaces (files, sockets, processes) instead of raw devices and registers.
  • Virtualization: Make a single physical CPU and memory look like many virtual CPUs and address spaces.
  • Isolation: Ensure faults in one address space do not corrupt others.
  • Coordination: Provide primitives (locks, signals, pipes, futexes) so concurrent entities can cooperate.
  • Accounting: Track which process used how much CPU, memory, I/O; enforce quotas and limits.

0.2 Types of Operating Systems

Monolithic Kernels

Examples: Linux, traditional UnixMost services (drivers, file systems, networking) run in kernel mode. Fast but large kernel.

Microkernels

Examples: seL4, Minix3, QNXMove many services to user space for stronger isolation. More message passing overhead.

Hybrid Kernels

Examples: Windows NT, macOS (XNU)Combine monolithic and microkernel approaches for balance.

Real-Time OS

Examples: FreeRTOS, VxWorksTrade general-purpose flexibility for strict deadline guarantees.
In this course, we primarily focus on Linux as the concrete example, but the mental models transfer to all of these.

0.5 From Source Code to Running Process

To make OS fundamentals concrete, walk through what happens when you compile and run a simple C program:
// main.c
#include <stdio.h>

int main(int argc, char **argv) {
    printf("hello\n");
    return 0;
}

Step 1: Compilation and Linking

gcc -E main.c -o main.i
Expands #include and macros into a single translation unit:
// Thousands of lines from stdio.h are now inserted
// ...
int main(int argc, char **argv) {
    printf("hello\n");
    return 0;
}
ELF Structure (Executable and Linkable Format):
┌─────────────────────────┐
│  ELF Header             │  Magic number, entry point, section offsets
├─────────────────────────┤
│  Program Headers        │  How to load segments into memory
├─────────────────────────┤
│  .text (Code)           │  Machine instructions (read-only)
├─────────────────────────┤
│  .rodata (Constants)    │  "hello" string lives here
├─────────────────────────┤
│  .data (Initialized)    │  Initialized global variables
├─────────────────────────┤
│  .bss (Uninitialized)   │  Uninitialized globals (zero-filled by loader)
├─────────────────────────┤
│  Symbol Table           │  Function names, debugging info
├─────────────────────────┤
│  Section Headers        │  Metadata about each section
└─────────────────────────┘
At this point you have a program on disk, not yet a process.

Step 2: Shell Creates a New Process

When you run:
$ ./main
What the shell does:
  1. Your shell (itself a process) parses the command.
  2. The shell calls fork():
    • The kernel creates a child process by copying the parent’s PCB and page tables (copy-on-write).
    • Parent and child now both exist; they differ only in the return value of fork().
// Inside bash or sh:
pid_t pid = fork();
if (pid == 0) {
    // Child process
    execve("./main", argv, envp);
} else {
    // Parent process
    wait(NULL);  // Wait for child to finish
}

Step 3: Child Calls execve()

In the child:
  1. The shell calls execve("./main", ...).
  2. The kernel:
    • Reads the ELF headers from disk.
    • Allocates a new address space.
    • Maps code, data, stack, and shared libraries into that space.
    • Sets up the initial user stack with argc/argv and environment.
    • Sets the program counter to the C runtime entry point (_start).
Memory Layout After execve():
High Address (0x7FFF...)
┌─────────────────────────┐
│  Kernel Space           │ ← Not accessible from user mode
├─────────────────────────┤
│  Stack (grows down ↓)   │ ← argc, argv, environment vars
│         ...             │
├─────────────────────────┤
│  Memory Mapped Region   │ ← Shared libraries (libc.so)
│  (libc, ld-linux.so)    │
├─────────────────────────┤
│  Heap (grows up ↑)      │ ← malloc() allocates here
│         ...             │
├─────────────────────────┤
│  .bss (uninitialized)   │ ← Zero-filled global variables
├─────────────────────────┤
│  .data (initialized)    │ ← Initialized global variables
├─────────────────────────┤
│  .text (code)           │ ← Your program's machine code
└─────────────────────────┘
Low Address (0x0000...)
At this moment the program becomes a process with its own PID and address space.

Step 4: C Runtime → main → Exit

  1. The C runtime (crt1.o) runs first, initializing the runtime and calling your main().
  2. Your code executes (printf("hello\n")), which itself issues syscalls under the hood (write() on stdout).
  3. When main returns, the runtime calls exit(), which:
    • Flushes stdio buffers.
    • Invokes the exit_group syscall.
    • Lets the kernel tear down the process (free memory, close FDs, reap the PCB).
Complete Flow Diagram:
User types "./main"

Shell receives command (shell is a running process, PID 1000)

Shell calls fork() ──────────┐
        ↓                     ↓
Parent (PID 1000)        Child (PID 1001)
calls wait()             calls execve("./main")
blocks...                     ↓
                         Kernel loads ELF binary
                         Sets up address space
                         Maps .text, .data, stack
                         Jumps to _start

                         C runtime initializes

                         main() executes
                         printf() → write syscall

                         main() returns 0

                         exit(0) → exit_group syscall

                         Kernel cleans up process

Parent's wait() returns
Shell prints next prompt
This entire path is the lifecycle of a simple process; later chapters (Processes, Virtual Memory, Scheduling, File Systems) each zoom into one part of this story.

1. The Kernel vs. User Space

The CPU hardware enforces the boundary.

1.1 Privilege Levels (Protection Rings)

Four Rings (though most OSes only use 2):
Ring 0 (Kernel Mode)    ← Full hardware access
Ring 1 (Device Drivers) ← Unused in modern OSes
Ring 2 (Device Drivers) ← Unused in modern OSes
Ring 3 (User Mode)      ← Restricted instructions
Current Privilege Level (CPL) stored in CS register (Code Segment).Privileged Instructions (only Ring 0):
  • HLT - halt the CPU
  • CLI/STI - disable/enable interrupts
  • MOV CR3, reg - change page tables
  • LGDT/LIDT - load GDT/IDT
  • IN/OUT - direct hardware I/O (on some systems)

1.2 What Each Mode Can Do

CapabilityUser Space (Ring 3 / EL0)Kernel Space (Ring 0 / EL1)
Execute normal instructions
Access own virtual memory
Access all physical memory
Modify page tables
Disable interrupts
Execute I/O instructions
Halt the CPU
Load kernel modules
Why the restriction?
// If user space could do this:
asm("cli");  // Disable interrupts
while(1);    // Infinite loop
// The entire system would freeze!

// Or this:
void *kernel_memory = (void *)0xFFFF888000000000;
*kernel_memory = 0x90909090;  // Overwrite kernel code
// System compromised!
The hardware enforces that attempts to execute privileged instructions in user mode trigger a General Protection Fault (x86) or Illegal Instruction exception (ARM/RISC-V), which the kernel handles by terminating the offending process.

2. System Call Evolution (x86-64)

How does a program ask the kernel for help?

2.1 Legacy: INT 0x80 (i386)

In the 32-bit era, applications used a software interrupt.
; Legacy 32-bit Linux syscall
mov eax, 4          ; syscall number (write)
mov ebx, 1          ; file descriptor (stdout)
mov ecx, msg        ; buffer
mov edx, 13         ; count
int 0x80            ; Trap to kernel
What happens:
  1. CPU saves registers (CS, EIP, EFLAGS).
  2. CPU looks up interrupt vector 0x80 in the IDT (Interrupt Descriptor Table).
  3. CPU jumps to kernel’s interrupt handler.
  4. Handler switches to kernel stack.
  5. Handler calls the appropriate syscall function.
  6. Handler returns using IRET.
Problem: Very slow (~300-500 cycles) due to:
  • Interrupt controller overhead
  • Full register save/restore
  • Stack switching
  • Permission checks

2.2 Modern: SYSCALL (AMD) / SYSENTER (Intel)

x86-64 introduced a dedicated instruction for syscalls. SYSCALL Instruction (AMD64):
; Modern 64-bit Linux syscall
mov rax, 1          ; syscall number (write)
mov rdi, 1          ; arg1: file descriptor
mov rsi, msg        ; arg2: buffer
mov rdx, 13         ; arg3: count
syscall             ; Fast system call
Hardware Magic:
  1. No IDT lookup: CPU jumps to address stored in IA32_LSTAR MSR (Model Specific Register).
  2. No stack lookup: Uses IA32_KERNEL_GS_BASE for per-CPU data.
  3. Minimal save: Only saves RIP and RFLAGS to RCX and R11.
Kernel Entry Point (simplified):
; Entry point stored in LSTAR MSR
entry_SYSCALL_64:
    SWAPGS                  ; Switch to kernel GS (per-CPU area)
    mov    QWORD PTR gs:0x14, rsp   ; Save user stack
    mov    rsp, QWORD PTR gs:0x1c   ; Load kernel stack

    push   rax              ; Save registers
    push   rcx              ; (RCX = user RIP)
    push   r11              ; (R11 = user RFLAGS)
    ; ... save more registers

    call   do_syscall_64    ; C function dispatch

    ; ... restore registers
    pop    r11
    pop    rcx
    pop    rax

    mov    rsp, QWORD PTR gs:0x14   ; Restore user stack
    SWAPGS                  ; Switch back to user GS
    sysretq                 ; Return to user space
Performance: Much faster (~100-200 cycles). The savings come from:
  • Direct jump (no table lookup)
  • Minimal register save/restore
  • No interrupt controller involved

2.3 ARM64: SVC Instruction

; ARM64 syscall
mov x8, #64         ; syscall number (write)
mov x0, #1          ; arg1: fd
ldr x1, =msg        ; arg2: buffer
mov x2, #13         ; arg3: count
svc #0              ; SuperVisor Call
Hardware behavior:
  • Saves PC to ELR_EL1
  • Saves PSTATE to SPSR_EL1
  • Jumps to exception vector
  • Kernel dispatches based on x8

2.4 RISC-V: ECALL Instruction

; RISC-V syscall
li a7, 64           ; syscall number (write)
li a0, 1            ; arg1: fd
la a1, msg          ; arg2: buffer
li a2, 13           ; arg3: count
ecall               ; Environment call
Cost Comparison Table:
MechanismTypical CyclesUse Case
INT 0x80300-500Legacy 32-bit x86
SYSENTER100-150Intel x86-32 (deprecated)
SYSCALL100-200Modern x86-64
SVC100-200ARM64
ECALL100-200RISC-V
vDSO (no switch)5-20Kernel-provided user code

3. The vDSO (Virtual Dynamic Shared Object)

Some system calls are called thousands of times per second (e.g., gettimeofday(), clock_gettime()). Switching to kernel mode every time is a massive waste of CPU.

3.1 How it Works

The vDSO is a special page of memory that the kernel maps into every user process’s address space. This page contains:
  1. Code: Executable functions that run in user mode.
  2. Data: Read-only kernel data (like current time).
Process Address Space:
┌─────────────────────────┐
│  Kernel Space           │
├─────────────────────────┤
│  Stack                  │
├─────────────────────────┤
│  [vdso] ← Magic page!   │ ← Kernel-provided, user-executable
│    - gettimeofday()     │
│    - clock_gettime()    │
│    - getcpu()           │
├─────────────────────────┤
│  Shared Libraries       │
│  (libc.so)              │
├─────────────────────────┤
│  Heap                   │
├─────────────────────────┤
│  .text                  │
└─────────────────────────┘
Example: gettimeofday() Without vDSO:
// Slow path: requires syscall
struct timeval tv;
gettimeofday(&tv, NULL);
// → syscall → kernel mode → read kernel clock → return
// Cost: ~200 cycles
Example: gettimeofday() With vDSO:
// Fast path: executes in user mode
struct timeval tv;
gettimeofday(&tv, NULL);
// → calls vDSO function → reads shared memory → returns
// Cost: ~20 cycles (no mode switch!)

3.2 Implementation Details

Kernel Side (sets up vDSO):
// In arch/x86/entry/vdso/vdso.c
static int __init init_vdso(void) {
    // Map vDSO page into every process
    vdso_pages[0] = alloc_page(GFP_KERNEL);
    copy_vdso_to_page(vdso_pages[0]);
    return 0;
}

// Kernel periodically updates shared time data
void update_vsyscall(struct timekeeper *tk) {
    vdso_data->wall_time_sec = tk->wall_time.tv_sec;
    vdso_data->wall_time_nsec = tk->wall_time.tv_nsec;
}
User Side (vDSO function):
// Inside vDSO (simplified)
notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) {
    // Read from shared memory (no syscall!)
    ts->tv_sec = vdso_data->wall_time_sec;
    ts->tv_nsec = vdso_data->wall_time_nsec;
    return 0;
}
Variable Shadowing: The kernel periodically writes the current time into a “data” part of the vDSO page using atomic operations or seqlocks to ensure consistency.

3.3 Why Some Syscalls Can’t Use vDSO

Safe for vDSO:
  • Read-only operations
  • No side effects
  • Data changes slowly or predictably
Cannot use vDSO:
  • write() - modifies kernel state
  • fork() - creates process
  • mmap() - changes address space
  • open() - allocates file descriptor
Performance Impact:
// Benchmark: 10 million calls
clock_gettime(CLOCK_REALTIME, &ts);

// With vDSO:     ~200 ms (20 ns per call)
// Without vDSO:  ~2000 ms (200 ns per call)
// Speedup: 10x

3.4 Finding the vDSO

# View process mappings
$ cat /proc/self/maps | grep vdso
7ffd1a3fe000-7ffd1a400000 r-xp 00000000 00:00 0    [vdso]

# Dump vDSO symbols
$ objdump -T /lib/x86_64-linux-gnu/libc.so.6 | grep vdso
# (libc wraps vDSO calls)

# Using LD_SHOW_AUXV
$ LD_SHOW_AUXV=1 ./program 2>&1 | grep SYSINFO
AT_SYSINFO_EHDR: 0x7ffd1a3fe000 vDSO base address
Result: Zero context switches. The system call is “executed” entirely in user space.

4. vsyscall: The Legacy Fixed Address

Before the vDSO, there was vsyscall.

4.1 The Problem with vsyscall

Fixed virtual address: 0xffffffffff600000

Every process had:
┌───────────────────────────┐
│ 0xffffffffff600000:       │
│   gettimeofday code       │ ← Same address in EVERY process
│   time code               │
│   getcpu code             │
└───────────────────────────┘
Security Issue: Predictable addresses enable ROP attacks (Return-Oriented Programming). An attacker could reliably:
// Exploit code
return_address = 0xffffffffff600000 + offset;
// Execute attacker-controlled syscalls

4.2 Modern Linux Solution

vsyscall is now emulated:
// In arch/x86/entry/vsyscall/vsyscall_64.c
static bool is_vsyscall_vaddr(unsigned long vaddr) {
    return vaddr >= VSYSCALL_ADDR && vaddr < VSYSCALL_ADDR + PAGE_SIZE;
}

// Trap and emulate
do_page_fault() {
    if (is_vsyscall_vaddr(address)) {
        // Emulate the syscall (SLOW!)
        emulate_vsyscall();
        return;
    }
}
Result: vsyscall addresses still work (for legacy binaries), but they trap to the kernel instead of executing directly. Modern code uses vDSO instead.

5. Kernel Entry: entry_SYSCALL_64

When the SYSCALL instruction is executed, the CPU jumps to this assembly entry point in the kernel.

5.1 Complete Entry Path (x86-64)

; arch/x86/entry/entry_64.S

ENTRY(entry_SYSCALL_64)
    /*
     * Interrupts are off on entry.
     * We are in kernel mode, but need to switch stacks.
     */

    /* 1. SWAPGS: Switch GS register to kernel per-CPU area */
    SWAPGS

    /* 2. Save user stack pointer */
    movq    %rsp, PER_CPU_VAR(rsp_scratch)

    /* 3. Load kernel stack */
    movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp

    /* 4. Build pt_regs structure on kernel stack */
    pushq   $__USER_DS              /* User data segment */
    pushq   PER_CPU_VAR(rsp_scratch)  /* User RSP */
    pushq   %r11                    /* User RFLAGS (saved by SYSCALL) */
    pushq   $__USER_CS              /* User code segment */
    pushq   %rcx                    /* User RIP (saved by SYSCALL) */

    pushq   %rax    /* Syscall number */
    pushq   %rdi    /* Arg 1 */
    pushq   %rsi    /* Arg 2 */
    pushq   %rdx    /* Arg 3 */
    pushq   %r10    /* Arg 4 (RCX was overwritten) */
    pushq   %r8     /* Arg 5 */
    pushq   %r9     /* Arg 6 */

    /* 5. Call the C dispatcher */
    call    do_syscall_64

    /* 6. Restore user registers from pt_regs */
    popq    %r9
    popq    %r8
    popq    %r10
    popq    %rdx
    popq    %rsi
    popq    %rdi
    popq    %rax    /* Return value */

    popq    %rcx    /* User RIP */
    popq    %r11    /* Skip CS */
    popq    %r11    /* User RFLAGS */
    popq    %rsp    /* Skip DS */
    popq    %rsp    /* User RSP */

    /* 7. SWAPGS back to user GS */
    SWAPGS

    /* 8. SYSRETQ: Return to user space */
    sysretq
END(entry_SYSCALL_64)

5.2 The C Dispatcher

// In arch/x86/entry/common.c

__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr) {
    // Security checks
    nr = syscall_enter_from_user_mode(regs, nr);

    // Validate syscall number
    if (likely(nr < NR_syscalls)) {
        // Look up in syscall table and call
        regs->ax = sys_call_table[nr](regs);
    }

    syscall_exit_to_user_mode(regs);
}

5.3 The Syscall Table

// In arch/x86/entry/syscall_64.c

const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
    [0] = __x64_sys_read,
    [1] = __x64_sys_write,
    [2] = __x64_sys_open,
    [3] = __x64_sys_close,
    [4] = __x64_sys_stat,
    // ... 300+ more syscalls
    [39] = __x64_sys_getpid,
    [57] = __x64_sys_fork,
    [59] = __x64_sys_execve,
    // ...
};
Security Note: The syscall table is marked read-only after boot. Modifying it is a common rootkit technique.

5.4 Complete Flow Diagram

User Program                  CPU Hardware              Kernel
─────────────────────────────────────────────────────────────
mov rax, 1
mov rdi, 1
mov rsi, buffer
mov rdx, 13
syscall ──────────────────>  SYSCALL instruction
                             - Save RIP → RCX
                             - Save RFLAGS → R11
                             - Load RIP from LSTAR MSR ───────> entry_SYSCALL_64:
                             - Switch to Ring 0                   SWAPGS
                                                                  Save user RSP
                                                                  Load kernel stack
                                                                  Push registers (pt_regs)
                                                                  call do_syscall_64

                                                                  sys_call_table[1]

                                                                  __x64_sys_write()

                                                                  ksys_write()

                                                                  vfs_write()

                                                                  [... kernel work ...]

                                                                  return bytes_written

                                                                  Pop registers
                                                                  Restore user RSP
                                                                  SWAPGS
                             SYSRETQ <────────────────────────── sysretq
                             - Restore RIP from RCX
                             - Restore RFLAGS from R11
                             - Switch to Ring 3
RAX = bytes_written <───────
continue program...

6. Processes vs. Threads vs. Kernel Tasks

Before we dive into system call micro-details, it is critical to distinguish the units of execution the kernel manages:

Process

Isolation Unit
  • Own virtual address space
  • Own page tables (mm_struct)
  • Own resources (FDs, signals, cwd)
  • Heavyweight context switch

Thread

Execution Unit
  • Shares address space with process
  • Own stack and registers
  • Own TID
  • Lightweight context switch (same CR3)

Kernel Thread

Kernel Worker
  • No user address space
  • Lives entirely in kernel
  • Examples: kswapd, kworker
  • No context switch overhead for syscalls

6.1 Visualization

Think of a process as a house, and threads as people inside the house:
Process = House
├── Address Space = House structure (walls, roof)
├── Resources
│   ├── File Descriptors = Shared utilities (kitchen, bathroom)
│   ├── Signal Handlers = House rules (fire alarm protocol)
│   └── Current Directory = Current room the house is in
└── Threads = People inside
    ├── Thread 1
    │   ├── Stack = Private bedroom
    │   └── Registers = Personal belongings
    ├── Thread 2
    │   ├── Stack = Private bedroom
    │   └── Registers = Personal belongings
    └── ...

6.2 Why It Matters for System Calls

Example: read() system call
// Thread 1
ssize_t n = read(fd, buf1, size);  // May block
  • Issued by a thread
  • CPU time charged to the process
  • May block only this thread, not the whole process
  • Other threads can continue executing
Example: fork() system call
// Process with 5 threads calls fork()
pid_t pid = fork();
  • Creates a new process with copy-on-write address space
  • The child initially has only one thread (the caller)
  • Even though parent had 5 threads, they don’t get copied
  • Child’s single thread continues from the fork() return point
Kernel’s View:
// Linux doesn't distinguish! Everything is a task_struct

// Thread vs Process determined by clone() flags:
clone(CLONE_VM | CLONE_FS | CLONE_FILES);  // = Thread
clone(SIGCHLD);                             // = Process
You will see these distinctions repeatedly in later chapters (Scheduling, Synchronization, Signals, and Linux Internals).

7. System Call Deep Dive: Real Linux Examples

7.1 Example: write() System Call

User space call:
#include <unistd.h>
ssize_t n = write(1, "Hello\n", 6);
Libc wrapper (glibc/sysdeps/unix/sysv/linux/write.c):
ssize_t __write(int fd, const void *buf, size_t count) {
    return INLINE_SYSCALL_CALL(write, fd, buf, count);
}
INLINE_SYSCALL_CALL expands to:
mov rax, 1      ; __NR_write
mov rdi, 1      ; fd
mov rsi, buf    ; buffer
mov rdx, 6      ; count
syscall
Kernel entry (arch/x86/entry/syscall_64.c):
sys_call_table[1] = __x64_sys_write;
Syscall implementation (fs/read_write.c):
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, size_t, count) {
    return ksys_write(fd, buf, count);
}

ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count) {
    struct fd f = fdget_pos(fd);
    if (!f.file)
        return -EBADF;

    ssize_t ret = vfs_write(f.file, buf, count, &pos);
    fdput_pos(f);
    return ret;
}
VFS layer (fs/read_write.c):
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos) {
    // Check permissions
    if (!(file->f_mode & FMODE_WRITE))
        return -EBADF;

    // Check file operations
    if (!file->f_op->write && !file->f_op->write_iter)
        return -EINVAL;

    // Call filesystem-specific write
    if (file->f_op->write)
        ret = file->f_op->write(file, buf, count, pos);
    else
        ret = new_sync_write(file, buf, count, pos);

    return ret;
}
Filesystem layer (e.g., ext4):
// fs/ext4/file.c
static ssize_t ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) {
    // Allocate disk blocks
    // Update inode metadata
    // Write to page cache
    // Mark pages dirty
    return generic_file_write_iter(iocb, from);
}
Complete Stack:
User Application: write(1, "Hello\n", 6)

Libc Wrapper: __write()

Assembly: syscall instruction

entry_SYSCALL_64

do_syscall_64

sys_call_table[1] → __x64_sys_write

ksys_write(fd, buf, count)

vfs_write(file, buf, count, pos)

ext4_file_write_iter() [if file is on ext4]

generic_file_write_iter()

Page Cache operations

Return bytes written

sysretq back to user space

7.2 Example: getpid() - Fast Path

User space:
pid_t pid = getpid();
Modern implementation uses vDSO:
// glibc calls:
pid_t getpid(void) {
    // Check if vDSO provides getpid
    if (GLRO(dl_vdso_getpid))
        return GLRO(dl_vdso_getpid)();

    // Fallback to syscall
    return INLINE_SYSCALL_CALL(getpid);
}
vDSO version (no syscall!):
// In vDSO
notrace static int __vdso_getpid(void) {
    // Read from per-thread cached value
    return current->tgid;  // Thread Group ID
}
Cost: ~5-10 cycles (no kernel transition)

7.3 Example: open() - Complex Path

int fd = open("/home/user/file.txt", O_RDWR | O_CREAT, 0644);
Syscall path:
// fs/open.c
SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode) {
    return do_sys_open(AT_FDCWD, filename, flags, mode);
}

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode) {
    // 1. Allocate file descriptor
    int fd = get_unused_fd_flags(flags);

    // 2. Path lookup
    struct file *f = do_filp_open(dfd, filename, flags, mode);

    // 3. Install in FD table
    fd_install(fd, f);

    return fd;
}
Path lookup (fs/namei.c):
struct file *do_filp_open(int dfd, struct filename *pathname, const struct open_flags *op) {
    // RCU-walk: lockless path walk
    // Falls back to ref-walk if needed
    struct nameidata nd;
    int error = path_openat(&nd, op, flags);

    // Creates/opens inode
    // Allocates struct file
    // Links to dentry cache

    return file;
}
Work done:
  • String parsing (/home/user/file.txt → components)
  • Dentry cache lookups (hot path)
  • Inode cache lookups
  • Disk reads (cold path, if not cached)
  • Permission checks (each directory component)
  • File allocation
  • FD table modification
Cost: Highly variable
  • Hot (all cached): 1-2 μs
  • Cold (disk reads): 1-10 ms

8. Performance Analysis of System Calls

8.1 Measuring Syscall Overhead

Microbenchmark:
#include <stdio.h>
#include <unistd.h>
#include <time.h>

#define ITERATIONS 10000000

int main() {
    struct timespec start, end;

    // Benchmark getpid() (vDSO)
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < ITERATIONS; i++) {
        getpid();
    }
    clock_gettime(CLOCK_MONOTONIC, &end);

    long ns = (end.tv_sec - start.tv_sec) * 1000000000L +
              (end.tv_nsec - start.tv_nsec);
    printf("getpid: %ld ns per call\n", ns / ITERATIONS);

    // Benchmark write() to /dev/null
    int fd = open("/dev/null", O_WRONLY);
    char buf[1] = {0};

    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < ITERATIONS; i++) {
        write(fd, buf, 1);
    }
    clock_gettime(CLOCK_MONOTONIC, &end);

    ns = (end.tv_sec - start.tv_sec) * 1000000000L +
         (end.tv_nsec - start.tv_nsec);
    printf("write: %ld ns per call\n", ns / ITERATIONS);

    return 0;
}
Typical Results (modern Intel/AMD):
getpid: 15 ns per call      ← vDSO, no syscall
write:  200 ns per call     ← Real syscall

8.2 Using perf to Analyze

# Count syscalls
perf stat -e syscalls:sys_enter_* ./program

# Trace individual syscalls
strace -c ./program
strace -T ./program  # With timing

# Profile syscall overhead
perf record -e raw_syscalls:sys_enter,raw_syscalls:sys_exit ./program
perf report

8.3 Syscall Batching Strategies

Bad: Many small syscalls
// DON'T DO THIS
for (int i = 0; i < 1000; i++) {
    write(fd, &data[i], 1);  // 1000 syscalls!
}
Good: One large syscall
// DO THIS
write(fd, data, 1000);  // 1 syscall
Better: Vectored I/O
struct iovec iov[10];
for (int i = 0; i < 10; i++) {
    iov[i].iov_base = buffers[i];
    iov[i].iov_len = sizes[i];
}
writev(fd, iov, 10);  // 1 syscall for multiple buffers
Best: Asynchronous I/O (io_uring)
// Submit many operations with zero syscalls
io_uring_prep_write(sqe1, fd, buf1, len1, offset1);
io_uring_prep_write(sqe2, fd, buf2, len2, offset2);
io_uring_prep_write(sqe3, fd, buf3, len3, offset3);
io_uring_submit(ring);  // One syscall for all 3!

9. Security Implications

9.1 Spectre/Meltdown and Syscalls

Meltdown (2018) exploited speculative execution:
// Attack code
char *kernel_addr = (char *)0xFFFF888000000000;
char value = *kernel_addr;  // Would normally fault

// But CPU speculatively loads it!
// Side channel timing attack can extract value
KPTI Fix (Kernel Page Table Isolation): Before KPTI:
User process page tables included kernel mappings
→ Fast syscalls (no CR3 switch)
→ Vulnerable to Meltdown
After KPTI:
User process page tables: User space only
Kernel page tables: Kernel + user space
→ CR3 switch on every syscall/interrupt
→ Immune to Meltdown
→ 5-30% performance hit (mitigated by PCID)
PCID (Process Context Identifier):
Instead of flushing TLB on CR3 switch:
Tag TLB entries with PCID
→ User TLB entries coexist with kernel TLB entries
→ Much lower performance impact

9.2 Syscall Filtering with seccomp

seccomp-BPF: Filter which syscalls a process can make
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/bpf.h>

// Allow only read, write, exit
struct sock_filter filter[] = {
    // Load syscall number
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),

    // Allow read (0)
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_read, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Allow write (1)
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Allow exit (60)
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_exit, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Kill process for anything else
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
};

struct sock_fprog prog = {
    .len = sizeof(filter) / sizeof(filter[0]),
    .filter = filter,
};

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);

// Now can only use read, write, exit
Use cases:
  • Docker containers
  • Chrome sandbox
  • systemd service hardening

10. Interview Deep Dive Questions

Complete Answer:User Space:
  1. Application loads syscall number into RAX
  2. Arguments into RDI, RSI, RDX, R10, R8, R9
  3. Executes SYSCALL instruction
CPU Hardware: 4. Saves user RIP to RCX (return address) 5. Saves user RFLAGS to R11 6. Loads kernel RIP from IA32_LSTAR MSR 7. Switches to Ring 0 (kernel mode) 8. Jumps to entry_SYSCALL_64Kernel Entry: 9. SWAPGS (switch to kernel GS register) 10. Save user RSP, load kernel stack 11. Build pt_regs structure (save all registers) 12. Call do_syscall_64(pt_regs, nr)Kernel Dispatch: 13. Validate syscall number 14. Look up sys_call_table[nr] 15. Call syscall handler function 16. Function does actual work (VFS, scheduler, etc.) 17. Return value placed in RAXKernel Exit: 18. Restore registers from pt_regs 19. Restore user RSP 20. SWAPGS back to user GS 21. Execute SYSRETQCPU Hardware: 22. Restore RIP from RCX 23. Restore RFLAGS from R11 24. Switch to Ring 3 (user mode) 25. Jump to user codeCost: ~100-200 CPU cycles on modern hardware
Answer:Problem: Context switching to kernel mode is expensive (~100-200 cycles). For frequently-called syscalls like gettimeofday(), this overhead dominates.Solution: vDSO (Virtual Dynamic Shared Object)How it works:
  1. Kernel maps a page of executable code into every process
  2. This page contains implementations of certain syscalls
  3. Kernel periodically updates read-only data in this page
  4. Libc resolves these functions to vDSO instead of syscalls
  5. Calls execute entirely in user space (no mode switch)
Syscalls that use vDSO:
  • gettimeofday() - reads kernel’s time data
  • clock_gettime() - reads clock data
  • getcpu() - reads current CPU number
  • time() - simplified time call
Performance impact:
Without vDSO: ~200 ns per call
With vDSO:    ~20 ns per call
Speedup: 10x
Why only certain syscalls?:
  • Must be read-only (no side effects)
  • Data must be safely readable from user space
  • Data changes must be atomic/consistent
  • Cannot require kernel state modifications
Security: vDSO code is kernel-provided, mapped at random addresses (ASLR), and cannot be modified by user space.
Answer:SWAPGS is an x86-64 instruction that atomically swaps the GS base register with a kernel-specific per-CPU value.Purpose:
  • GS register points to per-CPU data structures
  • In user mode: GS points to user thread-local storage (TLS)
  • In kernel mode: GS points to kernel per-CPU area
Why it’s needed:
User mode:
  GS:0 → thread ID
  GS:8 → errno location
  GS:16 → TLS data

Kernel mode:
  GS:0 → current task_struct pointer
  GS:8 → CPU number
  GS:16 → kernel stack pointer
Security-critical:
  1. Spectre v1 exploited missing SWAPGS
  2. CPU could speculatively execute kernel code with user GS
  3. Kernel would read wrong data, leak information
Attack scenario:
// User sets GS to malicious value
syscall();  // Enter kernel

// If SWAPGS not executed:
current = *(task_struct **)GS:0;  // Reads attacker-controlled memory!
Mitigation:
  • SWAPGS must be first instruction in syscall entry
  • Must happen before any GS-relative memory access
  • Hardware barriers prevent speculative execution reordering
Modern defense (FENCE after SWAPGS):
entry_SYSCALL_64:
    SWAPGS
    lfence              ; Speculation barrier
    mov rsp, gs:0x1c    ; Now safe to use GS
Answer:Function Call (user → user):
call foo
  push return_address
  jmp foo
foo:
  ; function body
  ret
Cost: 5-10 cycles
  • Push return address
  • Branch (usually predicted correctly)
  • Return (predicted via RAS - Return Address Stack)
System Call (user → kernel → user):
syscall
  ; Save RIP, RFLAGS
  ; Switch privilege level
  ; Switch stack
  ; Jump to kernel
[kernel work]
sysretq
  ; Restore RIP, RFLAGS
  ; Switch privilege level
  ; Switch stack
  ; Jump to user
Cost: 100-200+ cycles
  • Mode switch overhead: ~50 cycles
  • Register save/restore: ~20 cycles
  • TLB effects (if PCID not used): ~50 cycles
  • Cache effects (kernel code not in L1): ~50+ cycles
Real-world comparison:
// Function call
int add(int a, int b) { return a + b; }
add(1, 2);  // ~10 cycles

// Syscall
getpid();   // ~200 cycles (without vDSO)
Why syscalls are slow:
  1. Privilege level transition (Ring 3 → Ring 0 → Ring 3)
  2. Page table switch (if KPTI enabled)
  3. TLB flush (if PCID not supported)
  4. Cache pollution (kernel code evicts user code)
  5. Security checks and barriers
Ratio: Syscalls are 20-40x slower than function callsOptimization strategies:
  • Batch operations (writev vs many writes)
  • Use vDSO when available
  • Use memory mapping (mmap) to avoid read/write syscalls
  • Use io_uring for async I/O with minimal syscalls
Answer:
Aspectx86-64ARM64RISC-V
InstructionSYSCALLSVC #0ECALL
Syscall #RAXx8a7
ArgsRDI,RSI,RDX,R10,R8,R9x0-x5a0-a5
ReturnRAXx0a0
Max Args666
x86-64:
mov rax, 1      ; syscall number
mov rdi, 1      ; arg1
mov rsi, buf    ; arg2
mov rdx, len    ; arg3
syscall         ; trap to kernel
; return value in RAX
ARM64:
mov x8, #64     ; syscall number (write)
mov x0, #1      ; arg1 (fd)
ldr x1, =buf    ; arg2 (buffer)
mov x2, len     ; arg3 (count)
svc #0          ; supervisor call
; return value in x0
RISC-V:
li a7, 64       ; syscall number
li a0, 1        ; arg1
la a1, buf      ; arg2
li a2, len      ; arg3
ecall           ; environment call
; return value in a0
Key differences:
  1. Calling convention:
    • x86-64 uses different registers for syscalls vs function calls
    • ARM/RISC-V use same registers for both
  2. Syscall numbering:
    • Each architecture has different syscall numbers
    • write is #1 on x86-64, #64 on ARM64/RISC-V
    • Forces architecture-specific syscall tables
  3. Mode switching:
    • x86: Ring 0 vs Ring 3
    • ARM: EL0 vs EL1
    • RISC-V: U-mode vs S-mode
  4. Performance:
    • Similar overhead (~100-200 cycles)
    • RISC architectures slightly cleaner (fewer legacy modes)
    • All benefit from vDSO equally

11. Hands-On Practice

Lab 1: Tracing System Calls

# Trace all syscalls
strace ./program

# Count syscalls by type
strace -c ./program

# Trace only open/read/write
strace -e trace=open,read,write ./program

# Show timing per syscall
strace -T ./program

# Attach to running process
strace -p <pid>

Lab 2: Minimal Syscall (No libc)

// syscall_raw.c - Direct syscall without libc
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>

// Don't use printf (it uses syscalls internally)
static inline long my_write(int fd, const void *buf, size_t count) {
    long ret;
    __asm__ volatile (
        "syscall"
        : "=a" (ret)
        : "0"(__NR_write), "D"(fd), "S"(buf), "d"(count)
        : "rcx", "r11", "memory"
    );
    return ret;
}

void _start() {
    const char msg[] = "Hello from raw syscall!\n";
    my_write(1, msg, sizeof(msg) - 1);

    __asm__ volatile (
        "mov $60, %%rax\n"  // __NR_exit
        "xor %%rdi, %%rdi\n"
        "syscall"
        ::: "rax", "rdi"
    );
}
Compile without libc:
gcc -nostdlib -static -o syscall_raw syscall_raw.c
./syscall_raw

Lab 3: Benchmark Syscall Overhead

// benchmark_syscall.c
#define _GNU_SOURCE
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <fcntl.h>

#define ITERATIONS 10000000

static inline long long rdtsc() {
    unsigned int lo, hi;
    __asm__ volatile ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((long long)hi << 32) | lo;
}

int main() {
    long long start, end;

    // Warm up
    for (int i = 0; i < 1000; i++) getpid();

    // Benchmark getpid (may use vDSO)
    start = rdtsc();
    for (int i = 0; i < ITERATIONS; i++) {
        getpid();
    }
    end = rdtsc();
    printf("getpid: %lld cycles/call\n", (end - start) / ITERATIONS);

    // Benchmark syscall(__NR_getpid) (forces real syscall)
    start = rdtsc();
    for (int i = 0; i < ITERATIONS; i++) {
        syscall(SYS_getpid);
    }
    end = rdtsc();
    printf("syscall(getpid): %lld cycles/call\n", (end - start) / ITERATIONS);

    return 0;
}

Lab 4: Examine vDSO

// find_vdso.c
#include <stdio.h>
#include <string.h>

int main(int argc, char **argv, char **envp) {
    // Skip past environment variables
    char **p = envp;
    while (*p) p++;

    // Now at auxiliary vector
    unsigned long *auxv = (unsigned long *)(p + 1);

    while (*auxv) {
        if (auxv[0] == 33) {  // AT_SYSINFO_EHDR
            printf("vDSO base address: 0x%lx\n", auxv[1]);
        }
        auxv += 2;
    }

    // Also check /proc/self/maps
    system("cat /proc/self/maps | grep vdso");

    return 0;
}

Summary for Senior Engineers

System Calls Aren't Free

Even with SYSCALL, there is a 100-200 cycle cost. Batch your syscalls (writev, io_uring).

vDSO is Magic

Some “syscalls” execute entirely in user space. This is why they don’t show in strace.

SWAPGS is Critical

Primary target for Spectre variants. Must happen before any kernel memory access.

Syscall Table is Kernel's API

Modifying it is how rootkits hide. Kernel marks it read-only after boot.
Key Takeaways:
  1. Privilege separation is enforced by hardware (CPU rings/modes)
  2. System calls are the only legitimate way to cross the user/kernel boundary
  3. vDSO eliminates syscall overhead for frequently-used operations
  4. Security mechanisms (KPTI, SWAPGS, seccomp) protect the syscall interface
  5. Performance matters: Modern systems minimize context switches


Production Caveats: What Goes Wrong at the User-Kernel Boundary

The textbook explanation makes syscalls look clean. Production reality is messier. Most kernel-related performance bugs and security incidents trace back to specific failure modes around mode switching, syscall semantics, and kernel architecture choices.
Real-world pitfalls senior engineers have learned the hard way:
  1. The “syscall is cheap” assumption breaks at scale. A modern syscall costs roughly 100 to 200 cycles bare; with KPTI it costs 300 to 500. At 200K syscalls per second per core (entirely realistic for a busy nginx or Redis), that is 60 to 100 million cycles per second purely on mode switching — before any useful work. Apps tuned on a 4.x kernel and deployed on a KPTI-enabled 5.x kernel can lose 20 percent throughput on this alone. Always measure with perf stat -e cycles,instructions,raw_syscalls:sys_enter.
  2. Monolithic vs microkernel is not a universal answer. “Microkernels are safer” sounds compelling until you measure the IPC overhead — early Mach was 5x slower than monolithic Unix on the same hardware. “Monolithic kernels are faster” sounds true until a single buggy driver crashes your entire fleet. Both architectures work; the right answer depends on the workload’s tolerance for failure vs latency. Linux’s middle-ground answer (loadable modules + eBPF + user-mode helpers) emerged because pure ideologies on either side lost.
  3. vDSO is invisible in strace. If you are profiling and see “no syscalls” for clock_gettime, you have not eliminated the calls — they are routing through the vDSO. This matters when reasoning about performance. The vDSO is fast (under 20ns) but it still reads memory under contention; on a heavily oversubscribed system the vDSO can be slower than expected because its data page is bouncing between CPU caches.
  4. fork() cost is not constant. A small process forks in microseconds. A process with 100GB of RSS and 25 million page table entries can take 200ms to fork because the kernel must copy all those page tables. PostgreSQL’s fork() per connection model breaks at scale partly because of this. The fix is connection pooling (pgbouncer) or moving to thread-based engines.
  5. SIGSEGV is not always a bug in your code. A successful mmap() followed by reads from the mapped region can SIGSEGV if the underlying file was truncated. Memory pressure can cause demand-paged executable pages to be evicted; reading them re-pages from disk. If the disk is gone (NFS unmount, removed USB device), you get SIGBUS. The kernel will not magically save you from physical-layer failures.
Solutions and patterns:
  • Batch syscalls aggressively. writev/readv combine multiple buffers into one call. sendmmsg/recvmmsg batch network packets. io_uring is the endgame: dozens of operations per syscall. If you are doing more than 50K syscalls per second per core, start here.
  • Use vDSO functions when available. clock_gettime(CLOCK_MONOTONIC, ...) is vDSO; clock_gettime(CLOCK_REALTIME_COARSE, ...) is even cheaper (kernel updates it once per tick, accurate to roughly 4ms). For high-frequency timing, the difference is real.
  • Profile with perf trace not just strace. strace adds 100x overhead because each syscall round-trips through ptrace. perf trace uses tracepoints and is 50 to 100x cheaper. Run strace for correctness, perf trace for performance.
  • Set RLIMIT_NOFILE and RLIMIT_NPROC defensively. Default limits are often inadequate (1024 file descriptors). Set them in systemd unit files or Dockerfile, not in app startup — if you set them at app startup, the first connection burst can hit the old limit.
  • Use seccomp profiles for security boundaries. Docker’s default seccomp profile blocks roughly 50 syscalls. If you can identify your app’s actual syscall set with perf trace -e raw_syscalls:sys_enter, write a tighter profile. Smaller syscall surface equals smaller attack surface.

Senior Interview Questions: Kernel Architecture and Syscall Semantics

Strong Answer Framework:
  1. Define the architectural difference. Monolithic kernels (Linux) run drivers, file systems, network stack, and scheduler all in kernel mode (ring 0). They communicate via direct function calls and shared data structures. Microkernels (L4, seL4, QNX) keep only the absolute minimum (address space management, IPC, scheduling) in kernel mode and move everything else (file systems, drivers, networking) to user-space servers.
  2. Performance reality. A function call in a monolithic kernel is roughly 5 cycles. The equivalent operation in a microkernel is an IPC round-trip: typically 1 to 10 microseconds even on optimized L4 (sub-microsecond), with multiple context switches and cache effects. For an operation invoked 100K times per second (typical for a busy file server), the difference is 1 to 10 percent of CPU pure overhead.
  3. Reliability reality. A bug in a Linux driver can kernel-panic the whole system. A bug in a microkernel driver crashes the driver process; the kernel can restart it. seL4 takes this further: it is formally verified, with mathematical proof that the kernel itself contains no bugs. For mission-critical systems (defense, medical, automotive), this matters more than performance.
  4. Linux’s middle path. Linux is mostly monolithic but with three escape valves: loadable kernel modules (drivers can crash without rebuilding the kernel), eBPF (sandboxed user-supplied code in kernel space), and user-mode helpers (FUSE, vhost-user, CUSE). Drivers that need isolation run as VFIO-passed user-space processes. This pragmatic hybrid captures most microkernel benefits at a fraction of the cost.
  5. Why monolithic won market share. Linux’s monolithic design was good enough plus orders of magnitude faster than 1990s-era microkernels (Mach was the cautionary tale). By the time microkernel IPC got fast (L4 in 2000s), Linux had already eaten desktop, server, mobile, and embedded markets. Network effects and ecosystem gravity outweigh architectural elegance.
Real-World Example: Apple’s macOS started as monolithic Mach + BSD personality (XNU). The microkernel philosophy lost in practice — Apple’s BSD layer ended up doing most of the work directly, with Mach IPC bypassed for hot paths. This is the textbook case of “microkernel theory meets monolithic performance pressure.” Conversely, QNX (microkernel) dominates automotive infotainment because a crashing radio app must never affect the engine controller.
Senior follow-up: Where does eBPF fit on this spectrum? eBPF is not quite either. It runs in kernel mode (monolithic-fast) but with verifier-enforced safety properties (microkernel-isolated). It has effectively become a “user-extensible kernel” mechanism without the IPC tax. This is why eBPF has exploded since 2018 — it gives Linux microkernel-like flexibility without giving up monolithic speed.
Senior follow-up: Why is seL4 not winning market share despite being formally verified? Two reasons. First, the verification covers the kernel only — the user-space servers (file system, network) are not verified, so the system-level guarantees are weaker than they sound. Second, the development cost is enormous; porting drivers and building a userspace ecosystem is a decade-long effort. seL4 wins where the math matters more than the ecosystem (military, aerospace) and loses where ecosystem matters more (general computing).
Senior follow-up: If you were designing an OS today from scratch, which architecture would you pick? For a general-purpose OS, I would replicate Linux’s hybrid: monolithic core with a strong extension story (eBPF, modules, FUSE). For a specialized OS (real-time, security-sensitive), I would lean toward microkernel (seL4 or its descendants). The “right” answer is workload-specific, which is why neither pure architecture has won outright.
Common Wrong Answers:
  1. “Microkernels are objectively safer and the industry is wrong.” Andy Tanenbaum’s argument from the 1992 Tanenbaum-Torvalds debate. It misses that “safer” without “fast and ecosystem-rich” does not win market share. seL4 is safer; almost no one runs it.
  2. “Linux is monolithic so it cannot be reliable.” Linux runs Google, Facebook, AWS, and most of the planet’s infrastructure with five to seven nines availability. Reliability is engineering practice (testing, fuzzing, eBPF-based verification, hardening), not architecture per se.
Further Reading:
  • Liedtke, “On µ-kernel construction” (1995) — the foundational L4 paper.
  • Klein et al., “seL4: formal verification of an OS kernel” (SOSP 2009) — how you actually prove a kernel correct.
  • Linus Torvalds vs Andy Tanenbaum, USENET archive (1992) — the original “Linux is obsolete” debate, still worth reading for the framing.
Strong Answer Framework:
  1. Baseline cost on x86-64. A bare syscall instruction round-trip is roughly 100 cycles (about 30ns at 3 GHz). With KPTI (Meltdown mitigation) it jumps to 200 to 500 cycles because of the CR3 page-table switch. With Spectre mitigations (IBPB, IBRS) and KPTI fully enforced, it can hit 1000+ cycles. The variance comes from PCID support (mitigates KPTI cost), microarchitectural state, and what the syscall actually does.
  2. Where the time goes. Roughly: 30 cycles for the privilege transition itself (SWAPGS, register save), 50 to 100 cycles for the page-table switch and TLB effects under KPTI, 20 to 50 cycles for the dispatcher (validate syscall number, look up in sys_call_table, security checks via LSM hooks), then the actual syscall body, then the reverse. The fixed overhead is 100 to 500 cycles regardless of what the syscall does.
  3. vDSO eliminates the overhead entirely for read-only operations. clock_gettime, gettimeofday, getcpu, time are mapped as user-space code that reads kernel-maintained data via a seqlock. Cost: roughly 10 to 20 cycles. Speedup: 10 to 30x.
  4. io_uring batches syscalls. Submit a queue of operations (read, write, accept, recv) and reap completions, with a single (or zero, with IORING_SETUP_SQPOLL) syscall. For 100 ops, cost goes from 100 syscalls (10K to 50K cycles) to 1 syscall (100 to 500 cycles): 50 to 100x reduction. Used by databases (Ceph, ScyllaDB), high-perf web (proxygen).
  5. Other techniques. eBPF for in-kernel processing without round-tripping to userspace (XDP for networking, BPF LSM for security). Kernel bypass via DPDK/RDMA/SPDK skips the syscall entirely for I/O. Shared memory + lock-free queues for IPC where syscalls are not needed (futex only on contention).
Real-World Example: ScyllaDB’s published benchmarks (2020) showed that switching from epoll-based I/O to io_uring with IORING_SETUP_SQPOLL doubled their throughput on NVMe storage workloads. The gain was almost entirely from eliminating syscalls, not from faster I/O. They went from roughly 250K IOPS per core to 500K IOPS per core on the same hardware.
Senior follow-up: What is the security risk of io_uring and why have some platforms restricted it? io_uring’s submission queue lets userspace describe operations that the kernel executes asynchronously. Several CVEs (CVE-2022-2602, CVE-2022-29582) found memory-safety bugs in this fast path. Google ChromeOS and Android disabled io_uring by default in 2023; AWS recommends sysctl restrictions. The pattern: any major new kernel API trades audit maturity for performance, and you should not enable it on internet-facing untrusted-tenant workloads until it has hardened.
Senior follow-up: When does the vDSO fail to help? When the data the vDSO reads is contended. The vDSO reads a shared page; if many cores are reading frequently and the kernel timer interrupt is updating, you can see cache-line bouncing under heavy load. Also, vDSO is read-only — anything that mutates kernel state still needs a real syscall.
Senior follow-up: Should I rewrite my service to use io_uring? Only if syscalls are a measured bottleneck. For a service doing under 50K syscalls per second, io_uring is dead weight — adds complexity for no measurable gain. For a storage engine or proxy doing 500K+ per core, io_uring can be a 2x win. As always: profile first.
Common Wrong Answers:
  1. “Syscalls always cost about 1 microsecond.” Wrong by an order of magnitude in either direction depending on architecture, mitigations, and what the syscall does. The honest answer is “between 30ns and 1 microsecond, measure it on your kernel.”
  2. “io_uring makes everything faster.” No. For low-throughput workloads, the queue management overhead exceeds the saved syscall cost. io_uring is for high-throughput, not for general use.
Further Reading:
  • Jens Axboe, “Efficient I/O with io_uring” (kernel.org, 2019) — the original design document.
  • Brendan Gregg, “Linux System Call Performance” (LWN-style write-up, 2018) — how to measure syscall overhead in production.
  • Linux kernel Documentation/userspace-api/vsyscall.rst and Documentation/ABI/stable/vdso — the canonical references.
Strong Answer Framework:
  1. What SWAPGS does mechanically. It atomically swaps the GS base register between the user-mode value (TLS pointer) and the kernel-mode value (per-CPU data pointer). On entry to the kernel, the kernel needs to access per-CPU data structures (current task_struct, kernel stack pointer, etc.) immediately. Those are addressed via gs:offset. SWAPGS makes this work without a separate setup instruction.
  2. Why it is security-critical. Before SWAPGS executes, the GS register still points at user-controlled data. If the CPU speculatively executes a memory access using gs:offset before SWAPGS retires, it dereferences attacker-controlled memory. The kernel then reads from wherever the user pointed GS, potentially leaking data through cache side channels.
  3. The Spectre v1 SWAPGS variant. Researchers found that the CPU’s speculation engine could speculatively execute the kernel entry path with the wrong GS value, even though architecturally SWAPGS happens first. The speculative reads completed, polluted the cache, and leaked data to a measuring attacker — even though the speculation was eventually discarded.
  4. The LFENCE mitigation. LFENCE is a load fence — it serializes loads, preventing the CPU from speculatively executing loads after the LFENCE until prior loads complete. Placing LFENCE immediately after SWAPGS guarantees that any subsequent gs:offset access happens with the correct (kernel) GS value.
  5. The performance cost. LFENCE serializes the pipeline; on a modern OoO core that costs roughly 10 to 30 cycles per syscall. Across millions of syscalls per second, this is real overhead. The kernel applies the fence selectively (only on entry, only on architectures that need it) to minimize impact.
Real-World Example: The Spectre-SWAPGS variant (CVE-2019-1125) was disclosed in August 2019. The Linux kernel patches added LFENCE on entry; Microsoft Windows shipped equivalent mitigations the same week. Phoronix benchmarks measured roughly 3 to 5 percent overhead on syscall-heavy workloads. The trade-off was deemed acceptable because the alternative was kernel memory disclosure.
Senior follow-up: Are there architectures that do not have this problem? ARM64 has separate registers for kernel and user TLS (TPIDR_EL0 and TPIDR_EL1) so there is no swap operation at the same level. RISC-V’s sscratch register serves a similar role. The x86 design (one GS base, swap on entry) is a relic of the original AMD64 design that became a footgun under speculation.
Senior follow-up: What other speculation mitigations does the kernel apply on syscall entry? Retpolines (replacing indirect branches with safe trampolines that prevent BTB poisoning), IBPB (Indirect Branch Predictor Barrier on context switch), STIBP (Single Thread Indirect Branch Predictor) on hyperthread boundaries, and KPTI (Kernel Page Table Isolation) for Meltdown. Each has a measurable cost; full mitigations stack to 10 to 30 percent on syscall-heavy workloads.
Senior follow-up: When would you turn off these mitigations? For dedicated single-tenant workloads where no untrusted code runs (high-frequency trading, dedicated compute clusters), mitigations=off on the kernel command line restores 10 to 30 percent throughput. For multi-tenant systems (cloud, containers from untrusted images), never turn them off.
Common Wrong Answers:
  1. “SWAPGS is just a privilege transition instruction.” It is more specific than that; it does not change the privilege level (the CPL change from SYSCALL does that). It only swaps a register. The conflation of “kernel transition” with SWAPGS leads to confusion about what each piece actually protects.
  2. “LFENCE prevents Spectre.” LFENCE prevents one specific class of Spectre variants involving speculative loads after a barrier. It does not prevent BTB-based variants (those need retpolines or IBRS).
Further Reading:
  • Bitdefender’s original SWAPGS variant disclosure (August 2019).
  • Linux kernel arch/x86/entry/entry_64.S — read the actual assembly with comments.
  • Mark Brand, Project Zero, “Speculative buffer overflows: attacks and defenses” — background on speculation-based attacks.

Interview Deep-Dive

Strong Answer:The way I think about a system call is as a controlled boundary crossing with non-trivial cost. Here is the full sequence on x86-64:
  • User-space setup: The C library (glibc) places the syscall number in RAX and arguments in RDI, RSI, RDX, R10, R8, R9. Then it executes the syscall instruction.
  • Hardware transition: The CPU reads the target address from the IA32_LSTAR MSR (set at boot by the kernel), saves the return address in RCX and flags in R11, switches to ring 0, and jumps to entry_SYSCALL_64.
  • Kernel entry: The kernel executes SWAPGS to load the per-CPU kernel data area, saves the user stack pointer, loads the kernel stack, and pushes a full register frame (pt_regs). On systems with KPTI (Kernel Page Table Isolation, the Meltdown mitigation), the kernel must also switch CR3 to the kernel page table, which invalidates TLB entries unless PCIDs are used.
  • Dispatch: The kernel indexes into sys_call_table[RAX] and calls the appropriate handler (e.g., ksys_write()).
  • Return: Reverse the process — restore registers, SWAPGS back, switch CR3 if KPTI, execute sysretq to return to user space at the address saved in RCX.
Performance implications: a “naked” syscall costs 100-200 cycles. With KPTI, add another 50-100 cycles for the CR3 switch. At a rate of 100K syscalls/second (common for busy web servers), that is 10-20 million cycles per second spent just crossing the boundary.The vDSO (Virtual Dynamic Shared Object) optimizes this for calls that only need to read kernel-maintained data. The kernel maps a read-only page into every process’s address space containing executable code and shared data (like the current time). When you call clock_gettime(), glibc routes to the vDSO function, which reads a memory-mapped time value updated by the kernel’s timer interrupt — no mode switch at all. Cost drops from 200 cycles to about 20 cycles. The functions typically available via vDSO are gettimeofday, clock_gettime, getcpu, and time.The key insight for interviews: not every “system call” is actually a system call. The vDSO makes some of the most frequently called functions essentially free, which is why you do not see them in strace output — strace only intercepts actual syscall instructions.Follow-up: If vDSO runs in user space with kernel data, how does the kernel keep the time value up to date without a race condition?The kernel uses a seqlock pattern. The vDSO page contains a sequence counter and the time data. The kernel’s timer interrupt updates the time data and increments the sequence counter (odd during write, even when stable). The vDSO reader code loops: read the sequence counter, read the time, read the sequence counter again. If the counter changed or was odd, retry. This guarantees the reader always gets a consistent snapshot without any locks or atomic instructions on the read path. The retry is almost never needed because the timer interrupt is very brief.
Strong Answer:
  • Abstraction (hiding hardware complexity): The OS presents uniform interfaces (files, sockets, processes) regardless of underlying hardware. A production failure from broken abstraction: a cloud provider migrated VMs from Intel to AMD hosts. Applications using RDTSC directly (bypassing the OS clock abstraction) started producing incorrect timestamps because TSC behavior differs between CPU vendors. The fix was to use clock_gettime() (the proper OS abstraction) instead of raw hardware instructions. Lesson: when you bypass the OS abstraction, you take on hardware portability risk.
  • Multiplexing (sharing resources among competing users): The OS divides CPU, memory, I/O, and network among processes. A production failure from broken multiplexing: a noisy neighbor on a shared Kubernetes node consumed all available I/O bandwidth (no blkio cgroup limits were set). The database on the same node saw query latency spike from 5ms to 500ms because its fsync calls were queued behind the neighbor’s bulk writes. The OS was multiplexing the I/O device fairly by default (CFQ scheduler), but “fair” meant the database got equal share, not prioritized share. Fix: set blkio cgroup weights and move latency-sensitive workloads to dedicated nodes.
  • Isolation (preventing interference between processes): The OS ensures one process cannot corrupt another’s memory or resources. A production failure from broken isolation: the Meltdown vulnerability (2018) showed that speculative execution could leak kernel memory to user space, breaking the fundamental isolation between kernel and user. A malicious process could read passwords, encryption keys, and other secrets from kernel memory at roughly 500KB/s. The fix (KPTI) restored isolation but cost 5-30% performance on syscall-heavy workloads. This is arguably the most expensive isolation failure in computing history.
The meta-lesson: all three properties are load-bearing pillars. Weaken any one, and the system fails in surprising, hard-to-diagnose ways. In system design interviews, I always think about which of these three is most critical for the system under discussion and what happens when it breaks.Follow-up: How do containers provide isolation, and where do they fall short compared to VMs?Containers use kernel namespaces (PID, mount, network, UTS, IPC, cgroup, user, time) to create isolated views of system resources, and cgroups to limit resource consumption. But all containers share the same kernel. This means a kernel vulnerability affects every container on the host. VMs, by contrast, run separate kernels on a hypervisor — a vulnerability in the guest kernel does not affect the host or other guests (assuming no hypervisor escape). The classic trade-off: containers are lighter (millisecond startup, megabytes of overhead) but weaker isolation; VMs are heavier (second startup, gigabytes of overhead) but stronger isolation. For multi-tenant environments processing untrusted code (like CI runners), you want VMs or microVMs (Firecracker) for the outer boundary and containers inside for convenience.
Strong Answer:This is a great question because it gets at the fundamental reason operating systems exist. The answer comes down to trust and shared resources.
  • Mutual distrust: Your web browser, your editor, and a random npm package you installed all run as user-space processes. None of them should be able to read each other’s memory, delete each other’s files, or monopolize the CPU. The kernel/user-space boundary, enforced by hardware (CPU privilege rings), is the mechanism that makes this isolation real. Without it, any process could overwrite any other process’s memory with a simple pointer dereference.
  • Hardware protection requires privilege: Certain operations — modifying page tables, programming the interrupt controller, accessing I/O ports, halting the CPU — would allow a single process to break the entire system if performed incorrectly. The hardware restricts these to ring 0 (kernel mode). The kernel acts as a trusted intermediary that validates requests before executing privileged operations.
  • Resource accounting: The kernel tracks who owns what — which process has which memory pages, file descriptors, and CPU time. This accounting is what enables fair scheduling, memory limits, and cgroups. If everything ran in user space with equal privilege, there would be no authority to enforce limits.
  • Crash containment: When a user-space program dereferences a NULL pointer, the kernel catches the fault and kills just that process. If that code were running in kernel mode, a NULL dereference would panic the entire system.
The cost of this boundary (100-200 cycles per syscall) is the price of safety. The industry has explored alternatives: microkernels (move more code to user space, use IPC instead of syscalls — but IPC overhead often exceeds syscall overhead), unikernels (run a single application as the kernel — no isolation but maximum performance), and library OSes (like Demikernel) that give each application its own kernel library for I/O. Each trades isolation for performance in different ways.The pragmatic answer: for 99% of production systems, the syscall overhead is negligible compared to the cost of a security breach or a system crash. The 1% that cannot afford it (high-frequency trading, DPDK networking) use kernel bypass techniques that are carefully scoped to specific I/O paths while keeping the general-purpose kernel for everything else.Follow-up: What is a microkernel and why hasn’t it replaced the monolithic kernel in practice?A microkernel runs the absolute minimum in kernel mode (address space management, IPC, scheduling) and moves everything else — file systems, device drivers, networking — into user-space servers that communicate via IPC. The theory is beautiful: smaller trusted computing base, better fault isolation (a crashed driver does not bring down the kernel), cleaner architecture. In practice, the IPC overhead is devastating. Every operation that was a function call in a monolithic kernel becomes a context switch plus message copy. Mach-based systems (early macOS) were notoriously slow until Apple layered a monolithic BSD personality on top, defeating the purpose. L4 and seL4 have made progress with highly optimized IPC (sub-microsecond), and seL4 is formally verified. But Linux’s monolithic design with loadable modules has won in practice because the performance advantage is enormous and the modular design (while not as clean) is “good enough” for fault isolation via things like eBPF and user-mode drivers.

Next: CPU Architectures & Microarchitecture