OS Fundamentals & System Call Internals

Operating Systems exist to manage hardware and provide a safe abstraction for applications. A “Senior” engineer must understand the physical transition between these two worlds.

Interview Frequency: Extremely High (90%+ of system programming interviews) Key Topics: System calls, kernel/user space, vDSO, context switching, privilege levels Time to Master: 12-15 hours Prerequisites: C programming, basic computer architecture

0. What is an Operating System?

At the highest level, an Operating System is a resource manager and isolation layer:

Resource Manager:
- Multiplexes CPU time between many processes.
- Allocates and reclaims memory, files, sockets, and devices.
- Schedules and prioritizes work according to policy (throughput, latency, fairness, deadlines).
Isolation & Protection Layer:
- Prevents one program from corrupting another program’s memory.
- Prevents untrusted code from directly touching hardware.
- Enforces security boundaries (user vs kernel, containers, VMs).

A helpful analogy is a city operating authority:

Streets/highways ⇢ CPU cores and buses.
Buildings ⇢ processes.
Rooms ⇢ threads.
Zoning rules and permits ⇢ permissions and security policies.
Traffic lights ⇢ synchronization and scheduling.

The OS makes the city feel orderly and predictable to its “citizens” (programs) even though underneath, the physical world (hardware) is chaotic and failure-prone.

0.1 Core Responsibilities

Every mainstream OS (Linux, Windows, macOS, BSD, RTOS variants) implements the same core ideas:

Abstraction: Present simple interfaces (files, sockets, processes) instead of raw devices and registers.
Virtualization: Make a single physical CPU and memory look like many virtual CPUs and address spaces.
Isolation: Ensure faults in one address space do not corrupt others.
Coordination: Provide primitives (locks, signals, pipes, futexes) so concurrent entities can cooperate.
Accounting: Track which process used how much CPU, memory, I/O; enforce quotas and limits.

0.2 Types of Operating Systems

Monolithic Kernels

Examples: Linux, traditional UnixMost services (drivers, file systems, networking) run in kernel mode. Fast but large kernel.

Microkernels

Examples: seL4, Minix3, QNXMove many services to user space for stronger isolation. More message passing overhead.

Hybrid Kernels

Examples: Windows NT, macOS (XNU)Combine monolithic and microkernel approaches for balance.

Real-Time OS

Examples: FreeRTOS, VxWorksTrade general-purpose flexibility for strict deadline guarantees.

In this course, we primarily focus on Linux as the concrete example, but the mental models transfer to all of these.

0.5 From Source Code to Running Process

To make OS fundamentals concrete, walk through what happens when you compile and run a simple C program:

// main.c
#include <stdio.h>

int main(int argc, char **argv) {
    printf("hello\n");
    return 0;
}

Step 1: Compilation and Linking

Preprocessing
Compilation
Assembly
Linking

gcc -E main.c -o main.i

Expands #include and macros into a single translation unit:

// Thousands of lines from stdio.h are now inserted
// ...
int main(int argc, char **argv) {
    printf("hello\n");
    return 0;
}

gcc -S main.i -o main.s

Turns C into assembly:

main:
    push    rbp
    mov     rbp, rsp
    lea     rdi, [rip + .LC0]
    call    puts
    mov     eax, 0
    pop     rbp
    ret
.LC0:
    .string "hello"

gcc -c main.s -o main.o

Turns assembly into machine code (object file). Still has unresolved symbols like puts.

gcc main.o -o main

Invokes the linker to:

Resolve symbols (puts from libc)
Combine sections (.text, .data, .bss)
Produce ELF executable with entry point

ELF Structure (Executable and Linkable Format):

┌─────────────────────────┐
│  ELF Header             │  Magic number, entry point, section offsets
├─────────────────────────┤
│  Program Headers        │  How to load segments into memory
├─────────────────────────┤
│  .text (Code)           │  Machine instructions (read-only)
├─────────────────────────┤
│  .rodata (Constants)    │  "hello" string lives here
├─────────────────────────┤
│  .data (Initialized)    │  Initialized global variables
├─────────────────────────┤
│  .bss (Uninitialized)   │  Uninitialized globals (zero-filled by loader)
├─────────────────────────┤
│  Symbol Table           │  Function names, debugging info
├─────────────────────────┤
│  Section Headers        │  Metadata about each section
└─────────────────────────┘

At this point you have a program on disk, not yet a process.

Step 2: Shell Creates a New Process

When you run:

$ ./main

What the shell does:

Your shell (itself a process) parses the command.
The shell calls fork():
- The kernel creates a child process by copying the parent’s PCB and page tables (copy-on-write).
- Parent and child now both exist; they differ only in the return value of fork().

// Inside bash or sh:
pid_t pid = fork();
if (pid == 0) {
    // Child process
    execve("./main", argv, envp);
} else {
    // Parent process
    wait(NULL);  // Wait for child to finish
}

Step 3: Child Calls `execve()`

In the child:

The shell calls execve("./main", ...).
The kernel:
- Reads the ELF headers from disk.
- Allocates a new address space.
- Maps code, data, stack, and shared libraries into that space.
- Sets up the initial user stack with argc/argv and environment.
- Sets the program counter to the C runtime entry point (_start).

Memory Layout After execve():

High Address (0x7FFF...)
┌─────────────────────────┐
│  Kernel Space           │ ← Not accessible from user mode
├─────────────────────────┤
│  Stack (grows down ↓)   │ ← argc, argv, environment vars
│         ...             │
├─────────────────────────┤
│  Memory Mapped Region   │ ← Shared libraries (libc.so)
│  (libc, ld-linux.so)    │
├─────────────────────────┤
│  Heap (grows up ↑)      │ ← malloc() allocates here
│         ...             │
├─────────────────────────┤
│  .bss (uninitialized)   │ ← Zero-filled global variables
├─────────────────────────┤
│  .data (initialized)    │ ← Initialized global variables
├─────────────────────────┤
│  .text (code)           │ ← Your program's machine code
└─────────────────────────┘
Low Address (0x0000...)

At this moment the program becomes a process with its own PID and address space.

Step 4: C Runtime → `main` → Exit

The C runtime (crt1.o) runs first, initializing the runtime and calling your main().
Your code executes (printf("hello\n")), which itself issues syscalls under the hood (write() on stdout).
When main returns, the runtime calls exit(), which:
- Flushes stdio buffers.
- Invokes the exit_group syscall.
- Lets the kernel tear down the process (free memory, close FDs, reap the PCB).

Complete Flow Diagram:

User types "./main"
        ↓
Shell receives command (shell is a running process, PID 1000)
        ↓
Shell calls fork() ──────────┐
        ↓                     ↓
Parent (PID 1000)        Child (PID 1001)
calls wait()             calls execve("./main")
blocks...                     ↓
                         Kernel loads ELF binary
                         Sets up address space
                         Maps .text, .data, stack
                         Jumps to _start
                              ↓
                         C runtime initializes
                              ↓
                         main() executes
                         printf() → write syscall
                              ↓
                         main() returns 0
                              ↓
                         exit(0) → exit_group syscall
                              ↓
                         Kernel cleans up process
                              ↓
Parent's wait() returns
Shell prints next prompt

This entire path is the lifecycle of a simple process; later chapters (Processes, Virtual Memory, Scheduling, File Systems) each zoom into one part of this story.

1. The Kernel vs. User Space

The CPU hardware enforces the boundary.

1.1 Privilege Levels (Protection Rings)

x86/x86-64
ARM
RISC-V

Four Rings (though most OSes only use 2):

Ring 0 (Kernel Mode)    ← Full hardware access
Ring 1 (Device Drivers) ← Unused in modern OSes
Ring 2 (Device Drivers) ← Unused in modern OSes
Ring 3 (User Mode)      ← Restricted instructions

Current Privilege Level (CPL) stored in CS register (Code Segment).Privileged Instructions (only Ring 0):

HLT - halt the CPU
CLI/STI - disable/enable interrupts
MOV CR3, reg - change page tables
LGDT/LIDT - load GDT/IDT
IN/OUT - direct hardware I/O (on some systems)

Exception Levels (EL):

EL3 (Secure Monitor)   ← TrustZone, secure boot
EL2 (Hypervisor)       ← KVM, Xen
EL1 (OS Kernel)        ← Linux kernel
EL0 (User Application) ← Your programs

Transitions happen via:

SVC (SuperVisor Call) - syscall
HVC (HyperVisor Call)
SMC (Secure Monitor Call)

Privilege Modes:

M-Mode (Machine)       ← Firmware, bootloader
S-Mode (Supervisor)    ← OS kernel
U-Mode (User)          ← Applications

Optional:

H-Mode (Hypervisor) for virtualization

1.2 What Each Mode Can Do

Capability	User Space (Ring 3 / EL0)	Kernel Space (Ring 0 / EL1)
Execute normal instructions	✓	✓
Access own virtual memory	✓	✓
Access all physical memory	✗	✓
Modify page tables	✗	✓
Disable interrupts	✗	✓
Execute I/O instructions	✗	✓
Halt the CPU	✗	✓
Load kernel modules	✗	✓

Why the restriction?

// If user space could do this:
asm("cli");  // Disable interrupts
while(1);    // Infinite loop
// The entire system would freeze!

// Or this:
void *kernel_memory = (void *)0xFFFF888000000000;
*kernel_memory = 0x90909090;  // Overwrite kernel code
// System compromised!

The hardware enforces that attempts to execute privileged instructions in user mode trigger a General Protection Fault (x86) or Illegal Instruction exception (ARM/RISC-V), which the kernel handles by terminating the offending process.

2. System Call Evolution (x86-64)

How does a program ask the kernel for help?

2.1 Legacy: `INT 0x80` (i386)

In the 32-bit era, applications used a software interrupt.

; Legacy 32-bit Linux syscall
mov eax, 4          ; syscall number (write)
mov ebx, 1          ; file descriptor (stdout)
mov ecx, msg        ; buffer
mov edx, 13         ; count
int 0x80            ; Trap to kernel

What happens:

CPU saves registers (CS, EIP, EFLAGS).
CPU looks up interrupt vector 0x80 in the IDT (Interrupt Descriptor Table).
CPU jumps to kernel’s interrupt handler.
Handler switches to kernel stack.
Handler calls the appropriate syscall function.
Handler returns using IRET.

Problem: Very slow (~300-500 cycles) due to:

Interrupt controller overhead
Full register save/restore
Stack switching
Permission checks

2.2 Modern: `SYSCALL` (AMD) / `SYSENTER` (Intel)

x86-64 introduced a dedicated instruction for syscalls. SYSCALL Instruction (AMD64):

; Modern 64-bit Linux syscall
mov rax, 1          ; syscall number (write)
mov rdi, 1          ; arg1: file descriptor
mov rsi, msg        ; arg2: buffer
mov rdx, 13         ; arg3: count
syscall             ; Fast system call

Hardware Magic:

No IDT lookup: CPU jumps to address stored in IA32_LSTAR MSR (Model Specific Register).
No stack lookup: Uses IA32_KERNEL_GS_BASE for per-CPU data.
Minimal save: Only saves RIP and RFLAGS to RCX and R11.

Kernel Entry Point (simplified):

; Entry point stored in LSTAR MSR
entry_SYSCALL_64:
    SWAPGS                  ; Switch to kernel GS (per-CPU area)
    mov    QWORD PTR gs:0x14, rsp   ; Save user stack
    mov    rsp, QWORD PTR gs:0x1c   ; Load kernel stack

    push   rax              ; Save registers
    push   rcx              ; (RCX = user RIP)
    push   r11              ; (R11 = user RFLAGS)
    ; ... save more registers

    call   do_syscall_64    ; C function dispatch

    ; ... restore registers
    pop    r11
    pop    rcx
    pop    rax

    mov    rsp, QWORD PTR gs:0x14   ; Restore user stack
    SWAPGS                  ; Switch back to user GS
    sysretq                 ; Return to user space

Performance: Much faster (~100-200 cycles). The savings come from:

Direct jump (no table lookup)
Minimal register save/restore
No interrupt controller involved

2.3 ARM64: `SVC` Instruction

; ARM64 syscall
mov x8, #64         ; syscall number (write)
mov x0, #1          ; arg1: fd
ldr x1, =msg        ; arg2: buffer
mov x2, #13         ; arg3: count
svc #0              ; SuperVisor Call

Hardware behavior:

Saves PC to ELR_EL1
Saves PSTATE to SPSR_EL1
Jumps to exception vector
Kernel dispatches based on x8

2.4 RISC-V: `ECALL` Instruction

; RISC-V syscall
li a7, 64           ; syscall number (write)
li a0, 1            ; arg1: fd
la a1, msg          ; arg2: buffer
li a2, 13           ; arg3: count
ecall               ; Environment call

Cost Comparison Table:

Mechanism	Typical Cycles	Use Case
`INT 0x80`	300-500	Legacy 32-bit x86
`SYSENTER`	100-150	Intel x86-32 (deprecated)
`SYSCALL`	100-200	Modern x86-64
`SVC`	100-200	ARM64
`ECALL`	100-200	RISC-V
vDSO (no switch)	5-20	Kernel-provided user code

3. The vDSO (Virtual Dynamic Shared Object)

Some system calls are called thousands of times per second (e.g., gettimeofday(), clock_gettime()). Switching to kernel mode every time is a massive waste of CPU.

3.1 How it Works

The vDSO is a special page of memory that the kernel maps into every user process’s address space. This page contains:

Code: Executable functions that run in user mode.
Data: Read-only kernel data (like current time).

Process Address Space:
┌─────────────────────────┐
│  Kernel Space           │
├─────────────────────────┤
│  Stack                  │
├─────────────────────────┤
│  [vdso] ← Magic page!   │ ← Kernel-provided, user-executable
│    - gettimeofday()     │
│    - clock_gettime()    │
│    - getcpu()           │
├─────────────────────────┤
│  Shared Libraries       │
│  (libc.so)              │
├─────────────────────────┤
│  Heap                   │
├─────────────────────────┤
│  .text                  │
└─────────────────────────┘

Example: gettimeofday() Without vDSO:

// Slow path: requires syscall
struct timeval tv;
gettimeofday(&tv, NULL);
// → syscall → kernel mode → read kernel clock → return
// Cost: ~200 cycles

Example: gettimeofday() With vDSO:

// Fast path: executes in user mode
struct timeval tv;
gettimeofday(&tv, NULL);
// → calls vDSO function → reads shared memory → returns
// Cost: ~20 cycles (no mode switch!)

3.2 Implementation Details

Kernel Side (sets up vDSO):

// In arch/x86/entry/vdso/vdso.c
static int __init init_vdso(void) {
    // Map vDSO page into every process
    vdso_pages[0] = alloc_page(GFP_KERNEL);
    copy_vdso_to_page(vdso_pages[0]);
    return 0;
}

// Kernel periodically updates shared time data
void update_vsyscall(struct timekeeper *tk) {
    vdso_data->wall_time_sec = tk->wall_time.tv_sec;
    vdso_data->wall_time_nsec = tk->wall_time.tv_nsec;
}

User Side (vDSO function):

// Inside vDSO (simplified)
notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) {
    // Read from shared memory (no syscall!)
    ts->tv_sec = vdso_data->wall_time_sec;
    ts->tv_nsec = vdso_data->wall_time_nsec;
    return 0;
}

Variable Shadowing: The kernel periodically writes the current time into a “data” part of the vDSO page using atomic operations or seqlocks to ensure consistency.

3.3 Why Some Syscalls Can’t Use vDSO

Safe for vDSO:

Read-only operations
No side effects
Data changes slowly or predictably

Cannot use vDSO:

write() - modifies kernel state
fork() - creates process
mmap() - changes address space
open() - allocates file descriptor

Performance Impact:

// Benchmark: 10 million calls
clock_gettime(CLOCK_REALTIME, &ts);

// With vDSO:     ~200 ms (20 ns per call)
// Without vDSO:  ~2000 ms (200 ns per call)
// Speedup: 10x

3.4 Finding the vDSO

# View process mappings
$ cat /proc/self/maps | grep vdso
7ffd1a3fe000-7ffd1a400000 r-xp 00000000 00:00 0    [vdso]

# Dump vDSO symbols
$ objdump -T /lib/x86_64-linux-gnu/libc.so.6 | grep vdso
# (libc wraps vDSO calls)

# Using LD_SHOW_AUXV
$ LD_SHOW_AUXV=1 ./program 2>&1 | grep SYSINFO
AT_SYSINFO_EHDR: 0x7ffd1a3fe000  ← vDSO base address

Result: Zero context switches. The system call is “executed” entirely in user space.

4. `vsyscall`: The Legacy Fixed Address

Before the vDSO, there was vsyscall.

4.1 The Problem with vsyscall

Fixed virtual address: 0xffffffffff600000

Every process had:
┌───────────────────────────┐
│ 0xffffffffff600000:       │
│   gettimeofday code       │ ← Same address in EVERY process
│   time code               │
│   getcpu code             │
└───────────────────────────┘

Security Issue: Predictable addresses enable ROP attacks (Return-Oriented Programming). An attacker could reliably:

// Exploit code
return_address = 0xffffffffff600000 + offset;
// Execute attacker-controlled syscalls

4.2 Modern Linux Solution

vsyscall is now emulated:

// In arch/x86/entry/vsyscall/vsyscall_64.c
static bool is_vsyscall_vaddr(unsigned long vaddr) {
    return vaddr >= VSYSCALL_ADDR && vaddr < VSYSCALL_ADDR + PAGE_SIZE;
}

// Trap and emulate
do_page_fault() {
    if (is_vsyscall_vaddr(address)) {
        // Emulate the syscall (SLOW!)
        emulate_vsyscall();
        return;
    }
}

Result: vsyscall addresses still work (for legacy binaries), but they trap to the kernel instead of executing directly. Modern code uses vDSO instead.

5. Kernel Entry: `entry_SYSCALL_64`

When the SYSCALL instruction is executed, the CPU jumps to this assembly entry point in the kernel.

5.1 Complete Entry Path (x86-64)

; arch/x86/entry/entry_64.S

ENTRY(entry_SYSCALL_64)
    /*
     * Interrupts are off on entry.
     * We are in kernel mode, but need to switch stacks.
     */

    /* 1. SWAPGS: Switch GS register to kernel per-CPU area */
    SWAPGS

    /* 2. Save user stack pointer */
    movq    %rsp, PER_CPU_VAR(rsp_scratch)

    /* 3. Load kernel stack */
    movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp

    /* 4. Build pt_regs structure on kernel stack */
    pushq   $__USER_DS              /* User data segment */
    pushq   PER_CPU_VAR(rsp_scratch)  /* User RSP */
    pushq   %r11                    /* User RFLAGS (saved by SYSCALL) */
    pushq   $__USER_CS              /* User code segment */
    pushq   %rcx                    /* User RIP (saved by SYSCALL) */

    pushq   %rax    /* Syscall number */
    pushq   %rdi    /* Arg 1 */
    pushq   %rsi    /* Arg 2 */
    pushq   %rdx    /* Arg 3 */
    pushq   %r10    /* Arg 4 (RCX was overwritten) */
    pushq   %r8     /* Arg 5 */
    pushq   %r9     /* Arg 6 */

    /* 5. Call the C dispatcher */
    call    do_syscall_64

    /* 6. Restore user registers from pt_regs */
    popq    %r9
    popq    %r8
    popq    %r10
    popq    %rdx
    popq    %rsi
    popq    %rdi
    popq    %rax    /* Return value */

    popq    %rcx    /* User RIP */
    popq    %r11    /* Skip CS */
    popq    %r11    /* User RFLAGS */
    popq    %rsp    /* Skip DS */
    popq    %rsp    /* User RSP */

    /* 7. SWAPGS back to user GS */
    SWAPGS

    /* 8. SYSRETQ: Return to user space */
    sysretq
END(entry_SYSCALL_64)

5.2 The C Dispatcher

// In arch/x86/entry/common.c

__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr) {
    // Security checks
    nr = syscall_enter_from_user_mode(regs, nr);

    // Validate syscall number
    if (likely(nr < NR_syscalls)) {
        // Look up in syscall table and call
        regs->ax = sys_call_table[nr](regs);
    }

    syscall_exit_to_user_mode(regs);
}

5.3 The Syscall Table

// In arch/x86/entry/syscall_64.c

const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
    [0] = __x64_sys_read,
    [1] = __x64_sys_write,
    [2] = __x64_sys_open,
    [3] = __x64_sys_close,
    [4] = __x64_sys_stat,
    // ... 300+ more syscalls
    [39] = __x64_sys_getpid,
    [57] = __x64_sys_fork,
    [59] = __x64_sys_execve,
    // ...
};

Security Note: The syscall table is marked read-only after boot. Modifying it is a common rootkit technique.

5.4 Complete Flow Diagram

User Program                  CPU Hardware              Kernel
─────────────────────────────────────────────────────────────
mov rax, 1
mov rdi, 1
mov rsi, buffer
mov rdx, 13
syscall ──────────────────>  SYSCALL instruction
                             - Save RIP → RCX
                             - Save RFLAGS → R11
                             - Load RIP from LSTAR MSR ───────> entry_SYSCALL_64:
                             - Switch to Ring 0                   SWAPGS
                                                                  Save user RSP
                                                                  Load kernel stack
                                                                  Push registers (pt_regs)
                                                                  call do_syscall_64
                                                                     ↓
                                                                  sys_call_table[1]
                                                                     ↓
                                                                  __x64_sys_write()
                                                                     ↓
                                                                  ksys_write()
                                                                     ↓
                                                                  vfs_write()
                                                                     ↓
                                                                  [... kernel work ...]
                                                                     ↓
                                                                  return bytes_written
                                                                     ↓
                                                                  Pop registers
                                                                  Restore user RSP
                                                                  SWAPGS
                             SYSRETQ <────────────────────────── sysretq
                             - Restore RIP from RCX
                             - Restore RFLAGS from R11
                             - Switch to Ring 3
RAX = bytes_written <───────
continue program...

6. Processes vs. Threads vs. Kernel Tasks

Before we dive into system call micro-details, it is critical to distinguish the units of execution the kernel manages:

Process

Isolation Unit

Own virtual address space
Own page tables (mm_struct)
Own resources (FDs, signals, cwd)
Heavyweight context switch

Thread

Execution Unit

Shares address space with process
Own stack and registers
Own TID
Lightweight context switch (same CR3)

Kernel Thread

Kernel Worker

No user address space
Lives entirely in kernel
Examples: kswapd, kworker
No context switch overhead for syscalls

6.1 Visualization

Think of a process as a house, and threads as people inside the house:

Process = House
├── Address Space = House structure (walls, roof)
├── Resources
│   ├── File Descriptors = Shared utilities (kitchen, bathroom)
│   ├── Signal Handlers = House rules (fire alarm protocol)
│   └── Current Directory = Current room the house is in
└── Threads = People inside
    ├── Thread 1
    │   ├── Stack = Private bedroom
    │   └── Registers = Personal belongings
    ├── Thread 2
    │   ├── Stack = Private bedroom
    │   └── Registers = Personal belongings
    └── ...

6.2 Why It Matters for System Calls

Example: read() system call

// Thread 1
ssize_t n = read(fd, buf1, size);  // May block

Issued by a thread
CPU time charged to the process
May block only this thread, not the whole process
Other threads can continue executing

Example: fork() system call

// Process with 5 threads calls fork()
pid_t pid = fork();

Creates a new process with copy-on-write address space
The child initially has only one thread (the caller)
Even though parent had 5 threads, they don’t get copied
Child’s single thread continues from the fork() return point

Kernel’s View:

// Linux doesn't distinguish! Everything is a task_struct

// Thread vs Process determined by clone() flags:
clone(CLONE_VM | CLONE_FS | CLONE_FILES);  // = Thread
clone(SIGCHLD);                             // = Process

You will see these distinctions repeatedly in later chapters (Scheduling, Synchronization, Signals, and Linux Internals).

7. System Call Deep Dive: Real Linux Examples

7.1 Example: `write()` System Call

User space call:

#include <unistd.h>
ssize_t n = write(1, "Hello\n", 6);

Libc wrapper (glibc/sysdeps/unix/sysv/linux/write.c):

ssize_t __write(int fd, const void *buf, size_t count) {
    return INLINE_SYSCALL_CALL(write, fd, buf, count);
}

INLINE_SYSCALL_CALL expands to:

mov rax, 1      ; __NR_write
mov rdi, 1      ; fd
mov rsi, buf    ; buffer
mov rdx, 6      ; count
syscall

Kernel entry (arch/x86/entry/syscall_64.c):

sys_call_table[1] = __x64_sys_write;

Syscall implementation (fs/read_write.c):

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, size_t, count) {
    return ksys_write(fd, buf, count);
}

ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count) {
    struct fd f = fdget_pos(fd);
    if (!f.file)
        return -EBADF;

    ssize_t ret = vfs_write(f.file, buf, count, &pos);
    fdput_pos(f);
    return ret;
}

VFS layer (fs/read_write.c):

ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos) {
    // Check permissions
    if (!(file->f_mode & FMODE_WRITE))
        return -EBADF;

    // Check file operations
    if (!file->f_op->write && !file->f_op->write_iter)
        return -EINVAL;

    // Call filesystem-specific write
    if (file->f_op->write)
        ret = file->f_op->write(file, buf, count, pos);
    else
        ret = new_sync_write(file, buf, count, pos);

    return ret;
}

Filesystem layer (e.g., ext4):

// fs/ext4/file.c
static ssize_t ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) {
    // Allocate disk blocks
    // Update inode metadata
    // Write to page cache
    // Mark pages dirty
    return generic_file_write_iter(iocb, from);
}

Complete Stack:

User Application: write(1, "Hello\n", 6)
        ↓
Libc Wrapper: __write()
        ↓
Assembly: syscall instruction
        ↓
entry_SYSCALL_64
        ↓
do_syscall_64
        ↓
sys_call_table[1] → __x64_sys_write
        ↓
ksys_write(fd, buf, count)
        ↓
vfs_write(file, buf, count, pos)
        ↓
ext4_file_write_iter() [if file is on ext4]
        ↓
generic_file_write_iter()
        ↓
Page Cache operations
        ↓
Return bytes written
        ↓
sysretq back to user space

7.2 Example: `getpid()` - Fast Path

User space:

pid_t pid = getpid();

Modern implementation uses vDSO:

// glibc calls:
pid_t getpid(void) {
    // Check if vDSO provides getpid
    if (GLRO(dl_vdso_getpid))
        return GLRO(dl_vdso_getpid)();

    // Fallback to syscall
    return INLINE_SYSCALL_CALL(getpid);
}

vDSO version (no syscall!):

// In vDSO
notrace static int __vdso_getpid(void) {
    // Read from per-thread cached value
    return current->tgid;  // Thread Group ID
}

Cost: ~5-10 cycles (no kernel transition)

7.3 Example: `open()` - Complex Path

int fd = open("/home/user/file.txt", O_RDWR | O_CREAT, 0644);

Syscall path:

// fs/open.c
SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode) {
    return do_sys_open(AT_FDCWD, filename, flags, mode);
}

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode) {
    // 1. Allocate file descriptor
    int fd = get_unused_fd_flags(flags);

    // 2. Path lookup
    struct file *f = do_filp_open(dfd, filename, flags, mode);

    // 3. Install in FD table
    fd_install(fd, f);

    return fd;
}

Path lookup (fs/namei.c):

struct file *do_filp_open(int dfd, struct filename *pathname, const struct open_flags *op) {
    // RCU-walk: lockless path walk
    // Falls back to ref-walk if needed
    struct nameidata nd;
    int error = path_openat(&nd, op, flags);

    // Creates/opens inode
    // Allocates struct file
    // Links to dentry cache

    return file;
}

Work done:

String parsing (/home/user/file.txt → components)
Dentry cache lookups (hot path)
Inode cache lookups
Disk reads (cold path, if not cached)
Permission checks (each directory component)
File allocation
FD table modification

Cost: Highly variable

Hot (all cached): 1-2 μs
Cold (disk reads): 1-10 ms

8. Performance Analysis of System Calls

8.1 Measuring Syscall Overhead

Microbenchmark:

#include <stdio.h>
#include <unistd.h>
#include <time.h>

#define ITERATIONS 10000000

int main() {
    struct timespec start, end;

    // Benchmark getpid() (vDSO)
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < ITERATIONS; i++) {
        getpid();
    }
    clock_gettime(CLOCK_MONOTONIC, &end);

    long ns = (end.tv_sec - start.tv_sec) * 1000000000L +
              (end.tv_nsec - start.tv_nsec);
    printf("getpid: %ld ns per call\n", ns / ITERATIONS);

    // Benchmark write() to /dev/null
    int fd = open("/dev/null", O_WRONLY);
    char buf[1] = {0};

    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < ITERATIONS; i++) {
        write(fd, buf, 1);
    }
    clock_gettime(CLOCK_MONOTONIC, &end);

    ns = (end.tv_sec - start.tv_sec) * 1000000000L +
         (end.tv_nsec - start.tv_nsec);
    printf("write: %ld ns per call\n", ns / ITERATIONS);

    return 0;
}

Typical Results (modern Intel/AMD):

getpid: 15 ns per call      ← vDSO, no syscall
write:  200 ns per call     ← Real syscall

8.2 Using `perf` to Analyze

# Count syscalls
perf stat -e syscalls:sys_enter_* ./program

# Trace individual syscalls
strace -c ./program
strace -T ./program  # With timing

# Profile syscall overhead
perf record -e raw_syscalls:sys_enter,raw_syscalls:sys_exit ./program
perf report

8.3 Syscall Batching Strategies

Bad: Many small syscalls

// DON'T DO THIS
for (int i = 0; i < 1000; i++) {
    write(fd, &data[i], 1);  // 1000 syscalls!
}

Good: One large syscall

// DO THIS
write(fd, data, 1000);  // 1 syscall

Better: Vectored I/O

struct iovec iov[10];
for (int i = 0; i < 10; i++) {
    iov[i].iov_base = buffers[i];
    iov[i].iov_len = sizes[i];
}
writev(fd, iov, 10);  // 1 syscall for multiple buffers

Best: Asynchronous I/O (io_uring)

// Submit many operations with zero syscalls
io_uring_prep_write(sqe1, fd, buf1, len1, offset1);
io_uring_prep_write(sqe2, fd, buf2, len2, offset2);
io_uring_prep_write(sqe3, fd, buf3, len3, offset3);
io_uring_submit(ring);  // One syscall for all 3!

9. Security Implications

9.1 Spectre/Meltdown and Syscalls

Meltdown (2018) exploited speculative execution:

// Attack code
char *kernel_addr = (char *)0xFFFF888000000000;
char value = *kernel_addr;  // Would normally fault

// But CPU speculatively loads it!
// Side channel timing attack can extract value

KPTI Fix (Kernel Page Table Isolation): Before KPTI:

User process page tables included kernel mappings
→ Fast syscalls (no CR3 switch)
→ Vulnerable to Meltdown

After KPTI:

User process page tables: User space only
Kernel page tables: Kernel + user space
→ CR3 switch on every syscall/interrupt
→ Immune to Meltdown
→ 5-30% performance hit (mitigated by PCID)

PCID (Process Context Identifier):

Instead of flushing TLB on CR3 switch:
Tag TLB entries with PCID
→ User TLB entries coexist with kernel TLB entries
→ Much lower performance impact

9.2 Syscall Filtering with seccomp

seccomp-BPF: Filter which syscalls a process can make

#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/bpf.h>

// Allow only read, write, exit
struct sock_filter filter[] = {
    // Load syscall number
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),

    // Allow read (0)
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_read, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Allow write (1)
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Allow exit (60)
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_exit, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Kill process for anything else
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
};

struct sock_fprog prog = {
    .len = sizeof(filter) / sizeof(filter[0]),
    .filter = filter,
};

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);

// Now can only use read, write, exit

Use cases:

Docker containers
Chrome sandbox
systemd service hardening

10. Interview Deep Dive Questions

Q1: Explain exactly what happens during a system call on x86-64

Complete Answer:User Space:

Application loads syscall number into RAX
Arguments into RDI, RSI, RDX, R10, R8, R9
Executes SYSCALL instruction

CPU Hardware: 4. Saves user RIP to RCX (return address) 5. Saves user RFLAGS to R11 6. Loads kernel RIP from IA32_LSTAR MSR 7. Switches to Ring 0 (kernel mode) 8. Jumps to entry_SYSCALL_64Kernel Entry: 9. SWAPGS (switch to kernel GS register) 10. Save user RSP, load kernel stack 11. Build pt_regs structure (save all registers) 12. Call do_syscall_64(pt_regs, nr)Kernel Dispatch: 13. Validate syscall number 14. Look up sys_call_table[nr] 15. Call syscall handler function 16. Function does actual work (VFS, scheduler, etc.) 17. Return value placed in RAXKernel Exit: 18. Restore registers from pt_regs 19. Restore user RSP 20. SWAPGS back to user GS 21. Execute SYSRETQCPU Hardware: 22. Restore RIP from RCX 23. Restore RFLAGS from R11 24. Switch to Ring 3 (user mode) 25. Jump to user codeCost: ~100-200 CPU cycles on modern hardware

Q2: Why does the vDSO exist and what syscalls benefit from it?

Answer:Problem: Context switching to kernel mode is expensive (~100-200 cycles). For frequently-called syscalls like gettimeofday(), this overhead dominates.Solution: vDSO (Virtual Dynamic Shared Object)How it works:

Kernel maps a page of executable code into every process
This page contains implementations of certain syscalls
Kernel periodically updates read-only data in this page
Libc resolves these functions to vDSO instead of syscalls
Calls execute entirely in user space (no mode switch)

Syscalls that use vDSO:

gettimeofday() - reads kernel’s time data
clock_gettime() - reads clock data
getcpu() - reads current CPU number
time() - simplified time call

Performance impact:

Without vDSO: ~200 ns per call
With vDSO:    ~20 ns per call
Speedup: 10x

Why only certain syscalls?:

Must be read-only (no side effects)
Data must be safely readable from user space
Data changes must be atomic/consistent
Cannot require kernel state modifications

Security: vDSO code is kernel-provided, mapped at random addresses (ASLR), and cannot be modified by user space.

Q3: What is SWAPGS and why is it critical for security?

Answer:SWAPGS is an x86-64 instruction that atomically swaps the GS base register with a kernel-specific per-CPU value.Purpose:

GS register points to per-CPU data structures
In user mode: GS points to user thread-local storage (TLS)
In kernel mode: GS points to kernel per-CPU area

Why it’s needed:

User mode:
  GS:0 → thread ID
  GS:8 → errno location
  GS:16 → TLS data

Kernel mode:
  GS:0 → current task_struct pointer
  GS:8 → CPU number
  GS:16 → kernel stack pointer

Security-critical:

Spectre v1 exploited missing SWAPGS
CPU could speculatively execute kernel code with user GS
Kernel would read wrong data, leak information

Attack scenario:

// User sets GS to malicious value
syscall();  // Enter kernel

// If SWAPGS not executed:
current = *(task_struct **)GS:0;  // Reads attacker-controlled memory!

Mitigation:

SWAPGS must be first instruction in syscall entry
Must happen before any GS-relative memory access
Hardware barriers prevent speculative execution reordering

Modern defense (FENCE after SWAPGS):

entry_SYSCALL_64:
    SWAPGS
    lfence              ; Speculation barrier
    mov rsp, gs:0x1c    ; Now safe to use GS

Q4: Compare the cost of system calls vs function calls

Answer:Function Call (user → user):

call foo
  push return_address
  jmp foo
foo:
  ; function body
  ret

Cost: 5-10 cycles

Push return address
Branch (usually predicted correctly)
Return (predicted via RAS - Return Address Stack)

System Call (user → kernel → user):

syscall
  ; Save RIP, RFLAGS
  ; Switch privilege level
  ; Switch stack
  ; Jump to kernel
[kernel work]
sysretq
  ; Restore RIP, RFLAGS
  ; Switch privilege level
  ; Switch stack
  ; Jump to user

Cost: 100-200+ cycles

Mode switch overhead: ~50 cycles
Register save/restore: ~20 cycles
TLB effects (if PCID not used): ~50 cycles
Cache effects (kernel code not in L1): ~50+ cycles

Real-world comparison:

// Function call
int add(int a, int b) { return a + b; }
add(1, 2);  // ~10 cycles

// Syscall
getpid();   // ~200 cycles (without vDSO)

Why syscalls are slow:

Privilege level transition (Ring 3 → Ring 0 → Ring 3)
Page table switch (if KPTI enabled)
TLB flush (if PCID not supported)
Cache pollution (kernel code evicts user code)
Security checks and barriers

Ratio: Syscalls are 20-40x slower than function callsOptimization strategies:

Batch operations (writev vs many writes)
Use vDSO when available
Use memory mapping (mmap) to avoid read/write syscalls
Use io_uring for async I/O with minimal syscalls

Q5: How do system calls differ across architectures?

Answer:

Aspect	x86-64	ARM64	RISC-V
Instruction	`SYSCALL`	`SVC #0`	`ECALL`
Syscall #	RAX	x8	a7
Args	RDI,RSI,RDX,R10,R8,R9	x0-x5	a0-a5
Return	RAX	x0	a0
Max Args	6	6	6

x86-64:

mov rax, 1      ; syscall number
mov rdi, 1      ; arg1
mov rsi, buf    ; arg2
mov rdx, len    ; arg3
syscall         ; trap to kernel
; return value in RAX

ARM64:

mov x8, #64     ; syscall number (write)
mov x0, #1      ; arg1 (fd)
ldr x1, =buf    ; arg2 (buffer)
mov x2, len     ; arg3 (count)
svc #0          ; supervisor call
; return value in x0

RISC-V:

li a7, 64       ; syscall number
li a0, 1        ; arg1
la a1, buf      ; arg2
li a2, len      ; arg3
ecall           ; environment call
; return value in a0

Key differences:

Calling convention:
- x86-64 uses different registers for syscalls vs function calls
- ARM/RISC-V use same registers for both
Syscall numbering:
- Each architecture has different syscall numbers
- write is #1 on x86-64, #64 on ARM64/RISC-V
- Forces architecture-specific syscall tables
Mode switching:
- x86: Ring 0 vs Ring 3
- ARM: EL0 vs EL1
- RISC-V: U-mode vs S-mode
Performance:
- Similar overhead (~100-200 cycles)
- RISC architectures slightly cleaner (fewer legacy modes)
- All benefit from vDSO equally

11. Hands-On Practice

Lab 1: Tracing System Calls

# Trace all syscalls
strace ./program

# Count syscalls by type
strace -c ./program

# Trace only open/read/write
strace -e trace=open,read,write ./program

# Show timing per syscall
strace -T ./program

# Attach to running process
strace -p <pid>

Lab 2: Minimal Syscall (No libc)

// syscall_raw.c - Direct syscall without libc
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>

// Don't use printf (it uses syscalls internally)
static inline long my_write(int fd, const void *buf, size_t count) {
    long ret;
    __asm__ volatile (
        "syscall"
        : "=a" (ret)
        : "0"(__NR_write), "D"(fd), "S"(buf), "d"(count)
        : "rcx", "r11", "memory"
    );
    return ret;
}

void _start() {
    const char msg[] = "Hello from raw syscall!\n";
    my_write(1, msg, sizeof(msg) - 1);

    __asm__ volatile (
        "mov $60, %%rax\n"  // __NR_exit
        "xor %%rdi, %%rdi\n"
        "syscall"
        ::: "rax", "rdi"
    );
}

Compile without libc:

gcc -nostdlib -static -o syscall_raw syscall_raw.c
./syscall_raw

Lab 3: Benchmark Syscall Overhead

// benchmark_syscall.c
#define _GNU_SOURCE
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <fcntl.h>

#define ITERATIONS 10000000

static inline long long rdtsc() {
    unsigned int lo, hi;
    __asm__ volatile ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((long long)hi << 32) | lo;
}

int main() {
    long long start, end;

    // Warm up
    for (int i = 0; i < 1000; i++) getpid();

    // Benchmark getpid (may use vDSO)
    start = rdtsc();
    for (int i = 0; i < ITERATIONS; i++) {
        getpid();
    }
    end = rdtsc();
    printf("getpid: %lld cycles/call\n", (end - start) / ITERATIONS);

    // Benchmark syscall(__NR_getpid) (forces real syscall)
    start = rdtsc();
    for (int i = 0; i < ITERATIONS; i++) {
        syscall(SYS_getpid);
    }
    end = rdtsc();
    printf("syscall(getpid): %lld cycles/call\n", (end - start) / ITERATIONS);

    return 0;
}

Lab 4: Examine vDSO

// find_vdso.c
#include <stdio.h>
#include <string.h>

int main(int argc, char **argv, char **envp) {
    // Skip past environment variables
    char **p = envp;
    while (*p) p++;

    // Now at auxiliary vector
    unsigned long *auxv = (unsigned long *)(p + 1);

    while (*auxv) {
        if (auxv[0] == 33) {  // AT_SYSINFO_EHDR
            printf("vDSO base address: 0x%lx\n", auxv[1]);
        }
        auxv += 2;
    }

    // Also check /proc/self/maps
    system("cat /proc/self/maps | grep vdso");

    return 0;
}

Summary for Senior Engineers

System Calls Aren't Free

Even with SYSCALL, there is a 100-200 cycle cost. Batch your syscalls (writev, io_uring).

vDSO is Magic

Some “syscalls” execute entirely in user space. This is why they don’t show in strace.

SWAPGS is Critical

Primary target for Spectre variants. Must happen before any kernel memory access.

Syscall Table is Kernel's API

Modifying it is how rootkits hide. Kernel marks it read-only after boot.

Key Takeaways:

Privilege separation is enforced by hardware (CPU rings/modes)
System calls are the only legitimate way to cross the user/kernel boundary
vDSO eliminates syscall overhead for frequently-used operations
Security mechanisms (KPTI, SWAPGS, seccomp) protect the syscall interface
Performance matters: Modern systems minimize context switches

Next: CPU Architectures & Microarchitecture →

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​OS Fundamentals & System Call Internals

​0. What is an Operating System?

​0.1 Core Responsibilities

​0.2 Types of Operating Systems

Monolithic Kernels

Microkernels

Hybrid Kernels

Real-Time OS

​0.5 From Source Code to Running Process

​Step 1: Compilation and Linking

​Step 2: Shell Creates a New Process

​Step 3: Child Calls execve()

​Step 4: C Runtime → main → Exit

​1. The Kernel vs. User Space

​1.1 Privilege Levels (Protection Rings)

​1.2 What Each Mode Can Do

​2. System Call Evolution (x86-64)

​2.1 Legacy: INT 0x80 (i386)

​2.2 Modern: SYSCALL (AMD) / SYSENTER (Intel)

​2.3 ARM64: SVC Instruction

​2.4 RISC-V: ECALL Instruction

​3. The vDSO (Virtual Dynamic Shared Object)

​3.1 How it Works

​3.2 Implementation Details

​3.3 Why Some Syscalls Can’t Use vDSO

​3.4 Finding the vDSO

​4. vsyscall: The Legacy Fixed Address

​4.1 The Problem with vsyscall

​4.2 Modern Linux Solution

​5. Kernel Entry: entry_SYSCALL_64

​5.1 Complete Entry Path (x86-64)

​5.2 The C Dispatcher

​5.3 The Syscall Table

​5.4 Complete Flow Diagram

​6. Processes vs. Threads vs. Kernel Tasks

Process

Thread

Kernel Thread

OS Fundamentals & System Call Internals

0. What is an Operating System?

0.1 Core Responsibilities

0.2 Types of Operating Systems

0.5 From Source Code to Running Process

Step 1: Compilation and Linking

Step 2: Shell Creates a New Process

Step 3: Child Calls `execve()`

Step 4: C Runtime → `main` → Exit

1. The Kernel vs. User Space

1.1 Privilege Levels (Protection Rings)

1.2 What Each Mode Can Do

2. System Call Evolution (x86-64)

2.1 Legacy: `INT 0x80` (i386)

2.2 Modern: `SYSCALL` (AMD) / `SYSENTER` (Intel)

2.3 ARM64: `SVC` Instruction

2.4 RISC-V: `ECALL` Instruction

3. The vDSO (Virtual Dynamic Shared Object)

3.1 How it Works

3.2 Implementation Details

3.3 Why Some Syscalls Can’t Use vDSO

3.4 Finding the vDSO

4. `vsyscall`: The Legacy Fixed Address

4.1 The Problem with vsyscall

4.2 Modern Linux Solution

5. Kernel Entry: `entry_SYSCALL_64`

5.1 Complete Entry Path (x86-64)

5.2 The C Dispatcher

5.3 The Syscall Table

5.4 Complete Flow Diagram

6. Processes vs. Threads vs. Kernel Tasks

6.1 Visualization

6.2 Why It Matters for System Calls

7. System Call Deep Dive: Real Linux Examples

7.1 Example: `write()` System Call

7.2 Example: `getpid()` - Fast Path

7.3 Example: `open()` - Complex Path

8. Performance Analysis of System Calls

8.1 Measuring Syscall Overhead

8.2 Using `perf` to Analyze

8.3 Syscall Batching Strategies

9. Security Implications

9.1 Spectre/Meltdown and Syscalls

9.2 Syscall Filtering with seccomp

10. Interview Deep Dive Questions

11. Hands-On Practice

Lab 1: Tracing System Calls

Lab 2: Minimal Syscall (No libc)

Lab 3: Benchmark Syscall Overhead

Lab 4: Examine vDSO

Summary for Senior Engineers