Operating Systems exist to manage hardware and provide a safe abstraction for applications. A “Senior” engineer must understand the physical transition between these two worlds.
Interview Frequency: Extremely High (90%+ of system programming interviews)
Key Topics: System calls, kernel/user space, vDSO, context switching, privilege levels
Time to Master: 12-15 hours
Prerequisites: C programming, basic computer architecture
At the highest level, an Operating System is a resource manager and isolation layer:
Resource Manager:
Multiplexes CPU time between many processes.
Allocates and reclaims memory, files, sockets, and devices.
Schedules and prioritizes work according to policy (throughput, latency, fairness, deadlines).
Isolation & Protection Layer:
Prevents one program from corrupting another program’s memory.
Prevents untrusted code from directly touching hardware.
Enforces security boundaries (user vs kernel, containers, VMs).
A helpful analogy is a city operating authority:
Streets/highways ⇢ CPU cores and buses.
Buildings ⇢ processes.
Rooms ⇢ threads.
Zoning rules and permits ⇢ permissions and security policies.
Traffic lights ⇢ synchronization and scheduling.
The OS makes the city feel orderly and predictable to its “citizens” (programs) even though underneath, the physical world (hardware) is chaotic and failure-prone.
The C runtime (crt1.o) runs first, initializing the runtime and calling your main().
Your code executes (printf("hello\n")), which itself issues syscalls under the hood (write() on stdout).
When main returns, the runtime calls exit(), which:
Flushes stdio buffers.
Invokes the exit_group syscall.
Lets the kernel tear down the process (free memory, close FDs, reap the PCB).
Complete Flow Diagram:
Copy
User types "./main" ↓Shell receives command (shell is a running process, PID 1000) ↓Shell calls fork() ──────────┐ ↓ ↓Parent (PID 1000) Child (PID 1001)calls wait() calls execve("./main")blocks... ↓ Kernel loads ELF binary Sets up address space Maps .text, .data, stack Jumps to _start ↓ C runtime initializes ↓ main() executes printf() → write syscall ↓ main() returns 0 ↓ exit(0) → exit_group syscall ↓ Kernel cleans up process ↓Parent's wait() returnsShell prints next prompt
This entire path is the lifecycle of a simple process; later chapters (Processes, Virtual Memory, Scheduling, File Systems) each zoom into one part of this story.
Ring 0 (Kernel Mode) ← Full hardware accessRing 1 (Device Drivers) ← Unused in modern OSesRing 2 (Device Drivers) ← Unused in modern OSesRing 3 (User Mode) ← Restricted instructions
Current Privilege Level (CPL) stored in CS register (Code Segment).Privileged Instructions (only Ring 0):
// If user space could do this:asm("cli"); // Disable interruptswhile(1); // Infinite loop// The entire system would freeze!// Or this:void *kernel_memory = (void *)0xFFFF888000000000;*kernel_memory = 0x90909090; // Overwrite kernel code// System compromised!
The hardware enforces that attempts to execute privileged instructions in user mode trigger a General Protection Fault (x86) or Illegal Instruction exception (ARM/RISC-V), which the kernel handles by terminating the offending process.
x86-64 introduced a dedicated instruction for syscalls.SYSCALL Instruction (AMD64):
Copy
; Modern 64-bit Linux syscallmov rax, 1 ; syscall number (write)mov rdi, 1 ; arg1: file descriptormov rsi, msg ; arg2: buffermov rdx, 13 ; arg3: countsyscall ; Fast system call
Hardware Magic:
No IDT lookup: CPU jumps to address stored in IA32_LSTAR MSR (Model Specific Register).
No stack lookup: Uses IA32_KERNEL_GS_BASE for per-CPU data.
Minimal save: Only saves RIP and RFLAGS to RCX and R11.
Kernel Entry Point (simplified):
Copy
; Entry point stored in LSTAR MSRentry_SYSCALL_64: SWAPGS ; Switch to kernel GS (per-CPU area) mov QWORD PTR gs:0x14, rsp ; Save user stack mov rsp, QWORD PTR gs:0x1c ; Load kernel stack push rax ; Save registers push rcx ; (RCX = user RIP) push r11 ; (R11 = user RFLAGS) ; ... save more registers call do_syscall_64 ; C function dispatch ; ... restore registers pop r11 pop rcx pop rax mov rsp, QWORD PTR gs:0x14 ; Restore user stack SWAPGS ; Switch back to user GS sysretq ; Return to user space
Performance: Much faster (~100-200 cycles). The savings come from:
Some system calls are called thousands of times per second (e.g., gettimeofday(), clock_gettime()). Switching to kernel mode every time is a massive waste of CPU.
// In arch/x86/entry/vdso/vdso.cstatic int __init init_vdso(void) { // Map vDSO page into every process vdso_pages[0] = alloc_page(GFP_KERNEL); copy_vdso_to_page(vdso_pages[0]); return 0;}// Kernel periodically updates shared time datavoid update_vsyscall(struct timekeeper *tk) { vdso_data->wall_time_sec = tk->wall_time.tv_sec; vdso_data->wall_time_nsec = tk->wall_time.tv_nsec;}
User Side (vDSO function):
Copy
// Inside vDSO (simplified)notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) { // Read from shared memory (no syscall!) ts->tv_sec = vdso_data->wall_time_sec; ts->tv_nsec = vdso_data->wall_time_nsec; return 0;}
Variable Shadowing: The kernel periodically writes the current time into a “data” part of the vDSO page using atomic operations or seqlocks to ensure consistency.
// Benchmark: 10 million callsclock_gettime(CLOCK_REALTIME, &ts);// With vDSO: ~200 ms (20 ns per call)// Without vDSO: ~2000 ms (200 ns per call)// Speedup: 10x
Fixed virtual address: 0xffffffffff600000Every process had:┌───────────────────────────┐│ 0xffffffffff600000: ││ gettimeofday code │ ← Same address in EVERY process│ time code ││ getcpu code │└───────────────────────────┘
// In arch/x86/entry/common.c__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr) { // Security checks nr = syscall_enter_from_user_mode(regs, nr); // Validate syscall number if (likely(nr < NR_syscalls)) { // Look up in syscall table and call regs->ax = sys_call_table[nr](regs); } syscall_exit_to_user_mode(regs);}
// Thread 1ssize_t n = read(fd, buf1, size); // May block
Issued by a thread
CPU time charged to the process
May block only this thread, not the whole process
Other threads can continue executing
Example: fork() system call
Copy
// Process with 5 threads calls fork()pid_t pid = fork();
Creates a new process with copy-on-write address space
The child initially has only one thread (the caller)
Even though parent had 5 threads, they don’t get copied
Child’s single thread continues from the fork() return point
Kernel’s View:
Copy
// Linux doesn't distinguish! Everything is a task_struct// Thread vs Process determined by clone() flags:clone(CLONE_VM | CLONE_FS | CLONE_FILES); // = Threadclone(SIGCHLD); // = Process
You will see these distinctions repeatedly in later chapters (Scheduling, Synchronization, Signals, and Linux Internals).
// DON'T DO THISfor (int i = 0; i < 1000; i++) { write(fd, &data[i], 1); // 1000 syscalls!}
Good: One large syscall
Copy
// DO THISwrite(fd, data, 1000); // 1 syscall
Better: Vectored I/O
Copy
struct iovec iov[10];for (int i = 0; i < 10; i++) { iov[i].iov_base = buffers[i]; iov[i].iov_len = sizes[i];}writev(fd, iov, 10); // 1 syscall for multiple buffers
Best: Asynchronous I/O (io_uring)
Copy
// Submit many operations with zero syscallsio_uring_prep_write(sqe1, fd, buf1, len1, offset1);io_uring_prep_write(sqe2, fd, buf2, len2, offset2);io_uring_prep_write(sqe3, fd, buf3, len3, offset3);io_uring_submit(ring); // One syscall for all 3!
// Attack codechar *kernel_addr = (char *)0xFFFF888000000000;char value = *kernel_addr; // Would normally fault// But CPU speculatively loads it!// Side channel timing attack can extract value
User process page tables included kernel mappings→ Fast syscalls (no CR3 switch)→ Vulnerable to Meltdown
After KPTI:
Copy
User process page tables: User space onlyKernel page tables: Kernel + user space→ CR3 switch on every syscall/interrupt→ Immune to Meltdown→ 5-30% performance hit (mitigated by PCID)
PCID (Process Context Identifier):
Copy
Instead of flushing TLB on CR3 switch:Tag TLB entries with PCID→ User TLB entries coexist with kernel TLB entries→ Much lower performance impact
Q1: Explain exactly what happens during a system call on x86-64
Complete Answer:User Space:
Application loads syscall number into RAX
Arguments into RDI, RSI, RDX, R10, R8, R9
Executes SYSCALL instruction
CPU Hardware:
4. Saves user RIP to RCX (return address)
5. Saves user RFLAGS to R11
6. Loads kernel RIP from IA32_LSTAR MSR
7. Switches to Ring 0 (kernel mode)
8. Jumps to entry_SYSCALL_64Kernel Entry:
9. SWAPGS (switch to kernel GS register)
10. Save user RSP, load kernel stack
11. Build pt_regs structure (save all registers)
12. Call do_syscall_64(pt_regs, nr)Kernel Dispatch:
13. Validate syscall number
14. Look up sys_call_table[nr]
15. Call syscall handler function
16. Function does actual work (VFS, scheduler, etc.)
17. Return value placed in RAXKernel Exit:
18. Restore registers from pt_regs
19. Restore user RSP
20. SWAPGS back to user GS
21. Execute SYSRETQCPU Hardware:
22. Restore RIP from RCX
23. Restore RFLAGS from R11
24. Switch to Ring 3 (user mode)
25. Jump to user codeCost: ~100-200 CPU cycles on modern hardware
Q2: Why does the vDSO exist and what syscalls benefit from it?
Answer:Problem: Context switching to kernel mode is expensive (~100-200 cycles). For frequently-called syscalls like gettimeofday(), this overhead dominates.Solution: vDSO (Virtual Dynamic Shared Object)How it works:
Kernel maps a page of executable code into every process
This page contains implementations of certain syscalls
Kernel periodically updates read-only data in this page
Libc resolves these functions to vDSO instead of syscalls
Calls execute entirely in user space (no mode switch)
Syscalls that use vDSO:
gettimeofday() - reads kernel’s time data
clock_gettime() - reads clock data
getcpu() - reads current CPU number
time() - simplified time call
Performance impact:
Copy
Without vDSO: ~200 ns per callWith vDSO: ~20 ns per callSpeedup: 10x
Why only certain syscalls?:
Must be read-only (no side effects)
Data must be safely readable from user space
Data changes must be atomic/consistent
Cannot require kernel state modifications
Security: vDSO code is kernel-provided, mapped at random addresses (ASLR), and cannot be modified by user space.
Q3: What is SWAPGS and why is it critical for security?
Answer:SWAPGS is an x86-64 instruction that atomically swaps the GS base register with a kernel-specific per-CPU value.Purpose:
GS register points to per-CPU data structures
In user mode: GS points to user thread-local storage (TLS)
In kernel mode: GS points to kernel per-CPU area
Why it’s needed:
Copy
User mode: GS:0 → thread ID GS:8 → errno location GS:16 → TLS dataKernel mode: GS:0 → current task_struct pointer GS:8 → CPU number GS:16 → kernel stack pointer
Security-critical:
Spectre v1 exploited missing SWAPGS
CPU could speculatively execute kernel code with user GS
Kernel would read wrong data, leak information
Attack scenario:
Copy
// User sets GS to malicious valuesyscall(); // Enter kernel// If SWAPGS not executed:current = *(task_struct **)GS:0; // Reads attacker-controlled memory!
# Trace all syscallsstrace ./program# Count syscalls by typestrace -c ./program# Trace only open/read/writestrace -e trace=open,read,write ./program# Show timing per syscallstrace -T ./program# Attach to running processstrace -p <pid>