System Call Flow - User space to kernel transition

System Call Interface

System calls are the only legitimate way for user-space programs to request services from the kernel. Understanding syscalls deeply is essential for observability engineers who trace application behavior and infrastructure engineers who debug performance issues.

Interview Frequency: Very High (especially at observability companies)
Key Topics: syscall mechanism, vDSO, seccomp, overhead analysis
Time to Master: 12-14 hours

What Are System Calls?

System calls are the interface between user-space applications and the kernel:

System Calls in Linux

┌─────────────────────────────────────────────────────────────────────────────┐
│                     USER SPACE TO KERNEL TRANSITION                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   USER SPACE                                                                 │
│   ──────────                                                                │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  Application                                                         │   │
│   │       ↓                                                              │   │
│   │  libc wrapper (e.g., read())                                        │   │
│   │       ↓                                                              │   │
│   │  Set up registers: syscall number, arguments                        │   │
│   │       ↓                                                              │   │
│   │  SYSCALL instruction                                                │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                              │                                               │
│   ═══════════════════════════╪═══════════════════════════════════════════   │
│                              │ CPU switches to ring 0                        │
│   ═══════════════════════════╪═══════════════════════════════════════════   │
│                              ↓                                               │
│   KERNEL SPACE                                                               │
│   ────────────                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  entry_SYSCALL_64 (arch/x86/entry/entry_64.S)                       │   │
│   │       ↓                                                              │   │
│   │  Save user registers, switch to kernel stack                        │   │
│   │       ↓                                                              │   │
│   │  Look up syscall in sys_call_table                                  │   │
│   │       ↓                                                              │   │
│   │  Call syscall handler (e.g., ksys_read)                            │   │
│   │       ↓                                                              │   │
│   │  Return to user space via SYSRET/IRET                               │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Understanding the Transition

When an application makes a system call, the CPU must transition from user mode (Ring 3) to kernel mode (Ring 0). This is a privileged operation that involves:

Saving user context: All registers are saved so we can return to exactly where we left off
Switching stacks: User stack → Kernel stack (each process has both)
Changing privilege level: Ring 3 → Ring 0 (CPU enforces this)
Executing kernel code: The actual syscall handler runs
Returning to user mode: Restore context and switch back to Ring 3

This transition is expensive (200-500 CPU cycles) because of security checks, context switching, and cache effects. Understanding this overhead is crucial for writing performant systems.

x86-64 Syscall Mechanism

The SYSCALL Instruction

On x86-64, the syscall instruction is the fast path for entering the kernel:

; User space syscall invocation
mov    rax, 0       ; syscall number (0 = read)
mov    rdi, 0       ; arg1: fd (stdin)
mov    rsi, buffer  ; arg2: buffer pointer
mov    rdx, 100     ; arg3: count
syscall             ; Enter kernel
; Return value in rax

Register Convention

Register	Purpose
`rax`	Syscall number (input), return value (output)
`rdi`	Argument 1
`rsi`	Argument 2
`rdx`	Argument 3
`r10`	Argument 4 (not rcx, which is used by syscall)
`r8`	Argument 5
`r9`	Argument 6

MSR Configuration

The CPU needs to know where to jump on syscall:

// These MSRs are set during boot:
// MSR_LSTAR (0xC0000082) - Syscall entry point address
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);

// MSR_STAR - Segment selectors for syscall/sysret
// MSR_SYSCALL_MASK - Flags to clear on syscall

Syscall Entry Point Deep Dive

The syscall entry point is one of the most critical pieces of kernel code:

// Simplified from arch/x86/entry/entry_64.S
SYM_CODE_START(entry_SYSCALL_64)
    // User RSP is in per-CPU storage, kernel RSP loaded
    swapgs                      // Switch GS base to kernel
    movq    %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
    movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp
    
    // Push user registers onto kernel stack
    pushq   $__USER_DS          // SS
    pushq   PER_CPU_VAR(...)    // RSP
    pushq   %r11                // RFLAGS (saved by syscall)
    pushq   $__USER_CS          // CS
    pushq   %rcx                // RIP (saved by syscall)
    
    // Save remaining registers
    PUSH_AND_CLEAR_REGS
    
    // Call C handler
    movq    %rsp, %rdi          // pt_regs pointer
    call    do_syscall_64
    
    // Return to user space
    ...
    sysretq
SYM_CODE_END(entry_SYSCALL_64)

do_syscall_64 - The C Entry Point

// arch/x86/entry/common.c
__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
{
    add_random_kstack_offset();  // Security: randomize stack
    nr = syscall_enter_from_user_mode(regs, nr);
    
    if (likely(nr < NR_syscalls)) {
        nr = array_index_nospec(nr, NR_syscalls);  // Spectre mitigation
        regs->ax = sys_call_table[nr](regs);      // Call handler!
    }
    
    syscall_exit_to_user_mode(regs);
}

System Call Table

The syscall table maps syscall numbers to handler functions:

// arch/x86/entry/syscall_64.c
const sys_call_ptr_t sys_call_table[] = {
    [0]   = __x64_sys_read,
    [1]   = __x64_sys_write,
    [2]   = __x64_sys_open,
    [3]   = __x64_sys_close,
    [9]   = __x64_sys_mmap,
    [56]  = __x64_sys_clone,
    [57]  = __x64_sys_fork,
    [59]  = __x64_sys_execve,
    [60]  = __x64_sys_exit,
    ...
};

Finding Syscall Numbers

# Method 1: Header files
grep -E "^#define __NR_" /usr/include/asm/unistd_64.h | head -20

# Method 2: ausyscall tool
ausyscall --dump | head -20

# Method 3: Python
python3 -c "import os; print(os.SYS_read, os.SYS_write)"

vDSO - Virtual Dynamic Shared Object

The Time Query Problem

Before vDSO, getting the current time was surprisingly expensive: The problem: Applications call gettimeofday() or clock_gettime() millions of times per second:

Web servers log every request with timestamps
Databases track transaction times
Profilers measure code execution
Games render frames with timing

The cost: Each call was a full syscall (~200-500 cycles overhead) just to read a number that the kernel updates periodically anyway. The insight: Time is read-only data that changes slowly (milliseconds). Why context switch to read it? The solution: vDSO maps kernel data into user space. Applications read time directly from memory - no syscall needed! vDSO is a kernel optimization that provides certain syscalls without entering kernel mode:

┌─────────────────────────────────────────────────────────────────────────────┐
│                            vDSO MECHANISM                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Without vDSO (slow):                                                       │
│   ┌────────────┐     syscall     ┌────────────┐                             │
│   │ User Space │ ──────────────→ │   Kernel   │                             │
│   │            │ ←────────────── │            │                             │
│   └────────────┘     return      └────────────┘                             │
│        ~100-200 cycles overhead                                              │
│                                                                              │
│   With vDSO (fast):                                                          │
│   ┌────────────┐  function call  ┌────────────┐                             │
│   │ User Space │ ──────────────→ │   vDSO     │  (still user space!)        │
│   │            │ ←────────────── │  (shared)  │                             │
│   └────────────┘     return      └────────────┘                             │
│        ~10-20 cycles (no mode switch)                                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

vDSO Functions

Function	What it does
`__vdso_clock_gettime`	Get current time (most common)
`__vdso_gettimeofday`	Legacy time function
`__vdso_time`	Simple seconds since epoch
`__vdso_getcpu`	Get current CPU/NUMA node

How vDSO Works

Kernel maps a special page into every process
Page contains code that can read kernel data (time, CPU)
Kernel updates shared data (timekeeping) periodically
User-space reads data without entering kernel

# Find vDSO mapping
cat /proc/self/maps | grep vdso
# 7fff12345000-7fff12346000 r-xp 00000000 00:00 0 [vdso]

# Dump vDSO contents
dd if=/proc/self/mem bs=4096 skip=$((0x7fff12345)) count=1 2>/dev/null | xxd | head

vDSO Performance Impact

// Benchmark: clock_gettime performance
#include <time.h>

void benchmark() {
    struct timespec ts;
    
    // This uses vDSO (~20ns)
    for (int i = 0; i < 1000000; i++) {
        clock_gettime(CLOCK_MONOTONIC, &ts);
    }
}

Interview Insight: “gettimeofday/clock_gettime are the most frequently called syscalls in many applications. vDSO makes them essentially free, which is why you rarely see them in syscall traces as performance problems.”

Syscall Overhead Analysis

Why Are Syscalls Expensive?

Syscalls are one of the most expensive operations you can do in user space. Here’s why: The fundamental problem: CPU privilege levels. User code runs in ring 3 (unprivileged), kernel code runs in ring 0 (privileged). Switching between them is expensive. What makes it expensive:

Mode switch: CPU must save all registers, switch stacks, change privilege level
Security checks: Validate arguments, check permissions, run seccomp filters
Cache pollution: Kernel code evicts user code from CPU caches
Spectre mitigations: KPTI adds extra overhead (page table switching)

Real-world impact: A program doing 100,000 syscalls/second spends 2-5% of CPU time just on syscall overhead, before doing any actual work. Understanding syscall overhead is crucial for performance:

Cost Breakdown

┌─────────────────────────────────────────────────────────────────────────────┐
│                      SYSCALL OVERHEAD BREAKDOWN                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. Mode Switch (~50-100 cycles)                                            │
│     - SYSCALL instruction                                                    │
│     - Save user registers                                                    │
│     - Load kernel stack                                                      │
│                                                                              │
│  2. Kernel Entry Work (~50-100 cycles)                                      │
│     - Entry tracing (if enabled)                                            │
│     - Audit logging (if enabled)                                            │
│     - seccomp check (if enabled)                                            │
│                                                                              │
│  3. Actual Syscall Work (varies)                                            │
│     - Read: depends on data availability                                    │
│     - Write: depends on buffer state                                        │
│     - Memory operations: page faults, allocation                            │
│                                                                              │
│  4. Return Path (~50-100 cycles)                                            │
│     - Signal checking                                                        │
│     - Rescheduling if needed                                                │
│     - Exit tracing                                                          │
│     - SYSRET/IRET                                                           │
│                                                                              │
│  Minimum overhead: ~200-500 cycles (trivial syscall)                        │
│  Typical overhead: 1000+ cycles (with I/O)                                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Measuring Syscall Overhead

// Measure getpid() overhead (does almost nothing)
#include <sys/syscall.h>
#include <time.h>

long measure_syscall_overhead(int iterations) {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    for (int i = 0; i < iterations; i++) {
        syscall(SYS_getpid);
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    return (end.tv_nsec - start.tv_nsec) / iterations;
}
// Typical result: ~200-400ns per syscall

Spectre Mitigations Impact

After Spectre/Meltdown, syscall overhead increased:

# Check for mitigations
cat /sys/devices/system/cpu/vulnerabilities/*

# Mitigations that affect syscall performance:
# - KPTI (Kernel Page Table Isolation): ~100-400 cycles
# - Retpoline: ~10-50 cycles on indirect calls
# - IBRS: ~30-100 cycles

seccomp - Syscall Filtering

The Container Security Problem

Containers provide isolation, but they share the same kernel. This creates a security risk: The threat: A compromised container could:

Use ptrace() to inspect other processes
Use mount() to escape the container
Use reboot() to crash the host
Use kexec_load() to replace the kernel
Use clock_settime() to break time-based security

The challenge: Containers need some syscalls to function, but not all ~300+ syscalls. The solution: seccomp-BPF filters syscalls before they execute. Even if an attacker gains code execution in a container, dangerous syscalls are blocked at the kernel level. seccomp-BPF allows filtering syscalls for security:

seccomp Modes

Mode	Description	Use Case
`SECCOMP_MODE_STRICT`	Allow only read, write, exit, sigreturn	Maximum security
`SECCOMP_MODE_FILTER`	BPF program filters syscalls	Container sandboxing

How seccomp-BPF Works

┌─────────────────────────────────────────────────────────────────────────────┐
│                        SECCOMP-BPF FILTERING                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Syscall Entry                                                              │
│        ↓                                                                     │
│   ┌─────────────────────────────────────────────────────────────────┐       │
│   │  seccomp BPF filter (runs before syscall)                       │       │
│   │                                                                  │       │
│   │  Input: struct seccomp_data {                                   │       │
│   │           int   nr;          // syscall number                  │       │
│   │           __u32 arch;        // architecture                    │       │
│   │           __u64 instruction_pointer;                            │       │
│   │           __u64 args[6];     // syscall arguments               │       │
│   │         };                                                       │       │
│   │                                                                  │       │
│   │  Output: action                                                  │       │
│   │    SECCOMP_RET_ALLOW  - Allow syscall                           │       │
│   │    SECCOMP_RET_KILL   - Kill process                            │       │
│   │    SECCOMP_RET_ERRNO  - Return error                            │       │
│   │    SECCOMP_RET_TRACE  - Notify tracer (ptrace)                  │       │
│   │    SECCOMP_RET_LOG    - Allow but log                           │       │
│   └─────────────────────────────────────────────────────────────────┘       │
│        ↓                                                                     │
│   If allowed: proceed to syscall handler                                    │
│   If denied: return error or kill process                                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

seccomp Example

#include <seccomp.h>

void apply_seccomp_filter() {
    // Create filter context (default: allow all)
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW);
    
    // Block specific syscalls
    seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(ptrace), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(kexec_load), 0);
    
    // Apply filter
    seccomp_load(ctx);
    seccomp_release(ctx);
}

seccomp in Containers

Docker uses seccomp to restrict container syscalls:

# View Docker's default seccomp profile
docker info --format '{{ .SecurityOptions }}'

# Run container with custom seccomp profile
docker run --security-opt seccomp=profile.json myimage

# Run without seccomp (less secure)
docker run --security-opt seccomp=unconfined myimage

System Call Tracing

Essential skill for observability engineering:

strace - User-Space Tracer

# Basic syscall tracing
strace ls

# Trace specific syscalls
strace -e read,write cat /etc/passwd

# With timing information
strace -T ls

# Count syscalls
strace -c ls

# Trace child processes too
strace -f ./multi_threaded_app

# Output to file
strace -o trace.log ls

# Attach to running process
strace -p 1234

strace Output Analysis

$ strace -T cat /etc/passwd 2>&1 | head -20
execve("/usr/bin/cat", ["cat", "/etc/passwd"], ...) = 0 <0.000892>
brk(NULL)                               = 0x5555557c3000 <0.000012>
mmap(NULL, 8192, PROT_READ|PROT_WRITE, ...) = 0x7ffff7fc5000 <0.000023>
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT <0.000015>
openat(AT_FDCWD, "/etc/passwd", O_RDONLY) = 3 <0.000021>
fstat(3, {st_mode=S_IFREG|0644, st_size=2773, ...}) = 0 <0.000011>
read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 131072) = 2773 <0.000018>
write(1, "root:x:0:0:root:/root:/bin/bash\n"..., 2773) = 2773 <0.000042>
close(3)                                = 0 <0.000010>

ltrace - Library Call Tracer

# Trace library calls
ltrace ls

# Trace specific library
ltrace -l libc.so.6 ls

Kernel-Level Tracing

For production, use eBPF-based tracing (covered in Track 5):

# Using bpftrace
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { 
    printf("%s opened %s\n", comm, str(args->filename)); 
}'

# Using perf
sudo perf trace ls

Adding a Custom Syscall (Lab)

Understanding by implementation:

Step 1: Define the Syscall

// Add to include/linux/syscalls.h
asmlinkage long sys_hello(const char __user *name);

Step 2: Implement the Handler

// Create kernel/hello.c
#include <linux/syscalls.h>
#include <linux/uaccess.h>

SYSCALL_DEFINE1(hello, const char __user *, name)
{
    char buf[64];
    
    if (copy_from_user(buf, name, sizeof(buf) - 1))
        return -EFAULT;
    
    buf[sizeof(buf) - 1] = '\0';
    pr_info("Hello, %s! (from syscall)\n", buf);
    
    return 0;
}

Step 3: Add to Syscall Table

// arch/x86/entry/syscalls/syscall_64.tbl
# Add line:
451    common    hello    sys_hello

Step 4: Test from User Space

#include <sys/syscall.h>
#include <unistd.h>

#define SYS_hello 451

int main() {
    syscall(SYS_hello, "World");
    return 0;
}

Compatibility and ABI

32-bit Compatibility on 64-bit

// Different syscall tables for different ABIs
// arch/x86/entry/syscall_64.c - 64-bit
// arch/x86/entry/syscall_32.c - 32-bit compat

// Check ABI in seccomp
struct seccomp_data {
    __u32 arch;  // AUDIT_ARCH_X86_64 or AUDIT_ARCH_I386
    ...
};

Syscall Number Differences

# Same syscall, different numbers
# 64-bit: read = 0, write = 1
# 32-bit: read = 3, write = 4

# View 32-bit syscall numbers
cat /usr/include/asm/unistd_32.h

Lab Exercises

Lab 1: Syscall Overhead Measurement

Objective: Measure and compare syscall overhead

// syscall_bench.c
#define _GNU_SOURCE
#include <stdio.h>
#include <time.h>
#include <sys/syscall.h>
#include <unistd.h>

#define ITERATIONS 10000000

double measure(void (*func)(void)) {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < ITERATIONS; i++) func();
    clock_gettime(CLOCK_MONOTONIC, &end);
    return (end.tv_sec - start.tv_sec) * 1e9 + 
           (end.tv_nsec - start.tv_nsec);
}

void test_getpid() { syscall(SYS_getpid); }
void test_gettid() { syscall(SYS_gettid); }
void test_getuid() { syscall(SYS_getuid); }

int main() {
    printf("getpid: %.1f ns\n", measure(test_getpid) / ITERATIONS);
    printf("gettid: %.1f ns\n", measure(test_gettid) / ITERATIONS);
    printf("getuid: %.1f ns\n", measure(test_getuid) / ITERATIONS);
    return 0;
}

gcc -O2 syscall_bench.c -o syscall_bench
./syscall_bench

Lab 2: strace Deep Analysis

Objective: Analyze real application syscall patterns

# 1. Compare static vs dynamic linking
strace -c /bin/ls -la
strace -c /bin/busybox ls -la

# 2. Analyze a web server
strace -c -f python3 -m http.server 8000 &
curl http://localhost:8000/
kill %1

# 3. Find slowest syscalls
strace -T -o trace.log dd if=/dev/zero of=/tmp/test bs=1M count=100
sort -t'<' -k2 -n trace.log | tail -20

Lab 3: seccomp Filter Implementation

Objective: Create a sandboxed execution environment

// sandbox.c
#include <stdio.h>
#include <stdlib.h>
#include <seccomp.h>
#include <unistd.h>
#include <sys/wait.h>

void apply_sandbox() {
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
    
    // Allow only essential syscalls
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(brk), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 0);
    
    seccomp_load(ctx);
    seccomp_release(ctx);
}

int main() {
    printf("Before sandbox\n");
    apply_sandbox();
    printf("After sandbox\n");
    
    // This will fail (open not allowed)
    FILE *f = fopen("/etc/passwd", "r");
    if (!f) printf("fopen blocked by seccomp!\n");
    
    return 0;
}

gcc sandbox.c -o sandbox -lseccomp
./sandbox

Interview Questions

Q1: Walk through what happens during a read() syscall

Answer:

User space:
- Application calls read(fd, buf, count)
- libc sets up registers: rax=0 (SYS_read), rdi=fd, rsi=buf, rdx=count
- Executes syscall instruction
Kernel entry:
- CPU switches to ring 0, loads kernel stack
- entry_SYSCALL_64 saves registers
- do_syscall_64 looks up sys_read in syscall table
Syscall handler (ksys_read):
- Validates fd, gets struct file *
- Calls file’s read operation (via file->f_op->read)
- For regular files: checks page cache, reads from disk if needed
- Copies data to user buffer via copy_to_user
Return:
- Returns bytes read (or error)
- syscall_exit_to_user_mode: check signals, scheduling
- sysretq: return to user mode

Q2: How does vDSO improve performance? What syscalls use it?

Answer:How it works:

Kernel maps a special page into every process’s address space
Page contains code that reads kernel-maintained data
No mode switch needed — runs entirely in user space

Performance improvement:

Regular syscall: ~200-500 cycles (mode switch overhead)
vDSO call: ~10-20 cycles (just a function call)

Functions provided:

gettimeofday(), clock_gettime() — most important
time(), getcpu()

Why limited:

Only works for read-only data
Kernel maintains shared data (timekeeping)
Can’t be used for anything requiring kernel intervention

Impact: Applications doing millions of time queries (monitoring, logging) would be ~10-50x slower without vDSO.

Q3: How does seccomp protect containers?

Answer:Protection mechanism:

BPF program runs on every syscall entry
Blocks dangerous syscalls before they execute
Defense in depth — even if container escapes, syscalls limited

Commonly blocked syscalls:

mount, umount — prevent filesystem manipulation
reboot, kexec_load — prevent system disruption
ptrace — prevent debugging/injection
init_module, delete_module — prevent kernel modification
clock_settime — prevent time manipulation

Docker default profile: Blocks ~44 syscalls out of ~300+Example attack prevention:

Container exploit tries ptrace to escape → blocked
Malware tries kexec_load → blocked
Process tries to load kernel module → blocked

Q4: What's the overhead of enabling seccomp?

Answer:Overhead sources:

BPF filter runs on every syscall entry
Constant-time operations for simple filters
More complex filters = higher overhead

Typical overhead:

Simple whitelist: ~20-50 nanoseconds per syscall
Complex filters with argument checking: 100-200 ns

Why it’s acceptable:

Syscalls already cost 200-500ns minimum
20-50ns is <25% additional overhead
Security benefit outweighs cost

Optimization tips:

Put common allowed syscalls first in filter
Use SECCOMP_RET_ALLOW as default if mostly allowing
Profile with perf to measure actual impact

Key Takeaways

Syscall Mechanism

SYSCALL instruction, register convention, and kernel entry path are fundamental knowledge

vDSO Optimization

Critical for understanding why some “syscalls” have nearly zero overhead

seccomp Security

BPF-based syscall filtering is the foundation of container security

Tracing Skills

strace and understanding syscall patterns are essential for debugging

Next: Kernel Data Structures →

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​System Call Interface

​What Are System Calls?

​System Calls in Linux

​Understanding the Transition

​x86-64 Syscall Mechanism

​The SYSCALL Instruction

​Register Convention

​MSR Configuration

​Syscall Entry Point Deep Dive

​do_syscall_64 - The C Entry Point

​System Call Table

​Finding Syscall Numbers

​vDSO - Virtual Dynamic Shared Object

​The Time Query Problem

​vDSO Functions

​How vDSO Works

​vDSO Performance Impact

​Syscall Overhead Analysis

​Why Are Syscalls Expensive?

​Cost Breakdown

​Measuring Syscall Overhead

​Spectre Mitigations Impact

​seccomp - Syscall Filtering

​The Container Security Problem

​seccomp Modes

​How seccomp-BPF Works

​seccomp Example

​seccomp in Containers

​System Call Tracing

​strace - User-Space Tracer

​strace Output Analysis

​ltrace - Library Call Tracer

​Kernel-Level Tracing

​Adding a Custom Syscall (Lab)

​Step 1: Define the Syscall

​Step 2: Implement the Handler

​Step 3: Add to Syscall Table

​Step 4: Test from User Space

System Call Interface

What Are System Calls?

System Calls in Linux

Understanding the Transition

x86-64 Syscall Mechanism

The SYSCALL Instruction

Register Convention

MSR Configuration

Syscall Entry Point Deep Dive

do_syscall_64 - The C Entry Point

System Call Table

Finding Syscall Numbers

vDSO - Virtual Dynamic Shared Object

The Time Query Problem

vDSO Functions

How vDSO Works

vDSO Performance Impact

Syscall Overhead Analysis

Why Are Syscalls Expensive?

Cost Breakdown

Measuring Syscall Overhead

Spectre Mitigations Impact

seccomp - Syscall Filtering

The Container Security Problem

seccomp Modes

How seccomp-BPF Works

seccomp Example

seccomp in Containers

System Call Tracing

strace - User-Space Tracer

strace Output Analysis

ltrace - Library Call Tracer

Kernel-Level Tracing

Adding a Custom Syscall (Lab)

Step 1: Define the Syscall

Step 2: Implement the Handler

Step 3: Add to Syscall Table

Step 4: Test from User Space

Compatibility and ABI

32-bit Compatibility on 64-bit

Syscall Number Differences

Lab Exercises

Interview Questions

Key Takeaways