> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Operating System Interfaces > Master OS internals through xv6 - system calls, processes, I/O, pipes, and file systems # Operating System Interfaces - Complete Study Guide Learn operating system fundamentals by understanding the **xv6 teaching operating system** - a simple, Unix-like OS that demonstrates core OS concepts in \~9,000 lines of readable code. *** ## Core Concept: What is an Operating System? An operating system has **three primary jobs**: Programs don't need to know specific hardware details (e.g., which disk type, GPU model) Multiple programs run simultaneously (or appear to) by multiplexing CPU, memory, and I/O Programs can communicate and share data safely through controlled mechanisms ### Key Design Tension > \[!IMPORTANT] > **The Interface Dilemma** > > * **Simple/narrow interface** = easier to implement correctly, but limited features > * **Complex interface** = more features but harder to maintain and secure > * **Solution**: Few powerful mechanisms that combine for generality *** ## The xv6 Operating System ### What is xv6? **xv6** is a teaching OS based on Unix design by Ken Thompson & Dennis Ritchie. It provides basic Unix interfaces and helps you understand modern operating systems like BSD, Linux, macOS, Solaris, and even Windows. **Key Features:** * Small enough to understand completely (\~9,000 lines) * Real enough to be useful (actual working OS) * Provides essential Unix interfaces * Runs on RISC-V architecture ### Architecture Overview ![User Space vs Kernel Space Architecture](C:/Users/Zeeshan/.gemini/antigravity/brain/d5e5ab3f-4854-4e6e-9158-da0f7cc9eac9/user_kernel_space_1765856429239.png) **Key Terms:** * **Process**: Running program with memory (instructions, data, stack) * **Kernel**: Special program providing services to processes * **System Call**: Process requests kernel service (e.g., `fork()`, `open()`, `read()`) * **User Space**: Normal programs run here (limited privileges) * **Kernel Space**: Kernel runs here (full hardware access) **Process alternates**: User space → System call → Kernel space → Back to user space *** ## System Calls Reference ### Process Management | System Call | Description | Return Value | | -------------------- | -------------------------------- | --------------------------------- | | `fork()` | Create new process | Child PID to parent, 0 to child | | `exit(status)` | Terminate process | No return (0=success, 1=failure) | | `wait(*status)` | Wait for child exit | Child PID or -1 | | `kill(pid)` | Terminate process | 0 or -1 | | `getpid()` | Get current process PID | PID | | `sleep(n)` | Pause for n clock ticks | 0 | | `exec(file, argv[])` | Replace process with new program | No return on success, -1 on error | | `sbrk(n)` | Grow memory by n bytes | Address of new memory | ### File Operations | System Call | Description | Return Value | | ------------------- | ----------------------- | --------------------------- | | `open(file, flags)` | Open file | File descriptor (fd) or -1 | | `read(fd, buf, n)` | Read n bytes | Bytes read or 0 at EOF | | `write(fd, buf, n)` | Write n bytes | Bytes written | | `close(fd)` | Release file descriptor | 0 or -1 | | `dup(fd)` | Duplicate fd | New fd to same file | | `pipe(p[])` | Create pipe | 0 (p\[0]=read, p\[1]=write) | ### File System | System Call | Description | Return Value | | --------------------------- | ------------------------ | ------------ | | `chdir(dir)` | Change current directory | 0 or -1 | | `mkdir(dir)` | Create directory | 0 or -1 | | `mknod(file, major, minor)` | Create device file | 0 or -1 | | `fstat(fd, *st)` | Get file info from fd | 0 or -1 | | `stat(file, *st)` | Get file info from path | 0 or -1 | | `link(file1, file2)` | Create new name for file | 0 or -1 | | `unlink(file)` | Remove file name | 0 or -1 | > \[!NOTE] > **Convention**: Unless stated otherwise, system calls return 0 for success and -1 for error. *** ## Processes and Memory ### Process Structure Each process has: * **User-space memory**: instructions + data + stack * **Kernel-private state**: CPU registers, PID, etc. * **PID**: Process identifier (unique number) ### fork() - Creating Processes The `fork()` system call creates an **exact copy** of the parent process. **Behavior:** 1. Creates exact copy of parent process memory 2. Both processes continue execution after `fork()` 3. Only difference: return value **Return values:** * **Parent gets**: child's PID (positive number) * **Child gets**: 0 * **Error**: -1 ![Fork and Exec Process Flow](C:/Users/Zeeshan/.gemini/antigravity/brain/d5e5ab3f-4854-4e6e-9158-da0f7cc9eac9/fork_exec_flow_1765856454912.png) **Example:** ```c theme={null} int pid = fork(); if(pid > 0) { // Parent branch: fork() returned the child's PID (a positive number). // The parent continues executing from this point with its own copy of memory. printf("parent: child=%d\n", pid); pid = wait((int *) 0); // Block until the child calls exit() printf("child %d is done\n", pid); } else if(pid == 0) { // Child branch: fork() returned 0, which is how the child knows it IS the child. // This code runs in a completely separate process with its own address space. printf("child: exiting\n"); exit(0); } else { // fork() returned -1: the kernel could not allocate a new process // (e.g., out of memory or process table full). printf("fork error\n"); } ``` **Output order unpredictable:** ``` parent: child=1234 child: exiting ``` OR ``` child: exiting parent: child=1234 ``` **Then:** ``` parent: child 1234 is done ``` > \[!IMPORTANT] > **Critical Points:** > > * Parent and child have **SEPARATE memory** > * Changing variable in one doesn't affect other > * Both start with identical memory contents > * Each has separate registers ### wait() - Synchronizing Processes **Purpose:** Parent waits for child to finish **Behavior:** * Returns PID of exited child * Copies child's exit status to provided address * If no children exited yet, **blocks** until one does * If no children exist, returns -1 immediately **Usage:** ```c theme={null} int status; int pid = wait(&status); // Get exit status // OR int pid = wait(0); // Don't care about status ``` ### exec() - Running Programs **Purpose:** Replace current process with new program > \[!WARNING] > **Critical characteristic**: Does NOT create new process, **replaces** current one! **Arguments:** 1. Path to executable file 2. Array of string arguments (NULL-terminated) **File format:** ELF (Executable and Linkable Format) **Behavior:** * Loads new program from file * Replaces process memory completely * **Preserves**: PID, file descriptors, open files * If successful: **NEVER returns** (new program runs) * If error: returns to caller **Example:** ```c theme={null} char *argv[3]; argv[0] = "echo"; // Program name (convention: argv[0] is the program name) argv[1] = "hello"; // First real argument passed to the program argv[2] = 0; // NULL terminator -- exec() needs this to know where the array ends exec("/bin/echo", argv); // Replaces this process's memory with /bin/echo's code printf("exec error\n"); // This line only executes if exec() FAILED (returned -1). // On success, the process is now running /bin/echo -- // the old code, stack, and heap are gone. ``` > \[!NOTE] > **argv\[0] convention**: First argument is program name (mostly ignored by program) ### How Shell Uses These Calls **Shell main loop:** ```c theme={null} // Simplified shell structure while(getcmd(buf, sizeof(buf)) >= 0) { if(fork() == 0) { // Child: run command runcmd(parsecmd(buf)); } // Parent: wait for child wait(0); } ``` **For command `echo hello`:** 1. Shell forks 2. Parent waits 3. Child calls `exec("/bin/echo", ["echo", "hello", 0])` 4. echo runs and calls `exit()` 5. Parent's `wait()` returns 6. Shell ready for next command **Why fork + exec are separate:** * Allows I/O redirection between `fork()` and `exec()`. The child can close, open, and dup file descriptors in the gap between fork and exec, customizing its I/O environment without affecting the parent. * Shell can modify child's file descriptors before `exec()` * Parent's I/O remains unchanged This two-step design is one of the most influential decisions in Unix's history. It means the shell -- not the program -- controls I/O routing. A program like `cat` does not need any code for handling redirection, pipes, or output capture. All of that is the shell's job, using fork+exec with fd manipulation in between. **Alternative (worse) designs:** * Combined `forkexec()` call -- awkward for I/O redirection because you would need to pass the full fd configuration as parameters * Shell modifies own I/O, then undoes -- error-prone and not thread-safe * Every program handles own redirection -- duplicated work across every utility ### Memory Management **Implicit allocation:** * `fork()` - Allocates memory for child's copy * `exec()` - Allocates memory for new executable **Explicit allocation:** * `sbrk(n)` - Grow data memory by n bytes * Returns location of new memory * Used by `malloc()` implementation **Optimization:** * `fork()` uses **copy-on-write (COW)** * Doesn't actually copy memory until modified * Avoids waste when `exec()` immediately follows `fork()` *** ## I/O and File Descriptors ### File Descriptor Concept **Definition:** Small integer representing kernel-managed I/O object **Can refer to:** * Regular files * Directories * Devices (keyboard, screen, etc.) * Pipes **Key abstraction:** Everything looks like **stream of bytes** **How to obtain:** * Open file/directory/device * Create pipe * Duplicate existing descriptor ### File Descriptor Table ![File Descriptor Table](C:/Users/Zeeshan/.gemini/antigravity/brain/d5e5ab3f-4854-4e6e-9158-da0f7cc9eac9/file_descriptor_table_1765856474592.png) **Per-process table:** * Each process has private fd space * FDs start at 0 * Kernel uses fd as index into table **Standard convention:** * **FD 0** = standard input (stdin) * **FD 1** = standard output (stdout) * **FD 2** = standard error (stderr) **Shell ensures:** Always has FDs 0, 1, 2 open (default to console) ### read() and write() **read(fd, buf, n):** * Reads **UP TO** n bytes from fd * Copies into buf * Returns number of bytes actually read * Returns 0 at end of file * Each fd has file offset that advances automatically **write(fd, buf, n):** * Writes n bytes from buf to fd * Returns number of bytes written * Less than n only on error * Offset advances automatically **Sequential access:** ```c theme={null} // First read gets bytes 0-99 read(fd, buf, 100); // Next read gets bytes 100-199 read(fd, buf, 100); ``` ### cat Example - The Power of Abstraction ```c theme={null} char buf[512]; int n; for(;;) { n = read(0, buf, sizeof buf); // Read from stdin if(n == 0) break; // EOF if(n < 0) { fprintf(2, "read error\n"); // Error to stderr exit(1); } if(write(1, buf, n) != n) { // Write to stdout fprintf(2, "write error\n"); exit(1); } } ``` **Why this is powerful:** * `cat` does not know if it is reading from a file, the console, or a pipe * `cat` does not know if it is writing to a file, the console, or a pipe * Same code works for all cases -- this is the Unix "everything is a file" philosophy in action * Shell controls actual I/O sources/destinations via fd manipulation This is the reason the same `cat` binary works in all of these contexts without modification: ```bash theme={null} cat file.txt # read from file, write to console cat < input.txt # read from redirected stdin echo hello | cat # read from pipe cat > output.txt # write to file instead of console cat < in.txt | sort > out.txt # pipe chain with redirection ``` ### File Descriptor Allocation **close(fd):** * Releases file descriptor * Makes fd available for reuse **New fd allocation:** * Always uses **LOWEST unused number** * This is critical for I/O redirection ### I/O Redirection Mechanism **Example: `cat < input.txt`** ```c theme={null} char *argv[2]; argv[0] = "cat"; argv[1] = 0; if(fork() == 0) { close(0); // Free fd 0 open("input.txt", O_RDONLY); // Gets fd 0 (lowest available) exec("cat", argv); // cat reads from fd 0 } ``` **Why this works:** 1. Child closes stdin (fd 0) 2. `open()` gets fd 0 (lowest available) 3. `exec()` preserves file descriptor table 4. `cat` reads from fd 0, which now refers to `input.txt` **Parent unaffected:** Only child's fds are modified ### open() Flags **Defined in fcntl.h:** * `O_RDONLY` - Read only * `O_WRONLY` - Write only * `O_RDWR` - Read and write * `O_CREATE` - Create if doesn't exist * `O_TRUNC` - Truncate to zero length **Combining flags:** Use bitwise OR ```c theme={null} fd = open("file", O_CREATE|O_WRONLY); ``` ### fork() and File Descriptors **Behavior:** `fork()` copies file descriptor table **Shared file offset example:** ```c theme={null} if(fork() == 0) { write(1, "hello ", 6); exit(0); } else { wait(0); write(1, "world\n", 6); } ``` **Result:** File contains `"hello world"` **Why:** Parent and child share same underlying file offset * Child writes `"hello "` at position 0 * Offset advances to 6 * Parent writes `"world\n"` at position 6 **Use case:** Sequential output from shell commands ```bash theme={null} (echo hello; echo world) > output.txt ``` ### dup() - Duplicating Descriptors **Purpose:** Create new fd referring to same file **Behavior:** * Returns new fd (lowest available) * Both fds share offset (like `fork()`) **Example:** ```c theme={null} fd = dup(1); write(1, "hello ", 6); write(fd, "world\n", 6); ``` **Result:** `"hello world"` (sequential writes) **When offsets are shared:** * Derived from same original fd by `fork()`/`dup()` **When offsets are NOT shared:** * Separate `open()` calls, even for same file ### Error Redirection **Command:** `ls existing non-existing > tmp1 2>&1` **Meaning:** * `> tmp1` - Redirect stdout (fd 1) to tmp1 * `2>&1` - Redirect stderr (fd 2) to duplicate of fd 1 **Implementation:** ```c theme={null} close(1); open("tmp1", O_CREATE|O_WRONLY); // Gets fd 1 close(2); dup(1); // Gets fd 2, refers to tmp1 ``` **Result:** Both stdout and stderr go to tmp1 *** ## Pipes ### Pipe Concept **Definition:** Kernel buffer with two file descriptors * One fd for reading * One fd for writing **Purpose:** Inter-process communication ![Pipe Communication](C:/Users/Zeeshan/.gemini/antigravity/brain/d5e5ab3f-4854-4e6e-9158-da0f7cc9eac9/pipe_communication_1765856492853.png) **Creation:** ```c theme={null} int p[2]; pipe(p); // p[0] = read end // p[1] = write end ``` ### Pipe Example - Running wc ```c theme={null} int p[2]; char *argv[2]; argv[0] = "wc"; argv[1] = 0; pipe(p); // Create the pipe: p[0]=read end, p[1]=write end if(fork() == 0) { // Child -- will become `wc`, reading from the pipe close(0); // Free fd 0 (stdin) so the next open/dup gets it dup(p[0]); // Duplicate pipe read end -> gets fd 0 (stdin) close(p[0]); // Close the original pipe read fd (now redundant) close(p[1]); // CRITICAL: close write end so wc will see EOF exec("/bin/wc", argv); // wc reads from fd 0, which is now the pipe } else { // Parent -- writes data into the pipe close(p[0]); // Close read end (parent won't read from pipe) write(p[1], "hello world\n", 12); // Send data to child close(p[1]); // Close write end -> triggers EOF on read end } ``` **Flow:** 1. Parent creates pipe 2. Fork creates child with copy of pipe fds 3. Child redirects stdin to pipe read end 4. Child closes both pipe fds (already has copy at fd 0) 5. Parent closes read end (won't read) 6. Parent writes data to pipe 7. Parent closes write end (signals EOF) 8. Child reads from stdin (pipe), processes, exits > \[!IMPORTANT] > **Critical: Why close pipe fds?** > > * Child must close write end before `exec()` > * Otherwise `wc` would never see EOF (it has write end open) > * If ANY process has write end open, read won't return EOF ### Pipe Blocking Behavior **When reading from pipe:** * If data available: returns data * If no data AND write end open: **blocks** (waits for data) * If no data AND all write ends closed: returns 0 (EOF) **This is why closing is important!** ### Shell Pipeline Implementation **Command:** `grep fork sh.c | wc -l` **Process tree:** ``` shell | v pipe process / \ / \ grep wc ``` **Steps:** 1. Create pipe 2. Fork for grep * Redirect stdout to pipe write end * exec grep 3. Fork for wc * Redirect stdin to pipe read end * exec wc 4. Wait for both children **Complex pipeline:** `a | b | c` ``` shell | process1 / \ a process2 / \ b c ``` **Tree structure:** * Leaves = commands * Interior nodes = processes managing pipes ### Pipes vs Temporary Files **Command comparison:** ```bash theme={null} # With pipe echo hello world | wc # With temp file echo hello world > /tmp/xyz; wc < /tmp/xyz ``` **Pipe advantages:** 1. **Auto cleanup** -- No temp files to delete; the pipe buffer is reclaimed when both ends close 2. **Streaming** -- The writer does not need to finish before the reader starts. Data flows through the pipe in a FIFO stream, bounded by a kernel buffer (typically 64 KB on Linux). If the buffer is full, the writer blocks until the reader drains some data. 3. **Parallel execution** -- Both programs run simultaneously, overlapping CPU and I/O work. `grep` can start processing output from `find` while `find` is still traversing directories. 4. **Backpressure** -- If the consumer is slow, the pipe buffer fills up and the producer naturally pauses. This prevents unbounded memory growth without any explicit coordination. **Temp file disadvantages:** 1. Must manually clean up (and handle cleanup on errors/crashes) 2. Need disk space for all data -- a pipeline processing 10 TB would require 10 TB of temp storage 3. Sequential execution (first program must finish before second starts) 4. Security risk: temp files can be read by other processes unless permissions are carefully managed *** ## File System ### Structure **Hierarchical organization:** * Tree of directories * Root directory: `/` * Data files: uninterpreted byte arrays * Directories: named references to files/directories **Path examples:** * `/a/b/c` - Absolute path from root * `a/b/c` - Relative to current directory ### Directory Navigation **Both open same file:** ```c theme={null} // Method 1 chdir("/a"); chdir("b"); open("c", O_RDONLY); // Method 2 open("/a/b/c", O_RDONLY); ``` **First method:** * Changes process current directory to `/a`, then `/a/b` * Opens `c` relative to `/a/b` **Second method:** * No directory change * Direct absolute path ### Creating Files and Directories ```c theme={null} mkdir("/dir"); // Create directory fd = open("/dir/file", O_CREATE|O_WRONLY); // Create file close(fd); mknod("/console", 1, 1); // Create device file ``` **mknod creates device file:** * First number = major device number * Second number = minor device number * Together uniquely identify kernel device * `read()`/`write()` calls diverted to device driver ### Inodes and Links ![Inode and Link Relationships](C:/Users/Zeeshan/.gemini/antigravity/brain/d5e5ab3f-4854-4e6e-9158-da0f7cc9eac9/inode_links_1765856511362.png) **Inode = actual file data structure** **Contains:** * File type (file, directory, device) * File length * Location of content on disk * Number of links to file **Link = name for inode** > \[!IMPORTANT] > **Key insight**: File name ≠ file itself > > * One inode can have multiple names (links) > * Each link is directory entry: name + inode reference **Example:** ```c theme={null} open("a", O_CREATE|O_WRONLY); link("a", "b"); ``` **Result:** * One inode * Two names: `"a"` and `"b"` * Both refer to same content * Reading/writing `"a"` = reading/writing `"b"` ### stat Structure ```c theme={null} struct stat { int dev; // File system's disk device uint ino; // Inode number (unique) short type; // T_DIR, T_FILE, or T_DEVICE short nlink; // Number of links to file uint64 size; // Size in bytes }; ``` **Using fstat:** ```c theme={null} int fd = open("a", O_RDONLY); struct stat st; fstat(fd, &st); printf("Inode: %d, Links: %d\n", st.ino, st.nlink); ``` **After linking:** * Both `"a"` and `"b"` have same `ino` * `nlink` = 2 ### unlink() - Removing Names **Behavior:** * Removes name from file system * Decrements link count * Inode and disk space freed ONLY when: * Link count = 0 AND * No file descriptors refer to it **Example sequence:** ```c theme={null} open("a", O_CREATE|O_WRONLY); link("a", "b"); unlink("a"); ``` **Result:** * Name `"a"` removed * File still accessible as `"b"` * Inode unchanged **Temporary file idiom:** ```c theme={null} fd = open("/tmp/xyz", O_CREATE|O_RDWR); unlink("/tmp/xyz"); // File has no name now // Will be deleted when process closes fd or exits ``` ### Built-in vs External Commands **External (user-level programs):** * `mkdir` * `ln` * `rm` * `ls` * `cat` * Most commands **Why external:** * Anyone can add new commands * No need to modify kernel/shell * Easier to extend system **Built-in (part of shell):** * `cd` **Why cd is built-in:** ```c theme={null} // If cd was external: if(fork() == 0) { // Child changes its own directory chdir("/new/dir"); exit(0); } // Parent (shell) directory unchanged! ``` > \[!WARNING] > **Problem**: `cd` must change SHELL's directory, not child's *** ## System Call Flow Understanding how system calls work is crucial for OS internals. ![System Call Flow](C:/Users/Zeeshan/.gemini/antigravity/brain/d5e5ab3f-4854-4e6e-9158-da0f7cc9eac9/system_call_flow_1765856532301.png) ### Tracing fork() from User to Kernel **Step 1: User Program Calls fork()** ```c theme={null} // user/sh.c int pid = fork(); ``` **Step 2: User Library Wrapper (user/usys.S)** ```assembly theme={null} .global fork fork: li a7, SYS_fork # a7 = system call number (1) ecall # Trap into kernel! ret # Return with result in a0 ``` **Step 3: Trap Entry (kernel/trampoline.S)** ```assembly theme={null} uservec: # Save user registers csrw sscratch, a0 sd ra, 40(a0) sd sp, 48(a0) # ... save all 32 registers ... # Load kernel page table ld t1, 0(a0) csrw satp, t1 sfence.vma zero, zero # Jump to usertrap() in C jr t0 ``` **Step 4: Trap Handler (kernel/trap.c)** ```c theme={null} void usertrap(void) { struct proc *p = myproc(); p->trapframe->epc = r_sepc(); if(r_scause() == 8){ // System call p->trapframe->epc += 4; // Advance PC past ecall intr_on(); syscall(); // Handle the system call! } usertrapret(); } ``` **Step 5: System Call Dispatcher (kernel/syscall.c)** ```c theme={null} void syscall(void) { int num; struct proc *p = myproc(); num = p->trapframe->a7; // Get syscall number if(num > 0 && num < NELEM(syscalls) && syscalls[num]) { p->trapframe->a0 = syscalls[num](); // Call handler } else { printf("%d %s: unknown sys call %d\n", p->pid, p->name, num); p->trapframe->a0 = -1; } } ``` **Step 6: Fork Implementation (kernel/proc.c)** ```c theme={null} int fork(void) { struct proc *np; struct proc *p = myproc(); // Allocate process if((np = allocproc()) == 0) return -1; // Copy user memory from parent to child if(uvmcopy(p->pagetable, np->pagetable, p->sz) < 0){ freeproc(np); return -1; } np->sz = p->sz; // Copy saved user registers *(np->trapframe) = *(p->trapframe); np->trapframe->a0 = 0; // Child returns 0 // Copy open file descriptors for(int i = 0; i < NOFILE; i++) if(p->ofile[i]) np->ofile[i] = filedup(p->ofile[i]); np->state = RUNNABLE; return np->pid; // Parent returns child's PID } ``` *** ## Real World Context ### Unix Philosophy **"Software tools" culture:** * Small programs doing one thing well * Combine via pipes * Shell as "scripting language" **Key innovations:** * Standard file descriptors (0, 1, 2) * Pipes for composition * Simple but powerful shell syntax ### POSIX Standard **Purpose:** Standardize Unix system call interface **xv6 vs POSIX:** * xv6 NOT POSIX compliant * Missing: `lseek()`, many other calls * Different implementations of existing calls * Goal: Simplicity for teaching **Modern systems:** * Many more system calls * Networking, windowing, threads * Many device drivers * Continuous evolution ### Alternative Designs **Plan 9:** * Extended "everything is a file" concept * Applied to networks, graphics * Most Unix systems didn't follow **Multics:** * Predecessor of Unix * Files looked like memory * Very different interface * Complexity influenced Unix designers to simplify **Other OS models:** * Different abstractions possible * File descriptors not only solution * Unix model proven very successful ### xv6 Limitations **No user protection:** * All processes run as root * No isolation between users * Teaching-focused tradeoff **Limited features:** * Minimal system calls * No networking * No windowing * Basic functionality only *** ## Key Takeaways ### Design Principles Few powerful mechanisms that combine for generality Hide hardware details behind clean interfaces Processes isolated from each other for safety Explicit mechanisms for safe interaction ### Core Abstractions 1. **Processes** - Unit of execution with isolated memory 2. **File descriptors** - Unified I/O interface for files, pipes, devices 3. **Pipes** - Communication channels between processes 4. **File system** - Persistent storage hierarchy with inodes and links ### Why These Abstractions Work **File descriptors:** * Hide differences (files, pipes, devices) * Enable I/O redirection * Simple but powerful **fork/exec separation:** * Enables I/O redirection * Shell controls child environment * Parent unaffected **Pipes:** * Clean inter-process communication * Better than temp files * Enable parallel execution **Inodes/links:** * Separate names from content * Multiple names for same file * Automatic cleanup when unused *** ## Getting Started with xv6 ### Installation (Ubuntu/Debian) ```bash theme={null} # Install dependencies sudo apt-get install git build-essential gdb-multiarch \ qemu-system-misc gcc-riscv64-linux-gnu binutils-riscv64-linux-gnu # Clone xv6 git clone git://github.com/mit-pdos/xv6-riscv.git cd xv6-riscv # Build and run make qemu # You'll see xv6 boot and get a shell prompt: $ ``` ### First Commands to Try ```bash theme={null} $ ls # List files $ cat README # Show file contents $ echo hello xv6 # Print text $ grep the README # Search in file $ wc README # Word count $ mkdir test # Create directory $ sh # Start new shell ``` *** ## Caveats and Common Pitfalls xv6 is one of the best ways to learn how an operating system actually works. It is also one of the easiest ways to develop misleading mental models of how *real* operating systems work. The simplifications that make xv6 readable in 9,000 lines also make it dangerously different from production code. Treat xv6 as a teaching diagram, not a blueprint. **Traps when applying xv6 lessons to real systems:** 1. **xv6 is teaching code, not production code.** It deliberately omits everything that makes real OS code complex: SMP-aware schedulers (xv6's scheduler is essentially per-CPU round robin, no load balancing), modern filesystems (xv6's FS is a simple log-structured layout, no journaling at the level of ext4 or btrfs, no extents, no compression), interrupt coalescing, NUMA awareness, security hardening, or any of the kernel hardening that consumes significant fractions of Linux. If you are studying xv6 and a Linux kernel bug at the same time, do not assume the xv6 mechanism applies. 2. **xv6 is different from real Linux in critical ways -- do not transfer assumptions blindly.** The system call table, virtual memory layout, scheduler, file system, and interrupt handling are all simplified. xv6 has a single page table per process; Linux has KPTI page tables (split user/kernel) for Meltdown mitigation. xv6 fork is a literal page-by-page copy; Linux fork is COW. xv6 has no preemptive in-kernel preemption; Linux has CONFIG\_PREEMPT options that radically change scheduling behavior. Reading xv6 source and assuming "this is how Linux does it" is a recipe for confidently wrong answers in interviews. 3. **RISC-V vs x86 versions of xv6 diverge -- pick the one your course aligns with.** The original xv6 was x86; the modern MIT version is RISC-V. The two have different boot flows, different trap mechanisms, different privilege models (RISC-V has M/S/U modes, x86 has rings), and different conventions. If your course or job interview is on RISC-V (current MIT 6.S081/6.1810), study the RISC-V version. If it is on x86 (older xv6, many third-party forks), study that one. Mixing them creates confusion that hurts you in interviews. 4. **xv6's locking is subtler than it looks.** xv6 uses spinlocks and disables interrupts inside critical sections, which is correct but creates surprising properties: you cannot sleep with a lock held (would deadlock), and lock acquisition order matters (xv6 has a documented lock ordering). Real systems have RCU, lockdep, sleeping locks, and many more primitives. Do not assume "if xv6 does it this way, it is the right way" -- it is the simple way that fits in a teaching kernel. **How to use xv6 effectively:** * **Use xv6 to build mechanism intuition, then read Linux source for production patterns.** xv6 teaches you what a page table looks like, how a context switch works, and what fork actually does at the lowest level. Once that is internalized, read `arch/x86/mm/fault.c` or `kernel/sched/fair.c` in Linux to see what production code looks like. The contrast is the lesson. * **Match your xv6 version to your target course or interview.** MIT 6.S081 is RISC-V; many older textbooks use x86. Decide which platform matters for your goals and stick with it. If both, learn one fully before touching the other. * **Treat xv6 modifications as the real exercise.** Reading xv6 teaches some; modifying it (adding COW fork, lazy allocation, mmap, multithreading inside a process) teaches everything. The MIT 6.S081 lab assignments are public and excellent. Do them; do not just read solutions. * **When asked about Linux in an interview, do not reflexively answer with xv6 details.** "In xv6, fork copies every page" is true and irrelevant if the interviewer asked about Linux. State "in real Linux this uses copy-on-write; in the simpler xv6 teaching kernel, it is a literal copy" to show you know the difference. * **Read the xv6 book alongside the source.** The xv6 book is one of the best OS texts ever written -- \~150 pages of clear prose explaining every line. Reading source without the book is much harder than reading both together. *** ## Interview Deep-Dive **Strong Answer Framework:** 1. **xv6's scheduler is per-CPU round robin.** Each CPU independently runs `scheduler()` in `proc.c`. It walks the process table looking for a RUNNABLE process and runs it. There is no priority, no fairness accounting, no virtual runtime, no load balancing across CPUs, no I/O priority boost, no sleep-time accounting. A long-running CPU-bound process gets exactly the same share as an interactive shell. 2. **Linux CFS tracks vruntime (virtual runtime).** Each task has a vruntime that increments based on actual CPU time consumed, weighted by nice value. The scheduler picks the task with the lowest vruntime, stored in a red-black tree for O(log n) selection. Tasks that sleep accumulate vruntime slowly while sleeping (capped to prevent infinite catch-up), so they get a "wakeup boost" that makes them latency-friendly. 3. **What is missing in xv6.** Fairness across nice values, load balancing across CPUs (xv6 actively wastes CPU time when one CPU is idle and another has multiple runnable tasks), CPU affinity hints, group scheduling (cgroups), real-time scheduling classes (SCHED\_FIFO / SCHED\_RR / SCHED\_DEADLINE), and any I/O latency awareness. The scheduler is also not preemptive in the kernel -- once a task is running in kernel mode in xv6, it runs until it voluntarily yields. 4. **What Linux gains.** Linux CFS handles workloads from a 4-thread embedded device to a 256-CPU server with thousands of tasks. The complexity buys: predictable interactive latency under load, fairness guarantees that hold over time (not just instantaneously), the ability to express priority via cgroups for containers, and the ability to mix latency-sensitive and batch workloads on the same machine. 5. **What Linux trades.** Complexity (CFS is several thousand lines vs xv6's hundred), debugging difficulty (vruntime quirks like the "wakeup buddy" heuristic took years to tune), and edge cases (CFS bandwidth control's period-boundary throttle problem in containers). Simpler is sometimes a feature -- which is why ULE (FreeBSD), CFS, and EEVDF have all evolved over decades. **Real-World Example:** In 2023, Linux 6.6 replaced CFS with EEVDF (Earliest Eligible Virtual Deadline First, by Peter Zijlstra). EEVDF was actually proposed in academic literature in 1995 -- it took 28 years for the kernel to adopt it because CFS was "good enough" for most workloads, and the community was conservative about scheduler changes. The trigger was that CFS had accumulated too many heuristics for tail-latency tuning, and EEVDF's deadline model unified them more cleanly. The lesson: scheduler design is hard, and even good ideas take decades to ship. **Senior follow-up 1:** "What is the 'thundering herd' problem in scheduling, and does xv6 have it?" When many tasks block on one event and all wake at once, they all become RUNNABLE simultaneously. xv6 has it (sleep/wakeup wakes all waiters); Linux mitigates it via wait queue exclusivity flags and EPOLLEXCLUSIVE. **Senior follow-up 2:** "Why does Linux have separate scheduling classes (CFS, RT, deadline, idle) instead of one unified scheduler?" Different workloads have fundamentally incompatible requirements. Real-time tasks need bounded worst-case latency; CFS provides fairness; idle tasks should never run when anything else is runnable. A unified scheduler tuned for one class compromises the others. The class hierarchy lets each class implement what it needs. **Senior follow-up 3:** "What is preemption disabling in the kernel and why does it matter?" When kernel code holds a spinlock or runs in interrupt context, it cannot be preempted -- the scheduler will not switch tasks. Long-held preemption-disabled sections cause latency spikes. CONFIG\_PREEMPT\_RT (recently merged) replaces most spinlocks with sleeping mutexes to allow preemption almost everywhere, dramatically reducing worst-case latency at some throughput cost. **Common Wrong Answers:** 1. *"xv6's scheduler is the same as Linux, just smaller."* Wrong -- the algorithm is fundamentally different (round-robin vs vruntime-based). Saying "same but smaller" misses the point that fairness is an algorithmic property, not a code-size property. 2. *"Linux CFS is fair to the millisecond."* Misleading -- CFS targets long-run fairness, not instantaneous fairness. Over a few milliseconds, one task can dominate due to wakeup heuristics; over seconds, weights are honored. **Further Reading:** * "Operating Systems: Three Easy Pieces", chapter on scheduling -- best free scheduling textbook. * Documentation/scheduler/sched-design-CFS.rst in the Linux kernel source. * LWN article "An EEVDF CPU scheduler for Linux" (2023) -- explains the CFS-to-EEVDF transition. **Strong Answer Framework:** 1. **The disk layout.** xv6 divides the disk into named regions: boot block (block 0), super block (block 1, contains FS metadata), log blocks (for crash recovery), inode blocks (each block holds several inodes), bitmap blocks (track which data blocks are free), and data blocks (the actual file contents). The super block tells you where each region starts. 2. **Inodes.** Every file or directory is an inode -- a fixed-size structure containing type (file/dir/device), size, and a list of block pointers. xv6 has 12 direct pointers and 1 indirect pointer (which points to a block of more pointers), giving max file size \~70KB. Real Linux ext4 uses extents and triple-indirect blocks for files in the terabyte range, but the inode concept is identical. 3. **Directory entries.** A directory in xv6 is just a file whose contents are an array of `struct dirent { ushort inum; char name[14]; }`. Looking up a name means scanning the directory's data blocks for a matching name and returning the inode number. 4. **Resolving /a/b.txt.** Start at inode 1 (root directory, hardcoded). Read its data, scan for "a", find inum N. Read inode N (must be a directory), read its data, scan for "b.txt", find inum M. Read inode M -- this is the file. Each inode read may require a disk I/O (cached in the inode cache after first read). 5. **The block cache and log.** Every disk block goes through a buffer cache (`buf.c`) for performance. Writes go through a write-ahead log: changes are written to the log first, then applied to actual locations after commit. On crash, replay the log to recover. This gives crash consistency similar to ext4 with journaling. 6. **The path lookup function.** In xv6, `namei()` and `nameiparent()` walk paths. Their pseudocode: split path on `/`, start at root (or cwd), repeat: get current inode, lock it, look up next component, follow to next inode. Locking matters because two threads renaming files concurrently could deadlock without ordered acquisition. **Real-World Example:** In 2018, a CVE in ext4 (CVE-2018-1093) was found in path traversal: a crafted directory entry with specific bytes could cause the kernel to read past the end of a directory block. The fix was a few lines of bounds checking in `ext4_dx_find_entry()`. The lesson: path resolution is one of the most performance-critical and security-critical paths in any kernel, and the simple xv6 implementation glosses over many edge cases (Unicode normalization, race conditions, symlinks, mount points). The MIT 6.S081 fs lab walks through exactly these issues at a teaching scale. **Senior follow-up 1:** "What happens if the system crashes in the middle of writing a directory entry?" Without journaling, you could see a half-written entry -- corrupt filesystem. xv6's log handles this by writing all changes to the log first, then replaying on boot. ext4 does the same with its journal. ZFS/btrfs use copy-on-write so old data is never overwritten until new data is fully durable. **Senior follow-up 2:** "How does xv6 handle the inode cache and reference counting?" Each in-memory inode has a reference count. `iget` increments it; `iput` decrements and may free if zero. Locking is separate from reference counting -- you can hold a reference without holding the lock. Real Linux dcache/icache works similarly but is much more complex due to RCU. **Senior follow-up 3:** "What is missing from xv6's filesystem compared to a real production FS?" No journaling at the metadata level (xv6's log is per-syscall, not full FS journaling), no extents (extents pack large contiguous ranges into one descriptor), no soft updates or delayed allocation, no copy-on-write snapshots, no compression, no encryption, no extended attributes, no quotas. Each of these features represents a decade of FS research. **Common Wrong Answers:** 1. *"xv6 uses ext2-style inodes."* Sort of -- ext2 inspired the design, but xv6's inode is much simpler (12 direct + 1 indirect, no triple indirect, no extended attributes). Saying "ext2-style" without acknowledging the simplifications shows shallow understanding. 2. *"Path resolution is just a series of lookups."* Strictly true but misses the locking, caching, and crash-recovery layers that make real path resolution work. A senior engineer should mention all three. **Further Reading:** * The xv6 book, chapter 8 on the file system -- one of the best filesystem tutorials in print. * "Operating Systems: Three Easy Pieces", chapters on file systems and FFS/LFS. * ext4 Wikipedia article and `Documentation/filesystems/ext4.rst` in Linux source for comparison. **Strong Answer Framework:** 1. **What COW fork does.** Instead of copying every page on fork (xv6 default), mark all pages read-only in both parent and child page tables. When either writes, take a page fault, copy that page, mark it writable, restore the writer. Most pages are never written before exec(), so most pages are never copied -- huge speedup for fork+exec. 2. **Files to modify.** `kernel/vm.c` (page table walk and protection bits), `kernel/proc.c` (the `fork()` syscall implementation), `kernel/trap.c` (the page fault handler -- need to detect write-to-RO and handle it), and `kernel/kalloc.c` (need a reference count per physical page so we know when to free). 3. **The reference counting problem.** When parent and child share a page, we cannot free it until both are done. Add an array indexed by physical frame number, with a reference count. Increment on share, decrement on unshare. Free when count hits zero. The reference count must be lock-protected (xv6 is multi-CPU). 4. **The page-fault handler.** On a write fault to a RO page that is shared (refcount > 1), allocate a new page, copy contents, decrement old page's refcount, install new page in the faulting process's page table as writable. If refcount is exactly 1 (we are the only user), just flip RO to RW -- no copy needed. This is the optimization that makes COW efficient. 5. **Gotchas.** (a) The kernel itself sometimes touches user pages (`copyin`, `copyout`) -- these need COW handling too, or they will silently write to a shared page. (b) `fork()` of a fork()ed child means a page might have refcount 3 -- the chain of COW must work correctly. (c) Stacks grow via faults -- the COW path must not confuse stack growth faults with COW faults. (d) Race condition: two threads in the parent could both write to the same shared page simultaneously and both take a fault -- locking is essential. **Real-World Example:** Linux's COW implementation has had several major CVEs -- most famously **Dirty COW (CVE-2016-5195)**, discovered by Phil Oester in 2016. The bug was a race between madvise(MADV\_DONTNEED) and the COW page-fault handler that allowed a non-privileged user to write to read-only memory mappings. Even root files like /etc/passwd became writable. The fix was a synchronization change in `mm/gup.c`. Every modern kernel has been fixed, but the lesson is that COW logic is subtle even after 30 years of refinement. If you implement COW in xv6, expect your first version to have at least one race. **Senior follow-up 1:** "How does the MIT 6.S081 'cowgc' (COW garbage collection) lab differ from a real Linux implementation?" The lab uses a per-physical-page refcount array, similar to Linux. The real Linux uses `struct page` (one per physical page) with a refcount field, plus a separate mapcount for "mapped into how many user page tables." The two-counter approach handles file-mapped pages (where many processes map the same page-cache page) correctly. **Senior follow-up 2:** "Can COW pages cause OOM in unexpected ways?" Yes -- if many children fork from a parent that allocates a large array, then each child writes to different parts of it, every write triggers a copy. The system can run out of memory even though logically nothing was allocated. The fix is `madvise(MADV_DONTFORK)` to mark large mappings as not inherited, or `vfork()` for the case where the child immediately execs. **Senior follow-up 3:** "How does COW interact with the TLB?" When you flip a page from RW to RO, you must invalidate that TLB entry on every CPU that might have it cached. This is a TLB shootdown -- expensive on multi-CPU systems. xv6's `sfence.vma` handles this for RISC-V; Linux uses inter-processor interrupts for x86 TLB shootdowns. High fork rates can become TLB-shootdown bound. **Common Wrong Answers:** 1. *"Just mark pages read-only and copy on fault -- done."* Skips reference counting, which means you cannot tell when to free. A naive implementation leaks every COW-shared page. 2. *"COW always saves memory."* Not always -- if the child writes to most pages (which is common in some long-lived child processes), you pay the fault overhead and end up copying anyway. The savings are biggest when the child immediately execs. **Further Reading:** * MIT 6.S081 "cowgc" lab handout (free at pdos.csail.mit.edu) -- step-by-step COW implementation. * Linux kernel source `mm/memory.c`, function `do_wp_page()` -- the production COW handler. * "Dirty COW" CVE write-ups -- excellent for understanding the subtlety of COW races. *** ## Interview Relevance Understanding OS interfaces helps you: * Design better APIs and abstractions * Understand performance implications * Make informed technology choices * Debug production issues * How does `fork()` work? What's copy-on-write? * Explain file descriptors and I/O redirection * How do pipes work? When to use them? * What's the difference between hard links and symbolic links? * How does the shell implement pipelines? * Explain the system call mechanism * Understand trade-offs in OS design * Explain why Unix chose these abstractions * Compare with alternative designs * Discuss performance implications * Debug issues at the system call level *** ## Resources ### Official Materials * [xv6 Book](https://pdos.csail.mit.edu/6.828/2023/xv6/book-riscv-rev3.pdf) - Complete guide * [xv6 Source Code](https://github.com/mit-pdos/xv6-riscv) - GitHub repository * [MIT 6.S081](https://pdos.csail.mit.edu/6.828/2023/) - OS Engineering course ### Next Steps Deep dive into process lifecycle, scheduling, and context switching Understand paging, page tables, and memory management Learn about inodes, directories, and file system implementation Explore how these concepts scale in production systems *** > \[!TIP] > **Learning Path**: Start by reading the xv6 book alongside the source code. Try modifying xv6 to add features or change behavior. This hands-on experience is invaluable for truly understanding operating systems.