> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # OS Security: From Hardware to Userspace > Deep dive into access control, capabilities, SELinux, ASLR, DEP, stack canaries, and microarchitectural attacks # Operating System Security Modern OS security is a multi-layered defense against both software vulnerabilities and hardware-level attacks. From memory protection to mandatory access control, understanding these mechanisms is crucial for building secure systems. **Mastery Level**: Senior Security Engineer **Key Internals**: Page Table Permissions, Capabilities, LSM hooks, CPU security features, Speculative execution mitigations **Prerequisites**: [Virtual Memory](/operating-systems/virtual-memory), [Process Internals](/operating-systems/processes) *** ## 1. Memory Protection Fundamentals ### 1.1 Page-Level Protection (NX/DEP) **No-Execute (NX)** / **Data Execution Prevention (DEP)** marks memory pages as non-executable. ``` Traditional (Insecure): ┌────────────────────────────────┐ │ Stack │ Executable! │ ├─ Return addresses │ ← Attacker can inject shellcode │ └─ Local variables │ ├────────────────────────────────┤ │ Heap │ Executable! │ ├─ Malloc'd buffers │ ← Attacker can put code here │ └─ Dynamic data │ ├────────────────────────────────┤ │ Data/BSS │ Executable! └────────────────────────────────┘ With NX/DEP: ┌────────────────────────────────┐ │ Stack (NX bit set) │ NOT Executable │ ├─ Return addresses │ ← Shellcode won't execute! │ └─ Local variables │ ├────────────────────────────────┤ │ Heap (NX bit set) │ NOT Executable │ ├─ Malloc'd buffers │ │ └─ Dynamic data │ ├────────────────────────────────┤ │ Data/BSS (NX bit set) │ NOT Executable ├────────────────────────────────┤ │ Text (executable) │ Executable │ └─ Program code │ └────────────────────────────────┘ ``` **Implementation**: ```c theme={null} // Kernel sets page table entry (PTE) NX bit // x86-64 page table entry structure struct pte { unsigned long present : 1; // Page is in memory unsigned long rw : 1; // Read/Write permission unsigned long user : 1; // User-mode accessible unsigned long pwt : 1; // Page write-through unsigned long pcd : 1; // Page cache disabled unsigned long accessed : 1; // Page was accessed unsigned long dirty : 1; // Page was written unsigned long pat : 1; // Page attribute table unsigned long global : 1; // Global page unsigned long avail : 3; // Available for OS use unsigned long pfn : 40; // Physical frame number unsigned long avail2 : 11; // Available unsigned long nx : 1; // NO-EXECUTE bit (bit 63) }; // Kernel code for stack allocation (simplified from mm/mmap.c) unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long pgoff) { struct vm_area_struct *vma; vma = vm_area_alloc(current->mm); vma->vm_start = addr; vma->vm_end = addr + len; // Stack protection: read/write but NOT executable if (flags & MAP_STACK) { vma->vm_flags = VM_READ | VM_WRITE | VM_GROWSDOWN; // NX bit will be set in page table entries } // Code segment: executable but NOT writable if (prot & PROT_EXEC) { vma->vm_flags |= VM_EXEC; vma->vm_flags &= ~VM_WRITE; // W^X: Write XOR Execute } return addr; } ``` **W^X Policy** (Write XOR Execute): * A page can be writable OR executable, but never both * Prevents attacker from modifying code or executing data * The logic is straightforward: if you can write it, the attacker can inject code there, so it must not execute. If it executes, it must be immutable. **Practical tip**: JIT compilers (V8, JVM HotSpot) are the main exception -- they must generate code at runtime. They handle this by allocating pages as RW, writing machine code, then calling `mprotect()` to flip them to RX before execution. This W-then-X pattern is audited carefully in security-critical JITs like Firefox's Wasm compiler. **Check NX status**: ```bash theme={null} # Check if NX is enabled dmesg | grep NX # NX (Execute Disable) protection: active # Check process memory maps cat /proc/self/maps # 7ffff7dd1000-7ffff7df3000 r-xp ... /lib/x86_64-linux-gnu/ld-2.31.so (executable) # 7ffff7df3000-7ffff7df4000 r--p ... /lib/x86_64-linux-gnu/ld-2.31.so (read-only) # 7ffffffde000-7ffffffff000 rw-p ... [stack] (no 'x'!) # Check if binary has NX enabled readelf -l /bin/ls | grep GNU_STACK # GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x10 # ^^^ (RW, not RWE) ``` ### 1.2 Address Space Layout Randomization (ASLR) **Problem**: Without ASLR, addresses are predictable. ``` Without ASLR (Predictable): ┌─────────────────────────────────┐ │ Stack: 0x7ffffffde000 │ ← Always same address! │ Heap: 0x555555559000 │ ← Attacker knows these │ libc: 0x7ffff7a0d000 │ ← Can hardcode in exploit │ Program: 0x555555554000 │ │ vDSO: 0x7ffff7fc9000 │ └─────────────────────────────────┘ With ASLR (Randomized): Run 1: Run 2: ┌─────────────────────────────┐ ┌─────────────────────────────┐ │ Stack: 0x7ffc9e3a2000 │ │ Stack: 0x7ffe1b8d6000 │ │ Heap: 0x5643ab123000 │ │ Heap: 0x55e2d9abc000 │ │ libc: 0x7f8a2e456000 │ │ libc: 0x7f3c81de2000 │ │ Program: 0x5643ab11f000 │ │ Program: 0x55e2d9ab8000 │ │ vDSO: 0x7f8a2f1c3000 │ │ vDSO: 0x7f3c82b4f000 │ └─────────────────────────────┘ └─────────────────────────────┘ ↑ Different every time! ↑ ``` **Kernel Implementation**: ```c theme={null} // Simplified from arch/x86/mm/mmap.c unsigned long arch_mmap_rnd(void) { unsigned long rnd; // Get random bits from kernel PRNG if (mmap_is_ia32()) { rnd = get_random_long() & ((1UL << mmap_rnd_bits) - 1); } else { rnd = get_random_long() & ((1UL << mmap_rnd_compat_bits) - 1); } return rnd << PAGE_SHIFT; // Align to page boundary } unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; unsigned long start_addr; // Add random offset if (!(flags & MAP_FIXED)) { // Random offset for ASLR start_addr = mm->mmap_base + arch_mmap_rnd(); } else { start_addr = addr; } // Find free region starting at randomized address vma = find_vma(mm, start_addr); // ... allocation logic ... return start_addr; } ``` **Entropy Sources**: ``` ASLR Entropy (bits of randomness): Stack: 19 bits (on x86-64) = 524,288 possible locations Heap: 13 bits = 8,192 possible locations Libraries: 28 bits = 268 million possible locations PIE binary: 28 bits = 268 million possible locations Formula: Brute force attempts = 2^(entropy_bits) Example: 28 bits → attacker needs avg 2^27 = 134 million attempts If each attempt crashes the program (1 sec delay): 134 million seconds = 1,551 days! But if process doesn't crash (fork server): Attacker can brute force in minutes! ``` **KASLR (Kernel ASLR)**: ```c theme={null} // Kernel virtual address randomization (from arch/x86/boot/compressed/kaslr.c) void choose_random_location(unsigned long input, unsigned long input_size, unsigned long *output, unsigned long output_size, unsigned long *virt_addr) { unsigned long random_addr, min_addr; // Get entropy from: // 1. RDRAND/RDSEED (CPU instructions) // 2. RDTSC (timestamp counter) // 3. Boot parameters random_addr = get_random_long(); // Align and constrain to valid kernel address range min_addr = min(*output, *virt_addr); random_addr = find_random_phys_addr(min_addr, output_size); *output = random_addr; *virt_addr = random_addr + __START_KERNEL_map; } ``` **Check ASLR status**: ```bash theme={null} # View ASLR setting cat /proc/sys/kernel/randomize_va_space # 0 = Disabled # 1 = Randomize stack, libraries, mmap # 2 = Full randomization (includes heap) # Enable full ASLR echo 2 | sudo tee /proc/sys/kernel/randomize_va_space # Test ASLR for i in {1..5}; do cat /proc/self/maps | grep stack; done # 7ffc12345000-7ffc12366000 (different) # 7ffe9abcd000-7ffe9abee000 (different) # 7ffd45678000-7ffd45699000 (different) ``` ### 1.3 Stack Canaries (Stack Smashing Protection) **Stack canary** is a random value placed on the stack between local variables and the return address. ``` Stack Layout with Canary: ───────────────────────── High Address ┌──────────────────────┐ │ Return Address │ ← Protected by canary! ├──────────────────────┤ │ Saved Frame Pointer │ ├──────────────────────┤ │ CANARY (random) │ ← __stack_chk_guard ├──────────────────────┤ │ Local Variables │ ← Buffer overflow starts here │ char buf[100]; │ └──────────────────────┘ Low Address Attack Scenario: 1. Attacker overflows buf 2. Overwrites canary (but doesn't know correct value) 3. Function returns 4. Kernel checks: canary == __stack_chk_guard? 5. Mismatch → Stack smashing detected! → Abort ``` **Compiler Implementation**: ```c theme={null} // Original vulnerable code void vulnerable_function(char *input) { char buffer[64]; strcpy(buffer, input); // Buffer overflow! } // Compiled with -fstack-protector-strong void vulnerable_function(char *input) { char buffer[64]; unsigned long canary = __stack_chk_guard; // Load canary strcpy(buffer, input); if (canary != __stack_chk_guard) { __stack_chk_fail(); // Stack smashing detected! } } // __stack_chk_fail() implementation (in glibc) void __attribute__((noreturn)) __stack_chk_fail(void) { __fortify_fail("stack smashing detected"); } void __attribute__((noreturn)) __fortify_fail(const char *msg) { // Log the error syslog(LOG_CRIT, "%s: %s terminated", __progname, msg); // Terminate immediately abort(); } ``` **Canary Types**: ```c theme={null} // Uses NULL, CR, LF, EOF (0x00, 0x0D, 0x0A, 0xFF) // Idea: strcpy stops at NULL, gets/printf stop at CR/LF unsigned long canary = 0x000d0aff00000000; // Weakness: Attacker can guess/brute-force known bytes ``` ```c theme={null} // Completely random value generated at program startup // In kernel (from arch/x86/kernel/cpu/common.c) void __init cpu_init(void) { // ... __stack_chk_guard = get_random_canary(); // ... } unsigned long get_random_canary(void) { unsigned long canary; // Use hardware RNG if available if (cpu_has_rdrand()) { rdrand_long(&canary); } else { // Fall back to PRNG get_random_bytes(&canary, sizeof(canary)); } return canary; } // Strength: Unpredictable, unique per process ``` ```c theme={null} // XOR of return address, frame pointer, and random value unsigned long canary = __stack_chk_guard ^ (unsigned long)__builtin_return_address(0) ^ (unsigned long)__builtin_frame_address(0); // Idea: Even if attacker overwrites return address, // canary changes correspondingly // Weakness: Complex, rarely used ``` **Compiler Flags**: ```bash theme={null} # -fstack-protector: Protect functions with vulnerable buffers gcc -fstack-protector vulnerable.c # -fstack-protector-strong: Protect more functions (recommended) gcc -fstack-protector-strong vulnerable.c # -fstack-protector-all: Protect ALL functions (performance cost) gcc -fstack-protector-all vulnerable.c # Check if binary has stack protector readelf -s /bin/ls | grep stack_chk # 123: 00000000000060a0 8 OBJECT GLOBAL DEFAULT 25 __stack_chk_fail@@GLIBC_2.4 ``` **Bypass Techniques** (and mitigations): | Attack | Mitigation | | ------------------------------------------- | --------------------------------------- | | Leak canary via format string | Use fortified functions (\_printf\_chk) | | Overwrite canary with correct value | Use random canary per thread | | Jump over canary (partial overflow) | Place canary near variables | | Fork before overflow (canary same in child) | Re-randomize after fork | *** ## 2. Control Flow Integrity (CFI) CFI ensures program control flow follows legitimate paths (no arbitrary jumps). ### 2.1 Forward-Edge CFI (Indirect Calls) **Problem**: Function pointers can be hijacked. ```c theme={null} // Vulnerable code struct ops { void (*process)(char *data); }; struct ops *vtable = malloc(sizeof(struct ops)); vtable->process = legitimate_function; // ... buffer overflow ... // Attacker overwrites vtable->process to point to shellcode vtable->process(data); // Calls shellcode! ``` **CFI Solution**: ```c theme={null} // Compiler generates CFI check before indirect call // Original code vtable->process(data); // Compiled with CFI void *target = vtable->process; // Check 1: Is target a valid code address? if (!is_valid_code_address(target)) { cfi_violation(); } // Check 2: Is target in allowed set for this call site? if (!is_allowed_target(call_site_id, target)) { cfi_violation(); } // Perform call ((void (*)(char *))target)(data); ``` **Allowed Target Sets**: ``` Function Signature-Based CFI: void func_a(int x); ← Set 1: void (int) void func_b(int x); ← int func_c(int x, int y); ← Set 2: int (int, int) int func_d(int x, int y); ← void func_e(void); ← Set 3: void (void) Rule: Indirect call with signature void(int) can only jump to Set 1. Implementation: 1. Compiler assigns ID to each function signature 2. Compiler tags each function with its ID 3. Before indirect call, check ID matches expected signature ``` **Clang CFI**: ```bash theme={null} # Compile with CFI clang -fsanitize=cfi -flto program.c # CFI variants -fsanitize=cfi-icall # Indirect calls -fsanitize=cfi-vcall # Virtual function calls (C++) -fsanitize=cfi-cast # Bad casts # Generate CFI violation report UBSAN_OPTIONS=print_stacktrace=1 ./program # Example violation SUMMARY: UndefinedBehaviorSanitizer: cfi-check-fail pc 0x55f8a2b3c4d5 in main program.c:42 ``` ### 2.2 Backward-Edge CFI (Return Address Protection) **Shadow Stack**: Hardware-protected copy of return addresses. ``` Regular Stack Shadow Stack (Protected) ───────────────── ──────────────────────── ┌─────────────┐ ┌─────────────┐ │ Ret Addr 3 │ ◄──────►│ Ret Addr 3 │ (Copy) ├─────────────┤ ├─────────────┤ │ Locals │ │ │ ├─────────────┤ │ │ │ Ret Addr 2 │ ◄──────►│ Ret Addr 2 │ ├─────────────┤ ├─────────────┤ │ Locals │ │ │ ├─────────────┤ │ │ │ Ret Addr 1 │ ◄──────►│ Ret Addr 1 │ └─────────────┘ └─────────────┘ ↑ ↑ Writable! Read-Only! (Attacker can (CPU enforced, modify) not accessible) On Function Return: 1. Pop return address from regular stack → addr_stack 2. Pop return address from shadow stack → addr_shadow 3. Compare: addr_stack == addr_shadow? 4. Mismatch → #CP exception (Control Protection) → Crash ``` **Intel CET (Control-flow Enforcement Technology)**: ```c theme={null} // CPU features for shadow stack #define X86_FEATURE_SHSTK (1 << 7) // Shadow stack #define X86_FEATURE_IBT (1 << 20) // Indirect branch tracking // Enable shadow stack (kernel code) void cet_enable(void) { u64 msr_val; // Check if CPU supports CET if (!boot_cpu_has(X86_FEATURE_SHSTK)) return; // Enable in MSR rdmsrl(MSR_IA32_S_CET, msr_val); msr_val |= MSR_IA32_CET_SHSTK_EN; // Enable shadow stack wrmsrl(MSR_IA32_S_CET, msr_val); // Allocate shadow stack for current thread unsigned long ssp = alloc_shstk(); // Shadow stack pointer wrmsrl(MSR_IA32_PL3_SSP, ssp); } // Shadow stack operations (new x86 instructions) // INCSSP - Increment shadow stack pointer // RDSSP - Read shadow stack pointer // SAVEPREVSSP - Save previous SSP // RSTORSSP - Restore SSP // WRSSD/WRSSQ - Write to shadow stack // SETSSBSY - Mark shadow stack busy ``` **ARM Pointer Authentication**: ```c theme={null} // ARM PAuth uses cryptographic signing of return addresses // On function prologue: // PAC (Pointer Authentication Code) = sign(return_addr, context_key) // Store: PAC || return_addr on stack // On function epilogue: // Verify: sign(return_addr, context_key) == PAC? // If mismatch → Fault // ARM instructions PACIA X30, SP // Sign return address (X30) with stack pointer (SP) RETAA // Authenticate and return ``` **Software Shadow Stack** (Android): ```c theme={null} // Implemented in libc (not hardware-protected) __thread void *shadow_stack[1024]; __thread int shadow_stack_ptr = 0; void function_entry(void *return_addr) { shadow_stack[shadow_stack_ptr++] = return_addr; } void function_exit(void *return_addr) { void *expected = shadow_stack[--shadow_stack_ptr]; if (return_addr != expected) { abort(); // Stack corruption detected } } // Weakness: Attacker can corrupt shadow_stack too if memory bug exists // Strength: Works on CPUs without hardware support ``` *** ## 3. Privilege Separation & Capabilities ### 3.1 Traditional Unix DAC (Discretionary Access Control) ``` User/Group/Other Permissions: File: /etc/shadow Owner: root Group: shadow Permissions: rw-r----- │││││││││ ││││││││└─ Other: no permissions │││││││└── Other: no permissions ││││││└─── Other: no permissions │││││└──── Group: read ││││└───── Group: no write │││└────── Group: no execute ││└─────── Owner: read │└──────── Owner: write └───────── Owner: no execute Problem: All-or-nothing root privileges! - Process needs root to bind port 80 → runs fully as root - Process needs root to read /etc/shadow → runs fully as root ``` ### 3.2 POSIX Capabilities **Divide root privileges into distinct units**: ```c theme={null} // From /usr/include/linux/capability.h #define CAP_CHOWN 0 // Change file ownership #define CAP_DAC_OVERRIDE 1 // Bypass file permission checks #define CAP_DAC_READ_SEARCH 2 // Bypass read/search permissions #define CAP_FOWNER 3 // Bypass permission checks on file operations #define CAP_FSETID 4 // Don't clear setuid/setgid bits #define CAP_KILL 5 // Bypass permission checks for sending signals #define CAP_SETGID 6 // Make arbitrary setgid calls #define CAP_SETUID 7 // Make arbitrary setuid calls #define CAP_NET_BIND_SERVICE 10 // Bind to ports < 1024 #define CAP_NET_RAW 13 // Use RAW and PACKET sockets #define CAP_SYS_ADMIN 21 // Lots of system admin operations #define CAP_SYS_PTRACE 19 // Trace arbitrary processes #define CAP_SYS_MODULE 16 // Load/unload kernel modules // ... 41 capabilities total ... ``` **Capability Sets**: ```c theme={null} // Each process has 5 capability sets struct cred { // ... kernel_cap_t cap_inheritable; // Inherited by exec'd programs kernel_cap_t cap_permitted; // Can be enabled (superset) kernel_cap_t cap_effective; // Actually active NOW kernel_cap_t cap_bset; // Bounding set (limits inheritable) kernel_cap_t cap_ambient; // Ambient set (new in Linux 4.3) }; // Each set is a 64-bit bitmask (2^64 possible capabilities) typedef struct { __u32 cap[_LINUX_CAPABILITY_U32S_3]; // 2 × 32 bits } kernel_cap_t; ``` **Capability Semantics**: ``` Permitted (P): Capabilities the process CAN use Effective (E): Capabilities CURRENTLY active Inheritable (I): Capabilities that can be inherited across exec Ambient (A): Capabilities automatically granted after exec Bounding (B): Upper limit on capabilities (cannot gain capabilities not in B) On exec(): P' = (P & I) | (F_permitted & F_inheritable) | A E' = F_effective ? P' : A I' = I A' = A & P' & I' Where F_* are file capabilities (set on executable) ``` **Using Capabilities**: ```bash theme={null} # Give ping the ability to create raw sockets (no setuid needed!) sudo setcap cap_net_raw+ep /bin/ping # Verify getcap /bin/ping # /bin/ping = cap_net_raw+ep # Now ping works without setuid bit! ls -l /bin/ping # -rwxr-xr-x ... /bin/ping (no 's' bit!) # Remove capabilities sudo setcap -r /bin/ping # Set multiple capabilities sudo setcap cap_net_bind_service,cap_net_raw+ep /usr/bin/server ``` ```c theme={null} #include #include int main() { cap_t caps; cap_value_t cap_list[2] = {CAP_NET_BIND_SERVICE, CAP_NET_RAW}; // Get current capabilities caps = cap_get_proc(); // Add capabilities to permitted and effective sets cap_set_flag(caps, CAP_PERMITTED, 2, cap_list, CAP_SET); cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_list, CAP_SET); // Apply capabilities if (cap_set_proc(caps) != 0) { perror("cap_set_proc"); return 1; } // Now we can bind port 80! int sock = socket(AF_INET, SOCK_STREAM, 0); struct sockaddr_in addr = { .sin_family = AF_INET, .sin_port = htons(80), .sin_addr.s_addr = INADDR_ANY }; if (bind(sock, (struct sockaddr *)&addr, sizeof(addr)) == 0) { printf("Successfully bound to port 80\n"); } // Drop capabilities we no longer need cap_clear(caps); cap_set_proc(caps); cap_free(caps); return 0; } // Compile: gcc -o server server.c -lcap // Run: setcap cap_net_bind_service+ep ./server ``` ```bash theme={null} # View capabilities of running process grep Cap /proc/self/status # CapInh: 0000000000000000 # CapPrm: 0000000000000000 # CapEff: 0000000000000000 # CapBnd: 000001ffffffffff # CapAmb: 0000000000000000 # Decode capability mask capsh --decode=000001ffffffffff # 0x000001ffffffffff=cap_chown,cap_dac_override,... # View capabilities of any process grep Cap /proc/1234/status # List all capabilities capsh --print ``` ```c theme={null} // Ambient capabilities (Linux 4.3+) // Allows non-root processes to exec and retain capabilities #include int main() { // Raise CAP_NET_RAW in ambient set if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, CAP_NET_RAW, 0, 0) != 0) { perror("prctl"); return 1; } // exec another program execl("/usr/bin/ping", "ping", "8.8.8.8", NULL); // /usr/bin/ping inherits CAP_NET_RAW! // (even though it's not setuid and has no file capabilities) return 0; } // Use case: Container init process grants capabilities to children ``` ### 3.3 Seccomp (Secure Computing Mode) **Seccomp-BPF**: Restrict system calls a process can make using BPF filters. ```c theme={null} #include int main() { scmp_filter_ctx ctx; // Create filter: default action = KILL ctx = seccomp_init(SCMP_ACT_KILL); // Allow essential syscalls seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0); // Allow open only for specific file seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(open), 1, SCMP_A0(SCMP_CMP_EQ, (scmp_datum_t)"/tmp/allowed.txt")); // Load filter into kernel seccomp_load(ctx); // After this point, any syscall not explicitly allowed → SIGSYS (kill) open("/tmp/allowed.txt", O_RDONLY); // ✓ Allowed open("/etc/passwd", O_RDONLY); // ✗ Killed! seccomp_release(ctx); return 0; } // Compile: gcc -o sandbox sandbox.c -lseccomp ``` **Raw Seccomp-BPF**: ```c theme={null} #include #include #include void install_seccomp_filter() { struct sock_filter filter[] = { // Load syscall number BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)), // Allow exit syscall BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_exit, 0, 1), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), // Allow write syscall BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), // Kill on any other syscall BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL), }; struct sock_fprog prog = { .len = sizeof(filter) / sizeof(filter[0]), .filter = filter, }; // Enable seccomp prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); // Cannot gain privileges prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog); } int main() { install_seccomp_filter(); write(1, "Hello\n", 6); // ✓ Allowed getpid(); // ✗ Killed (SIGSYS) return 0; } ``` **Seccomp Actions**: | Action | Effect | | -------------------------- | ------------------------ | | `SECCOMP_RET_KILL_PROCESS` | Kill entire process | | `SECCOMP_RET_KILL_THREAD` | Kill only current thread | | `SECCOMP_RET_TRAP` | Send SIGSYS signal | | `SECCOMP_RET_ERRNO` | Return error code | | `SECCOMP_RET_TRACE` | Notify tracer (ptrace) | | `SECCOMP_RET_LOG` | Log and allow | | `SECCOMP_RET_ALLOW` | Allow syscall | **Real-World Usage**: ```bash theme={null} # Chrome sandbox ps aux | grep chrome # ... --type=renderer --enable-sandbox ... # Docker seccomp profile docker run --security-opt seccomp=default.json alpine sh # systemd service with seccomp cat /etc/systemd/system/myservice.service # [Service] # SystemCallFilter=@system-service # SystemCallFilter=~@privileged @resources # View seccomp status of process grep Seccomp /proc/self/status # Seccomp: 2 (mode 2 = filter active) ``` *** ## 4. Mandatory Access Control (MAC) ### 4.1 SELinux (Security-Enhanced Linux) **SELinux adds mandatory access control on top of DAC**. ``` DAC says: "Can user alice read file.txt?" → Check: alice's UID vs file owner, group, permissions SELinux says: "Can process with label X access file with label Y?" → Check: Policy rules for (process_label, file_label, operation) Both must succeed for access! ``` **SELinux Components**: ``` ┌─────────────────────────────────────────────────┐ │ SELinux Policy │ │ ┌───────────────────────────────────────────┐ │ │ │ Type Enforcement (TE) Rules │ │ │ │ allow httpd_t http_port_t:tcp_socket bind│ │ │ │ allow httpd_t httpd_sys_content_t:file r │ │ │ └───────────────────────────────────────────┘ │ │ ┌───────────────────────────────────────────┐ │ │ │ Security Contexts (Labels) │ │ │ │ user:role:type:level │ │ │ │ system_u:system_r:httpd_t:s0 │ │ │ └───────────────────────────────────────────┘ │ └─────────────────────────────────────────────────┘ ↓ ┌──────────────────────┐ │ LSM (Linux Security │ │ Module) Framework │ └──────────────────────┘ ↓ Kernel enforces at runtime ``` **Security Context**: ```bash theme={null} # View file context ls -Z /var/www/html/index.html # system_u:object_r:httpd_sys_content_t:s0 /var/www/html/index.html # │ │ │ │ # │ │ │ └─ MLS level (sensitivity) # │ │ └─ Type (most important!) # │ └─ Role # └─ User # View process context ps -Z 1234 # system_u:system_r:httpd_t:s0 /usr/sbin/httpd # Change file context chcon -t httpd_sys_content_t /var/www/html/newfile.html # Restore default contexts restorecon -Rv /var/www/html/ ``` **Type Enforcement Rules**: ``` # From /etc/selinux/targeted/policy/policy.conf (compiled binary) # Allow httpd to bind TCP sockets on http_port_t (port 80, 443) allow httpd_t http_port_t:tcp_socket { bind listen }; # Allow httpd to read files labeled httpd_sys_content_t allow httpd_t httpd_sys_content_t:file { read getattr open }; # Allow httpd to write to log files allow httpd_t httpd_log_t:file { write append create }; # Deny (example) # If no rule allows, default is deny! # httpd_t trying to access user_home_t → DENIED ``` **SELinux Modes**: ```bash theme={null} # Active enforcement getenforce # Enforcing # Violations blocked # AVC denials logged sestatus # SELinux status: enabled # Current mode: enforcing ``` ```bash theme={null} # Log-only mode setenforce 0 getenforce # Permissive # Violations logged # but NOT blocked # Good for debugging ``` ```bash theme={null} # SELinux completely off # Edit /etc/selinux/config SELINUX=disabled # Reboot required # NO security benefit! ``` **Debugging SELinux**: ```bash theme={null} # View denials ausearch -m avc -ts recent # Example denial type=AVC msg=audit(1234567890.123:456): avc: denied { read } for pid=1234 comm="httpd" name="secret.txt" dev="sda1" ino=123456 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:user_home_t:s0 tclass=file permissive=0 # Translation: httpd_t tried to read user_home_t file → DENIED # Generate policy module to allow audit2allow -a -M mypolicy # module mypolicy 1.0; # require { # type httpd_t; # type user_home_t; # class file read; # } # allow httpd_t user_home_t:file read; # Install policy module semodule -i mypolicy.pp # List loaded modules semodule -l # Remove module semodule -r mypolicy ``` **SELinux Booleans** (runtime toggles): ```bash theme={null} # List all booleans getsebool -a | grep httpd # httpd_can_network_connect --> off # httpd_can_network_connect_db --> off # httpd_enable_cgi --> on # Enable httpd network connections setsebool -P httpd_can_network_connect on # -P makes it persistent across reboot ``` ### 4.2 AppArmor **AppArmor is path-based MAC** (vs SELinux's label-based). ``` SELinux: "Process with label X can access file with label Y" → Requires labeling entire filesystem AppArmor: "Process can access /var/www/* with read permission" → Based on filesystem paths (easier to understand) ``` **AppArmor Profile**: ```bash theme={null} # /etc/apparmor.d/usr.sbin.nginx #include /usr/sbin/nginx { #include #include capability dac_override, capability net_bind_service, capability setgid, capability setuid, /etc/nginx/** r, /var/log/nginx/** rw, /var/www/** r, /run/nginx.pid rw, # Network network inet stream, network inet6 stream, # Deny everything else (default) } ``` **Profile Modes**: ```bash theme={null} # Enforce mode aa-enforce /usr/sbin/nginx # Complain mode (log-only) aa-complain /usr/sbin/nginx # Disable profile aa-disable /usr/sbin/nginx # View status aa-status # apparmor module is loaded. # 12 profiles are loaded. # 10 profiles are in enforce mode. # 2 profiles are in complain mode. ``` **Creating Profiles**: ```bash theme={null} # Generate profile automatically aa-genprof /usr/bin/myapp # Steps: # 1. Runs app in learning mode # 2. Exercise all app functionality # 3. Reviews logged accesses # 4. Generates profile # Manually create profile cat > /etc/apparmor.d/usr.bin.myapp < /etc/myapp/** r, /var/lib/myapp/** rw, /tmp/** rw, capability net_bind_service, deny /etc/shadow r, } EOF # Load profile apparmor_parser -r /etc/apparmor.d/usr.bin.myapp ``` **SELinux vs AppArmor**: | Feature | SELinux | AppArmor | | ------------------ | -------------------- | -------------------- | | **Granularity** | Very fine (labels) | Coarse (paths) | | **Complexity** | High | Low | | **Performance** | Small overhead | Very small | | **Learning curve** | Steep | Gentle | | **Flexibility** | Maximum | Good | | **Default** | RHEL, Fedora, CentOS | Debian, Ubuntu, SUSE | *** ## 5. Microarchitectural Attacks & Mitigations ### 5.1 Spectre & Meltdown **Speculative Execution**: CPU predicts branch and executes ahead, then discards if wrong. ``` // Vulnerable code if (x < array1_size) { // Bounds check y = array2[array1[x]]; // Out-of-bounds access } Without Speculation: 1. Check: x < array1_size? 2. If true, execute access 3. If false, skip With Speculation (vulnerable): 1. CPU predicts branch will be taken 2. Speculatively executes: y = array2[array1[x]] Even if x >= array1_size! 3. Loads array1[x] (out of bounds!) 4. Uses it to index array2 5. array2[...] brought into cache ← SIDE EFFECT! 6. Branch misprediction detected 7. Architectural state rolled back 8. BUT: Cache state NOT rolled back! Attacker observes cache timing → leaks array1[x]! ``` **Meltdown (CVE-2017-5754)**: ```c theme={null} // Leak kernel memory from user space // 1. Flush cache clflush(probe_array); // 2. Access kernel memory (should fault, but speculatively executes) char kernel_byte = *(char *)kernel_address; // 3. Use leaked byte to index array char dummy = probe_array[kernel_byte * 4096]; // This brings probe_array[kernel_byte * 4096] into cache // 4. Branch misprediction, exception raised // But probe_array[...] is NOW in cache! // 5. Time accesses to probe_array for (int i = 0; i < 256; i++) { t0 = rdtsc(); dummy = probe_array[i * 4096]; t1 = rdtsc(); if (t1 - t0 < THRESHOLD) { // Cache hit! i == kernel_byte printf("Leaked kernel byte: 0x%02x\n", i); } } // Result: Leaked kernel memory byte-by-byte at ~100 KB/s! ``` **Mitigation: KPTI (Kernel Page Table Isolation)**: ``` Without KPTI: ┌────────────────────────────────────┐ │ User Page Tables │ │ ┌──────────────────────────────┐ │ │ │ User Space Mappings │ │ │ │ 0x00000000 - 0x7fffffffffff │ │ │ ├──────────────────────────────┤ │ │ │ Kernel Space Mappings │ │ │ │ 0xffff800000000000 - ... │ │ ← Meltdown can read this! │ └──────────────────────────────┘ │ └────────────────────────────────────┘ With KPTI (two sets of page tables): ┌────────────────────────────────────┐ │ User Page Tables │ │ ┌──────────────────────────────┐ │ │ │ User Space Mappings │ │ │ ├──────────────────────────────┤ │ │ │ Minimal Kernel (entry/exit) │ │ ← Only small trampoline │ └──────────────────────────────┘ │ └────────────────────────────────────┘ ┌────────────────────────────────────┐ │ Kernel Page Tables │ │ ┌──────────────────────────────┐ │ │ │ User Space Mappings │ │ │ ├──────────────────────────────┤ │ │ │ Full Kernel Space │ │ ← Full kernel mapped │ └──────────────────────────────┘ │ └────────────────────────────────────┘ On syscall: Switch from User PT → Kernel PT (CR3 register swap) On return: Switch from Kernel PT → User PT Cost: ~5-30% performance penalty (context switch overhead) ``` **Kernel Implementation** (simplified from arch/x86/mm/pti.c): ```c theme={null} // Enable KPTI void pti_init(void) { if (!boot_cpu_has_bug(X86_BUG_CPU_MELTDOWN)) return; // CPU not vulnerable pr_info("Kernel/User page tables isolation: enabled\n"); // Allocate separate user page tables pgd_t *user_pgd = (pgd_t *)__get_free_page(GFP_KERNEL); // Copy user space mappings clone_pgd_range(user_pgd, kernel_pgd, KERNEL_PGD_PTRS); // Map minimal kernel trampolines (entry/exit stubs) map_entry_trampoline(user_pgd); // Install user page tables current->mm->pgd = user_pgd; } // Syscall entry: switch to kernel page tables ENTRY(entry_SYSCALL_64) swapgs // Swap GS (per-CPU data) movq %rsp, PER_CPU_VAR(rsp_scratch) movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp /* Switch page tables */ movq PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp // Load kernel CR3 /* ... handle syscall ... */ SWITCH_TO_USER_CR3 scratch_reg=%rsp // Load user CR3 swapgs sysretq END(entry_SYSCALL_64) ``` **Spectre (CVE-2017-5753/5715)**: ```c theme={null} // Branch Target Injection (Spectre v2) // Victim code if (x < array1_size) { y = array2[array1[x] * 256]; } // Attacker code void attack() { // 1. Train branch predictor for (int i = 0; i < 1000; i++) { victim_function(valid_x); // Train: branch TAKEN } // 2. Flush cache clflush(probe_array); // 3. Call with malicious x victim_function(malicious_x); // x >= array1_size // CPU speculatively executes (branch predictor says TAKEN) // Leaks out-of-bounds memory into cache // 4. Time side-channel to recover for (int i = 0; i < 256; i++) { t0 = rdtsc(); dummy = probe_array[i * 256]; t1 = rdtsc(); if (t1 - t0 < THRESHOLD) { printf("Leaked: 0x%02x\n", i); } } } // Result: Can leak arbitrary memory across privilege boundaries! ``` **Mitigation: Retpoline (Return Trampoline)**: ```asm theme={null} ; Traditional indirect jump (vulnerable) jmp *%rax ; Retpoline (safe) call retpoline_label retpoline_label: pause ; Prevent speculation lfence ; Serialize execution jmp retpoline_label ; Infinite loop (never executed) ; CPU's return stack buffer prevents speculation ; Indirect jump converted to return instruction ``` **Kernel Implementation**: ```c theme={null} // Compiler flag KBUILD_CFLAGS += -mindirect-branch=thunk-extern KBUILD_CFLAGS += -mindirect-branch-register // Generated code // Before: // call *%rax // After: // call __x86_indirect_thunk_rax __x86_indirect_thunk_rax: call retpoline_label retpoline_label: pause lfence jmp retpoline_label mov %rax, (%rsp) // Never executed, but tricks CPU ret ``` **Hardware Mitigations**: ```bash theme={null} # Check CPU vulnerabilities cat /sys/devices/system/cpu/vulnerabilities/* # meltdown: Mitigation: PTI # spectre_v1: Mitigation: usercopy/swapgs barriers and __user pointer sanitization # spectre_v2: Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, RSB filling # IBRS (Indirect Branch Restricted Speculation) # IBPB (Indirect Branch Predictor Barrier) # STIBP (Single Thread Indirect Branch Predictors) # SSBD (Speculative Store Bypass Disable) # Disable mitigations (for benchmarking) # WARNING: Insecure! echo 0 > /sys/kernel/debug/x86/pti_enabled echo 0 > /sys/kernel/debug/x86/retp_enabled ``` ### 5.2 Rowhammer **DRAM vulnerability**: Rapidly accessing one row can flip bits in adjacent rows. ``` DRAM Organization: ┌─────────────────────────────────┐ │ Bank 0 │ │ ┌───────────────────────────┐ │ │ │ Row 0 [data] │ │ │ │ Row 1 [data] ← Target │ │ ← Victim row │ │ Row 2 [data] │ │ ← Hammered (read repeatedly) │ │ Row 3 [data] ← Target │ │ ← Victim row │ │ ... │ │ │ └───────────────────────────┘ │ └─────────────────────────────────┘ Attack: 1. Find adjacent rows in DRAM 2. Rapidly read from Row 2 (millions of times) 3. Electrical interference causes bit flips in Row 1 and Row 3 4. Attacker doesn't directly access victim rows! ``` **Exploitation**: ```c theme={null} // Rowhammer exploit (simplified) // 1. Spray memory with target pattern char *spray[1000]; for (int i = 0; i < 1000; i++) { spray[i] = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); memset(spray[i], 0xFF, 4096); // All bits set } // 2. Find adjacent rows (via DRAM addressing) uint64_t *hammer_addr1 = find_row_address(2); uint64_t *hammer_addr2 = find_row_address(4); // 3. Hammer rows for (int i = 0; i < 1000000; i++) { *hammer_addr1; // Read (causes DRAM row activation) *hammer_addr2; clflush(hammer_addr1); // Evict from cache (force DRAM access) clflush(hammer_addr2); } // 4. Check for bit flips in victim rows for (int i = 0; i < 1000; i++) { for (int j = 0; j < 4096; j++) { if (spray[i][j] != 0xFF) { printf("Bit flip at %p: 0x%02x\n", &spray[i][j], spray[i][j]); } } } // Real attacks: // - Flip bit in page table → gain access to kernel memory // - Flip bit in SELinux context → privilege escalation // - Flip bit in RSA key → factor private key ``` **Mitigations**: ``` Error-Correcting Code (ECC): - Detects and corrects single-bit errors - Detects (but can't correct) multi-bit errors Cost: ~10% more expensive Performance: Slight overhead Widely used in servers, rare in consumer devices ``` ``` Hardware solution by DRAM vendors: - Monitor row activation counters - If row accessed frequently, refresh adjacent rows - Prevents charge leak that causes bit flips Implemented in DDR4/DDR5 DRAM Effectiveness: Good but not perfect (bypasses exist with careful timing) ``` ```bash theme={null} # Limit cache flush instructions (clflush) # (Requires kernel patch) # Increase DRAM refresh rate # (Reduces performance) # Memory deduplication disabled echo 0 > /sys/kernel/mm/ksm/run # Prevent predictable physical addresses # (KASLR + physical address randomization) ``` ```c theme={null} // Double-sided rowhammer protection // Kernel detects excessive row activations void dram_protect(void) { // Monitor TLB misses (proxy for row activations) u64 tlb_misses = read_pmc(TLB_MISS_EVENT); if (tlb_misses > THRESHOLD) { // Potential rowhammer attack // Force memory refresh wbinvd(); // Write-back and invalidate caches // Log for analysis pr_warn("Potential Rowhammer attack detected\n"); } } ``` *** ## 6. Sandboxing Techniques ### 6.1 Namespaces (Containers) Linux namespaces isolate resources between processes. ```c theme={null} // 7 types of namespaces #define CLONE_NEWNS 0x00020000 // Mount namespace #define CLONE_NEWUTS 0x04000000 // UTS (hostname) namespace #define CLONE_NEWIPC 0x08000000 // IPC namespace #define CLONE_NEWPID 0x20000000 // PID namespace #define CLONE_NEWNET 0x40000000 // Network namespace #define CLONE_NEWUSER 0x10000000 // User namespace #define CLONE_NEWCGROUP 0x02000000 // Cgroup namespace ``` **Creating Isolated Environment**: ```c theme={null} #define _GNU_SOURCE #include #include #include int sandbox_init(void *arg) { // New mount namespace mount("none", "/", NULL, MS_REC | MS_PRIVATE, NULL); // New hostname sethostname("sandbox", 7); // New root filesystem chroot("/var/sandbox"); chdir("/"); // Execute sandboxed program execl("/bin/sh", "sh", NULL); return 0; } int main() { char stack[4096]; // Create new namespaces int flags = CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWIPC; // Clone process with new namespaces clone(sandbox_init, stack + sizeof(stack), flags | SIGCHLD, NULL); wait(NULL); return 0; } ``` **PID Namespace** (process isolation): ```c theme={null} // Parent namespace // PID 1: init // PID 123: parent // PID 124: child (clone with CLONE_NEWPID) // Inside child's PID namespace getpid(); // Returns 1 (child is PID 1 in its namespace) // Child can only see processes in its namespace ps aux // Only shows processes in this namespace // Parent can still see child // PID 124 in parent namespace == PID 1 in child namespace ``` **Network Namespace** (network isolation): ```bash theme={null} # Create new network namespace ip netns add sandbox # Execute command in namespace ip netns exec sandbox ip link list # 1: lo: state DOWN # (Only loopback, no network access!) # Create virtual interface pair ip link add veth0 type veth peer name veth1 # Move veth1 to sandbox namespace ip link set veth1 netns sandbox # Configure networking ip addr add 10.0.0.1/24 dev veth0 ip link set veth0 up ip netns exec sandbox ip addr add 10.0.0.2/24 dev veth1 ip netns exec sandbox ip link set veth1 up # Now sandbox can communicate via veth interface ``` ### 6.2 Chrome Multi-Process Sandbox ``` Chrome Architecture: ──────────────────── ┌───────────────────────────────────────────────┐ │ Browser Process │ │ - Runs with full privileges │ │ - Manages windows, tabs, plugins │ │ - Opens files, sockets on behalf of renderers│ │ - Passes FDs via SCM_RIGHTS │ └────────────┬──────────────────────────────────┘ │ ┌───────┼───────┬──────────┐ │ │ │ │ ┌────▼─────┐ │ ┌────▼─────┐ ┌▼──────────┐ │ Renderer │ │ │ Renderer │ │ GPU │ │ (Tab 1) │ │ │ (Tab 2) │ │ Process │ │ │ │ │ │ │ │ │ Sandbox: │ │ │ Sandbox: │ │ Sandbox: │ │ - seccomp│ │ │ - seccomp│ │ - seccomp │ │ - No FS │ │ │ - No FS │ │ - Limited │ │ - No net │ │ │ - No net │ │ access │ └──────────┘ │ └──────────┘ └───────────┘ │ ┌────▼─────┐ │ Plugin │ │ Process │ │ (Flash) │ │ Sandbox │ └──────────┘ Sandbox Restrictions (Linux): 1. Seccomp-BPF: Allow only ~30 syscalls 2. Namespaces: PID, NET, IPC isolation 3. chroot: Fake root filesystem 4. No setuid/setgid 5. No capabilities 6. Read-only /proc, /sys ``` **Chrome Sandbox Code** (simplified from sandbox/linux/): ```c theme={null} // Renderer process startup void RendererMain() { // 1. Drop all capabilities drop_all_capabilities(); // 2. Enter namespaces unshare(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWIPC); // 3. chroot to empty directory chroot("/var/empty"); chdir("/"); // 4. Install seccomp filter install_renderer_seccomp_filter(); // 5. Drop privileges setuid(nobody_uid); setgid(nobody_gid); // 6. Enable no_new_privs prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); // 7. Run renderer RunRendererLoop(); } void install_renderer_seccomp_filter() { // Allow only essential syscalls scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL); // Read/write/close seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0); // Memory management seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(munmap), 0); // IPC (to browser process) seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(recvmsg), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(sendmsg), 0); // DENY: open, socket, execve, fork, etc. seccomp_load(ctx); } ``` **Escape Detection** (from browser process): ```c theme={null} // Browser monitors renderer health void MonitorRenderer(int renderer_pid) { // Check if renderer tries forbidden syscalls ptrace(PTRACE_SEIZE, renderer_pid, NULL, PTRACE_O_TRACESECCOMP); while (1) { int status; waitpid(renderer_pid, &status, 0); if (WIFSTOPPED(status) && status >> 8 == (SIGTRAP | (PTRACE_EVENT_SECCOMP << 8))) { // Seccomp violation detected! struct user_regs_struct regs; ptrace(PTRACE_GETREGS, renderer_pid, NULL, ®s); long syscall_nr = regs.orig_rax; log_security_violation(renderer_pid, syscall_nr); // Kill malicious renderer kill(renderer_pid, SIGKILL); respawn_renderer(); } } } ``` *** ## 6.5 Production Caveats and Common Pitfalls Linux security primitives are individually well-designed. The failure mode is composition: each primitive looks correct in isolation, then a subtle assumption interaction creates an escape route. Below are four traps that bite even experienced security engineers, paired with the patterns that close them. **Pitfall 1: Reaching for setuid when modern Linux wants fine-grained capabilities** The historical Unix model is "either you are root or you are not." A setuid binary runs with the file owner's privileges -- almost always root -- which means a single bug in `ping` or `mount` or `passwd` is a path to total compromise. CVE history is full of setuid escalations: `pkexec` (CVE-2021-4034 PwnKit), `sudo` (CVE-2021-3156 Baron Samedit), `OverlayFS` plus setuid (CVE-2023-2640). The trap is that engineers reach for setuid out of habit because "I just need to bind to port 80" or "I need to read this hardware register," when modern Linux offers a far narrower grant. The other side of the same trap: people use Docker's `--privileged` flag because they ran into a permission error and wanted to make it go away. `--privileged` strips namespacing, gives the container all capabilities, mounts host devices, and disables seccomp. It is the docker equivalent of `chmod 777`. **Solution: file capabilities and ambient capabilities** Linux capabilities split root's powers into about 40 fine-grained privileges. Bind to low ports? `CAP_NET_BIND_SERVICE`. Send raw packets? `CAP_NET_RAW`. Read kernel memory? `CAP_SYS_PTRACE`. Grant only what the binary actually needs: ```bash theme={null} # Old way: setuid root chmod u+s /usr/bin/myserver # Modern way: file capability setcap cap_net_bind_service=+ep /usr/bin/myserver # Verify getcap /usr/bin/myserver # Run as a normal user; bind succeeds, nothing else is privileged ``` For containers, drop everything and add back what you need: ```yaml theme={null} # Kubernetes pod spec securityContext: capabilities: drop: ["ALL"] add: ["NET_BIND_SERVICE"] runAsNonRoot: true readOnlyRootFilesystem: true ``` The mental model: capabilities are the principle of least privilege made concrete. Anything you cannot justify by name should not be in the bag. **Pitfall 2: seccomp filter holes -- syscalls that transitively reach forbidden ones** Engineers reach for seccomp profiles assuming the syscall list is the whole attack surface. It is not. A syscall you allow can call into kernel code paths that ultimately invoke syscalls you blocked. Classic example: you block `mprotect` because you do not want anyone changing page permissions. But `printf` calls into `vfprintf`, which can call into the dynamic linker, which uses lazy binding -- and lazy binding fixes up the GOT by calling `mprotect` to make the GOT writable, then back to read-only. Block `mprotect` and `printf` segfaults the first time it touches the dynamic linker. The general pattern: glibc and the dynamic loader have invisible dependencies on `mmap`, `mprotect`, `arch_prctl`, `sigaltstack`, `prctl`, `rseq`, and others. A "minimal" syscall whitelist generated by `strace` on a happy-path test will miss all of these because they only fire on certain code paths -- error handling, signal delivery, malloc growth, TLS allocation. The application crashes hours into production with `SIGSYS`. **Solution: build seccomp profiles iteratively, log first, kill later** The first profile you deploy should be `SCMP_ACT_LOG` (audit-only). Run the application under realistic load -- including failure scenarios -- and watch `/var/log/audit/audit.log` for `SECCOMP` events. Add anything legitimate to the allowlist. Only after a stable observation window do you flip to `SCMP_ACT_KILL_PROCESS`. ```c theme={null} // Phase 1: log mode -- production observation seccomp_init(SCMP_ACT_LOG); // Phase 2: enforce, but with structured fallback seccomp_init(SCMP_ACT_ERRNO(EPERM)); // return EPERM to userspace // gives the app a chance to log and fail gracefully // Phase 3: hard enforcement seccomp_init(SCMP_ACT_KILL_PROCESS); ``` For containers, do not write a seccomp profile from scratch. Start from `docker/default`, audit it for your workload, and tighten. Tools like `containerd-shim`'s seccomp profile generator and Falco's policy engine help generate realistic profiles from real workloads. Practical rule: if a syscall is required for crash handling (`rt_sigreturn`, `restart_syscall`, `exit`, `exit_group`), it stays unconditionally. Locking these out turns recoverable errors into kernel oopses or zombie processes. **Pitfall 3: ASLR with insufficient entropy -- 32-bit and PIE-disabled binaries** ASLR works by randomizing base addresses. The strength is set by the entropy in those addresses. On 64-bit systems, libraries get \~28-30 bits of randomization, which makes brute force impractical. On 32-bit systems, you have at most 16 bits of entropy for shared libraries -- and given typical alignment and layout constraints, often closer to 8-12 effective bits. That is 256 to 4096 guesses to defeat. A network-facing service that survives crashes (forking server, supervisor that restarts) gives an attacker effectively unlimited attempts. Worse, ASLR only randomizes binaries that opted in. If a binary is built without `-fPIE -pie`, its `.text` section sits at a fixed address regardless of ASLR settings. CVE history is rich with examples: Apache modules built without PIE on RHEL, vendor binaries shipped without ASLR-aware compilation flags, JIT-compiled code regions that the runtime maps at deterministic addresses. **Solution: enforce 64-bit, PIE, and full RELRO at the toolchain level** ```bash theme={null} # Compiler flags for ASLR-aware binaries gcc -fPIE -pie -Wl,-z,relro,-z,now -fstack-protector-strong \ -D_FORTIFY_SOURCE=2 myapp.c -o myapp # Verify the result checksec --file=./myapp # Expected: PIE enabled, Full RELRO, NX enabled, Canary found, FORTIFY enabled ``` Audit your fleet with `checksec` or `hardening-check` across every shipped binary. Treat any binary without PIE as a finding. For 32-bit -- the honest answer in 2026 is "do not ship 32-bit network services." If you have legacy 32-bit binaries that must remain, run them inside a stricter sandbox: gVisor, Firecracker, or at minimum a dedicated user namespace with no network capabilities. The CPU is the wrong place to defend a 32-bit address space against a determined attacker. Modern bonus: enable `-fcf-protection=full` on x86 to opt into Intel CET (Indirect Branch Tracking and Shadow Stack), and `-mbranch-protection=standard` on ARM for PAC and BTI. These are the hardware-supported successors to ASLR-only defenses. **Pitfall 4: namespace escape via /proc/self vs procfs assumptions** User namespaces let unprivileged users gain capabilities scoped to a new namespace. The classic attack pattern: create a user namespace, become "root" inside it, then exploit a kernel bug that does not properly check whether your capability is namespaced or global. Pre-2018 kernels were riddled with these checks-without-namespaces, leading to escapes via `mount`, `keyctl`, and `bpf`. The procfs version of the same trap: `/proc/self` resolves relative to the *kernel's view* of the calling process, which can differ from the namespace's view in subtle ways. A container that mounts `/proc` from the host (rather than its own private procfs) leaks information about every process on the host, and `/proc/self/exe` and `/proc/self/root` can be used to bypass chroot in some configurations. Worse, `/proc//setgroups`, `/proc//uid_map`, `/proc//gid_map` are the gatekeepers for user namespace permissions -- a misconfigured container that allows write access to these can be escaped from. In 2019, runc had CVE-2019-5736 -- a container could overwrite the host's runc binary by exploiting the way `/proc/self/exe` resolved at exec time. The fix was substantial: runc now copies its own binary into a memfd and re-execs from there. **Solution: defense in depth around procfs, plus user-namespace guardrails** ```bash theme={null} # Mount /proc with hidepid -- containers cannot see other processes mount -t proc -o hidepid=2 proc /proc # Disable unprivileged user namespaces if you do not use them sysctl -w kernel.unprivileged_userns_clone=0 # In containers, mount /proc as a fresh procfs scoped to the namespace # Most container runtimes do this; verify with /proc/self/mountinfo ``` For container runtimes, follow the runc-CVE-2019-5736 lesson: never re-exec from a path the sandbox can write to. Modern runtimes use `memfd_create` plus `execveat` to load the runtime binary from a memory-backed fd that no namespaced process can touch. Auditing approach: enumerate every path in `/proc` your container can read or write, and for each, ask "what does this give an attacker if they can write arbitrary bytes here?" The answers are sometimes scary -- `/proc/sys/kernel/core_pattern` historically allowed pipe-to-program syntax, which let containers execute host commands by triggering a core dump. CVE-2022-0492 was the most recent variant. Stronger pattern: use rootless containers (Podman's default, Docker's optional mode) so there is no root inside the namespace at all. The escape primitives that need `CAP_SYS_ADMIN` or root simply do not apply. *** ## 7. Interview Questions & Answers **NX (No-Execute) / DEP (Data Execution Prevention)** uses the CPU's NX bit in page table entries. **Page Table Entry Structure** (x86-64): * Bit 63: NX (No-Execute) bit * When set: Page cannot be executed (will fault with #PF if IP points here) * When clear: Page is executable **Kernel Implementation**: ```c theme={null} // When mapping stack vma->vm_flags = VM_READ | VM_WRITE | VM_GROWSDOWN; // NO VM_EXEC flag! // Page table entry will have NX bit SET pte = pfn_pte(pfn, PAGE_KERNEL); // Default kernel page (with NX) set_pte(pte_addr, pte); ``` **Protection**: 1. Attacker overflows buffer on stack 2. Injects shellcode 3. Overwrites return address to point to shellcode 4. Function returns, jumps to shellcode address 5. **CPU checks NX bit** → Page is not executable 6. **#PF (Page Fault)** → Kernel kills process **W^X Policy**: Page is writable OR executable, never both. * Stack/Heap: Writable, NOT executable * Code: Executable, NOT writable * Prevents: Code injection attacks **Bypass**: Return-Oriented Programming (ROP) - reuse existing executable code instead of injecting new code. **ASLR (Address Space Layout Randomization)** randomizes memory layout at program start. **Randomized Regions**: * Stack base address * Heap base address * Libraries (libc, etc.) * Executable base (if PIE - Position Independent Executable) * vDSO, vvar **Entropy** (x86-64 Linux): * Stack: 19 bits → 524,288 possible positions * Heap: 13 bits → 8,192 possible positions * Libraries: 28 bits → 268 million possible positions * PIE executable: 28 bits → 268 million possible positions **How It Prevents Exploitation**: Traditional exploit (no ASLR): ``` Attacker knows: libc is at 0x7ffff7a0d000 Attacker's ROP chain: return to 0x7ffff7a52390 (system) argument: 0x7ffff7b99d88 ("/bin/sh") Works every time! ``` With ASLR: ``` Run 1: libc at 0x7f8a2e456000 Run 2: libc at 0x7f3c81de2000 Run 3: libc at 0x7fb1c9a2f000 Attacker's hardcoded addresses: WRONG! Exploit crashes instead of succeeding ``` **Weaknesses**: 1. **Information Leak**: * Pointer disclosure → calculate base addresses → bypass ASLR * Format string bugs, memory corruption leaks 2. **Entropy Limitations**: * 13 bits (heap) = 8,192 attempts * If process doesn't crash (fork server), brute-forceable 3. **32-bit Systems**: * Limited address space → low entropy * 8 bits library randomization → 256 attempts 4. **Non-PIE Executables**: * Main executable at fixed address * Contains ROP gadgets at known addresses 5. **Cache Timing Attacks**: * Side-channel attacks can determine addresses **Mitigations for Weaknesses**: * Use PIE (Position Independent Executable) * Fix information leaks * Crash on exploit attempts (don't fork) * Use Control Flow Integrity (CFI) * Combine with other defenses (NX, stack canaries) **Stack Canary**: Random value placed between local variables and return address. **Mechanism**: ``` Stack Frame: ┌──────────────────┐ High address │ Return Address │ ← Protected ├──────────────────┤ │ Saved RBP │ ├──────────────────┤ │ CANARY (random) │ ← __stack_chk_guard (stored in TLS) ├──────────────────┤ │ Local vars │ │ char buf[100] │ ← Overflow starts here └──────────────────┘ Low address Function Prologue: mov rax, fs:0x28 ; Load canary from TLS mov [rbp-8], rax ; Store on stack Function Epilogue: mov rax, [rbp-8] ; Load stack canary xor rax, fs:0x28 ; Compare with original je .L_OK ; Match? OK call __stack_chk_fail ; Mismatch? ABORT .L_OK: ret ``` **Detection**: 1. Buffer overflow overwrites local variables 2. Overflow continues, overwrites canary 3. Function returns 4. Kernel checks: stack\_canary == \_\_stack\_chk\_guard? 5. Mismatch → Stack smashing detected! → abort() **Bypass Techniques**: **1. Leak Canary**: ```c theme={null} // Format string vulnerability printf(user_input); // User provides: "%p %p %p ..." // Leaks stack contents, including canary! // Attacker: // 1. Leak canary value // 2. Craft overflow to include correct canary value // 3. Overflow succeeds without detection ``` **2. Overwrite Pointer Before Canary**: ```c theme={null} char buf[100]; char *ptr = &authorized; unsigned long canary; void *return_address; // Overflow overwrites ptr but not canary strcpy(buf, malicious_input); // Overflow only buf and ptr // ptr now points to attacker-controlled memory // Canary intact → No detection! ``` **3. Fork Without Re-randomization** (rare): ```c theme={null} // Parent forks children with same canary while (1) { if (fork() == 0) { handle_request(); // Sandbox child exit(0); } } // Attacker brute-forces canary byte-by-byte // Try 0x00, 0x01, 0x02, ... 0xFF for first byte // If child crashes: wrong guess // If child doesn't crash: correct! Move to next byte // 8 bytes × 256 attempts = 2,048 attempts max ``` **4. Partial Overflow**: ```c theme={null} // Overflow only return address, not canary // (Requires knowledge of stack layout) ┌──────────────┐ │ Ret Addr │ ← Overflow 1 byte (change low byte only) ├──────────────┤ │ Saved RBP │ ← Skip ├──────────────┤ │ Canary │ ← Leave untouched! ├──────────────┤ │ buf[100] │ └──────────────┘ // Careful overflow changes return address without touching canary ``` **Mitigations**: * Combine with ASLR (randomize canary address) * Use fortified functions (\_strcpy\_chk) to prevent overflows * Re-randomize canary after fork * Stack Clash protection (prevent jumping over canary) **Traditional setuid**: ```bash theme={null} # setuid binary runs with owner's privileges ls -l /usr/bin/passwd # -rwsr-xr-x root root /usr/bin/passwd # ↑ setuid bit # When user runs passwd: # 1. Process starts with user's UID # 2. Kernel sees setuid bit # 3. Sets effective UID to file owner (root) # 4. Process has FULL root privileges # Problem: All or nothing! # passwd only needs to write /etc/shadow # But gets ALL root capabilities ``` **Capabilities**: ``` Divide root into 41 distinct privileges: CAP_CHOWN - Change file ownership CAP_DAC_OVERRIDE - Bypass file permissions CAP_NET_BIND_SERVICE - Bind ports < 1024 CAP_NET_RAW - Use raw sockets CAP_SYS_ADMIN - System administration CAP_SYS_MODULE - Load kernel modules ... 35 more ... Process gets ONLY what it needs! ``` **Comparison**: | Feature | setuid | Capabilities | | ---------------- | -------------------------------------- | ------------------------------ | | **Granularity** | All or nothing | Fine-grained (41 capabilities) | | **Security** | Over-privileged | Least privilege | | **Persistence** | Lost on exec (unless binary is setuid) | Can be inherited | | **Auditability** | Hard to see why root is needed | Clear which capability is used | **Example: Network Server**: ```c theme={null} // Old way: setuid root binary int main() { // Running as root (UID 0) // Can do ANYTHING! int sock = socket(AF_INET, SOCK_STREAM, 0); bind(sock, ..., 80); // Bind port 80 (needs root) setuid(nobody); // Drop privileges after bind // Problem: Race window while root // If exploit before setuid(), full root access! } // New way: Capabilities int main() { // Running as nobody (UID 65534) // Has ONLY CAP_NET_BIND_SERVICE int sock = socket(AF_INET, SOCK_STREAM, 0); bind(sock, ..., 80); // Works! (has capability) open("/etc/shadow", O_RDONLY); // FAIL! (no CAP_DAC_OVERRIDE) // Even if exploited, attacker only has port binding // Cannot read files, cannot exec as root, etc. } ``` **Setting Capabilities**: ```bash theme={null} # Give binary capability instead of setuid # Before: chmod u+s /usr/bin/ping # setuid root (dangerous!) # After: setcap cap_net_raw+ep /usr/bin/ping # Only raw socket capability # Verify getcap /usr/bin/ping # /usr/bin/ping = cap_net_raw+ep # Now ping can create raw sockets but has NO other root powers ``` **Why Capabilities Are Better**: 1. **Principle of Least Privilege**: Only grant necessary permissions 2. **Reduced Attack Surface**: Exploit gets limited capabilities, not full root 3. **Better Auditability**: Clear why each capability is needed 4. **Flexibility**: Can grant to non-root users 5. **Inheritance**: Can design capability-aware services **Real-World Usage**: * systemd services with capabilities * Docker containers (run as non-root with specific capabilities) * Network daemons (CAP\_NET\_BIND\_SERVICE instead of setuid) **Meltdown Vulnerability**: ``` CPU speculatively executes kernel memory access from user mode: // User-mode code char kernel_byte = *(char *)0xffff880000000000; // Kernel address // CPU behavior: // 1. Starts speculative execution before permission check // 2. Loads kernel memory (should fault, but hasn't checked yet) // 3. Uses loaded byte to index array: probe[kernel_byte * 4096] // 4. This brings probe[...] into cache ← SIDE EFFECT! // 5. Permission check completes → Exception! // 6. Architectural state rolled back // 7. But cache state remains! ← LEAK! // Attacker measures cache timing → recovers kernel_byte ``` **KPTI (Kernel Page Table Isolation)** Solution: ``` Without KPTI (vulnerable): ┌─────────────────────────────┐ │ User Mode (CR3 = user_pgd) │ │ │ │ User virtual addresses │ │ 0x0 - 0x7fffffffffff │ │ │ │ │ ├─> User pages │ │ │ │ │ Kernel virtual addresses │ │ 0xffff800000000000 - ... │ ← Mapped in user page tables! │ │ │ ← Meltdown can speculatively access │ ├─> Kernel pages │ └─────────────────────────────┘ With KPTI (secure): User Mode: ┌─────────────────────────────┐ │ CR3 = user_pgd │ │ │ │ User virtual addresses │ │ 0x0 - 0x7fffffffffff │ │ ├─> User pages │ │ │ │ Kernel virtual addresses │ │ 0xffff800000000000 - ... │ │ ├─> MINIMAL kernel stubs │ ← Only entry/exit trampolines! │ │ (entry_SYSCALL_64) │ ← Rest of kernel NOT MAPPED └─────────────────────────────┘ Kernel Mode (after syscall): ┌─────────────────────────────┐ │ CR3 = kernel_pgd │ │ │ │ User virtual addresses │ │ ├─> User pages │ │ │ │ Kernel virtual addresses │ │ ├─> FULL kernel mapping │ ← All kernel code/data accessible └─────────────────────────────┘ ``` **Syscall Flow with KPTI**: ```asm theme={null} ; User-mode application mov rax, 1 ; SYS_write mov rdi, 1 ; fd = stdout syscall ; Enter kernel ; ← CPU switches to kernel mode ← entry_SYSCALL_64: ; Still using user page tables! swapgs ; Swap GS (get kernel stack) ; SWITCH PAGE TABLES (the expensive part!) mov rax, CR3 ; Read current CR3 (user page table) or rax, 0x1000 ; Set bit to switch to kernel tables mov CR3, rax ; ← PAGE TABLE SWITCH (TLB flush!) ; Now kernel is fully mapped ; Execute syscall handler... call do_syscall_64 ; SWITCH BACK to user page tables mov rax, CR3 and rax, ~0x1000 mov CR3, rax ; ← PAGE TABLE SWITCH (TLB flush!) swapgs sysretq ; Return to user mode ``` **Performance Cost**: **What makes it expensive**: 1. **CR3 Write** (page table switch): * \~150-300 CPU cycles per switch * 2 switches per syscall (enter + exit) 2. **TLB Flush**: * Translation Lookaside Buffer caches virtual→physical address translations * Changing CR3 flushes TLB (must reload from memory) * TLB misses add \~100 cycles per memory access 3. **Frequency of Syscalls**: * I/O-heavy workloads: Many syscalls → high overhead * CPU-bound workloads: Few syscalls → low overhead **Measured Impact** (varies by workload): | Workload Type | Performance Loss | | ------------------------------------ | ---------------- | | CPU-intensive (scientific computing) | 0-3% | | Light I/O (web browsing) | 3-5% | | Heavy I/O (file server) | 5-10% | | Heavy syscalls (database, Redis) | 10-30% | **Optimizations**: 1. **PCID (Process Context ID)**: * Tag TLB entries with PCID * Avoid full TLB flush on CR3 switch * Reduces overhead to 1-5% 2. **Lazy TLB Switching**: * Kernel threads don't switch page tables * Reuse previous user's kernel mapping 3. **CPU Microcode Updates**: * Intel CPUs without Meltdown bug → no KPTI needed * Check: `cat /sys/devices/system/cpu/vulnerabilities/meltdown` * If says "Not affected" → KPTI not active **Disable KPTI** (for testing/benchmarking only!): ```bash theme={null} # Boot parameter nopti # Or runtime (requires recompiled kernel) echo 0 > /sys/kernel/debug/x86/pti_enabled # WARNING: Disabling KPTI leaves system vulnerable to Meltdown! ``` **Spectre Vulnerability** (Branch Target Injection): **CPU Speculative Execution**: ``` // Victim code if (x < array_size) { // ← Branch y = array[x]; // ← Speculative execution } CPU's Branch Predictor: - Predicts if branch will be taken or not - Speculatively executes ahead while check happens - If prediction wrong, rollback - If prediction right, save time! Problem: Rollback discards architectural state but NOT cache state! ``` **Attack**: ```c theme={null} // Step 1: Train branch predictor for (int i = 0; i < 1000; i++) { victim_function(valid_x); // x < array_size, branch TAKEN } // Branch predictor learns: "This branch is ALWAYS taken" // Step 2: Prepare side-channel for (int i = 0; i < 256; i++) { clflush(&probe_array[i * 4096]); // Flush cache } // Step 3: Attack with out-of-bounds x victim_function(malicious_x); // malicious_x >= array_size // What happens: // 1. Branch predictor predicts: TAKEN (based on training) // 2. CPU speculatively executes: y = array[malicious_x] // 3. This accesses out-of-bounds memory (kernel memory!) // 4. Uses leaked byte to index: probe_array[y * 4096] // 5. This brings probe_array[y * 4096] into cache ← LEAK! // 6. Branch check completes: x < array_size? FALSE // 7. Rollback! Discard y, but cache state remains! // Step 4: Recover leaked byte via timing for (int i = 0; i < 256; i++) { t0 = rdtsc(); temp = probe_array[i * 4096]; t1 = rdtsc(); if (t1 - t0 < THRESHOLD) { printf("Leaked byte: 0x%02x\n", i); // Cache hit! break; } } // Result: Read arbitrary memory across privilege boundaries! ``` **Why Retpolines Work**: **Problem with Indirect Branches**: ```asm theme={null} ; Vulnerable indirect jump jmp *%rax ; Jump to address in rax ; Attacker can manipulate branch predictor to: ; 1. Predict wrong target ; 2. Cause speculative execution to gadget ; 3. Leak data via cache side-channel ``` **Retpoline (Return Trampoline)**: ```asm theme={null} ; Instead of: jmp *%rax ; Use: call .set_target ; Push return address on stack .set_target: mov %rax, (%rsp) ; Replace return address with rax ret ; Return to rax ; Why this is safe: ; CPU's Return Stack Buffer (RSB): ; - Separate predictor for RET instructions ; - Tracks call/return pairs ; - NOT poisonable by attacker ; When ret executes: ; - CPU predicts target from RSB ; - RSB says: return to .capture_spec ; - Speculative execution goes to .capture_spec ; - NOT to attacker-controlled address! .capture_spec: pause ; Prevent speculation lfence ; Serializing instruction jmp .capture_spec ; Infinite loop (never executed) ``` **Visual Comparison**: ``` Traditional Indirect Jump (vulnerable): ┌─────────────┐ │ jmp *rax │ → Branch predictor → Attacker controls prediction └─────────────┘ ↓ Speculative execution to gadget ↓ Leak via cache timing Retpoline (safe): ┌─────────────┐ │ call .label │ → Push return addr on stack │ .label: │ │ mov rax,SP │ → Replace return addr with rax │ ret │ → RSB predicts return to .capture └─────────────┘ (NOT attacker-controlled!) ↓ .capture_spec: pause lfence jmp .capture_spec ← Speculation contained in loop ← No leak possible! ``` **Kernel Implementation**: ```c theme={null} // Compiler generates retpolines for indirect branches // gcc -mindirect-branch=thunk-extern // Original code: void (*func_ptr)(void); func_ptr(); // Indirect call // Compiled without retpoline: call *%rax // Compiled with retpoline: call __x86_indirect_thunk_rax // Retpoline thunk (arch/x86/lib/retpoline.S): SYM_FUNC_START(__x86_indirect_thunk_rax) JMP_NOSPEC %rax SYM_FUNC_END(__x86_indirect_thunk_rax) #define JMP_NOSPEC(reg) \ call .Ldo_rop_##reg; \ .Lspec_trap_##reg: \ pause; \ lfence; \ jmp .Lspec_trap_##reg; \ .Ldo_rop_##reg: \ mov %reg, (%rsp); \ ret ``` **Performance Impact**: * Retpolines are **slower** than direct jumps (5-20% overhead) * But necessary for security on vulnerable CPUs * Modern CPUs have hardware mitigations (IBRS - Indirect Branch Restricted Speculation) **Check Mitigations**: ```bash theme={null} cat /sys/devices/system/cpu/vulnerabilities/spectre_v2 # Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, RSB filling # Retpoline: Software mitigation (compiler-generated) # IBRS: Hardware mitigation (CPU feature) # IBPB: Indirect Branch Predictor Barrier (flush predictor) # RSB filling: Prevent RSB underflow attacks ``` **Why Effective**: 1. **Return instructions are different**: RSB not poisonable 2. **Speculation contained**: Loop prevents speculative execution reaching gadgets 3. **Works on all CPUs**: Software mitigation (doesn't need hardware support) 4. **Comprehensive**: Protects all indirect branches **Limitations**: * Performance overhead (modern CPUs use IBRS instead) * Doesn't protect against Spectre v1 (bounds check bypass) * Doesn't protect against other speculative execution attacks (L1TF, MDS, etc.) **Fundamental Difference**: **SELinux**: Label-based MAC ``` Security Context: user:role:type:level Files: httpd_sys_content_t Process: httpd_t Rule: allow httpd_t httpd_sys_content_t:file { read open }; ─────────────── ────────────────── ──── ──────────── Subject Object Class Permissions Decision: Based on labels (NOT paths) ``` **AppArmor**: Path-based MAC ``` Profile: /usr/sbin/nginx { /etc/nginx/** r, /var/www/** r, /var/log/nginx/** rw, deny /etc/shadow r, } Decision: Based on absolute filesystem paths ``` *** **Detailed Comparison**: **1. Security Model**: **SELinux**: * Type Enforcement (TE): Subjects (processes) have types, objects (files) have types * Multi-Level Security (MLS): Confidentiality levels (Top Secret, Secret, etc.) * Multi-Category Security (MCS): Categories for compartmentalization * Very fine-grained control **AppArmor**: * Path-based access control * Capabilities control * Network access control (protocol/address) * Simpler model, easier to understand **2. Complexity**: **SELinux**: ```bash theme={null} # Policy is complex # Example policy snippet: allow httpd_t httpd_sys_content_t:file { getattr read open }; allow httpd_t http_port_t:tcp_socket { bind listen }; allow httpd_t httpd_log_t:file { write append create }; allow httpd_t proc_t:file read; allow httpd_t self:capability { setgid setuid }; # Hundreds of rules per service! # Requires understanding of: # - Type enforcement # - Security contexts # - Policy language # - Domain transitions ``` **AppArmor**: ```bash theme={null} # Policy is readable /usr/sbin/nginx { #include #include capability dac_override, capability net_bind_service, capability setgid, capability setuid, /etc/nginx/** r, /var/log/nginx/** rw, /var/www/** r, network inet stream, } # Human-readable! # Easy to audit ``` **3. Administration**: | Task | SELinux | AppArmor | | ------------------- | --------------------------- | ---------------------- | | **Create policy** | Complex (audit2allow helps) | Simple (aa-genprof) | | **Debug denials** | ausearch, sealert | aa-logprof, dmesg | | **Enable/Disable** | setenforce | aa-enforce/aa-complain | | **View status** | sestatus, getenforce | aa-status | | **Temporary allow** | semodb-boolean | aa-complain mode | **4. Performance**: **SELinux**: * Label lookups in xattrs (extended attributes) * Hash table lookups for policy decisions * Overhead: 3-7% typically **AppArmor**: * Path resolution for every access * Simpler policy checks * Overhead: 1-3% typically **5. Filesystem Requirements**: **SELinux**: * Requires filesystem with xattr support * Labels stored as extended attributes * `ls -Z` shows labels * Relabeling filesystem can be slow **AppArmor**: * No special filesystem requirements * Works on any filesystem (even FAT, NFS) * No labels to manage **6. Use Cases**: **Use SELinux when**: * Maximum security required (government, military) * Need MLS/MCS (confidentiality levels) * Want very fine-grained control * Already familiar with it (RHEL/Fedora/CentOS) * Need label-based security (labels follow files even if moved) **Use AppArmor when**: * Simplicity preferred over maximum granularity * Easier policy management desired * Filesystem doesn't support xattrs (NFS, FAT) * Developers/admins less experienced with MAC * Debian/Ubuntu/SUSE environment **7. Real-World Scenarios**: **Scenario 1: Web Server** SELinux: ```bash theme={null} # Pre-defined policy exists # But need to handle custom app # App stores files in /opt/myapp/ # SELinux denies access (wrong label) # Solution: semanage fcontext -a -t httpd_sys_content_t "/opt/myapp(/.*)?" restorecon -R /opt/myapp # More denials? Debug with: ausearch -m avc -ts recent audit2allow -a -M mypolicy semodule -i mypolicy.pp ``` AppArmor: ```bash theme={null} # Create profile cat > /etc/apparmor.d/usr.sbin.myapp < #include /opt/myapp/** r, /var/log/myapp/** rw, network inet stream, capability net_bind_service, } EOF # Load and enforce apparmor_parser -r /etc/apparmor.d/usr.sbin.myapp # Done! Much simpler. ``` **Scenario 2: Container Security** SELinux: * Docker/Podman use SELinux contexts * Each container gets unique MCS label * Container `svirt_sandbox_file_t`, host `container_file_t` * Strong isolation via labels AppArmor: * Docker uses AppArmor profiles * Default profile restricts mount, capabilities, etc. * Custom profiles for specific containers * Path-based restrictions easier to understand **8. Policy Portability**: **SELinux**: * Labels stored with files (xattrs) * Policy is separate from filesystem * Moving files between systems: labels can be lost * Need to relabel after restore from backup **AppArmor**: * Policy references absolute paths * Moving profile to different system: works if paths same * But path changes require profile updates *** **Recommendation Matrix**: | Priority | Choose | | -------------------- | ------------------------------- | | Maximum security | SELinux | | Ease of use | AppArmor | | Fine-grained control | SELinux | | Simple policies | AppArmor | | RHEL/CentOS | SELinux (default) | | Debian/Ubuntu | AppArmor (default) | | NFS/non-xattr FS | AppArmor | | MLS/MCS required | SELinux | | Container host | Both work (SELinux more common) | **Can you use both?**: No, they conflict (both use LSM hooks). Choose one. **Neither?**: Not recommended. MAC adds significant security layer beyond DAC. **Seccomp-BPF** (Secure Computing with Berkeley Packet Filter): **Core Concept**: Whitelist syscalls a process can make using BPF bytecode filters. *** **Architecture**: ``` User Space Process │ │ syscall (e.g., open, read, write) ↓ ┌─────────────────────────────┐ │ Syscall Entry Point │ │ (entry_SYSCALL_64) │ └─────────┬───────────────────┘ │ │ ① Check: Seccomp filter installed? ↓ ┌─────────────────────────────┐ │ Seccomp BPF Filter │ │ │ │ BPF Program: │ │ - Load syscall number │ │ - Load arguments │ │ - Check against rules │ │ - Return action: │ │ • ALLOW │ │ • KILL │ │ • ERRNO │ │ • TRAP │ └─────────┬───────────────────┘ │ │ ② Action ↓ ALLOW? ──────────────────> Execute Syscall KILL? ──────────────────> SIGSYS (kill process) ERRNO? ──────────────────> Return error code TRAP? ──────────────────> Send SIGSYS signal (debugger) ``` *** **BPF Filter Structure**: ```c theme={null} // Seccomp data available to BPF program struct seccomp_data { int nr; // Syscall number __u32 arch; // Architecture (x86-64, ARM, etc.) __u64 instruction_pointer; __u64 args[6]; // Syscall arguments }; // BPF filter example struct sock_filter filter[] = { // Load syscall number into accumulator BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)), // Allow SYS_read BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_read, 0, 1), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), // Allow SYS_write BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), // Allow SYS_exit BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_exit, 0, 1), BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), // Default: KILL BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL), }; ``` *** **Container Security Use Case**: **Problem**: Containers share kernel with host. Malicious container can exploit kernel vulnerabilities. **Seccomp Solution**: Reduce attack surface by blocking dangerous syscalls. **Docker Default Seccomp Profile** (simplified): ```json theme={null} { "defaultAction": "SCMP_ACT_ERRNO", "architectures": ["SCMP_ARCH_X86_64"], "syscalls": [ { "names": [ "read", "write", "open", "close", "stat", "fstat", "mmap", "mprotect", "munmap", "brk", "ioctl", "writev", "access", "socket", "connect", "accept", "bind", "listen", "select", "poll", "epoll_wait" /* ... ~300 allowed syscalls ... */ ], "action": "SCMP_ACT_ALLOW" }, { "names": [ "reboot", // Cannot reboot host! "swapon", "swapoff", // Cannot manage swap "mount", "umount", // Cannot mount filesystems "pivot_root", // Cannot change root "kexec_load", // Cannot load kernel "add_key", // Cannot add keyring keys "request_key", "bpf", // Cannot load BPF programs "perf_event_open", // Cannot use perf "ptrace" // Cannot trace other processes ], "action": "SCMP_ACT_ERRNO" // Return EPERM } ] } ``` **Why Critical for Containers**: 1. **Kernel Exploit Mitigation**: ``` Without seccomp: Container → Exploit in ioctl() → Kernel code execution → Host compromise With seccomp: Container → ioctl() → EPERM (syscall blocked) → Exploit fails ``` 2. **Privilege Escalation Prevention**: ```bash theme={null} # Without seccomp docker run -it ubuntu # Inside container: unshare --mount --uts --ipc --net --pid --fork /bin/bash # Success! Created new namespaces → potential escape # With seccomp (default) unshare --mount ... # unshare: unshare failed: Operation not permitted # Blocked! (unshare syscall not allowed) ``` 3. **Attack Surface Reduction**: ``` Linux kernel: ~450 syscalls Docker default seccomp: ~300 allowed Blocked (~150 syscalls): - Kernel module loading (init_module, finit_module) - Namespace manipulation (setns, unshare) - Performance monitoring (perf_event_open) - System administration (reboot, sethostname) - Capability manipulation (capset) - Key management (add_key, keyctl) - BPF programs (bpf) Result: 33% reduction in kernel attack surface! ``` *** **Implementing Custom Seccomp**: **Example: Strict Sandbox**: ```c theme={null} #include void install_strict_seccomp() { scmp_filter_ctx ctx; // Default: KILL (strictest!) ctx = seccomp_init(SCMP_ACT_KILL); // Allow ONLY these syscalls seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0); // Conditional: Allow open ONLY for /tmp/* seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(openat), 1, SCMP_A1(SCMP_CMP_STR, "/tmp/")); // Load filter seccomp_load(ctx); seccomp_release(ctx); // After this point: // - read/write/exit: OK // - openat("/tmp/file"): OK // - openat("/etc/passwd"): KILLED! // - socket(): KILLED! // - fork(): KILLED! } ``` **Docker Custom Profile**: ```bash theme={null} # custom-seccomp.json { "defaultAction": "SCMP_ACT_ERRNO", "syscalls": [ { "names": ["read", "write", "exit", "exit_group"], "action": "SCMP_ACT_ALLOW" } ] } # Run container with custom profile docker run --security-opt seccomp=custom-seccomp.json myimage ``` *** **Debugging Seccomp Violations**: ```bash theme={null} # Enable logging (dmesg) echo 1 > /proc/sys/kernel/seccomp/actions_logged # Run container docker run --rm -it --security-opt seccomp=strict.json ubuntu bash # Inside container, try forbidden syscall: mount -t tmpfs tmpfs /mnt # bash: mount: Operation not permitted # Check dmesg dmesg | tail # [12345.678] audit: type=1326 audit(1234567890.123:456): auid=1000 uid=0 gid=0 # ses=3 pid=12345 comm="mount" exe="/bin/mount" sig=0 arch=c000003e # syscall=165 compat=0 ip=0x7f... code=0x7ffc0000 # ^^^^^^^^^^ # syscall 165 = mount (BLOCKED!) # Syscall 165 (mount) was blocked by seccomp ``` *** **Why BPF**: 1. **Efficiency**: JIT-compiled to native code (fast!) 2. **Safety**: BPF verifier ensures filter cannot crash kernel 3. **Flexibility**: Can inspect syscall arguments, not just number 4. **Performance**: Evaluated in kernel space (no context switch) **Without BPF** (old seccomp mode 1): * Could only allow read/write/exit/\_exit * No flexibility **With BPF** (seccomp mode 2): * Can allow specific syscalls * Can inspect arguments (e.g., allow open but only for /tmp/\*) * Can return different actions (ERRNO, TRAP, LOG, ALLOW) *** **Limitations**: 1. **Cannot inspect pointers**: BPF cannot dereference user-space pointers (no access to path strings, only FDs) 2. **Time-of-check-time-of-use (TOCTOU)**: Arguments checked before syscall, but can change 3. **Bypass via allowed syscalls**: If `write()` allowed, attacker might abuse it 4. **Complexity**: Writing correct BPF filters is hard *** **Summary**: Seccomp-BPF is critical for containers because: * ✅ Reduces kernel attack surface (blocks \~1/3 of syscalls) * ✅ Prevents privilege escalation (blocks namespace manipulation) * ✅ Mitigates kernel exploits (blocks vulnerable syscalls) * ✅ Fast (BPF JIT compilation) * ✅ Flexible (programmable filters) * ✅ Secure (BPF verifier prevents filter bugs) Without it, containers have full access to \~450 syscalls → much larger attack surface. *** ## 12. Threat Modeling for OS-backed Services When designing secure services, think systematically about OS-level attack surfaces. ### The STRIDE Model Applied to OS | Threat | OS Attack Vector | Mitigation | | -------------------------- | --------------------------------------- | -------------------------------------- | | **S**poofing | Process impersonation, UID manipulation | User namespaces, strong authentication | | **T**ampering | Memory corruption, file modification | ASLR, KASLR, read-only mounts | | **R**epudiation | Log deletion, timestamp manipulation | Append-only logs, audit subsystem | | **I**nfo Disclosure | `/proc` leaks, side channels | `hidepid=2`, Spectre mitigations | | **D**enial of Service | Fork bombs, memory exhaustion | Cgroups limits, ulimits, quotas | | **E**levation of Privilege | Kernel exploits, setuid abuse | Seccomp, drop capabilities | ### Defense-in-Depth Checklist ```bash theme={null} # 1. Principle of Least Privilege capsh --print # See current capabilities setcap cap_net_bind_service=+ep ./app # Grant only specific caps # 2. Namespaces: Reduce Visibility unshare --user --map-root-user --pid --mount-proc bash # 3. Read-only Filesystems mount -o remount,ro / # Make root read-only # 4. Resource Limits (DoS protection) echo "100M" > /sys/fs/cgroup/myapp/memory.max echo "50" > /sys/fs/cgroup/myapp/pids.max ``` *** ## Summary **Key Takeaways**: 1. **Memory Protection**: NX/DEP, ASLR, and stack canaries are foundational defenses against memory corruption attacks. 2. **Control Flow Integrity**: Forward-edge CFI and shadow stacks (backward-edge CFI) prevent control-flow hijacking. 3. **Privilege Separation**: Capabilities provide fine-grained privileges instead of all-or-nothing root access. 4. **Mandatory Access Control**: SELinux (label-based) and AppArmor (path-based) enforce policies beyond DAC. 5. **Microarchitectural Attacks**: Spectre and Meltdown exploit speculative execution. KPTI and retpolines mitigate but with performance cost. 6. **Sandboxing**: Namespaces, seccomp, and combinations thereof create strong isolation for untrusted code. **Defense in Depth**: No single mechanism is perfect. Modern systems combine multiple layers: * ASLR + NX + Stack Canaries + CFI (memory safety) * Capabilities + Seccomp + Namespaces (privilege reduction) * SELinux/AppArmor (mandatory access control) * KPTI + Retpolines + CPU features (hardware attack mitigation) **Performance vs Security**: Many mitigations have performance costs. Understand trade-offs and apply based on threat model. *** *** ## Interview Deep-Dive **Strong Answer:** These three mechanisms form a layered defense against the classic buffer overflow attack chain. To understand why you need all three, walk through what an attacker must accomplish to exploit a buffer overflow: * **Step 1: Overwrite the return address** -- The attacker provides input that overflows a stack buffer and overwrites the saved return address (RIP) on the stack, redirecting execution to attacker-controlled code. * **Stack canaries** intervene here. A random value (the "canary") is placed between local variables and the saved return address at function entry. Before the function returns, the compiler-inserted code checks if the canary was modified. If it was (because the overflow overwrote it on the way to the return address), the program aborts immediately. The attacker must either guess the canary (2^64 possibilities on 64-bit) or find a way to overwrite the return address without touching the canary (possible with format string bugs or non-contiguous overwrites, but much harder). * **Step 2: Redirect execution to shellcode** -- If the attacker bypasses the canary, they redirect execution to injected code (shellcode) in the buffer itself. * **NX (No-Execute) / DEP** intervenes here. The stack (and heap, and data sections) are marked non-executable at the page table level. The CPU enforces this in hardware: executing an instruction from an NX page triggers a page fault. The attacker's shellcode on the stack cannot execute. This forces the attacker to use Return-Oriented Programming (ROP) -- chaining existing code snippets ("gadgets") from the binary and libraries. * **Step 3: Locate usable code gadgets** -- The attacker needs to find executable code at known addresses to build ROP chains. * **ASLR** intervenes here. The kernel randomizes the base addresses of the stack, heap, shared libraries, and (with KASLR) the kernel itself at each process start. The attacker cannot hard-code addresses of gadgets because they change every run. On 64-bit systems, the entropy is typically 28-30 bits for library randomization, making brute force impractical. Together, the attacker must: bypass the canary (hard without an information leak), cannot inject code (NX), and cannot find existing code to reuse (ASLR). Breaking one is insufficient -- you need to break at least two. Where it still fails: * **Information leaks**: A separate vulnerability that leaks memory addresses (e.g., a format string bug that prints stack values) can defeat both ASLR (reveals addresses) and canaries (reveals the canary value). This is why modern defenses add CFI (Control Flow Integrity) as a fourth layer -- even if the attacker knows addresses, they cannot redirect execution to arbitrary gadgets because the CPU verifies that indirect branches target valid function entries. **Follow-up: What is KASLR and why was KPTI needed despite it?** KASLR randomizes the kernel's base address in virtual memory at each boot. The idea is that even if an attacker has a kernel vulnerability, they cannot exploit it without knowing where kernel functions are located. KPTI (Kernel Page Table Isolation) was needed because the Meltdown vulnerability allowed user-space code to speculatively read kernel memory through the CPU's speculative execution, bypassing KASLR entirely -- the attacker could read kernel addresses at \~500KB/s and then use those addresses for their exploit. KPTI unmaps the kernel from user-space page tables entirely, so there is nothing for Meltdown to speculatively read. The cost is that every syscall now requires a CR3 switch between user and kernel page tables (5-30% overhead on older CPUs). **Strong Answer:** Seccomp-BPF and SELinux operate at completely different layers and are complementary, not interchangeable. * **Seccomp-BPF (System Call Filter)**: Intercepts every syscall at the entry point and runs a BPF filter that decides allow/deny/kill based on the syscall number and (with some limitations) its arguments. It answers: "Can this process invoke this kernel API?" Seccomp cannot distinguish between files, network addresses, or process targets -- if you allow `open()`, the process can open any file. If you allow `connect()`, it can connect to any address. * **SELinux (Mandatory Access Control)**: Assigns security labels to every process, file, socket, and kernel object. A policy defines which labels can perform which operations on which other labels. It answers: "Can this specific subject access this specific object in this specific way?" SELinux can say "process with label httpd\_t can read files with label httpd\_content\_t but cannot write to files with label etc\_t." This is far more granular than seccomp. For hardening an untrusted container, I would use both: * **Seccomp-BPF**: Block all syscalls the container does not need. A web server does not need `mount`, `reboot`, `kexec_load`, `ptrace`, `init_module`, or `io_uring_setup`. Docker ships a default seccomp profile that blocks about 60 dangerous syscalls. For untrusted workloads, I would create a custom profile that allowlists only the \~50 syscalls the application actually uses (determined by running `strace` during testing). This shrinks the kernel attack surface enormously -- most kernel CVEs are in obscure syscall handlers that a web server never touches. * **SELinux (or AppArmor)**: Apply a policy that restricts what the container can access even with the allowed syscalls. The container process can call `open()`, but SELinux ensures it can only open files in its designated directory. It can call `connect()`, but SELinux restricts it to specific ports and network labels. This prevents a compromised container from reading `/etc/shadow`, connecting to the metadata service (a common cloud attack vector), or accessing the Docker socket. The layers complement each other: seccomp removes dangerous kernel entry points, SELinux restricts what the remaining entry points can access. Neither alone is sufficient. Seccomp without SELinux means a process with `open()` allowed can read any file. SELinux without seccomp means a process can invoke dangerous syscalls (even if they fail due to policy, the syscall handler code still runs, potentially triggering kernel bugs). **Follow-up: What is the performance overhead of running both seccomp-BPF and SELinux simultaneously?** Seccomp-BPF adds 10-50 nanoseconds per syscall (running a small BPF program in the syscall entry path). SELinux adds 100-500 nanoseconds per security check (which happens on syscalls that access objects -- file open, socket connect, etc.). For a web server making 10K syscalls per second, the combined overhead is roughly 0.5-5 milliseconds per second -- negligible. For a storage-intensive application making 500K syscalls per second, the overhead is 25-250 milliseconds per second (2.5-25% of one core). The practical impact depends entirely on the syscall rate. For most workloads, the overhead is under 1% and invisible in application-level metrics. The security benefit far outweighs the cost. **Strong Answer:** Both Spectre and Meltdown exploit speculative execution -- the CPU's optimization of executing instructions ahead of time before knowing whether they are needed. The critical difference is the trust boundary they violate. * **Meltdown (CVE-2017-5754)**: Exploits the fact that on vulnerable Intel CPUs, speculative loads from kernel memory are not immediately stopped by the permission check. The CPU speculatively reads kernel data into a register, uses it to access a cache line (encoding the secret in a cache side channel), and then throws away the speculative result when the permission check fails. But the cache side channel remains -- the attacker can probe which cache line was accessed and recover the kernel data. Meltdown crosses the user/kernel boundary and allows reading arbitrary kernel memory. * **Spectre (CVE-2017-5753 Variant 1, CVE-2017-5715 Variant 2)**: Exploits the CPU's branch prediction. In Variant 1 (bounds check bypass), the attacker trains the branch predictor to predict that a bounds check will pass, then triggers speculative execution past the check with an out-of-bounds index. The speculative load accesses secret data and encodes it in the cache. In Variant 2 (branch target injection), the attacker poisons the Branch Target Buffer (BTB) to redirect speculative execution of an indirect branch to attacker-chosen code ("gadgets") within the victim's address space. Why Spectre is harder to mitigate: * **Meltdown has a clean fix**: KPTI (Kernel Page Table Isolation) unmaps the kernel from user-space page tables. If the kernel memory is not even mapped during user-space execution, the speculative load has nothing to read. The fix is at the OS level and is complete (with a 5-30% performance cost). * **Spectre crosses any trust boundary**: Spectre does not require reading kernel memory. It can leak data between processes, between VMs, between JavaScript contexts in a browser, between a sandbox and its host. Any code running on the same CPU can potentially be a Spectre victim or attacker. * **Software mitigations are partial**: Retpolines (replacing indirect branches with a return trampoline that defeats BTB poisoning) mitigate Variant 2 but add overhead to every indirect call. Array bounds masking (inserting an AND instruction after bounds checks to zero out speculative out-of-bounds accesses) mitigates Variant 1 but requires compiler changes and careful code auditing. Neither is a complete fix. * **New variants keep appearing**: Spectre is a class of vulnerabilities, not a single bug. Spectre-v3a, Spectre-RSB, Spectre-BHB, and MDS (Microarchitectural Data Sampling) are all variations on the same theme. Each requires its own mitigation. The fundamental problem is that speculative execution is not a bug -- it is a deliberate performance feature that provides 10-100x speedup for branch-heavy code. Disabling speculation entirely would reduce modern CPUs to 1990s performance levels. The industry is converging on hardware fixes in newer CPUs (Intel Golden Cove, AMD Zen 4) that add speculation barriers in microcode, but older hardware remains vulnerable. **Follow-up: How do cloud providers like AWS protect against cross-VM Spectre attacks on shared hardware?** Multiple layers: hardware partitioning (Intel CAT/MBA to partition the L3 cache between VMs, reducing cache side-channel leakage), microcode updates (clearing branch predictor state on VM entry/exit), hypervisor patches (KVM flushes speculation buffers on VMEXIT), and core scheduling (ensuring untrusted VMs do not share SMT siblings, since Hyper-Threading shares the branch predictor and L1 cache between logical cores). AWS's Nitro system goes further by offloading virtualization to dedicated hardware, reducing the hypervisor attack surface. Despite all this, the most sensitive workloads (HSMs, cryptographic key storage) run on dedicated single-tenant hosts with no sharing. **Strong Answer Framework:** 1. **Establish what the attacker should not be able to do.** Before reaching for tools, define the boundary. "Cannot read `/etc/shadow`" is different from "cannot exfiltrate any data" is different from "cannot persist a backdoor." Threat modeling forces you to choose mechanisms that match the goal. 2. **Apply user separation as the floor.** Run the binary as a dedicated unprivileged user with no shell, no sudo entries, no group memberships beyond its own. This is the cheapest layer and rules out 80 percent of trivial attacks. Anyone who skips this layer because "I have stronger mechanisms above" loses if the stronger mechanisms have a bug. 3. **Drop capabilities to the minimum.** Use `prctl(PR_CAPBSET_DROP)` to drop the bounding set, set `SECBIT_NOROOT` to prevent file capabilities or setuid from re-elevating, and add only the capabilities the workload needs. For most workloads, the answer is zero capabilities. 4. **Apply seccomp to shrink the kernel surface.** Custom syscall whitelist generated from observed behavior. The kernel has 350+ syscalls; a typical workload uses 50-80. Blocking the rest closes off entire classes of kernel CVEs preemptively. 5. **Use namespaces to make the world smaller.** Mount namespace with a chroot or pivot\_root into a private rootfs. Network namespace with no interfaces (or just a loopback). PID namespace so the process cannot see or signal anything outside. User namespace with the workload mapped to a non-overlapping host UID, so even root-in-namespace is unprivileged on the host. 6. **Layer mandatory access control on top.** SELinux or AppArmor profile that restricts what the workload can read or write even if it somehow got privilege. This is the layer that catches the bug in your seccomp profile. 7. **Cgroup limits for blast radius.** Memory limit, PID limit, CPU quota, IO weight. These do not stop intrusions, but they bound the damage of fork bombs, memory hogs, and crypto-miners-as-payload. 8. **Audit what you cannot prevent.** Even with all of the above, log every syscall through audit subsystem or eBPF tracing. The goal is detection within hours of a successful attack, not just prevention. **Real-World Example:** Google Chrome's renderer sandbox is the public reference design. Each renderer process drops all capabilities, applies a strict seccomp filter (about 65 syscalls allowed), runs in a user namespace with the renderer UID mapped to nobody, has no filesystem access (uses Mojo IPC to the privileged broker for file IO), and is restricted by SELinux on Android. The 2014 Pwn2Own attack on Chrome required chaining a renderer RCE with a seccomp escape and a kernel privilege escalation -- three independent vulnerabilities. The 2024 attack on Chrome's V8 still required two more bugs to escape the renderer sandbox to host code execution. **Senior follow-up 1: Why is user namespace mapping the root inside the namespace to a non-zero UID outside considered the strongest single primitive?** Because most kernel privilege checks use the *namespaced* uid for permission decisions but the *real* uid for capability decisions on global resources. If your namespace's UID 0 maps to host UID 100000, a successful exploit that gives the attacker capabilities only does so within the namespace. Operations that affect the host kernel (loading modules, mounting filesystems on host paths, ptrace of host processes) check the real UID, which is unprivileged. This is why rootless containers are a meaningful security improvement -- not just a usability one. **Senior follow-up 2: A seccomp profile is too restrictive in unpredictable ways. What is your debug strategy?** Set the default action to `SECCOMP_RET_LOG` instead of `SECCOMP_RET_KILL`, run the workload through realistic scenarios (not just happy path -- include error handling, signal delivery, malloc growth), and watch `/var/log/audit/audit.log` for `SECCOMP` records. Each entry shows the syscall number that would have been killed; map those to names with `ausyscall`. After a clean observation window, flip default action to `SECCOMP_RET_ERRNO(EPERM)` for one more cycle (so the application can fail gracefully), then to `SECCOMP_RET_KILL_PROCESS` for hard enforcement. Tools like `containerd-shim`'s seccomp recorder, Falco, and `kubectl-trace` automate this loop. **Senior follow-up 3: Where does gVisor fit relative to seccomp + namespaces, and when is the extra cost justified?** gVisor reimplements the Linux syscall surface in a userspace process (Sentry) that sits between the application and the host kernel. Calls that look like `read()` to the application are actually intercepted, validated, and either handled in Sentry or proxied to the host. This eliminates an entire class of risk: kernel CVEs in syscall handlers do not affect gVisor-sandboxed workloads because those handlers never execute on the host kernel for sandboxed traffic. The cost is real -- gVisor adds 10-50 percent overhead on syscall-heavy workloads and is incompatible with some applications (Linux-namespace-specific tools, applications that mmap and then expect specific kernel behaviors). The cost is justified for workloads where you genuinely cannot trust the binary -- shared CI runners, untrusted user code in PaaS, multi-tenant function execution. It is overkill for first-party microservices. **Common Wrong Answers:** * "Just put it in Docker." Docker by default runs as root inside the container, with most capabilities, and a default-permissive seccomp profile. Docker is a packaging tool first; security depends on configuration that must be applied explicitly. * "Use a VM." VMs have a smaller attack surface than namespaces against most threat models, but the hypervisor still has CVE history (CVE-2017-2596 KVM, CVE-2020-29569 Xen). Saying "use a VM" without acknowledging hypervisor risk hand-waves the problem. * "Drop capabilities and you are done." Capabilities are necessary but not sufficient. A process with zero capabilities can still read every file world-readable, connect to localhost services, and exploit kernel bugs in syscalls that do not require capabilities. **Further Reading:** * "Sandboxing and Workload Isolation" (Google production hardening guide) -- the gVisor design rationale and threat model * Jess Frazelle, "Hard multi-tenancy in Kubernetes" -- pragmatic stack for untrusted workloads * Linux source: `kernel/seccomp.c`, `kernel/user_namespace.c`, `security/security.c` for the LSM hook integration **Strong Answer Framework:** 1. **Capabilities answer: what privileged operations can this process invoke?** Drop all capabilities and the process cannot bind low ports, change UIDs, mount filesystems, load kernel modules, ptrace others, or do anything else that historically required root. Capabilities do not restrict file access (DAC handles that) or syscall surface (seccomp handles that). 2. **Seccomp answers: what syscalls can this process make at all?** Even without capabilities, a process can call hundreds of syscalls. Many have CVE history. Seccomp shrinks the kernel attack surface by blocking syscalls the workload does not need. It does not care about arguments deeply (only some support arg filtering), so it cannot say "open files only in /tmp." It just says "you can or cannot call open at all." 3. **Namespaces answer: what does this process see?** Mount namespace = its own filesystem view. Network namespace = its own network stack. PID namespace = its own process tree. User namespace = its own UID/GID mapping. Namespaces isolate visibility and resource scope, not privilege. Two processes can be in the same namespace and one can attack the other; namespaces only protect across the boundary. 4. **Docker is the orchestration that wires these together.** A Docker container is, mechanically, a process tree with namespaces, a default seccomp profile, dropped capabilities (most are off by default), an AppArmor or SELinux profile, and cgroup limits. Docker is not a new isolation mechanism -- it is a configuration that combines existing kernel mechanisms. 5. **Where they fail to compose:** seccomp filters by syscall number, but a syscall you allow can transitively reach functionality you blocked (the `mprotect`-via-printf issue). Namespaces leak through `/proc`, `/sys`, kernel keyrings, and shared kernel data structures. User namespaces have escalation paths through misconfigured uid\_map. Capabilities have surprising scopes -- `CAP_SYS_ADMIN` is "nearly root" because dozens of operations gate on it. Combining all four is necessary; each individually has gaps the others fill. **Real-World Example:** The 2024 LeakyVessels CVEs (CVE-2024-21626 in runc, CVE-2024-23651 in BuildKit) escaped containers despite seccomp, capability dropping, and AppArmor all being in place. The escapes worked through file descriptor leaks across the namespace boundary -- runc was leaking host file descriptors into containers via `/proc/self/fd`, and a malicious container could traverse those FDs to reach the host filesystem. None of the standard hardening primitives caught this because the attack did not violate any one mechanism's contract -- it exploited the gap between them. The fix was at the runtime level: runc closes all FDs before exec, a behavior that should have been there all along. **Senior follow-up 1: Why does a default Docker container still have CAP\_NET\_RAW and CAP\_NET\_BIND\_SERVICE despite the security guidance to drop everything?** Because Docker's defaults are tuned for compatibility with common workloads -- ping, DHCP clients, web servers binding to ports below 1024 in legacy configurations. Most real workloads do not need either capability and should drop them explicitly with `--cap-drop=ALL --cap-add=...`. The Docker maintainers chose conservative defaults so `docker run` would just work for as many users as possible, accepting a less-defensive baseline as the cost. For production, your image build or orchestration layer should override this default. **Senior follow-up 2: What is the difference between SECCOMP\_FILTER\_FLAG\_TSYNC and per-thread seccomp filters, and when does it matter?** `SECCOMP_FILTER_FLAG_TSYNC` synchronizes a seccomp filter across all threads in the process at install time, ensuring no thread escapes the filter. Without it, a multithreaded process can install a filter on the calling thread but other threads keep running unfiltered until they call `prctl(PR_SET_SECCOMP)` themselves. For a single-threaded program this is fine; for anything threaded (which is most modern code), `TSYNC` is mandatory or you have a race where a thread spawned during filter installation never gets the filter. The 2017 CVE-2017-2671 in QEMU is one example of this exact race being exploited. **Senior follow-up 3: When would you choose AppArmor over SELinux, or vice versa, and is there ever a case to run both?** AppArmor uses path-based labels -- "process X cannot write to /etc/\*". It is easier to write profiles for and easier to reason about, especially in containerized environments where filesystem layout is predictable. SELinux uses type labels assigned to files via xattrs -- "process labeled httpd\_t cannot write to files labeled etc\_t". This is more powerful (the label travels with the file regardless of path) but harder to debug. Use AppArmor for application-specific containment in container environments (Ubuntu, Debian, SUSE all default to AppArmor). Use SELinux for whole-system mandatory access control where the broader policy benefits outweigh complexity (RHEL, Fedora, Android). Running both simultaneously is theoretically possible but practically unwise -- LSM stacking still has rough edges, debugging conflicts is painful, and the marginal security from running both is small compared to running either one well. **Common Wrong Answers:** * "Containers are basically VMs." They are emphatically not. A VM has a hardware-virtualized hypervisor between guest and host kernel; a container shares the host kernel directly. Container escapes target host kernel bugs; VM escapes target hypervisor bugs (rarer, smaller surface). * "Seccomp blocks system calls and that is enough." Seccomp does not see filesystem paths or network addresses. A process with `open` allowed can read every file your DAC allows; a process with `connect` allowed can reach every IP your network namespace permits. * "If I drop all capabilities I am safe." Many CVEs do not need capabilities. Reading sensitive files via standard DAC, exploiting kernel bugs in syscalls that do not require privilege, and lateral movement through the container's mount namespace are all capability-free. **Further Reading:** * "Container Security" by Liz Rice -- the cleanest book-length tour of the kernel primitives and how Docker/Kubernetes wire them * LWN article: "Capabilities for system calls" (Mickael Salaun) -- why caps and seccomp are complementary * runc CVE-2024-21626 writeup -- a real-world example of compositional failure * Linux source: `Documentation/userspace-api/seccomp_filter.rst`, `Documentation/security/credentials.rst` **Strong Answer Framework:** 1. **The CPU's perspective: speculation as a performance feature.** A modern CPU does not wait for a branch's condition to resolve before fetching, decoding, and executing instructions on one of the predicted paths. Branch predictors -- including the Pattern History Table for direct branches and the Branch Target Buffer (BTB) for indirect branches -- predict where execution is going. The CPU executes speculatively, retains results in the Reorder Buffer, and either commits them (prediction correct) or discards them (prediction wrong). The trick is that "discards them" is not perfect: side effects on microarchitectural state -- cache lines loaded, branch predictor state updated -- persist even when the architectural result is rolled back. 2. **The attacker's perspective: turning microarchitectural side effects into a data leak.** Spectre Variant 1 (bounds check bypass): the attacker trains the branch predictor to expect a bounds check to pass, then triggers the speculative path with an out-of-bounds index. The speculative load reads secret memory, uses the secret as an index into a probe array, and brings a specific cache line into L1. The architectural result is rolled back, but the cache state is not. The attacker times accesses to the probe array; the line that hits in cache encodes the secret byte. With this primitive, the attacker reads memory at the rate of about 10-100 KB/sec. 3. **Spectre Variant 2 (branch target injection): poisoning indirect branches.** The attacker pollutes the BTB with branch targets that, when used speculatively by the victim, redirect speculative execution to attacker-chosen code -- "gadgets" -- in the victim's address space. Now the speculative-execution-and-cache-side-channel pattern can read across security boundaries (kernel space, other VMs, browser sandboxes). 4. **Kernel mitigations: per-variant.** Variant 1 mitigated with array bounds masking (`array_index_nospec`) and LFENCE / speculation barriers in kernel hot paths -- compiler and code review job, painful and incomplete. Variant 2 mitigated with retpolines on x86 (replacing indirect branches with a return trampoline that defeats BTB poisoning) and IBRS / IBPB / STIBP CPU features (clearing predictor state at boundary crossings). Cross-process (and cross-VM) protection via core scheduling -- never schedule untrusted SMT siblings on the same physical core. 5. **What you give up.** Retpolines add 5-30 percent overhead to indirect-call-heavy workloads (interpreters, VM monitors, system call entry). KPTI (which mitigates Meltdown but is part of the same family) costs 5-30 percent on syscall-heavy workloads, especially on older CPUs without PCID. Disabling SMT for security on multi-tenant hosts halves logical core count. The total cost on a Skylake-era Xeon running a syscall-heavy workload is non-trivial -- often 10-25 percent throughput loss compared to a fully-mitigation-disabled baseline. **Real-World Example:** When Spectre was disclosed in January 2018, AWS deployed mitigations in two phases: first, Linux KPTI plus retpolines on hosts (immediately, for all workloads), and second, a Nitro-based approach that moved virtualization to dedicated hardware so the hypervisor surface no longer ran on the same cores as guest code. Internal AWS benchmarks reported 1-5 percent average overhead for typical workloads, but specific workloads (Redis, syscall-heavy databases) showed 20-30 percent regressions until application-level tuning recovered most of it. Public web search engines saw similar cost; Google's response involved refactoring V8's JIT to insert speculation barriers in a way that did not pay the full retpoline cost on every JS function call. **Senior follow-up 1: Why is Meltdown easier to fully mitigate than Spectre?** Meltdown exploits a specific Intel CPU bug: speculative loads to kernel addresses from user mode were not properly checked. The mitigation -- KPTI, unmapping kernel from user-space page tables -- removes the speculative load's target entirely. There is nothing for the speculation to read. Spectre, in contrast, exploits a deliberate CPU feature (branch prediction) that you cannot remove without crippling performance. Every mitigation is partial: you fix one branch site, the next one is still vulnerable. You add a barrier somewhere, the attacker finds a different speculation primitive. The arms race is structurally asymmetric. **Senior follow-up 2: How does retpoline actually defeat branch target injection, mechanically?** A normal indirect branch (`jmp *%rax`) consults the BTB for prediction, which the attacker has poisoned. Retpoline replaces it with a sequence: push the target onto the stack, then `ret`. The CPU's return address predictor (Return Stack Buffer, RSB) is used instead of the BTB for `ret`. The attacker cannot easily poison the RSB because it is filled by `call` instructions, which the attacker controls less. The speculation that does happen lands in an infinite loop (`pause; jmp self`), so even if the predictor is wrong, the speculative path does not perform any useful work for an attacker. The cost is a few extra instructions per indirect call. On AMD CPUs and newer Intel CPUs with `eIBRS` (Enhanced Indirect Branch Restricted Speculation), retpoline is replaced with a hardware mode flag that gives equivalent protection at lower overhead. **Senior follow-up 3: When should I disable Spectre mitigations on a host I control?** Realistic case: a single-tenant host running first-party trusted code only, where every binary on the box is built and signed by you, the kernel CVEs you fear are not in the speculation family, and the 5-25 percent throughput loss matters more than defense in depth. HPC clusters running tightly-controlled workloads disable some mitigations for this reason (`mitigations=off` or specific flags like `nopti`, `spectre_v2=off`). The risk you accept: an unknown future CVE that uses speculation to escape userspace, or a supply-chain compromise in a trusted dependency. Most production environments cannot make this tradeoff because the trust assumptions do not actually hold; HPC and game servers can. Document the decision explicitly so the next operator does not assume mitigations are on. **Common Wrong Answers:** * "Spectre and Meltdown are the same thing." They share a primitive (cache side channel after speculation) but differ in trust boundary and mitigation profile. Conflating them suggests you have read the headline but not the technical writeup. * "Just patch your CPU microcode." Microcode updates are part of the mitigation but cannot fix Spectre fully because Spectre is a behavior, not a bug. Software mitigations remain mandatory. * "Disable Hyper-Threading and you are safe." Disabling SMT helps against L1TF and MDS variants where SMT siblings share microarchitectural state, but does nothing for cross-process Spectre on the same core. It is one mitigation, not the answer. **Further Reading:** * The original Spectre paper: Kocher et al., "Spectre Attacks: Exploiting Speculative Execution" (2018, USENIX Security) * LWN article: "The current state of kernel page-table isolation" -- comprehensive KPTI walkthrough * Intel "Speculative Execution Side Channel Mitigations" white paper -- the vendor's view * Linux source: `arch/x86/kernel/cpu/bugs.c`, `arch/x86/include/asm/nospec-branch.h` *** **Next**: [Boot Process & Initialization](/operating-systems/boot-process) →