Modern OS security is a multi-layered defense against both software vulnerabilities and hardware-level attacks. From memory protection to mandatory access control, understanding these mechanisms is crucial for building secure systems.
No-Execute (NX) / Data Execution Prevention (DEP) marks memory pages as non-executable.
Traditional (Insecure):┌────────────────────────────────┐│ Stack │ Executable!│ ├─ Return addresses │ ← Attacker can inject shellcode│ └─ Local variables │├────────────────────────────────┤│ Heap │ Executable!│ ├─ Malloc'd buffers │ ← Attacker can put code here│ └─ Dynamic data │├────────────────────────────────┤│ Data/BSS │ Executable!└────────────────────────────────┘With NX/DEP:┌────────────────────────────────┐│ Stack (NX bit set) │ NOT Executable│ ├─ Return addresses │ ← Shellcode won't execute!│ └─ Local variables │├────────────────────────────────┤│ Heap (NX bit set) │ NOT Executable│ ├─ Malloc'd buffers ││ └─ Dynamic data │├────────────────────────────────┤│ Data/BSS (NX bit set) │ NOT Executable├────────────────────────────────┤│ Text (executable) │ Executable│ └─ Program code │└────────────────────────────────┘
Implementation:
// Kernel sets page table entry (PTE) NX bit// x86-64 page table entry structurestruct pte { unsigned long present : 1; // Page is in memory unsigned long rw : 1; // Read/Write permission unsigned long user : 1; // User-mode accessible unsigned long pwt : 1; // Page write-through unsigned long pcd : 1; // Page cache disabled unsigned long accessed : 1; // Page was accessed unsigned long dirty : 1; // Page was written unsigned long pat : 1; // Page attribute table unsigned long global : 1; // Global page unsigned long avail : 3; // Available for OS use unsigned long pfn : 40; // Physical frame number unsigned long avail2 : 11; // Available unsigned long nx : 1; // NO-EXECUTE bit (bit 63)};// Kernel code for stack allocation (simplified from mm/mmap.c)unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long pgoff) { struct vm_area_struct *vma; vma = vm_area_alloc(current->mm); vma->vm_start = addr; vma->vm_end = addr + len; // Stack protection: read/write but NOT executable if (flags & MAP_STACK) { vma->vm_flags = VM_READ | VM_WRITE | VM_GROWSDOWN; // NX bit will be set in page table entries } // Code segment: executable but NOT writable if (prot & PROT_EXEC) { vma->vm_flags |= VM_EXEC; vma->vm_flags &= ~VM_WRITE; // W^X: Write XOR Execute } return addr;}
W^X Policy (Write XOR Execute):
A page can be writable OR executable, but never both
Prevents attacker from modifying code or executing data
The logic is straightforward: if you can write it, the attacker can inject code there, so it must not execute. If it executes, it must be immutable.
Practical tip: JIT compilers (V8, JVM HotSpot) are the main exception — they must generate code at runtime. They handle this by allocating pages as RW, writing machine code, then calling mprotect() to flip them to RX before execution. This W-then-X pattern is audited carefully in security-critical JITs like Firefox’s Wasm compiler.Check NX status:
# Check if NX is enableddmesg | grep NX# NX (Execute Disable) protection: active# Check process memory mapscat /proc/self/maps# 7ffff7dd1000-7ffff7df3000 r-xp ... /lib/x86_64-linux-gnu/ld-2.31.so (executable)# 7ffff7df3000-7ffff7df4000 r--p ... /lib/x86_64-linux-gnu/ld-2.31.so (read-only)# 7ffffffde000-7ffffffff000 rw-p ... [stack] (no 'x'!)# Check if binary has NX enabledreadelf -l /bin/ls | grep GNU_STACK# GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x10# ^^^ (RW, not RWE)
// Simplified from arch/x86/mm/mmap.cunsigned long arch_mmap_rnd(void) { unsigned long rnd; // Get random bits from kernel PRNG if (mmap_is_ia32()) { rnd = get_random_long() & ((1UL << mmap_rnd_bits) - 1); } else { rnd = get_random_long() & ((1UL << mmap_rnd_compat_bits) - 1); } return rnd << PAGE_SHIFT; // Align to page boundary}unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; unsigned long start_addr; // Add random offset if (!(flags & MAP_FIXED)) { // Random offset for ASLR start_addr = mm->mmap_base + arch_mmap_rnd(); } else { start_addr = addr; } // Find free region starting at randomized address vma = find_vma(mm, start_addr); // ... allocation logic ... return start_addr;}
Entropy Sources:
ASLR Entropy (bits of randomness):Stack: 19 bits (on x86-64) = 524,288 possible locationsHeap: 13 bits = 8,192 possible locationsLibraries: 28 bits = 268 million possible locationsPIE binary: 28 bits = 268 million possible locationsFormula: Brute force attempts = 2^(entropy_bits)Example: 28 bits → attacker needs avg 2^27 = 134 million attemptsIf each attempt crashes the program (1 sec delay): 134 million seconds = 1,551 days!But if process doesn't crash (fork server): Attacker can brute force in minutes!
KASLR (Kernel ASLR):
// Kernel virtual address randomization (from arch/x86/boot/compressed/kaslr.c)void choose_random_location(unsigned long input, unsigned long input_size, unsigned long *output, unsigned long output_size, unsigned long *virt_addr) { unsigned long random_addr, min_addr; // Get entropy from: // 1. RDRAND/RDSEED (CPU instructions) // 2. RDTSC (timestamp counter) // 3. Boot parameters random_addr = get_random_long(); // Align and constrain to valid kernel address range min_addr = min(*output, *virt_addr); random_addr = find_random_phys_addr(min_addr, output_size); *output = random_addr; *virt_addr = random_addr + __START_KERNEL_map;}
Check ASLR status:
# View ASLR settingcat /proc/sys/kernel/randomize_va_space# 0 = Disabled# 1 = Randomize stack, libraries, mmap# 2 = Full randomization (includes heap)# Enable full ASLRecho 2 | sudo tee /proc/sys/kernel/randomize_va_space# Test ASLRfor i in {1..5}; do cat /proc/self/maps | grep stack; done# 7ffc12345000-7ffc12366000 (different)# 7ffe9abcd000-7ffe9abee000 (different)# 7ffd45678000-7ffd45699000 (different)
// Uses NULL, CR, LF, EOF (0x00, 0x0D, 0x0A, 0xFF)// Idea: strcpy stops at NULL, gets/printf stop at CR/LFunsigned long canary = 0x000d0aff00000000;// Weakness: Attacker can guess/brute-force known bytes
// Completely random value generated at program startup// In kernel (from arch/x86/kernel/cpu/common.c)void __init cpu_init(void) { // ... __stack_chk_guard = get_random_canary(); // ...}unsigned long get_random_canary(void) { unsigned long canary; // Use hardware RNG if available if (cpu_has_rdrand()) { rdrand_long(&canary); } else { // Fall back to PRNG get_random_bytes(&canary, sizeof(canary)); } return canary;}// Strength: Unpredictable, unique per process
// XOR of return address, frame pointer, and random valueunsigned long canary = __stack_chk_guard ^ (unsigned long)__builtin_return_address(0) ^ (unsigned long)__builtin_frame_address(0);// Idea: Even if attacker overwrites return address,// canary changes correspondingly// Weakness: Complex, rarely used
Compiler Flags:
# -fstack-protector: Protect functions with vulnerable buffersgcc -fstack-protector vulnerable.c# -fstack-protector-strong: Protect more functions (recommended)gcc -fstack-protector-strong vulnerable.c# -fstack-protector-all: Protect ALL functions (performance cost)gcc -fstack-protector-all vulnerable.c# Check if binary has stack protectorreadelf -s /bin/ls | grep stack_chk# 123: 00000000000060a0 8 OBJECT GLOBAL DEFAULT 25 __stack_chk_fail@@GLIBC_2.4
// Vulnerable codestruct ops { void (*process)(char *data);};struct ops *vtable = malloc(sizeof(struct ops));vtable->process = legitimate_function;// ... buffer overflow ...// Attacker overwrites vtable->process to point to shellcodevtable->process(data); // Calls shellcode!
CFI Solution:
// Compiler generates CFI check before indirect call// Original codevtable->process(data);// Compiled with CFIvoid *target = vtable->process;// Check 1: Is target a valid code address?if (!is_valid_code_address(target)) { cfi_violation();}// Check 2: Is target in allowed set for this call site?if (!is_allowed_target(call_site_id, target)) { cfi_violation();}// Perform call((void (*)(char *))target)(data);
Allowed Target Sets:
Function Signature-Based CFI:void func_a(int x); ← Set 1: void (int)void func_b(int x); ←int func_c(int x, int y); ← Set 2: int (int, int)int func_d(int x, int y); ←void func_e(void); ← Set 3: void (void)Rule: Indirect call with signature void(int) can only jump to Set 1.Implementation:1. Compiler assigns ID to each function signature2. Compiler tags each function with its ID3. Before indirect call, check ID matches expected signature
Clang CFI:
# Compile with CFIclang -fsanitize=cfi -flto program.c# CFI variants-fsanitize=cfi-icall # Indirect calls-fsanitize=cfi-vcall # Virtual function calls (C++)-fsanitize=cfi-cast # Bad casts# Generate CFI violation reportUBSAN_OPTIONS=print_stacktrace=1 ./program# Example violationSUMMARY: UndefinedBehaviorSanitizer: cfi-check-failpc 0x55f8a2b3c4d5 in main program.c:42
Shadow Stack: Hardware-protected copy of return addresses.
Regular Stack Shadow Stack (Protected)───────────────── ────────────────────────┌─────────────┐ ┌─────────────┐│ Ret Addr 3 │ ◄──────►│ Ret Addr 3 │ (Copy)├─────────────┤ ├─────────────┤│ Locals │ │ │├─────────────┤ │ ││ Ret Addr 2 │ ◄──────►│ Ret Addr 2 │├─────────────┤ ├─────────────┤│ Locals │ │ │├─────────────┤ │ ││ Ret Addr 1 │ ◄──────►│ Ret Addr 1 │└─────────────┘ └─────────────┘ ↑ ↑ Writable! Read-Only! (Attacker can (CPU enforced, modify) not accessible)On Function Return:1. Pop return address from regular stack → addr_stack2. Pop return address from shadow stack → addr_shadow3. Compare: addr_stack == addr_shadow?4. Mismatch → #CP exception (Control Protection) → Crash
Intel CET (Control-flow Enforcement Technology):
// CPU features for shadow stack#define X86_FEATURE_SHSTK (1 << 7) // Shadow stack#define X86_FEATURE_IBT (1 << 20) // Indirect branch tracking// Enable shadow stack (kernel code)void cet_enable(void) { u64 msr_val; // Check if CPU supports CET if (!boot_cpu_has(X86_FEATURE_SHSTK)) return; // Enable in MSR rdmsrl(MSR_IA32_S_CET, msr_val); msr_val |= MSR_IA32_CET_SHSTK_EN; // Enable shadow stack wrmsrl(MSR_IA32_S_CET, msr_val); // Allocate shadow stack for current thread unsigned long ssp = alloc_shstk(); // Shadow stack pointer wrmsrl(MSR_IA32_PL3_SSP, ssp);}// Shadow stack operations (new x86 instructions)// INCSSP - Increment shadow stack pointer// RDSSP - Read shadow stack pointer// SAVEPREVSSP - Save previous SSP// RSTORSSP - Restore SSP// WRSSD/WRSSQ - Write to shadow stack// SETSSBSY - Mark shadow stack busy
ARM Pointer Authentication:
// ARM PAuth uses cryptographic signing of return addresses// On function prologue:// PAC (Pointer Authentication Code) = sign(return_addr, context_key)// Store: PAC || return_addr on stack// On function epilogue:// Verify: sign(return_addr, context_key) == PAC?// If mismatch → Fault// ARM instructionsPACIA X30, SP // Sign return address (X30) with stack pointer (SP)RETAA // Authenticate and return
Software Shadow Stack (Android):
// Implemented in libc (not hardware-protected)__thread void *shadow_stack[1024];__thread int shadow_stack_ptr = 0;void function_entry(void *return_addr) { shadow_stack[shadow_stack_ptr++] = return_addr;}void function_exit(void *return_addr) { void *expected = shadow_stack[--shadow_stack_ptr]; if (return_addr != expected) { abort(); // Stack corruption detected }}// Weakness: Attacker can corrupt shadow_stack too if memory bug exists// Strength: Works on CPUs without hardware support
// From /usr/include/linux/capability.h#define CAP_CHOWN 0 // Change file ownership#define CAP_DAC_OVERRIDE 1 // Bypass file permission checks#define CAP_DAC_READ_SEARCH 2 // Bypass read/search permissions#define CAP_FOWNER 3 // Bypass permission checks on file operations#define CAP_FSETID 4 // Don't clear setuid/setgid bits#define CAP_KILL 5 // Bypass permission checks for sending signals#define CAP_SETGID 6 // Make arbitrary setgid calls#define CAP_SETUID 7 // Make arbitrary setuid calls#define CAP_NET_BIND_SERVICE 10 // Bind to ports < 1024#define CAP_NET_RAW 13 // Use RAW and PACKET sockets#define CAP_SYS_ADMIN 21 // Lots of system admin operations#define CAP_SYS_PTRACE 19 // Trace arbitrary processes#define CAP_SYS_MODULE 16 // Load/unload kernel modules// ... 41 capabilities total ...
Capability Sets:
// Each process has 5 capability setsstruct cred { // ... kernel_cap_t cap_inheritable; // Inherited by exec'd programs kernel_cap_t cap_permitted; // Can be enabled (superset) kernel_cap_t cap_effective; // Actually active NOW kernel_cap_t cap_bset; // Bounding set (limits inheritable) kernel_cap_t cap_ambient; // Ambient set (new in Linux 4.3)};// Each set is a 64-bit bitmask (2^64 possible capabilities)typedef struct { __u32 cap[_LINUX_CAPABILITY_U32S_3]; // 2 × 32 bits} kernel_cap_t;
Capability Semantics:
Permitted (P): Capabilities the process CAN useEffective (E): Capabilities CURRENTLY activeInheritable (I): Capabilities that can be inherited across execAmbient (A): Capabilities automatically granted after execBounding (B): Upper limit on capabilities (cannot gain capabilities not in B)On exec():P' = (P & I) | (F_permitted & F_inheritable) | AE' = F_effective ? P' : AI' = IA' = A & P' & I'Where F_* are file capabilities (set on executable)
Using Capabilities:
Set File Capabilities
Capability-Aware Code
View Process Capabilities
Ambient Capabilities
# Give ping the ability to create raw sockets (no setuid needed!)sudo setcap cap_net_raw+ep /bin/ping# Verifygetcap /bin/ping# /bin/ping = cap_net_raw+ep# Now ping works without setuid bit!ls -l /bin/ping# -rwxr-xr-x ... /bin/ping (no 's' bit!)# Remove capabilitiessudo setcap -r /bin/ping# Set multiple capabilitiessudo setcap cap_net_bind_service,cap_net_raw+ep /usr/bin/server
#include <sys/capability.h>#include <sys/prctl.h>int main() { cap_t caps; cap_value_t cap_list[2] = {CAP_NET_BIND_SERVICE, CAP_NET_RAW}; // Get current capabilities caps = cap_get_proc(); // Add capabilities to permitted and effective sets cap_set_flag(caps, CAP_PERMITTED, 2, cap_list, CAP_SET); cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_list, CAP_SET); // Apply capabilities if (cap_set_proc(caps) != 0) { perror("cap_set_proc"); return 1; } // Now we can bind port 80! int sock = socket(AF_INET, SOCK_STREAM, 0); struct sockaddr_in addr = { .sin_family = AF_INET, .sin_port = htons(80), .sin_addr.s_addr = INADDR_ANY }; if (bind(sock, (struct sockaddr *)&addr, sizeof(addr)) == 0) { printf("Successfully bound to port 80\n"); } // Drop capabilities we no longer need cap_clear(caps); cap_set_proc(caps); cap_free(caps); return 0;}// Compile: gcc -o server server.c -lcap// Run: setcap cap_net_bind_service+ep ./server
# View capabilities of running processgrep Cap /proc/self/status# CapInh: 0000000000000000# CapPrm: 0000000000000000# CapEff: 0000000000000000# CapBnd: 000001ffffffffff# CapAmb: 0000000000000000# Decode capability maskcapsh --decode=000001ffffffffff# 0x000001ffffffffff=cap_chown,cap_dac_override,...# View capabilities of any processgrep Cap /proc/1234/status# List all capabilitiescapsh --print
// Ambient capabilities (Linux 4.3+)// Allows non-root processes to exec and retain capabilities#include <sys/prctl.h>int main() { // Raise CAP_NET_RAW in ambient set if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, CAP_NET_RAW, 0, 0) != 0) { perror("prctl"); return 1; } // exec another program execl("/usr/bin/ping", "ping", "8.8.8.8", NULL); // /usr/bin/ping inherits CAP_NET_RAW! // (even though it's not setuid and has no file capabilities) return 0;}// Use case: Container init process grants capabilities to children
SELinux adds mandatory access control on top of DAC.
DAC says: "Can user alice read file.txt?" → Check: alice's UID vs file owner, group, permissionsSELinux says: "Can process with label X access file with label Y?" → Check: Policy rules for (process_label, file_label, operation)Both must succeed for access!
# View denialsausearch -m avc -ts recent# Example denialtype=AVC msg=audit(1234567890.123:456): avc: denied { read } for pid=1234 comm="httpd" name="secret.txt" dev="sda1" ino=123456 scontext=system_u:system_r:httpd_t:s0 tcontext=system_u:object_r:user_home_t:s0 tclass=file permissive=0# Translation: httpd_t tried to read user_home_t file → DENIED# Generate policy module to allowaudit2allow -a -M mypolicy# module mypolicy 1.0;# require {# type httpd_t;# type user_home_t;# class file read;# }# allow httpd_t user_home_t:file read;# Install policy modulesemodule -i mypolicy.pp# List loaded modulessemodule -l# Remove modulesemodule -r mypolicy
SELinux Booleans (runtime toggles):
# List all booleansgetsebool -a | grep httpd# httpd_can_network_connect --> off# httpd_can_network_connect_db --> off# httpd_enable_cgi --> on# Enable httpd network connectionssetsebool -P httpd_can_network_connect on# -P makes it persistent across reboot
AppArmor is path-based MAC (vs SELinux’s label-based).
SELinux: "Process with label X can access file with label Y" → Requires labeling entire filesystemAppArmor: "Process can access /var/www/* with read permission" → Based on filesystem paths (easier to understand)
Speculative Execution: CPU predicts branch and executes ahead, then discards if wrong.
// Vulnerable codeif (x < array1_size) { // Bounds check y = array2[array1[x]]; // Out-of-bounds access}Without Speculation:1. Check: x < array1_size?2. If true, execute access3. If false, skipWith Speculation (vulnerable):1. CPU predicts branch will be taken2. Speculatively executes: y = array2[array1[x]] Even if x >= array1_size!3. Loads array1[x] (out of bounds!)4. Uses it to index array25. array2[...] brought into cache ← SIDE EFFECT!6. Branch misprediction detected7. Architectural state rolled back8. BUT: Cache state NOT rolled back!Attacker observes cache timing → leaks array1[x]!
Meltdown (CVE-2017-5754):
// Leak kernel memory from user space// 1. Flush cacheclflush(probe_array);// 2. Access kernel memory (should fault, but speculatively executes)char kernel_byte = *(char *)kernel_address;// 3. Use leaked byte to index arraychar dummy = probe_array[kernel_byte * 4096];// This brings probe_array[kernel_byte * 4096] into cache// 4. Branch misprediction, exception raised// But probe_array[...] is NOW in cache!// 5. Time accesses to probe_arrayfor (int i = 0; i < 256; i++) { t0 = rdtsc(); dummy = probe_array[i * 4096]; t1 = rdtsc(); if (t1 - t0 < THRESHOLD) { // Cache hit! i == kernel_byte printf("Leaked kernel byte: 0x%02x\n", i); }}// Result: Leaked kernel memory byte-by-byte at ~100 KB/s!
Mitigation: KPTI (Kernel Page Table Isolation):
Without KPTI:┌────────────────────────────────────┐│ User Page Tables ││ ┌──────────────────────────────┐ ││ │ User Space Mappings │ ││ │ 0x00000000 - 0x7fffffffffff │ ││ ├──────────────────────────────┤ ││ │ Kernel Space Mappings │ ││ │ 0xffff800000000000 - ... │ │ ← Meltdown can read this!│ └──────────────────────────────┘ │└────────────────────────────────────┘With KPTI (two sets of page tables):┌────────────────────────────────────┐│ User Page Tables ││ ┌──────────────────────────────┐ ││ │ User Space Mappings │ ││ ├──────────────────────────────┤ ││ │ Minimal Kernel (entry/exit) │ │ ← Only small trampoline│ └──────────────────────────────┘ │└────────────────────────────────────┘┌────────────────────────────────────┐│ Kernel Page Tables ││ ┌──────────────────────────────┐ ││ │ User Space Mappings │ ││ ├──────────────────────────────┤ ││ │ Full Kernel Space │ │ ← Full kernel mapped│ └──────────────────────────────┘ │└────────────────────────────────────┘On syscall: Switch from User PT → Kernel PT (CR3 register swap)On return: Switch from Kernel PT → User PTCost: ~5-30% performance penalty (context switch overhead)
Kernel Implementation (simplified from arch/x86/mm/pti.c):
// Rowhammer exploit (simplified)// 1. Spray memory with target patternchar *spray[1000];for (int i = 0; i < 1000; i++) { spray[i] = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); memset(spray[i], 0xFF, 4096); // All bits set}// 2. Find adjacent rows (via DRAM addressing)uint64_t *hammer_addr1 = find_row_address(2);uint64_t *hammer_addr2 = find_row_address(4);// 3. Hammer rowsfor (int i = 0; i < 1000000; i++) { *hammer_addr1; // Read (causes DRAM row activation) *hammer_addr2; clflush(hammer_addr1); // Evict from cache (force DRAM access) clflush(hammer_addr2);}// 4. Check for bit flips in victim rowsfor (int i = 0; i < 1000; i++) { for (int j = 0; j < 4096; j++) { if (spray[i][j] != 0xFF) { printf("Bit flip at %p: 0x%02x\n", &spray[i][j], spray[i][j]); } }}// Real attacks:// - Flip bit in page table → gain access to kernel memory// - Flip bit in SELinux context → privilege escalation// - Flip bit in RSA key → factor private key
Mitigations:
ECC Memory
Error-Correcting Code (ECC):- Detects and corrects single-bit errors- Detects (but can't correct) multi-bit errorsCost: ~10% more expensivePerformance: Slight overheadWidely used in servers, rare in consumer devices
Target Row Refresh (TRR)
Hardware solution by DRAM vendors:- Monitor row activation counters- If row accessed frequently, refresh adjacent rows- Prevents charge leak that causes bit flipsImplemented in DDR4/DDR5 DRAMEffectiveness: Good but not perfect(bypasses exist with careful timing)
#define _GNU_SOURCE#include <sched.h>#include <unistd.h>#include <sys/mount.h>int sandbox_init(void *arg) { // New mount namespace mount("none", "/", NULL, MS_REC | MS_PRIVATE, NULL); // New hostname sethostname("sandbox", 7); // New root filesystem chroot("/var/sandbox"); chdir("/"); // Execute sandboxed program execl("/bin/sh", "sh", NULL); return 0;}int main() { char stack[4096]; // Create new namespaces int flags = CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWIPC; // Clone process with new namespaces clone(sandbox_init, stack + sizeof(stack), flags | SIGCHLD, NULL); wait(NULL); return 0;}
PID Namespace (process isolation):
// Parent namespace// PID 1: init// PID 123: parent// PID 124: child (clone with CLONE_NEWPID)// Inside child's PID namespacegetpid(); // Returns 1 (child is PID 1 in its namespace)// Child can only see processes in its namespaceps aux // Only shows processes in this namespace// Parent can still see child// PID 124 in parent namespace == PID 1 in child namespace
Network Namespace (network isolation):
# Create new network namespaceip netns add sandbox# Execute command in namespaceip netns exec sandbox ip link list# 1: lo: <LOOPBACK> state DOWN# (Only loopback, no network access!)# Create virtual interface pairip link add veth0 type veth peer name veth1# Move veth1 to sandbox namespaceip link set veth1 netns sandbox# Configure networkingip addr add 10.0.0.1/24 dev veth0ip link set veth0 upip netns exec sandbox ip addr add 10.0.0.2/24 dev veth1ip netns exec sandbox ip link set veth1 up# Now sandbox can communicate via veth interface
Linux security primitives are individually well-designed. The failure mode is composition: each primitive looks correct in isolation, then a subtle assumption interaction creates an escape route. Below are four traps that bite even experienced security engineers, paired with the patterns that close them.
Pitfall 1: Reaching for setuid when modern Linux wants fine-grained capabilitiesThe historical Unix model is “either you are root or you are not.” A setuid binary runs with the file owner’s privileges — almost always root — which means a single bug in ping or mount or passwd is a path to total compromise. CVE history is full of setuid escalations: pkexec (CVE-2021-4034 PwnKit), sudo (CVE-2021-3156 Baron Samedit), OverlayFS plus setuid (CVE-2023-2640). The trap is that engineers reach for setuid out of habit because “I just need to bind to port 80” or “I need to read this hardware register,” when modern Linux offers a far narrower grant.The other side of the same trap: people use Docker’s --privileged flag because they ran into a permission error and wanted to make it go away. --privileged strips namespacing, gives the container all capabilities, mounts host devices, and disables seccomp. It is the docker equivalent of chmod 777.
Solution: file capabilities and ambient capabilitiesLinux capabilities split root’s powers into about 40 fine-grained privileges. Bind to low ports? CAP_NET_BIND_SERVICE. Send raw packets? CAP_NET_RAW. Read kernel memory? CAP_SYS_PTRACE. Grant only what the binary actually needs:
# Old way: setuid rootchmod u+s /usr/bin/myserver# Modern way: file capabilitysetcap cap_net_bind_service=+ep /usr/bin/myserver# Verifygetcap /usr/bin/myserver# Run as a normal user; bind succeeds, nothing else is privileged
For containers, drop everything and add back what you need:
The mental model: capabilities are the principle of least privilege made concrete. Anything you cannot justify by name should not be in the bag.
Pitfall 2: seccomp filter holes — syscalls that transitively reach forbidden onesEngineers reach for seccomp profiles assuming the syscall list is the whole attack surface. It is not. A syscall you allow can call into kernel code paths that ultimately invoke syscalls you blocked. Classic example: you block mprotect because you do not want anyone changing page permissions. But printf calls into vfprintf, which can call into the dynamic linker, which uses lazy binding — and lazy binding fixes up the GOT by calling mprotect to make the GOT writable, then back to read-only. Block mprotect and printf segfaults the first time it touches the dynamic linker.The general pattern: glibc and the dynamic loader have invisible dependencies on mmap, mprotect, arch_prctl, sigaltstack, prctl, rseq, and others. A “minimal” syscall whitelist generated by strace on a happy-path test will miss all of these because they only fire on certain code paths — error handling, signal delivery, malloc growth, TLS allocation. The application crashes hours into production with SIGSYS.
Solution: build seccomp profiles iteratively, log first, kill laterThe first profile you deploy should be SCMP_ACT_LOG (audit-only). Run the application under realistic load — including failure scenarios — and watch /var/log/audit/audit.log for SECCOMP events. Add anything legitimate to the allowlist. Only after a stable observation window do you flip to SCMP_ACT_KILL_PROCESS.
// Phase 1: log mode -- production observationseccomp_init(SCMP_ACT_LOG);// Phase 2: enforce, but with structured fallbackseccomp_init(SCMP_ACT_ERRNO(EPERM)); // return EPERM to userspace// gives the app a chance to log and fail gracefully// Phase 3: hard enforcementseccomp_init(SCMP_ACT_KILL_PROCESS);
For containers, do not write a seccomp profile from scratch. Start from docker/default, audit it for your workload, and tighten. Tools like containerd-shim’s seccomp profile generator and Falco’s policy engine help generate realistic profiles from real workloads.Practical rule: if a syscall is required for crash handling (rt_sigreturn, restart_syscall, exit, exit_group), it stays unconditionally. Locking these out turns recoverable errors into kernel oopses or zombie processes.
Pitfall 3: ASLR with insufficient entropy — 32-bit and PIE-disabled binariesASLR works by randomizing base addresses. The strength is set by the entropy in those addresses. On 64-bit systems, libraries get ~28-30 bits of randomization, which makes brute force impractical. On 32-bit systems, you have at most 16 bits of entropy for shared libraries — and given typical alignment and layout constraints, often closer to 8-12 effective bits. That is 256 to 4096 guesses to defeat. A network-facing service that survives crashes (forking server, supervisor that restarts) gives an attacker effectively unlimited attempts.Worse, ASLR only randomizes binaries that opted in. If a binary is built without -fPIE -pie, its .text section sits at a fixed address regardless of ASLR settings. CVE history is rich with examples: Apache modules built without PIE on RHEL, vendor binaries shipped without ASLR-aware compilation flags, JIT-compiled code regions that the runtime maps at deterministic addresses.
Solution: enforce 64-bit, PIE, and full RELRO at the toolchain level
# Compiler flags for ASLR-aware binariesgcc -fPIE -pie -Wl,-z,relro,-z,now -fstack-protector-strong \ -D_FORTIFY_SOURCE=2 myapp.c -o myapp# Verify the resultchecksec --file=./myapp# Expected: PIE enabled, Full RELRO, NX enabled, Canary found, FORTIFY enabled
Audit your fleet with checksec or hardening-check across every shipped binary. Treat any binary without PIE as a finding.For 32-bit — the honest answer in 2026 is “do not ship 32-bit network services.” If you have legacy 32-bit binaries that must remain, run them inside a stricter sandbox: gVisor, Firecracker, or at minimum a dedicated user namespace with no network capabilities. The CPU is the wrong place to defend a 32-bit address space against a determined attacker.Modern bonus: enable -fcf-protection=full on x86 to opt into Intel CET (Indirect Branch Tracking and Shadow Stack), and -mbranch-protection=standard on ARM for PAC and BTI. These are the hardware-supported successors to ASLR-only defenses.
Pitfall 4: namespace escape via /proc/self vs procfs assumptionsUser namespaces let unprivileged users gain capabilities scoped to a new namespace. The classic attack pattern: create a user namespace, become “root” inside it, then exploit a kernel bug that does not properly check whether your capability is namespaced or global. Pre-2018 kernels were riddled with these checks-without-namespaces, leading to escapes via mount, keyctl, and bpf.The procfs version of the same trap: /proc/self resolves relative to the kernel’s view of the calling process, which can differ from the namespace’s view in subtle ways. A container that mounts /proc from the host (rather than its own private procfs) leaks information about every process on the host, and /proc/self/exe and /proc/self/root can be used to bypass chroot in some configurations. Worse, /proc/<pid>/setgroups, /proc/<pid>/uid_map, /proc/<pid>/gid_map are the gatekeepers for user namespace permissions — a misconfigured container that allows write access to these can be escaped from.In 2019, runc had CVE-2019-5736 — a container could overwrite the host’s runc binary by exploiting the way /proc/self/exe resolved at exec time. The fix was substantial: runc now copies its own binary into a memfd and re-execs from there.
Solution: defense in depth around procfs, plus user-namespace guardrails
# Mount /proc with hidepid -- containers cannot see other processesmount -t proc -o hidepid=2 proc /proc# Disable unprivileged user namespaces if you do not use themsysctl -w kernel.unprivileged_userns_clone=0# In containers, mount /proc as a fresh procfs scoped to the namespace# Most container runtimes do this; verify with /proc/self/mountinfo
For container runtimes, follow the runc-CVE-2019-5736 lesson: never re-exec from a path the sandbox can write to. Modern runtimes use memfd_create plus execveat to load the runtime binary from a memory-backed fd that no namespaced process can touch.Auditing approach: enumerate every path in /proc your container can read or write, and for each, ask “what does this give an attacker if they can write arbitrary bytes here?” The answers are sometimes scary — /proc/sys/kernel/core_pattern historically allowed pipe-to-program syntax, which let containers execute host commands by triggering a core dump. CVE-2022-0492 was the most recent variant.Stronger pattern: use rootless containers (Podman’s default, Docker’s optional mode) so there is no root inside the namespace at all. The escape primitives that need CAP_SYS_ADMIN or root simply do not apply.
Q1: How does NX/DEP prevent code execution on the stack?
NX (No-Execute) / DEP (Data Execution Prevention) uses the CPU’s NX bit in page table entries.Page Table Entry Structure (x86-64):
Bit 63: NX (No-Execute) bit
When set: Page cannot be executed (will fault with #PF if IP points here)
When clear: Page is executable
Kernel Implementation:
// When mapping stackvma->vm_flags = VM_READ | VM_WRITE | VM_GROWSDOWN;// NO VM_EXEC flag!// Page table entry will have NX bit SETpte = pfn_pte(pfn, PAGE_KERNEL); // Default kernel page (with NX)set_pte(pte_addr, pte);
Protection:
Attacker overflows buffer on stack
Injects shellcode
Overwrites return address to point to shellcode
Function returns, jumps to shellcode address
CPU checks NX bit → Page is not executable
#PF (Page Fault) → Kernel kills process
W^X Policy: Page is writable OR executable, never both.
Stack/Heap: Writable, NOT executable
Code: Executable, NOT writable
Prevents: Code injection attacks
Bypass: Return-Oriented Programming (ROP) - reuse existing executable code instead of injecting new code.
Q2: Explain ASLR and how it prevents exploitation. What are its weaknesses?
ASLR (Address Space Layout Randomization) randomizes memory layout at program start.Randomized Regions:
Stack base address
Heap base address
Libraries (libc, etc.)
Executable base (if PIE - Position Independent Executable)
vDSO, vvar
Entropy (x86-64 Linux):
Stack: 19 bits → 524,288 possible positions
Heap: 13 bits → 8,192 possible positions
Libraries: 28 bits → 268 million possible positions
PIE executable: 28 bits → 268 million possible positions
How It Prevents Exploitation:Traditional exploit (no ASLR):
Attacker knows: libc is at 0x7ffff7a0d000Attacker's ROP chain: return to 0x7ffff7a52390 (system) argument: 0x7ffff7b99d88 ("/bin/sh")Works every time!
With ASLR:
Run 1: libc at 0x7f8a2e456000Run 2: libc at 0x7f3c81de2000Run 3: libc at 0x7fb1c9a2f000Attacker's hardcoded addresses: WRONG!Exploit crashes instead of succeeding
Weaknesses:
Information Leak:
Pointer disclosure → calculate base addresses → bypass ASLR
Format string bugs, memory corruption leaks
Entropy Limitations:
13 bits (heap) = 8,192 attempts
If process doesn’t crash (fork server), brute-forceable
32-bit Systems:
Limited address space → low entropy
8 bits library randomization → 256 attempts
Non-PIE Executables:
Main executable at fixed address
Contains ROP gadgets at known addresses
Cache Timing Attacks:
Side-channel attacks can determine addresses
Mitigations for Weaknesses:
Use PIE (Position Independent Executable)
Fix information leaks
Crash on exploit attempts (don’t fork)
Use Control Flow Integrity (CFI)
Combine with other defenses (NX, stack canaries)
Q3: How do stack canaries detect buffer overflows? Can they be bypassed?
Stack Canary: Random value placed between local variables and return address.Mechanism:
Stack Frame:┌──────────────────┐ High address│ Return Address │ ← Protected├──────────────────┤│ Saved RBP │├──────────────────┤│ CANARY (random) │ ← __stack_chk_guard (stored in TLS)├──────────────────┤│ Local vars ││ char buf[100] │ ← Overflow starts here└──────────────────┘ Low addressFunction Prologue: mov rax, fs:0x28 ; Load canary from TLS mov [rbp-8], rax ; Store on stackFunction Epilogue: mov rax, [rbp-8] ; Load stack canary xor rax, fs:0x28 ; Compare with original je .L_OK ; Match? OK call __stack_chk_fail ; Mismatch? ABORT.L_OK: ret
Detection:
Buffer overflow overwrites local variables
Overflow continues, overwrites canary
Function returns
Kernel checks: stack_canary == __stack_chk_guard?
Mismatch → Stack smashing detected! → abort()
Bypass Techniques:1. Leak Canary:
// Format string vulnerabilityprintf(user_input); // User provides: "%p %p %p ..."// Leaks stack contents, including canary!// Attacker:// 1. Leak canary value// 2. Craft overflow to include correct canary value// 3. Overflow succeeds without detection
2. Overwrite Pointer Before Canary:
char buf[100];char *ptr = &authorized;unsigned long canary;void *return_address;// Overflow overwrites ptr but not canarystrcpy(buf, malicious_input); // Overflow only buf and ptr// ptr now points to attacker-controlled memory// Canary intact → No detection!
3. Fork Without Re-randomization (rare):
// Parent forks children with same canarywhile (1) { if (fork() == 0) { handle_request(); // Sandbox child exit(0); }}// Attacker brute-forces canary byte-by-byte// Try 0x00, 0x01, 0x02, ... 0xFF for first byte// If child crashes: wrong guess// If child doesn't crash: correct! Move to next byte// 8 bytes × 256 attempts = 2,048 attempts max
4. Partial Overflow:
// Overflow only return address, not canary// (Requires knowledge of stack layout)┌──────────────┐│ Ret Addr │ ← Overflow 1 byte (change low byte only)├──────────────┤│ Saved RBP │ ← Skip├──────────────┤│ Canary │ ← Leave untouched!├──────────────┤│ buf[100] │└──────────────┘// Careful overflow changes return address without touching canary
Mitigations:
Combine with ASLR (randomize canary address)
Use fortified functions (_strcpy_chk) to prevent overflows
Re-randomize canary after fork
Stack Clash protection (prevent jumping over canary)
Q4: What is the difference between capabilities and setuid? Why are capabilities better?
Traditional setuid:
# setuid binary runs with owner's privilegesls -l /usr/bin/passwd# -rwsr-xr-x root root /usr/bin/passwd# ↑ setuid bit# When user runs passwd:# 1. Process starts with user's UID# 2. Kernel sees setuid bit# 3. Sets effective UID to file owner (root)# 4. Process has FULL root privileges# Problem: All or nothing!# passwd only needs to write /etc/shadow# But gets ALL root capabilities
Capabilities:
Divide root into 41 distinct privileges:CAP_CHOWN - Change file ownershipCAP_DAC_OVERRIDE - Bypass file permissionsCAP_NET_BIND_SERVICE - Bind ports < 1024CAP_NET_RAW - Use raw socketsCAP_SYS_ADMIN - System administrationCAP_SYS_MODULE - Load kernel modules... 35 more ...Process gets ONLY what it needs!
Comparison:
Feature
setuid
Capabilities
Granularity
All or nothing
Fine-grained (41 capabilities)
Security
Over-privileged
Least privilege
Persistence
Lost on exec (unless binary is setuid)
Can be inherited
Auditability
Hard to see why root is needed
Clear which capability is used
Example: Network Server:
// Old way: setuid root binaryint main() { // Running as root (UID 0) // Can do ANYTHING! int sock = socket(AF_INET, SOCK_STREAM, 0); bind(sock, ..., 80); // Bind port 80 (needs root) setuid(nobody); // Drop privileges after bind // Problem: Race window while root // If exploit before setuid(), full root access!}// New way: Capabilitiesint main() { // Running as nobody (UID 65534) // Has ONLY CAP_NET_BIND_SERVICE int sock = socket(AF_INET, SOCK_STREAM, 0); bind(sock, ..., 80); // Works! (has capability) open("/etc/shadow", O_RDONLY); // FAIL! (no CAP_DAC_OVERRIDE) // Even if exploited, attacker only has port binding // Cannot read files, cannot exec as root, etc.}
Setting Capabilities:
# Give binary capability instead of setuid# Before:chmod u+s /usr/bin/ping # setuid root (dangerous!)# After:setcap cap_net_raw+ep /usr/bin/ping # Only raw socket capability# Verifygetcap /usr/bin/ping# /usr/bin/ping = cap_net_raw+ep# Now ping can create raw sockets but has NO other root powers
Why Capabilities Are Better:
Principle of Least Privilege: Only grant necessary permissions
Reduced Attack Surface: Exploit gets limited capabilities, not full root
Better Auditability: Clear why each capability is needed
Flexibility: Can grant to non-root users
Inheritance: Can design capability-aware services
Real-World Usage:
systemd services with capabilities
Docker containers (run as non-root with specific capabilities)
Network daemons (CAP_NET_BIND_SERVICE instead of setuid)
Q5: How does KPTI mitigate Meltdown? What is the performance cost?
Meltdown Vulnerability:
CPU speculatively executes kernel memory access from user mode:// User-mode codechar kernel_byte = *(char *)0xffff880000000000; // Kernel address// CPU behavior:// 1. Starts speculative execution before permission check// 2. Loads kernel memory (should fault, but hasn't checked yet)// 3. Uses loaded byte to index array: probe[kernel_byte * 4096]// 4. This brings probe[...] into cache ← SIDE EFFECT!// 5. Permission check completes → Exception!// 6. Architectural state rolled back// 7. But cache state remains! ← LEAK!// Attacker measures cache timing → recovers kernel_byte
KPTI (Kernel Page Table Isolation) Solution:
Without KPTI (vulnerable):┌─────────────────────────────┐│ User Mode (CR3 = user_pgd) ││ ││ User virtual addresses ││ 0x0 - 0x7fffffffffff ││ │ ││ ├─> User pages ││ │ ││ Kernel virtual addresses ││ 0xffff800000000000 - ... │ ← Mapped in user page tables!│ │ │ ← Meltdown can speculatively access│ ├─> Kernel pages │└─────────────────────────────┘With KPTI (secure):User Mode:┌─────────────────────────────┐│ CR3 = user_pgd ││ ││ User virtual addresses ││ 0x0 - 0x7fffffffffff ││ ├─> User pages ││ ││ Kernel virtual addresses ││ 0xffff800000000000 - ... ││ ├─> MINIMAL kernel stubs │ ← Only entry/exit trampolines!│ │ (entry_SYSCALL_64) │ ← Rest of kernel NOT MAPPED└─────────────────────────────┘Kernel Mode (after syscall):┌─────────────────────────────┐│ CR3 = kernel_pgd ││ ││ User virtual addresses ││ ├─> User pages ││ ││ Kernel virtual addresses ││ ├─> FULL kernel mapping │ ← All kernel code/data accessible└─────────────────────────────┘
Syscall Flow with KPTI:
; User-mode applicationmov rax, 1 ; SYS_writemov rdi, 1 ; fd = stdoutsyscall ; Enter kernel; ← CPU switches to kernel mode ←entry_SYSCALL_64: ; Still using user page tables! swapgs ; Swap GS (get kernel stack) ; SWITCH PAGE TABLES (the expensive part!) mov rax, CR3 ; Read current CR3 (user page table) or rax, 0x1000 ; Set bit to switch to kernel tables mov CR3, rax ; ← PAGE TABLE SWITCH (TLB flush!) ; Now kernel is fully mapped ; Execute syscall handler... call do_syscall_64 ; SWITCH BACK to user page tables mov rax, CR3 and rax, ~0x1000 mov CR3, rax ; ← PAGE TABLE SWITCH (TLB flush!) swapgs sysretq ; Return to user mode
// Victim codeif (x < array_size) { // ← Branch y = array[x]; // ← Speculative execution}CPU's Branch Predictor:- Predicts if branch will be taken or not- Speculatively executes ahead while check happens- If prediction wrong, rollback- If prediction right, save time!Problem: Rollback discards architectural state but NOT cache state!
Attack:
// Step 1: Train branch predictorfor (int i = 0; i < 1000; i++) { victim_function(valid_x); // x < array_size, branch TAKEN}// Branch predictor learns: "This branch is ALWAYS taken"// Step 2: Prepare side-channelfor (int i = 0; i < 256; i++) { clflush(&probe_array[i * 4096]); // Flush cache}// Step 3: Attack with out-of-bounds xvictim_function(malicious_x); // malicious_x >= array_size// What happens:// 1. Branch predictor predicts: TAKEN (based on training)// 2. CPU speculatively executes: y = array[malicious_x]// 3. This accesses out-of-bounds memory (kernel memory!)// 4. Uses leaked byte to index: probe_array[y * 4096]// 5. This brings probe_array[y * 4096] into cache ← LEAK!// 6. Branch check completes: x < array_size? FALSE// 7. Rollback! Discard y, but cache state remains!// Step 4: Recover leaked byte via timingfor (int i = 0; i < 256; i++) { t0 = rdtsc(); temp = probe_array[i * 4096]; t1 = rdtsc(); if (t1 - t0 < THRESHOLD) { printf("Leaked byte: 0x%02x\n", i); // Cache hit! break; }}// Result: Read arbitrary memory across privilege boundaries!
Why Retpolines Work:Problem with Indirect Branches:
; Vulnerable indirect jumpjmp *%rax ; Jump to address in rax; Attacker can manipulate branch predictor to:; 1. Predict wrong target; 2. Cause speculative execution to gadget; 3. Leak data via cache side-channel
Retpoline (Return Trampoline):
; Instead of: jmp *%rax; Use:call .set_target ; Push return address on stack.set_target: mov %rax, (%rsp) ; Replace return address with rax ret ; Return to rax; Why this is safe:; CPU's Return Stack Buffer (RSB):; - Separate predictor for RET instructions; - Tracks call/return pairs; - NOT poisonable by attacker; When ret executes:; - CPU predicts target from RSB; - RSB says: return to .capture_spec; - Speculative execution goes to .capture_spec; - NOT to attacker-controlled address!.capture_spec: pause ; Prevent speculation lfence ; Serializing instruction jmp .capture_spec ; Infinite loop (never executed)
Visual Comparison:
Traditional Indirect Jump (vulnerable):┌─────────────┐│ jmp *rax │ → Branch predictor → Attacker controls prediction└─────────────┘ ↓ Speculative execution to gadget ↓ Leak via cache timingRetpoline (safe):┌─────────────┐│ call .label │ → Push return addr on stack│ .label: ││ mov rax,SP │ → Replace return addr with rax│ ret │ → RSB predicts return to .capture└─────────────┘ (NOT attacker-controlled!) ↓.capture_spec: pause lfence jmp .capture_spec ← Speculation contained in loop ← No leak possible!
Kernel Implementation:
// Compiler generates retpolines for indirect branches// gcc -mindirect-branch=thunk-extern// Original code:void (*func_ptr)(void);func_ptr(); // Indirect call// Compiled without retpoline:call *%rax// Compiled with retpoline:call __x86_indirect_thunk_rax// Retpoline thunk (arch/x86/lib/retpoline.S):SYM_FUNC_START(__x86_indirect_thunk_rax) JMP_NOSPEC %raxSYM_FUNC_END(__x86_indirect_thunk_rax)#define JMP_NOSPEC(reg) \ call .Ldo_rop_##reg; \.Lspec_trap_##reg: \ pause; \ lfence; \ jmp .Lspec_trap_##reg; \.Ldo_rop_##reg: \ mov %reg, (%rsp); \ ret
Performance Impact:
Retpolines are slower than direct jumps (5-20% overhead)
But necessary for security on vulnerable CPUs
Modern CPUs have hardware mitigations (IBRS - Indirect Branch Restricted Speculation)
Default profile restricts mount, capabilities, etc.
Custom profiles for specific containers
Path-based restrictions easier to understand
8. Policy Portability:SELinux:
Labels stored with files (xattrs)
Policy is separate from filesystem
Moving files between systems: labels can be lost
Need to relabel after restore from backup
AppArmor:
Policy references absolute paths
Moving profile to different system: works if paths same
But path changes require profile updates
Recommendation Matrix:
Priority
Choose
Maximum security
SELinux
Ease of use
AppArmor
Fine-grained control
SELinux
Simple policies
AppArmor
RHEL/CentOS
SELinux (default)
Debian/Ubuntu
AppArmor (default)
NFS/non-xattr FS
AppArmor
MLS/MCS required
SELinux
Container host
Both work (SELinux more common)
Can you use both?: No, they conflict (both use LSM hooks). Choose one.Neither?: Not recommended. MAC adds significant security layer beyond DAC.
Q8: How does seccomp-BPF work and why is it critical for container security?
Seccomp-BPF (Secure Computing with Berkeley Packet Filter):Core Concept: Whitelist syscalls a process can make using BPF bytecode filters.Architecture:
User Space Process │ │ syscall (e.g., open, read, write) ↓┌─────────────────────────────┐│ Syscall Entry Point ││ (entry_SYSCALL_64) │└─────────┬───────────────────┘ │ │ ① Check: Seccomp filter installed? ↓┌─────────────────────────────┐│ Seccomp BPF Filter ││ ││ BPF Program: ││ - Load syscall number ││ - Load arguments ││ - Check against rules ││ - Return action: ││ • ALLOW ││ • KILL ││ • ERRNO ││ • TRAP │└─────────┬───────────────────┘ │ │ ② Action ↓ ALLOW? ──────────────────> Execute Syscall KILL? ──────────────────> SIGSYS (kill process) ERRNO? ──────────────────> Return error code TRAP? ──────────────────> Send SIGSYS signal (debugger)
Explain how ASLR, stack canaries, and NX bits work together to defend against buffer overflow attacks. What is the attack sequence an adversary must defeat, and where does each mitigation intervene?
Strong Answer:These three mechanisms form a layered defense against the classic buffer overflow attack chain. To understand why you need all three, walk through what an attacker must accomplish to exploit a buffer overflow:
Step 1: Overwrite the return address — The attacker provides input that overflows a stack buffer and overwrites the saved return address (RIP) on the stack, redirecting execution to attacker-controlled code.
Stack canaries intervene here. A random value (the “canary”) is placed between local variables and the saved return address at function entry. Before the function returns, the compiler-inserted code checks if the canary was modified. If it was (because the overflow overwrote it on the way to the return address), the program aborts immediately. The attacker must either guess the canary (2^64 possibilities on 64-bit) or find a way to overwrite the return address without touching the canary (possible with format string bugs or non-contiguous overwrites, but much harder).
Step 2: Redirect execution to shellcode — If the attacker bypasses the canary, they redirect execution to injected code (shellcode) in the buffer itself.
NX (No-Execute) / DEP intervenes here. The stack (and heap, and data sections) are marked non-executable at the page table level. The CPU enforces this in hardware: executing an instruction from an NX page triggers a page fault. The attacker’s shellcode on the stack cannot execute. This forces the attacker to use Return-Oriented Programming (ROP) — chaining existing code snippets (“gadgets”) from the binary and libraries.
Step 3: Locate usable code gadgets — The attacker needs to find executable code at known addresses to build ROP chains.
ASLR intervenes here. The kernel randomizes the base addresses of the stack, heap, shared libraries, and (with KASLR) the kernel itself at each process start. The attacker cannot hard-code addresses of gadgets because they change every run. On 64-bit systems, the entropy is typically 28-30 bits for library randomization, making brute force impractical.
Together, the attacker must: bypass the canary (hard without an information leak), cannot inject code (NX), and cannot find existing code to reuse (ASLR). Breaking one is insufficient — you need to break at least two.Where it still fails:
Information leaks: A separate vulnerability that leaks memory addresses (e.g., a format string bug that prints stack values) can defeat both ASLR (reveals addresses) and canaries (reveals the canary value). This is why modern defenses add CFI (Control Flow Integrity) as a fourth layer — even if the attacker knows addresses, they cannot redirect execution to arbitrary gadgets because the CPU verifies that indirect branches target valid function entries.
Follow-up: What is KASLR and why was KPTI needed despite it?KASLR randomizes the kernel’s base address in virtual memory at each boot. The idea is that even if an attacker has a kernel vulnerability, they cannot exploit it without knowing where kernel functions are located. KPTI (Kernel Page Table Isolation) was needed because the Meltdown vulnerability allowed user-space code to speculatively read kernel memory through the CPU’s speculative execution, bypassing KASLR entirely — the attacker could read kernel addresses at ~500KB/s and then use those addresses for their exploit. KPTI unmaps the kernel from user-space page tables entirely, so there is nothing for Meltdown to speculatively read. The cost is that every syscall now requires a CR3 switch between user and kernel page tables (5-30% overhead on older CPUs).
Compare seccomp-BPF and SELinux as security mechanisms. If you were hardening a container running an untrusted workload, which would you use and why?
Strong Answer:Seccomp-BPF and SELinux operate at completely different layers and are complementary, not interchangeable.
Seccomp-BPF (System Call Filter): Intercepts every syscall at the entry point and runs a BPF filter that decides allow/deny/kill based on the syscall number and (with some limitations) its arguments. It answers: “Can this process invoke this kernel API?” Seccomp cannot distinguish between files, network addresses, or process targets — if you allow open(), the process can open any file. If you allow connect(), it can connect to any address.
SELinux (Mandatory Access Control): Assigns security labels to every process, file, socket, and kernel object. A policy defines which labels can perform which operations on which other labels. It answers: “Can this specific subject access this specific object in this specific way?” SELinux can say “process with label httpd_t can read files with label httpd_content_t but cannot write to files with label etc_t.” This is far more granular than seccomp.
For hardening an untrusted container, I would use both:
Seccomp-BPF: Block all syscalls the container does not need. A web server does not need mount, reboot, kexec_load, ptrace, init_module, or io_uring_setup. Docker ships a default seccomp profile that blocks about 60 dangerous syscalls. For untrusted workloads, I would create a custom profile that allowlists only the ~50 syscalls the application actually uses (determined by running strace during testing). This shrinks the kernel attack surface enormously — most kernel CVEs are in obscure syscall handlers that a web server never touches.
SELinux (or AppArmor): Apply a policy that restricts what the container can access even with the allowed syscalls. The container process can call open(), but SELinux ensures it can only open files in its designated directory. It can call connect(), but SELinux restricts it to specific ports and network labels. This prevents a compromised container from reading /etc/shadow, connecting to the metadata service (a common cloud attack vector), or accessing the Docker socket.
The layers complement each other: seccomp removes dangerous kernel entry points, SELinux restricts what the remaining entry points can access. Neither alone is sufficient. Seccomp without SELinux means a process with open() allowed can read any file. SELinux without seccomp means a process can invoke dangerous syscalls (even if they fail due to policy, the syscall handler code still runs, potentially triggering kernel bugs).Follow-up: What is the performance overhead of running both seccomp-BPF and SELinux simultaneously?Seccomp-BPF adds 10-50 nanoseconds per syscall (running a small BPF program in the syscall entry path). SELinux adds 100-500 nanoseconds per security check (which happens on syscalls that access objects — file open, socket connect, etc.). For a web server making 10K syscalls per second, the combined overhead is roughly 0.5-5 milliseconds per second — negligible. For a storage-intensive application making 500K syscalls per second, the overhead is 25-250 milliseconds per second (2.5-25% of one core). The practical impact depends entirely on the syscall rate. For most workloads, the overhead is under 1% and invisible in application-level metrics. The security benefit far outweighs the cost.
Explain the Spectre vulnerability. How does it differ from Meltdown, and why is Spectre considered harder to mitigate fully?
Strong Answer:Both Spectre and Meltdown exploit speculative execution — the CPU’s optimization of executing instructions ahead of time before knowing whether they are needed. The critical difference is the trust boundary they violate.
Meltdown (CVE-2017-5754): Exploits the fact that on vulnerable Intel CPUs, speculative loads from kernel memory are not immediately stopped by the permission check. The CPU speculatively reads kernel data into a register, uses it to access a cache line (encoding the secret in a cache side channel), and then throws away the speculative result when the permission check fails. But the cache side channel remains — the attacker can probe which cache line was accessed and recover the kernel data. Meltdown crosses the user/kernel boundary and allows reading arbitrary kernel memory.
Spectre (CVE-2017-5753 Variant 1, CVE-2017-5715 Variant 2): Exploits the CPU’s branch prediction. In Variant 1 (bounds check bypass), the attacker trains the branch predictor to predict that a bounds check will pass, then triggers speculative execution past the check with an out-of-bounds index. The speculative load accesses secret data and encodes it in the cache. In Variant 2 (branch target injection), the attacker poisons the Branch Target Buffer (BTB) to redirect speculative execution of an indirect branch to attacker-chosen code (“gadgets”) within the victim’s address space.
Why Spectre is harder to mitigate:
Meltdown has a clean fix: KPTI (Kernel Page Table Isolation) unmaps the kernel from user-space page tables. If the kernel memory is not even mapped during user-space execution, the speculative load has nothing to read. The fix is at the OS level and is complete (with a 5-30% performance cost).
Spectre crosses any trust boundary: Spectre does not require reading kernel memory. It can leak data between processes, between VMs, between JavaScript contexts in a browser, between a sandbox and its host. Any code running on the same CPU can potentially be a Spectre victim or attacker.
Software mitigations are partial: Retpolines (replacing indirect branches with a return trampoline that defeats BTB poisoning) mitigate Variant 2 but add overhead to every indirect call. Array bounds masking (inserting an AND instruction after bounds checks to zero out speculative out-of-bounds accesses) mitigates Variant 1 but requires compiler changes and careful code auditing. Neither is a complete fix.
New variants keep appearing: Spectre is a class of vulnerabilities, not a single bug. Spectre-v3a, Spectre-RSB, Spectre-BHB, and MDS (Microarchitectural Data Sampling) are all variations on the same theme. Each requires its own mitigation.
The fundamental problem is that speculative execution is not a bug — it is a deliberate performance feature that provides 10-100x speedup for branch-heavy code. Disabling speculation entirely would reduce modern CPUs to 1990s performance levels. The industry is converging on hardware fixes in newer CPUs (Intel Golden Cove, AMD Zen 4) that add speculation barriers in microcode, but older hardware remains vulnerable.Follow-up: How do cloud providers like AWS protect against cross-VM Spectre attacks on shared hardware?Multiple layers: hardware partitioning (Intel CAT/MBA to partition the L3 cache between VMs, reducing cache side-channel leakage), microcode updates (clearing branch predictor state on VM entry/exit), hypervisor patches (KVM flushes speculation buffers on VMEXIT), and core scheduling (ensuring untrusted VMs do not share SMT siblings, since Hyper-Threading shares the branch predictor and L1 cache between logical cores). AWS’s Nitro system goes further by offloading virtualization to dedicated hardware, reducing the hypervisor attack surface. Despite all this, the most sensitive workloads (HSMs, cryptographic key storage) run on dedicated single-tenant hosts with no sharing.
Threat model: someone gives you a binary you have to run, and you have to assume it is malicious. What boundaries does Linux give you, and how do you compose them?
Strong Answer Framework:
Establish what the attacker should not be able to do. Before reaching for tools, define the boundary. “Cannot read /etc/shadow” is different from “cannot exfiltrate any data” is different from “cannot persist a backdoor.” Threat modeling forces you to choose mechanisms that match the goal.
Apply user separation as the floor. Run the binary as a dedicated unprivileged user with no shell, no sudo entries, no group memberships beyond its own. This is the cheapest layer and rules out 80 percent of trivial attacks. Anyone who skips this layer because “I have stronger mechanisms above” loses if the stronger mechanisms have a bug.
Drop capabilities to the minimum. Use prctl(PR_CAPBSET_DROP) to drop the bounding set, set SECBIT_NOROOT to prevent file capabilities or setuid from re-elevating, and add only the capabilities the workload needs. For most workloads, the answer is zero capabilities.
Apply seccomp to shrink the kernel surface. Custom syscall whitelist generated from observed behavior. The kernel has 350+ syscalls; a typical workload uses 50-80. Blocking the rest closes off entire classes of kernel CVEs preemptively.
Use namespaces to make the world smaller. Mount namespace with a chroot or pivot_root into a private rootfs. Network namespace with no interfaces (or just a loopback). PID namespace so the process cannot see or signal anything outside. User namespace with the workload mapped to a non-overlapping host UID, so even root-in-namespace is unprivileged on the host.
Layer mandatory access control on top. SELinux or AppArmor profile that restricts what the workload can read or write even if it somehow got privilege. This is the layer that catches the bug in your seccomp profile.
Cgroup limits for blast radius. Memory limit, PID limit, CPU quota, IO weight. These do not stop intrusions, but they bound the damage of fork bombs, memory hogs, and crypto-miners-as-payload.
Audit what you cannot prevent. Even with all of the above, log every syscall through audit subsystem or eBPF tracing. The goal is detection within hours of a successful attack, not just prevention.
Real-World Example: Google Chrome’s renderer sandbox is the public reference design. Each renderer process drops all capabilities, applies a strict seccomp filter (about 65 syscalls allowed), runs in a user namespace with the renderer UID mapped to nobody, has no filesystem access (uses Mojo IPC to the privileged broker for file IO), and is restricted by SELinux on Android. The 2014 Pwn2Own attack on Chrome required chaining a renderer RCE with a seccomp escape and a kernel privilege escalation — three independent vulnerabilities. The 2024 attack on Chrome’s V8 still required two more bugs to escape the renderer sandbox to host code execution.
Senior follow-up 1: Why is user namespace mapping the root inside the namespace to a non-zero UID outside considered the strongest single primitive?Because most kernel privilege checks use the namespaced uid for permission decisions but the real uid for capability decisions on global resources. If your namespace’s UID 0 maps to host UID 100000, a successful exploit that gives the attacker capabilities only does so within the namespace. Operations that affect the host kernel (loading modules, mounting filesystems on host paths, ptrace of host processes) check the real UID, which is unprivileged. This is why rootless containers are a meaningful security improvement — not just a usability one.
Senior follow-up 2: A seccomp profile is too restrictive in unpredictable ways. What is your debug strategy?Set the default action to SECCOMP_RET_LOG instead of SECCOMP_RET_KILL, run the workload through realistic scenarios (not just happy path — include error handling, signal delivery, malloc growth), and watch /var/log/audit/audit.log for SECCOMP records. Each entry shows the syscall number that would have been killed; map those to names with ausyscall. After a clean observation window, flip default action to SECCOMP_RET_ERRNO(EPERM) for one more cycle (so the application can fail gracefully), then to SECCOMP_RET_KILL_PROCESS for hard enforcement. Tools like containerd-shim’s seccomp recorder, Falco, and kubectl-trace automate this loop.
Senior follow-up 3: Where does gVisor fit relative to seccomp + namespaces, and when is the extra cost justified?gVisor reimplements the Linux syscall surface in a userspace process (Sentry) that sits between the application and the host kernel. Calls that look like read() to the application are actually intercepted, validated, and either handled in Sentry or proxied to the host. This eliminates an entire class of risk: kernel CVEs in syscall handlers do not affect gVisor-sandboxed workloads because those handlers never execute on the host kernel for sandboxed traffic. The cost is real — gVisor adds 10-50 percent overhead on syscall-heavy workloads and is incompatible with some applications (Linux-namespace-specific tools, applications that mmap and then expect specific kernel behaviors). The cost is justified for workloads where you genuinely cannot trust the binary — shared CI runners, untrusted user code in PaaS, multi-tenant function execution. It is overkill for first-party microservices.
Common Wrong Answers:
“Just put it in Docker.” Docker by default runs as root inside the container, with most capabilities, and a default-permissive seccomp profile. Docker is a packaging tool first; security depends on configuration that must be applied explicitly.
“Use a VM.” VMs have a smaller attack surface than namespaces against most threat models, but the hypervisor still has CVE history (CVE-2017-2596 KVM, CVE-2020-29569 Xen). Saying “use a VM” without acknowledging hypervisor risk hand-waves the problem.
“Drop capabilities and you are done.” Capabilities are necessary but not sufficient. A process with zero capabilities can still read every file world-readable, connect to localhost services, and exploit kernel bugs in syscalls that do not require capabilities.
Further Reading:
“Sandboxing and Workload Isolation” (Google production hardening guide) — the gVisor design rationale and threat model
Jess Frazelle, “Hard multi-tenancy in Kubernetes” — pragmatic stack for untrusted workloads
Linux source: kernel/seccomp.c, kernel/user_namespace.c, security/security.c for the LSM hook integration
Compare what seccomp, capabilities, namespaces, and Docker each contribute to container security. Where do they overlap, where do they fail to compose, and what should I expect each one to catch?
Strong Answer Framework:
Capabilities answer: what privileged operations can this process invoke? Drop all capabilities and the process cannot bind low ports, change UIDs, mount filesystems, load kernel modules, ptrace others, or do anything else that historically required root. Capabilities do not restrict file access (DAC handles that) or syscall surface (seccomp handles that).
Seccomp answers: what syscalls can this process make at all? Even without capabilities, a process can call hundreds of syscalls. Many have CVE history. Seccomp shrinks the kernel attack surface by blocking syscalls the workload does not need. It does not care about arguments deeply (only some support arg filtering), so it cannot say “open files only in /tmp.” It just says “you can or cannot call open at all.”
Namespaces answer: what does this process see? Mount namespace = its own filesystem view. Network namespace = its own network stack. PID namespace = its own process tree. User namespace = its own UID/GID mapping. Namespaces isolate visibility and resource scope, not privilege. Two processes can be in the same namespace and one can attack the other; namespaces only protect across the boundary.
Docker is the orchestration that wires these together. A Docker container is, mechanically, a process tree with namespaces, a default seccomp profile, dropped capabilities (most are off by default), an AppArmor or SELinux profile, and cgroup limits. Docker is not a new isolation mechanism — it is a configuration that combines existing kernel mechanisms.
Where they fail to compose: seccomp filters by syscall number, but a syscall you allow can transitively reach functionality you blocked (the mprotect-via-printf issue). Namespaces leak through /proc, /sys, kernel keyrings, and shared kernel data structures. User namespaces have escalation paths through misconfigured uid_map. Capabilities have surprising scopes — CAP_SYS_ADMIN is “nearly root” because dozens of operations gate on it. Combining all four is necessary; each individually has gaps the others fill.
Real-World Example: The 2024 LeakyVessels CVEs (CVE-2024-21626 in runc, CVE-2024-23651 in BuildKit) escaped containers despite seccomp, capability dropping, and AppArmor all being in place. The escapes worked through file descriptor leaks across the namespace boundary — runc was leaking host file descriptors into containers via /proc/self/fd, and a malicious container could traverse those FDs to reach the host filesystem. None of the standard hardening primitives caught this because the attack did not violate any one mechanism’s contract — it exploited the gap between them. The fix was at the runtime level: runc closes all FDs before exec, a behavior that should have been there all along.
Senior follow-up 1: Why does a default Docker container still have CAP_NET_RAW and CAP_NET_BIND_SERVICE despite the security guidance to drop everything?Because Docker’s defaults are tuned for compatibility with common workloads — ping, DHCP clients, web servers binding to ports below 1024 in legacy configurations. Most real workloads do not need either capability and should drop them explicitly with --cap-drop=ALL --cap-add=.... The Docker maintainers chose conservative defaults so docker run would just work for as many users as possible, accepting a less-defensive baseline as the cost. For production, your image build or orchestration layer should override this default.
Senior follow-up 2: What is the difference between SECCOMP_FILTER_FLAG_TSYNC and per-thread seccomp filters, and when does it matter?SECCOMP_FILTER_FLAG_TSYNC synchronizes a seccomp filter across all threads in the process at install time, ensuring no thread escapes the filter. Without it, a multithreaded process can install a filter on the calling thread but other threads keep running unfiltered until they call prctl(PR_SET_SECCOMP) themselves. For a single-threaded program this is fine; for anything threaded (which is most modern code), TSYNC is mandatory or you have a race where a thread spawned during filter installation never gets the filter. The 2017 CVE-2017-2671 in QEMU is one example of this exact race being exploited.
Senior follow-up 3: When would you choose AppArmor over SELinux, or vice versa, and is there ever a case to run both?AppArmor uses path-based labels — “process X cannot write to /etc/*”. It is easier to write profiles for and easier to reason about, especially in containerized environments where filesystem layout is predictable. SELinux uses type labels assigned to files via xattrs — “process labeled httpd_t cannot write to files labeled etc_t”. This is more powerful (the label travels with the file regardless of path) but harder to debug. Use AppArmor for application-specific containment in container environments (Ubuntu, Debian, SUSE all default to AppArmor). Use SELinux for whole-system mandatory access control where the broader policy benefits outweigh complexity (RHEL, Fedora, Android). Running both simultaneously is theoretically possible but practically unwise — LSM stacking still has rough edges, debugging conflicts is painful, and the marginal security from running both is small compared to running either one well.
Common Wrong Answers:
“Containers are basically VMs.” They are emphatically not. A VM has a hardware-virtualized hypervisor between guest and host kernel; a container shares the host kernel directly. Container escapes target host kernel bugs; VM escapes target hypervisor bugs (rarer, smaller surface).
“Seccomp blocks system calls and that is enough.” Seccomp does not see filesystem paths or network addresses. A process with open allowed can read every file your DAC allows; a process with connect allowed can reach every IP your network namespace permits.
“If I drop all capabilities I am safe.” Many CVEs do not need capabilities. Reading sensitive files via standard DAC, exploiting kernel bugs in syscalls that do not require privilege, and lateral movement through the container’s mount namespace are all capability-free.
Further Reading:
“Container Security” by Liz Rice — the cleanest book-length tour of the kernel primitives and how Docker/Kubernetes wire them
LWN article: “Capabilities for system calls” (Mickael Salaun) — why caps and seccomp are complementary
runc CVE-2024-21626 writeup — a real-world example of compositional failure
Linux source: Documentation/userspace-api/seccomp_filter.rst, Documentation/security/credentials.rst
Walk me through how Spectre actually works -- from the CPU's perspective and from the attacker's. Then explain what mitigations the kernel applies and what you give up.
Strong Answer Framework:
The CPU’s perspective: speculation as a performance feature. A modern CPU does not wait for a branch’s condition to resolve before fetching, decoding, and executing instructions on one of the predicted paths. Branch predictors — including the Pattern History Table for direct branches and the Branch Target Buffer (BTB) for indirect branches — predict where execution is going. The CPU executes speculatively, retains results in the Reorder Buffer, and either commits them (prediction correct) or discards them (prediction wrong). The trick is that “discards them” is not perfect: side effects on microarchitectural state — cache lines loaded, branch predictor state updated — persist even when the architectural result is rolled back.
The attacker’s perspective: turning microarchitectural side effects into a data leak. Spectre Variant 1 (bounds check bypass): the attacker trains the branch predictor to expect a bounds check to pass, then triggers the speculative path with an out-of-bounds index. The speculative load reads secret memory, uses the secret as an index into a probe array, and brings a specific cache line into L1. The architectural result is rolled back, but the cache state is not. The attacker times accesses to the probe array; the line that hits in cache encodes the secret byte. With this primitive, the attacker reads memory at the rate of about 10-100 KB/sec.
Spectre Variant 2 (branch target injection): poisoning indirect branches. The attacker pollutes the BTB with branch targets that, when used speculatively by the victim, redirect speculative execution to attacker-chosen code — “gadgets” — in the victim’s address space. Now the speculative-execution-and-cache-side-channel pattern can read across security boundaries (kernel space, other VMs, browser sandboxes).
Kernel mitigations: per-variant. Variant 1 mitigated with array bounds masking (array_index_nospec) and LFENCE / speculation barriers in kernel hot paths — compiler and code review job, painful and incomplete. Variant 2 mitigated with retpolines on x86 (replacing indirect branches with a return trampoline that defeats BTB poisoning) and IBRS / IBPB / STIBP CPU features (clearing predictor state at boundary crossings). Cross-process (and cross-VM) protection via core scheduling — never schedule untrusted SMT siblings on the same physical core.
What you give up. Retpolines add 5-30 percent overhead to indirect-call-heavy workloads (interpreters, VM monitors, system call entry). KPTI (which mitigates Meltdown but is part of the same family) costs 5-30 percent on syscall-heavy workloads, especially on older CPUs without PCID. Disabling SMT for security on multi-tenant hosts halves logical core count. The total cost on a Skylake-era Xeon running a syscall-heavy workload is non-trivial — often 10-25 percent throughput loss compared to a fully-mitigation-disabled baseline.
Real-World Example: When Spectre was disclosed in January 2018, AWS deployed mitigations in two phases: first, Linux KPTI plus retpolines on hosts (immediately, for all workloads), and second, a Nitro-based approach that moved virtualization to dedicated hardware so the hypervisor surface no longer ran on the same cores as guest code. Internal AWS benchmarks reported 1-5 percent average overhead for typical workloads, but specific workloads (Redis, syscall-heavy databases) showed 20-30 percent regressions until application-level tuning recovered most of it. Public web search engines saw similar cost; Google’s response involved refactoring V8’s JIT to insert speculation barriers in a way that did not pay the full retpoline cost on every JS function call.
Senior follow-up 1: Why is Meltdown easier to fully mitigate than Spectre?Meltdown exploits a specific Intel CPU bug: speculative loads to kernel addresses from user mode were not properly checked. The mitigation — KPTI, unmapping kernel from user-space page tables — removes the speculative load’s target entirely. There is nothing for the speculation to read. Spectre, in contrast, exploits a deliberate CPU feature (branch prediction) that you cannot remove without crippling performance. Every mitigation is partial: you fix one branch site, the next one is still vulnerable. You add a barrier somewhere, the attacker finds a different speculation primitive. The arms race is structurally asymmetric.
Senior follow-up 2: How does retpoline actually defeat branch target injection, mechanically?A normal indirect branch (jmp *%rax) consults the BTB for prediction, which the attacker has poisoned. Retpoline replaces it with a sequence: push the target onto the stack, then ret. The CPU’s return address predictor (Return Stack Buffer, RSB) is used instead of the BTB for ret. The attacker cannot easily poison the RSB because it is filled by call instructions, which the attacker controls less. The speculation that does happen lands in an infinite loop (pause; jmp self), so even if the predictor is wrong, the speculative path does not perform any useful work for an attacker. The cost is a few extra instructions per indirect call. On AMD CPUs and newer Intel CPUs with eIBRS (Enhanced Indirect Branch Restricted Speculation), retpoline is replaced with a hardware mode flag that gives equivalent protection at lower overhead.
Senior follow-up 3: When should I disable Spectre mitigations on a host I control?Realistic case: a single-tenant host running first-party trusted code only, where every binary on the box is built and signed by you, the kernel CVEs you fear are not in the speculation family, and the 5-25 percent throughput loss matters more than defense in depth. HPC clusters running tightly-controlled workloads disable some mitigations for this reason (mitigations=off or specific flags like nopti, spectre_v2=off). The risk you accept: an unknown future CVE that uses speculation to escape userspace, or a supply-chain compromise in a trusted dependency. Most production environments cannot make this tradeoff because the trust assumptions do not actually hold; HPC and game servers can. Document the decision explicitly so the next operator does not assume mitigations are on.
Common Wrong Answers:
“Spectre and Meltdown are the same thing.” They share a primitive (cache side channel after speculation) but differ in trust boundary and mitigation profile. Conflating them suggests you have read the headline but not the technical writeup.
“Just patch your CPU microcode.” Microcode updates are part of the mitigation but cannot fix Spectre fully because Spectre is a behavior, not a bug. Software mitigations remain mandatory.
“Disable Hyper-Threading and you are safe.” Disabling SMT helps against L1TF and MDS variants where SMT siblings share microarchitectural state, but does nothing for cross-process Spectre on the same core. It is one mitigation, not the answer.
Further Reading:
The original Spectre paper: Kocher et al., “Spectre Attacks: Exploiting Speculative Execution” (2018, USENIX Security)
LWN article: “The current state of kernel page-table isolation” — comprehensive KPTI walkthrough
Intel “Speculative Execution Side Channel Mitigations” white paper — the vendor’s view
Linux source: arch/x86/kernel/cpu/bugs.c, arch/x86/include/asm/nospec-branch.h