Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Operating System Security

Modern OS security is a multi-layered defense against both software vulnerabilities and hardware-level attacks. From memory protection to mandatory access control, understanding these mechanisms is crucial for building secure systems.
Mastery Level: Senior Security Engineer Key Internals: Page Table Permissions, Capabilities, LSM hooks, CPU security features, Speculative execution mitigations Prerequisites: Virtual Memory, Process Internals

1. Memory Protection Fundamentals

1.1 Page-Level Protection (NX/DEP)

No-Execute (NX) / Data Execution Prevention (DEP) marks memory pages as non-executable.
Traditional (Insecure):
┌────────────────────────────────┐
│  Stack                         │  Executable!
│  ├─ Return addresses           │  ← Attacker can inject shellcode
│  └─ Local variables            │
├────────────────────────────────┤
│  Heap                          │  Executable!
│  ├─ Malloc'd buffers           │  ← Attacker can put code here
│  └─ Dynamic data               │
├────────────────────────────────┤
│  Data/BSS                      │  Executable!
└────────────────────────────────┘

With NX/DEP:
┌────────────────────────────────┐
│  Stack (NX bit set)            │  NOT Executable
│  ├─ Return addresses           │  ← Shellcode won't execute!
│  └─ Local variables            │
├────────────────────────────────┤
│  Heap (NX bit set)             │  NOT Executable
│  ├─ Malloc'd buffers           │
│  └─ Dynamic data               │
├────────────────────────────────┤
│  Data/BSS (NX bit set)         │  NOT Executable
├────────────────────────────────┤
│  Text (executable)             │  Executable
│  └─ Program code               │
└────────────────────────────────┘
Implementation:
// Kernel sets page table entry (PTE) NX bit

// x86-64 page table entry structure
struct pte {
    unsigned long present : 1;     // Page is in memory
    unsigned long rw : 1;          // Read/Write permission
    unsigned long user : 1;        // User-mode accessible
    unsigned long pwt : 1;         // Page write-through
    unsigned long pcd : 1;         // Page cache disabled
    unsigned long accessed : 1;    // Page was accessed
    unsigned long dirty : 1;       // Page was written
    unsigned long pat : 1;         // Page attribute table
    unsigned long global : 1;      // Global page
    unsigned long avail : 3;       // Available for OS use
    unsigned long pfn : 40;        // Physical frame number
    unsigned long avail2 : 11;     // Available
    unsigned long nx : 1;          // NO-EXECUTE bit (bit 63)
};

// Kernel code for stack allocation (simplified from mm/mmap.c)
unsigned long do_mmap(struct file *file, unsigned long addr,
                     unsigned long len, unsigned long prot,
                     unsigned long flags, unsigned long pgoff) {
    struct vm_area_struct *vma;

    vma = vm_area_alloc(current->mm);
    vma->vm_start = addr;
    vma->vm_end = addr + len;

    // Stack protection: read/write but NOT executable
    if (flags & MAP_STACK) {
        vma->vm_flags = VM_READ | VM_WRITE | VM_GROWSDOWN;
        // NX bit will be set in page table entries
    }

    // Code segment: executable but NOT writable
    if (prot & PROT_EXEC) {
        vma->vm_flags |= VM_EXEC;
        vma->vm_flags &= ~VM_WRITE;  // W^X: Write XOR Execute
    }

    return addr;
}
W^X Policy (Write XOR Execute):
  • A page can be writable OR executable, but never both
  • Prevents attacker from modifying code or executing data
  • The logic is straightforward: if you can write it, the attacker can inject code there, so it must not execute. If it executes, it must be immutable.
Practical tip: JIT compilers (V8, JVM HotSpot) are the main exception — they must generate code at runtime. They handle this by allocating pages as RW, writing machine code, then calling mprotect() to flip them to RX before execution. This W-then-X pattern is audited carefully in security-critical JITs like Firefox’s Wasm compiler. Check NX status:
# Check if NX is enabled
dmesg | grep NX
# NX (Execute Disable) protection: active

# Check process memory maps
cat /proc/self/maps
# 7ffff7dd1000-7ffff7df3000 r-xp ... /lib/x86_64-linux-gnu/ld-2.31.so  (executable)
# 7ffff7df3000-7ffff7df4000 r--p ... /lib/x86_64-linux-gnu/ld-2.31.so  (read-only)
# 7ffffffde000-7ffffffff000 rw-p ... [stack]                           (no 'x'!)

# Check if binary has NX enabled
readelf -l /bin/ls | grep GNU_STACK
# GNU_STACK      0x000000 0x00000000 0x00000000 0x00000 0x00000 RW  0x10
#                                                                 ^^^ (RW, not RWE)

1.2 Address Space Layout Randomization (ASLR)

Problem: Without ASLR, addresses are predictable.
Without ASLR (Predictable):
┌─────────────────────────────────┐
│ Stack:         0x7ffffffde000    │  ← Always same address!
│ Heap:          0x555555559000    │  ← Attacker knows these
│ libc:          0x7ffff7a0d000    │  ← Can hardcode in exploit
│ Program:       0x555555554000    │
│ vDSO:          0x7ffff7fc9000    │
└─────────────────────────────────┘

With ASLR (Randomized):
Run 1:                            Run 2:
┌─────────────────────────────┐  ┌─────────────────────────────┐
│ Stack:    0x7ffc9e3a2000    │  │ Stack:    0x7ffe1b8d6000    │
│ Heap:     0x5643ab123000    │  │ Heap:     0x55e2d9abc000    │
│ libc:     0x7f8a2e456000    │  │ libc:     0x7f3c81de2000    │
│ Program:  0x5643ab11f000    │  │ Program:  0x55e2d9ab8000    │
│ vDSO:     0x7f8a2f1c3000    │  │ vDSO:     0x7f3c82b4f000    │
└─────────────────────────────┘  └─────────────────────────────┘
                ↑ Different every time! ↑
Kernel Implementation:
// Simplified from arch/x86/mm/mmap.c

unsigned long arch_mmap_rnd(void) {
    unsigned long rnd;

    // Get random bits from kernel PRNG
    if (mmap_is_ia32()) {
        rnd = get_random_long() & ((1UL << mmap_rnd_bits) - 1);
    } else {
        rnd = get_random_long() & ((1UL << mmap_rnd_compat_bits) - 1);
    }

    return rnd << PAGE_SHIFT;  // Align to page boundary
}

unsigned long arch_get_unmapped_area(struct file *filp,
                                    unsigned long addr,
                                    unsigned long len,
                                    unsigned long pgoff,
                                    unsigned long flags) {
    struct mm_struct *mm = current->mm;
    struct vm_area_struct *vma;
    unsigned long start_addr;

    // Add random offset
    if (!(flags & MAP_FIXED)) {
        // Random offset for ASLR
        start_addr = mm->mmap_base + arch_mmap_rnd();
    } else {
        start_addr = addr;
    }

    // Find free region starting at randomized address
    vma = find_vma(mm, start_addr);
    // ... allocation logic ...

    return start_addr;
}
Entropy Sources:
ASLR Entropy (bits of randomness):

Stack:      19 bits (on x86-64)  = 524,288 possible locations
Heap:       13 bits              = 8,192 possible locations
Libraries:  28 bits              = 268 million possible locations
PIE binary: 28 bits              = 268 million possible locations

Formula: Brute force attempts = 2^(entropy_bits)

Example: 28 bits → attacker needs avg 2^27 = 134 million attempts
If each attempt crashes the program (1 sec delay):
  134 million seconds = 1,551 days!

But if process doesn't crash (fork server):
  Attacker can brute force in minutes!
KASLR (Kernel ASLR):
// Kernel virtual address randomization (from arch/x86/boot/compressed/kaslr.c)

void choose_random_location(unsigned long input,
                           unsigned long input_size,
                           unsigned long *output,
                           unsigned long output_size,
                           unsigned long *virt_addr) {
    unsigned long random_addr, min_addr;

    // Get entropy from:
    // 1. RDRAND/RDSEED (CPU instructions)
    // 2. RDTSC (timestamp counter)
    // 3. Boot parameters
    random_addr = get_random_long();

    // Align and constrain to valid kernel address range
    min_addr = min(*output, *virt_addr);
    random_addr = find_random_phys_addr(min_addr, output_size);

    *output = random_addr;
    *virt_addr = random_addr + __START_KERNEL_map;
}
Check ASLR status:
# View ASLR setting
cat /proc/sys/kernel/randomize_va_space
# 0 = Disabled
# 1 = Randomize stack, libraries, mmap
# 2 = Full randomization (includes heap)

# Enable full ASLR
echo 2 | sudo tee /proc/sys/kernel/randomize_va_space

# Test ASLR
for i in {1..5}; do cat /proc/self/maps | grep stack; done
# 7ffc12345000-7ffc12366000 (different)
# 7ffe9abcd000-7ffe9abee000 (different)
# 7ffd45678000-7ffd45699000 (different)

1.3 Stack Canaries (Stack Smashing Protection)

Stack canary is a random value placed on the stack between local variables and the return address.
Stack Layout with Canary:
─────────────────────────

High Address
┌──────────────────────┐
│  Return Address      │  ← Protected by canary!
├──────────────────────┤
│  Saved Frame Pointer │
├──────────────────────┤
│  CANARY (random)     │  ← __stack_chk_guard
├──────────────────────┤
│  Local Variables     │  ← Buffer overflow starts here
│  char buf[100];      │
└──────────────────────┘
Low Address

Attack Scenario:
1. Attacker overflows buf
2. Overwrites canary (but doesn't know correct value)
3. Function returns
4. Kernel checks: canary == __stack_chk_guard?
5. Mismatch → Stack smashing detected! → Abort
Compiler Implementation:
// Original vulnerable code
void vulnerable_function(char *input) {
    char buffer[64];
    strcpy(buffer, input);  // Buffer overflow!
}

// Compiled with -fstack-protector-strong
void vulnerable_function(char *input) {
    char buffer[64];
    unsigned long canary = __stack_chk_guard;  // Load canary

    strcpy(buffer, input);

    if (canary != __stack_chk_guard) {
        __stack_chk_fail();  // Stack smashing detected!
    }
}

// __stack_chk_fail() implementation (in glibc)
void __attribute__((noreturn)) __stack_chk_fail(void) {
    __fortify_fail("stack smashing detected");
}

void __attribute__((noreturn)) __fortify_fail(const char *msg) {
    // Log the error
    syslog(LOG_CRIT, "%s: %s terminated", __progname, msg);

    // Terminate immediately
    abort();
}
Canary Types:
// Uses NULL, CR, LF, EOF (0x00, 0x0D, 0x0A, 0xFF)
// Idea: strcpy stops at NULL, gets/printf stop at CR/LF

unsigned long canary = 0x000d0aff00000000;

// Weakness: Attacker can guess/brute-force known bytes
Compiler Flags:
# -fstack-protector: Protect functions with vulnerable buffers
gcc -fstack-protector vulnerable.c

# -fstack-protector-strong: Protect more functions (recommended)
gcc -fstack-protector-strong vulnerable.c

# -fstack-protector-all: Protect ALL functions (performance cost)
gcc -fstack-protector-all vulnerable.c

# Check if binary has stack protector
readelf -s /bin/ls | grep stack_chk
#    123: 00000000000060a0     8 OBJECT  GLOBAL DEFAULT   25 __stack_chk_fail@@GLIBC_2.4
Bypass Techniques (and mitigations):
AttackMitigation
Leak canary via format stringUse fortified functions (_printf_chk)
Overwrite canary with correct valueUse random canary per thread
Jump over canary (partial overflow)Place canary near variables
Fork before overflow (canary same in child)Re-randomize after fork

2. Control Flow Integrity (CFI)

CFI ensures program control flow follows legitimate paths (no arbitrary jumps).

2.1 Forward-Edge CFI (Indirect Calls)

Problem: Function pointers can be hijacked.
// Vulnerable code
struct ops {
    void (*process)(char *data);
};

struct ops *vtable = malloc(sizeof(struct ops));
vtable->process = legitimate_function;

// ... buffer overflow ...
// Attacker overwrites vtable->process to point to shellcode

vtable->process(data);  // Calls shellcode!
CFI Solution:
// Compiler generates CFI check before indirect call

// Original code
vtable->process(data);

// Compiled with CFI
void *target = vtable->process;

// Check 1: Is target a valid code address?
if (!is_valid_code_address(target)) {
    cfi_violation();
}

// Check 2: Is target in allowed set for this call site?
if (!is_allowed_target(call_site_id, target)) {
    cfi_violation();
}

// Perform call
((void (*)(char *))target)(data);
Allowed Target Sets:
Function Signature-Based CFI:

void func_a(int x);           ← Set 1: void (int)
void func_b(int x);           ←
int func_c(int x, int y);     ← Set 2: int (int, int)
int func_d(int x, int y);     ←
void func_e(void);            ← Set 3: void (void)

Rule: Indirect call with signature void(int) can only jump to Set 1.

Implementation:
1. Compiler assigns ID to each function signature
2. Compiler tags each function with its ID
3. Before indirect call, check ID matches expected signature
Clang CFI:
# Compile with CFI
clang -fsanitize=cfi -flto program.c

# CFI variants
-fsanitize=cfi-icall     # Indirect calls
-fsanitize=cfi-vcall     # Virtual function calls (C++)
-fsanitize=cfi-cast      # Bad casts

# Generate CFI violation report
UBSAN_OPTIONS=print_stacktrace=1 ./program

# Example violation
SUMMARY: UndefinedBehaviorSanitizer: cfi-check-fail
pc 0x55f8a2b3c4d5 in main program.c:42

2.2 Backward-Edge CFI (Return Address Protection)

Shadow Stack: Hardware-protected copy of return addresses.
Regular Stack           Shadow Stack (Protected)
─────────────────       ────────────────────────
┌─────────────┐         ┌─────────────┐
│  Ret Addr 3 │ ◄──────►│  Ret Addr 3 │  (Copy)
├─────────────┤         ├─────────────┤
│  Locals     │         │             │
├─────────────┤         │             │
│  Ret Addr 2 │ ◄──────►│  Ret Addr 2 │
├─────────────┤         ├─────────────┤
│  Locals     │         │             │
├─────────────┤         │             │
│  Ret Addr 1 │ ◄──────►│  Ret Addr 1 │
└─────────────┘         └─────────────┘
     ↑                         ↑
  Writable!              Read-Only!
  (Attacker can          (CPU enforced,
   modify)                not accessible)

On Function Return:
1. Pop return address from regular stack → addr_stack
2. Pop return address from shadow stack → addr_shadow
3. Compare: addr_stack == addr_shadow?
4. Mismatch → #CP exception (Control Protection) → Crash
Intel CET (Control-flow Enforcement Technology):
// CPU features for shadow stack
#define X86_FEATURE_SHSTK  (1 << 7)   // Shadow stack
#define X86_FEATURE_IBT    (1 << 20)  // Indirect branch tracking

// Enable shadow stack (kernel code)
void cet_enable(void) {
    u64 msr_val;

    // Check if CPU supports CET
    if (!boot_cpu_has(X86_FEATURE_SHSTK))
        return;

    // Enable in MSR
    rdmsrl(MSR_IA32_S_CET, msr_val);
    msr_val |= MSR_IA32_CET_SHSTK_EN;  // Enable shadow stack
    wrmsrl(MSR_IA32_S_CET, msr_val);

    // Allocate shadow stack for current thread
    unsigned long ssp = alloc_shstk();  // Shadow stack pointer
    wrmsrl(MSR_IA32_PL3_SSP, ssp);
}

// Shadow stack operations (new x86 instructions)
// INCSSP - Increment shadow stack pointer
// RDSSP  - Read shadow stack pointer
// SAVEPREVSSP - Save previous SSP
// RSTORSSP - Restore SSP
// WRSSD/WRSSQ - Write to shadow stack
// SETSSBSY - Mark shadow stack busy
ARM Pointer Authentication:
// ARM PAuth uses cryptographic signing of return addresses

// On function prologue:
// PAC (Pointer Authentication Code) = sign(return_addr, context_key)
// Store: PAC || return_addr on stack

// On function epilogue:
// Verify: sign(return_addr, context_key) == PAC?
// If mismatch → Fault

// ARM instructions
PACIA  X30, SP   // Sign return address (X30) with stack pointer (SP)
RETAA            // Authenticate and return
Software Shadow Stack (Android):
// Implemented in libc (not hardware-protected)

__thread void *shadow_stack[1024];
__thread int shadow_stack_ptr = 0;

void function_entry(void *return_addr) {
    shadow_stack[shadow_stack_ptr++] = return_addr;
}

void function_exit(void *return_addr) {
    void *expected = shadow_stack[--shadow_stack_ptr];
    if (return_addr != expected) {
        abort();  // Stack corruption detected
    }
}

// Weakness: Attacker can corrupt shadow_stack too if memory bug exists
// Strength: Works on CPUs without hardware support

3. Privilege Separation & Capabilities

3.1 Traditional Unix DAC (Discretionary Access Control)

User/Group/Other Permissions:

File: /etc/shadow
Owner: root
Group: shadow
Permissions: rw-r-----
             │││││││││
             ││││││││└─ Other: no permissions
             │││││││└── Other: no permissions
             ││││││└─── Other: no permissions
             │││││└──── Group: read
             ││││└───── Group: no write
             │││└────── Group: no execute
             ││└─────── Owner: read
             │└──────── Owner: write
             └───────── Owner: no execute

Problem: All-or-nothing root privileges!
- Process needs root to bind port 80 → runs fully as root
- Process needs root to read /etc/shadow → runs fully as root

3.2 POSIX Capabilities

Divide root privileges into distinct units:
// From /usr/include/linux/capability.h

#define CAP_CHOWN            0   // Change file ownership
#define CAP_DAC_OVERRIDE     1   // Bypass file permission checks
#define CAP_DAC_READ_SEARCH  2   // Bypass read/search permissions
#define CAP_FOWNER           3   // Bypass permission checks on file operations
#define CAP_FSETID           4   // Don't clear setuid/setgid bits
#define CAP_KILL             5   // Bypass permission checks for sending signals
#define CAP_SETGID           6   // Make arbitrary setgid calls
#define CAP_SETUID           7   // Make arbitrary setuid calls
#define CAP_NET_BIND_SERVICE 10  // Bind to ports < 1024
#define CAP_NET_RAW          13  // Use RAW and PACKET sockets
#define CAP_SYS_ADMIN        21  // Lots of system admin operations
#define CAP_SYS_PTRACE       19  // Trace arbitrary processes
#define CAP_SYS_MODULE       16  // Load/unload kernel modules
// ... 41 capabilities total ...
Capability Sets:
// Each process has 5 capability sets

struct cred {
    // ...
    kernel_cap_t cap_inheritable;  // Inherited by exec'd programs
    kernel_cap_t cap_permitted;    // Can be enabled (superset)
    kernel_cap_t cap_effective;    // Actually active NOW
    kernel_cap_t cap_bset;         // Bounding set (limits inheritable)
    kernel_cap_t cap_ambient;      // Ambient set (new in Linux 4.3)
};

// Each set is a 64-bit bitmask (2^64 possible capabilities)
typedef struct {
    __u32 cap[_LINUX_CAPABILITY_U32S_3];  // 2 × 32 bits
} kernel_cap_t;
Capability Semantics:
Permitted (P):   Capabilities the process CAN use
Effective (E):   Capabilities CURRENTLY active
Inheritable (I): Capabilities that can be inherited across exec
Ambient (A):     Capabilities automatically granted after exec
Bounding (B):    Upper limit on capabilities (cannot gain capabilities not in B)

On exec():
P' = (P & I) | (F_permitted & F_inheritable) | A
E' = F_effective ? P' : A
I' = I
A' = A & P' & I'

Where F_* are file capabilities (set on executable)
Using Capabilities:
# Give ping the ability to create raw sockets (no setuid needed!)
sudo setcap cap_net_raw+ep /bin/ping

# Verify
getcap /bin/ping
# /bin/ping = cap_net_raw+ep

# Now ping works without setuid bit!
ls -l /bin/ping
# -rwxr-xr-x  ... /bin/ping  (no 's' bit!)

# Remove capabilities
sudo setcap -r /bin/ping

# Set multiple capabilities
sudo setcap cap_net_bind_service,cap_net_raw+ep /usr/bin/server

3.3 Seccomp (Secure Computing Mode)

Seccomp-BPF: Restrict system calls a process can make using BPF filters.
#include <seccomp.h>

int main() {
    scmp_filter_ctx ctx;

    // Create filter: default action = KILL
    ctx = seccomp_init(SCMP_ACT_KILL);

    // Allow essential syscalls
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);

    // Allow open only for specific file
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(open), 1,
                     SCMP_A0(SCMP_CMP_EQ, (scmp_datum_t)"/tmp/allowed.txt"));

    // Load filter into kernel
    seccomp_load(ctx);

    // After this point, any syscall not explicitly allowed → SIGSYS (kill)

    open("/tmp/allowed.txt", O_RDONLY);  // ✓ Allowed
    open("/etc/passwd", O_RDONLY);       // ✗ Killed!

    seccomp_release(ctx);
    return 0;
}

// Compile: gcc -o sandbox sandbox.c -lseccomp
Raw Seccomp-BPF:
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <sys/prctl.h>

void install_seccomp_filter() {
    struct sock_filter filter[] = {
        // Load syscall number
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                 offsetof(struct seccomp_data, nr)),

        // Allow exit syscall
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_exit, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

        // Allow write syscall
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

        // Kill on any other syscall
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
    };

    struct sock_fprog prog = {
        .len = sizeof(filter) / sizeof(filter[0]),
        .filter = filter,
    };

    // Enable seccomp
    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);  // Cannot gain privileges
    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
}

int main() {
    install_seccomp_filter();

    write(1, "Hello\n", 6);  // ✓ Allowed
    getpid();                 // ✗ Killed (SIGSYS)

    return 0;
}
Seccomp Actions:
ActionEffect
SECCOMP_RET_KILL_PROCESSKill entire process
SECCOMP_RET_KILL_THREADKill only current thread
SECCOMP_RET_TRAPSend SIGSYS signal
SECCOMP_RET_ERRNOReturn error code
SECCOMP_RET_TRACENotify tracer (ptrace)
SECCOMP_RET_LOGLog and allow
SECCOMP_RET_ALLOWAllow syscall
Real-World Usage:
# Chrome sandbox
ps aux | grep chrome
# ... --type=renderer --enable-sandbox ...

# Docker seccomp profile
docker run --security-opt seccomp=default.json alpine sh

# systemd service with seccomp
cat /etc/systemd/system/myservice.service
# [Service]
# SystemCallFilter=@system-service
# SystemCallFilter=~@privileged @resources

# View seccomp status of process
grep Seccomp /proc/self/status
# Seccomp: 2  (mode 2 = filter active)

4. Mandatory Access Control (MAC)

4.1 SELinux (Security-Enhanced Linux)

SELinux adds mandatory access control on top of DAC.
DAC says: "Can user alice read file.txt?"
  → Check: alice's UID vs file owner, group, permissions

SELinux says: "Can process with label X access file with label Y?"
  → Check: Policy rules for (process_label, file_label, operation)

Both must succeed for access!
SELinux Components:
┌─────────────────────────────────────────────────┐
│              SELinux Policy                      │
│  ┌───────────────────────────────────────────┐  │
│  │  Type Enforcement (TE) Rules              │  │
│  │  allow httpd_t http_port_t:tcp_socket bind│  │
│  │  allow httpd_t httpd_sys_content_t:file r │  │
│  └───────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────┐  │
│  │  Security Contexts (Labels)               │  │
│  │  user:role:type:level                     │  │
│  │  system_u:system_r:httpd_t:s0             │  │
│  └───────────────────────────────────────────┘  │
└─────────────────────────────────────────────────┘

         ┌──────────────────────┐
         │  LSM (Linux Security │
         │  Module) Framework   │
         └──────────────────────┘

      Kernel enforces at runtime
Security Context:
# View file context
ls -Z /var/www/html/index.html
# system_u:object_r:httpd_sys_content_t:s0 /var/www/html/index.html
#    │        │           │              │
#    │        │           │              └─ MLS level (sensitivity)
#    │        │           └─ Type (most important!)
#    │        └─ Role
#    └─ User

# View process context
ps -Z 1234
# system_u:system_r:httpd_t:s0 /usr/sbin/httpd

# Change file context
chcon -t httpd_sys_content_t /var/www/html/newfile.html

# Restore default contexts
restorecon -Rv /var/www/html/
Type Enforcement Rules:
# From /etc/selinux/targeted/policy/policy.conf (compiled binary)

# Allow httpd to bind TCP sockets on http_port_t (port 80, 443)
allow httpd_t http_port_t:tcp_socket { bind listen };

# Allow httpd to read files labeled httpd_sys_content_t
allow httpd_t httpd_sys_content_t:file { read getattr open };

# Allow httpd to write to log files
allow httpd_t httpd_log_t:file { write append create };

# Deny (example)
# If no rule allows, default is deny!
# httpd_t trying to access user_home_t → DENIED
SELinux Modes:

Enforcing

# Active enforcement
getenforce
# Enforcing

# Violations blocked
# AVC denials logged

sestatus
# SELinux status: enabled
# Current mode: enforcing

Permissive

# Log-only mode
setenforce 0

getenforce
# Permissive

# Violations logged
# but NOT blocked

# Good for debugging

Disabled

# SELinux completely off

# Edit /etc/selinux/config
SELINUX=disabled

# Reboot required

# NO security benefit!
Debugging SELinux:
# View denials
ausearch -m avc -ts recent

# Example denial
type=AVC msg=audit(1234567890.123:456): avc: denied { read } for pid=1234
  comm="httpd" name="secret.txt" dev="sda1" ino=123456
  scontext=system_u:system_r:httpd_t:s0
  tcontext=system_u:object_r:user_home_t:s0
  tclass=file permissive=0

# Translation: httpd_t tried to read user_home_t file → DENIED

# Generate policy module to allow
audit2allow -a -M mypolicy
# module mypolicy 1.0;
# require {
#     type httpd_t;
#     type user_home_t;
#     class file read;
# }
# allow httpd_t user_home_t:file read;

# Install policy module
semodule -i mypolicy.pp

# List loaded modules
semodule -l

# Remove module
semodule -r mypolicy
SELinux Booleans (runtime toggles):
# List all booleans
getsebool -a | grep httpd
# httpd_can_network_connect --> off
# httpd_can_network_connect_db --> off
# httpd_enable_cgi --> on

# Enable httpd network connections
setsebool -P httpd_can_network_connect on

# -P makes it persistent across reboot

4.2 AppArmor

AppArmor is path-based MAC (vs SELinux’s label-based).
SELinux: "Process with label X can access file with label Y"
  → Requires labeling entire filesystem

AppArmor: "Process can access /var/www/* with read permission"
  → Based on filesystem paths (easier to understand)
AppArmor Profile:
# /etc/apparmor.d/usr.sbin.nginx
#include <tunables/global>

/usr/sbin/nginx {
  #include <abstractions/base>
  #include <abstractions/nameservice>

  capability dac_override,
  capability net_bind_service,
  capability setgid,
  capability setuid,

  /etc/nginx/** r,
  /var/log/nginx/** rw,
  /var/www/** r,
  /run/nginx.pid rw,

  # Network
  network inet stream,
  network inet6 stream,

  # Deny everything else (default)
}
Profile Modes:
# Enforce mode
aa-enforce /usr/sbin/nginx

# Complain mode (log-only)
aa-complain /usr/sbin/nginx

# Disable profile
aa-disable /usr/sbin/nginx

# View status
aa-status
# apparmor module is loaded.
# 12 profiles are loaded.
# 10 profiles are in enforce mode.
# 2 profiles are in complain mode.
Creating Profiles:
# Generate profile automatically
aa-genprof /usr/bin/myapp

# Steps:
# 1. Runs app in learning mode
# 2. Exercise all app functionality
# 3. Reviews logged accesses
# 4. Generates profile

# Manually create profile
cat > /etc/apparmor.d/usr.bin.myapp <<EOF
/usr/bin/myapp {
  #include <abstractions/base>

  /etc/myapp/** r,
  /var/lib/myapp/** rw,
  /tmp/** rw,

  capability net_bind_service,

  deny /etc/shadow r,
}
EOF

# Load profile
apparmor_parser -r /etc/apparmor.d/usr.bin.myapp
SELinux vs AppArmor:
FeatureSELinuxAppArmor
GranularityVery fine (labels)Coarse (paths)
ComplexityHighLow
PerformanceSmall overheadVery small
Learning curveSteepGentle
FlexibilityMaximumGood
DefaultRHEL, Fedora, CentOSDebian, Ubuntu, SUSE

5. Microarchitectural Attacks & Mitigations

5.1 Spectre & Meltdown

Speculative Execution: CPU predicts branch and executes ahead, then discards if wrong.
// Vulnerable code
if (x < array1_size) {         // Bounds check
    y = array2[array1[x]];     // Out-of-bounds access
}

Without Speculation:
1. Check: x < array1_size?
2. If true, execute access
3. If false, skip

With Speculation (vulnerable):
1. CPU predicts branch will be taken
2. Speculatively executes: y = array2[array1[x]]
   Even if x >= array1_size!
3. Loads array1[x] (out of bounds!)
4. Uses it to index array2
5. array2[...] brought into cache ← SIDE EFFECT!
6. Branch misprediction detected
7. Architectural state rolled back
8. BUT: Cache state NOT rolled back!

Attacker observes cache timing → leaks array1[x]!
Meltdown (CVE-2017-5754):
// Leak kernel memory from user space

// 1. Flush cache
clflush(probe_array);

// 2. Access kernel memory (should fault, but speculatively executes)
char kernel_byte = *(char *)kernel_address;

// 3. Use leaked byte to index array
char dummy = probe_array[kernel_byte * 4096];
// This brings probe_array[kernel_byte * 4096] into cache

// 4. Branch misprediction, exception raised
// But probe_array[...] is NOW in cache!

// 5. Time accesses to probe_array
for (int i = 0; i < 256; i++) {
    t0 = rdtsc();
    dummy = probe_array[i * 4096];
    t1 = rdtsc();

    if (t1 - t0 < THRESHOLD) {
        // Cache hit! i == kernel_byte
        printf("Leaked kernel byte: 0x%02x\n", i);
    }
}

// Result: Leaked kernel memory byte-by-byte at ~100 KB/s!
Mitigation: KPTI (Kernel Page Table Isolation):
Without KPTI:
┌────────────────────────────────────┐
│  User Page Tables                  │
│  ┌──────────────────────────────┐  │
│  │  User Space Mappings          │  │
│  │  0x00000000 - 0x7fffffffffff  │  │
│  ├──────────────────────────────┤  │
│  │  Kernel Space Mappings        │  │
│  │  0xffff800000000000 - ...     │  │ ← Meltdown can read this!
│  └──────────────────────────────┘  │
└────────────────────────────────────┘

With KPTI (two sets of page tables):
┌────────────────────────────────────┐
│  User Page Tables                  │
│  ┌──────────────────────────────┐  │
│  │  User Space Mappings          │  │
│  ├──────────────────────────────┤  │
│  │  Minimal Kernel (entry/exit)  │  │ ← Only small trampoline
│  └──────────────────────────────┘  │
└────────────────────────────────────┘

┌────────────────────────────────────┐
│  Kernel Page Tables                │
│  ┌──────────────────────────────┐  │
│  │  User Space Mappings          │  │
│  ├──────────────────────────────┤  │
│  │  Full Kernel Space            │  │ ← Full kernel mapped
│  └──────────────────────────────┘  │
└────────────────────────────────────┘

On syscall: Switch from User PT → Kernel PT (CR3 register swap)
On return:  Switch from Kernel PT → User PT

Cost: ~5-30% performance penalty (context switch overhead)
Kernel Implementation (simplified from arch/x86/mm/pti.c):
// Enable KPTI
void pti_init(void) {
    if (!boot_cpu_has_bug(X86_BUG_CPU_MELTDOWN))
        return;  // CPU not vulnerable

    pr_info("Kernel/User page tables isolation: enabled\n");

    // Allocate separate user page tables
    pgd_t *user_pgd = (pgd_t *)__get_free_page(GFP_KERNEL);

    // Copy user space mappings
    clone_pgd_range(user_pgd, kernel_pgd, KERNEL_PGD_PTRS);

    // Map minimal kernel trampolines (entry/exit stubs)
    map_entry_trampoline(user_pgd);

    // Install user page tables
    current->mm->pgd = user_pgd;
}

// Syscall entry: switch to kernel page tables
ENTRY(entry_SYSCALL_64)
    swapgs                        // Swap GS (per-CPU data)
    movq %rsp, PER_CPU_VAR(rsp_scratch)
    movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp

    /* Switch page tables */
    movq PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
    SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp  // Load kernel CR3

    /* ... handle syscall ... */

    SWITCH_TO_USER_CR3 scratch_reg=%rsp    // Load user CR3
    swapgs
    sysretq
END(entry_SYSCALL_64)
Spectre (CVE-2017-5753/5715):
// Branch Target Injection (Spectre v2)

// Victim code
if (x < array1_size) {
    y = array2[array1[x] * 256];
}

// Attacker code
void attack() {
    // 1. Train branch predictor
    for (int i = 0; i < 1000; i++) {
        victim_function(valid_x);  // Train: branch TAKEN
    }

    // 2. Flush cache
    clflush(probe_array);

    // 3. Call with malicious x
    victim_function(malicious_x);  // x >= array1_size

    // CPU speculatively executes (branch predictor says TAKEN)
    // Leaks out-of-bounds memory into cache

    // 4. Time side-channel to recover
    for (int i = 0; i < 256; i++) {
        t0 = rdtsc();
        dummy = probe_array[i * 256];
        t1 = rdtsc();

        if (t1 - t0 < THRESHOLD) {
            printf("Leaked: 0x%02x\n", i);
        }
    }
}

// Result: Can leak arbitrary memory across privilege boundaries!
Mitigation: Retpoline (Return Trampoline):
; Traditional indirect jump (vulnerable)
jmp *%rax

; Retpoline (safe)
call retpoline_label
retpoline_label:
    pause          ; Prevent speculation
    lfence         ; Serialize execution
    jmp retpoline_label  ; Infinite loop (never executed)

; CPU's return stack buffer prevents speculation
; Indirect jump converted to return instruction
Kernel Implementation:
// Compiler flag
KBUILD_CFLAGS += -mindirect-branch=thunk-extern
KBUILD_CFLAGS += -mindirect-branch-register

// Generated code
// Before:
//   call *%rax
// After:
//   call __x86_indirect_thunk_rax

__x86_indirect_thunk_rax:
    call retpoline_label
retpoline_label:
    pause
    lfence
    jmp retpoline_label
    mov %rax, (%rsp)  // Never executed, but tricks CPU
    ret
Hardware Mitigations:
# Check CPU vulnerabilities
cat /sys/devices/system/cpu/vulnerabilities/*

# meltdown: Mitigation: PTI
# spectre_v1: Mitigation: usercopy/swapgs barriers and __user pointer sanitization
# spectre_v2: Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, RSB filling

# IBRS (Indirect Branch Restricted Speculation)
# IBPB (Indirect Branch Predictor Barrier)
# STIBP (Single Thread Indirect Branch Predictors)
# SSBD (Speculative Store Bypass Disable)

# Disable mitigations (for benchmarking)
# WARNING: Insecure!
echo 0 > /sys/kernel/debug/x86/pti_enabled
echo 0 > /sys/kernel/debug/x86/retp_enabled

5.2 Rowhammer

DRAM vulnerability: Rapidly accessing one row can flip bits in adjacent rows.
DRAM Organization:
┌─────────────────────────────────┐
│  Bank 0                         │
│  ┌───────────────────────────┐  │
│  │ Row 0   [data]            │  │
│  │ Row 1   [data] ← Target   │  │ ← Victim row
│  │ Row 2   [data]            │  │ ← Hammered (read repeatedly)
│  │ Row 3   [data] ← Target   │  │ ← Victim row
│  │ ...                       │  │
│  └───────────────────────────┘  │
└─────────────────────────────────┘

Attack:
1. Find adjacent rows in DRAM
2. Rapidly read from Row 2 (millions of times)
3. Electrical interference causes bit flips in Row 1 and Row 3
4. Attacker doesn't directly access victim rows!
Exploitation:
// Rowhammer exploit (simplified)

// 1. Spray memory with target pattern
char *spray[1000];
for (int i = 0; i < 1000; i++) {
    spray[i] = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
                    MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    memset(spray[i], 0xFF, 4096);  // All bits set
}

// 2. Find adjacent rows (via DRAM addressing)
uint64_t *hammer_addr1 = find_row_address(2);
uint64_t *hammer_addr2 = find_row_address(4);

// 3. Hammer rows
for (int i = 0; i < 1000000; i++) {
    *hammer_addr1;  // Read (causes DRAM row activation)
    *hammer_addr2;
    clflush(hammer_addr1);  // Evict from cache (force DRAM access)
    clflush(hammer_addr2);
}

// 4. Check for bit flips in victim rows
for (int i = 0; i < 1000; i++) {
    for (int j = 0; j < 4096; j++) {
        if (spray[i][j] != 0xFF) {
            printf("Bit flip at %p: 0x%02x\n", &spray[i][j], spray[i][j]);
        }
    }
}

// Real attacks:
// - Flip bit in page table → gain access to kernel memory
// - Flip bit in SELinux context → privilege escalation
// - Flip bit in RSA key → factor private key
Mitigations:

ECC Memory

Error-Correcting Code (ECC):
- Detects and corrects single-bit errors
- Detects (but can't correct) multi-bit errors

Cost: ~10% more expensive
Performance: Slight overhead

Widely used in servers, rare in consumer devices

Target Row Refresh (TRR)

Hardware solution by DRAM vendors:

- Monitor row activation counters
- If row accessed frequently, refresh adjacent rows
- Prevents charge leak that causes bit flips

Implemented in DDR4/DDR5 DRAM

Effectiveness: Good but not perfect
(bypasses exist with careful timing)

Software Mitigations

# Limit cache flush instructions (clflush)
# (Requires kernel patch)

# Increase DRAM refresh rate
# (Reduces performance)

# Memory deduplication disabled
echo 0 > /sys/kernel/mm/ksm/run

# Prevent predictable physical addresses
# (KASLR + physical address randomization)

OS-Level

// Double-sided rowhammer protection
// Kernel detects excessive row activations

void dram_protect(void) {
    // Monitor TLB misses (proxy for row activations)
    u64 tlb_misses = read_pmc(TLB_MISS_EVENT);

    if (tlb_misses > THRESHOLD) {
        // Potential rowhammer attack
        // Force memory refresh
        wbinvd();  // Write-back and invalidate caches

        // Log for analysis
        pr_warn("Potential Rowhammer attack detected\n");
    }
}

6. Sandboxing Techniques

6.1 Namespaces (Containers)

Linux namespaces isolate resources between processes.
// 7 types of namespaces

#define CLONE_NEWNS     0x00020000  // Mount namespace
#define CLONE_NEWUTS    0x04000000  // UTS (hostname) namespace
#define CLONE_NEWIPC    0x08000000  // IPC namespace
#define CLONE_NEWPID    0x20000000  // PID namespace
#define CLONE_NEWNET    0x40000000  // Network namespace
#define CLONE_NEWUSER   0x10000000  // User namespace
#define CLONE_NEWCGROUP 0x02000000  // Cgroup namespace
Creating Isolated Environment:
#define _GNU_SOURCE
#include <sched.h>
#include <unistd.h>
#include <sys/mount.h>

int sandbox_init(void *arg) {
    // New mount namespace
    mount("none", "/", NULL, MS_REC | MS_PRIVATE, NULL);

    // New hostname
    sethostname("sandbox", 7);

    // New root filesystem
    chroot("/var/sandbox");
    chdir("/");

    // Execute sandboxed program
    execl("/bin/sh", "sh", NULL);

    return 0;
}

int main() {
    char stack[4096];

    // Create new namespaces
    int flags = CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWPID |
                CLONE_NEWNET | CLONE_NEWIPC;

    // Clone process with new namespaces
    clone(sandbox_init, stack + sizeof(stack), flags | SIGCHLD, NULL);

    wait(NULL);
    return 0;
}
PID Namespace (process isolation):
// Parent namespace
// PID 1: init
// PID 123: parent
// PID 124: child (clone with CLONE_NEWPID)

// Inside child's PID namespace
getpid();  // Returns 1 (child is PID 1 in its namespace)

// Child can only see processes in its namespace
ps aux  // Only shows processes in this namespace

// Parent can still see child
// PID 124 in parent namespace == PID 1 in child namespace
Network Namespace (network isolation):
# Create new network namespace
ip netns add sandbox

# Execute command in namespace
ip netns exec sandbox ip link list
# 1: lo: <LOOPBACK> state DOWN
# (Only loopback, no network access!)

# Create virtual interface pair
ip link add veth0 type veth peer name veth1

# Move veth1 to sandbox namespace
ip link set veth1 netns sandbox

# Configure networking
ip addr add 10.0.0.1/24 dev veth0
ip link set veth0 up

ip netns exec sandbox ip addr add 10.0.0.2/24 dev veth1
ip netns exec sandbox ip link set veth1 up

# Now sandbox can communicate via veth interface

6.2 Chrome Multi-Process Sandbox

Chrome Architecture:
────────────────────

┌───────────────────────────────────────────────┐
│             Browser Process                    │
│  - Runs with full privileges                  │
│  - Manages windows, tabs, plugins             │
│  - Opens files, sockets on behalf of renderers│
│  - Passes FDs via SCM_RIGHTS                  │
└────────────┬──────────────────────────────────┘

     ┌───────┼───────┬──────────┐
     │       │       │          │
┌────▼─────┐ │  ┌────▼─────┐  ┌▼──────────┐
│ Renderer │ │  │ Renderer │  │   GPU     │
│ (Tab 1)  │ │  │ (Tab 2)  │  │  Process  │
│          │ │  │          │  │           │
│ Sandbox: │ │  │ Sandbox: │  │ Sandbox:  │
│ - seccomp│ │  │ - seccomp│  │ - seccomp │
│ - No FS  │ │  │ - No FS  │  │ - Limited │
│ - No net │ │  │ - No net │  │   access  │
└──────────┘ │  └──────────┘  └───────────┘

        ┌────▼─────┐
        │  Plugin  │
        │ Process  │
        │ (Flash)  │
        │ Sandbox  │
        └──────────┘

Sandbox Restrictions (Linux):
1. Seccomp-BPF: Allow only ~30 syscalls
2. Namespaces: PID, NET, IPC isolation
3. chroot: Fake root filesystem
4. No setuid/setgid
5. No capabilities
6. Read-only /proc, /sys
Chrome Sandbox Code (simplified from sandbox/linux/):
// Renderer process startup

void RendererMain() {
    // 1. Drop all capabilities
    drop_all_capabilities();

    // 2. Enter namespaces
    unshare(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWIPC);

    // 3. chroot to empty directory
    chroot("/var/empty");
    chdir("/");

    // 4. Install seccomp filter
    install_renderer_seccomp_filter();

    // 5. Drop privileges
    setuid(nobody_uid);
    setgid(nobody_gid);

    // 6. Enable no_new_privs
    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);

    // 7. Run renderer
    RunRendererLoop();
}

void install_renderer_seccomp_filter() {
    // Allow only essential syscalls
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);

    // Read/write/close
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0);

    // Memory management
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(munmap), 0);

    // IPC (to browser process)
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(recvmsg), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(sendmsg), 0);

    // DENY: open, socket, execve, fork, etc.

    seccomp_load(ctx);
}
Escape Detection (from browser process):
// Browser monitors renderer health

void MonitorRenderer(int renderer_pid) {
    // Check if renderer tries forbidden syscalls
    ptrace(PTRACE_SEIZE, renderer_pid, NULL, PTRACE_O_TRACESECCOMP);

    while (1) {
        int status;
        waitpid(renderer_pid, &status, 0);

        if (WIFSTOPPED(status) && status >> 8 == (SIGTRAP | (PTRACE_EVENT_SECCOMP << 8))) {
            // Seccomp violation detected!
            struct user_regs_struct regs;
            ptrace(PTRACE_GETREGS, renderer_pid, NULL, &regs);

            long syscall_nr = regs.orig_rax;
            log_security_violation(renderer_pid, syscall_nr);

            // Kill malicious renderer
            kill(renderer_pid, SIGKILL);
            respawn_renderer();
        }
    }
}

6.5 Production Caveats and Common Pitfalls

Linux security primitives are individually well-designed. The failure mode is composition: each primitive looks correct in isolation, then a subtle assumption interaction creates an escape route. Below are four traps that bite even experienced security engineers, paired with the patterns that close them.
Pitfall 1: Reaching for setuid when modern Linux wants fine-grained capabilitiesThe historical Unix model is “either you are root or you are not.” A setuid binary runs with the file owner’s privileges — almost always root — which means a single bug in ping or mount or passwd is a path to total compromise. CVE history is full of setuid escalations: pkexec (CVE-2021-4034 PwnKit), sudo (CVE-2021-3156 Baron Samedit), OverlayFS plus setuid (CVE-2023-2640). The trap is that engineers reach for setuid out of habit because “I just need to bind to port 80” or “I need to read this hardware register,” when modern Linux offers a far narrower grant.The other side of the same trap: people use Docker’s --privileged flag because they ran into a permission error and wanted to make it go away. --privileged strips namespacing, gives the container all capabilities, mounts host devices, and disables seccomp. It is the docker equivalent of chmod 777.
Solution: file capabilities and ambient capabilitiesLinux capabilities split root’s powers into about 40 fine-grained privileges. Bind to low ports? CAP_NET_BIND_SERVICE. Send raw packets? CAP_NET_RAW. Read kernel memory? CAP_SYS_PTRACE. Grant only what the binary actually needs:
# Old way: setuid root
chmod u+s /usr/bin/myserver

# Modern way: file capability
setcap cap_net_bind_service=+ep /usr/bin/myserver
# Verify
getcap /usr/bin/myserver
# Run as a normal user; bind succeeds, nothing else is privileged
For containers, drop everything and add back what you need:
# Kubernetes pod spec
securityContext:
  capabilities:
    drop: ["ALL"]
    add: ["NET_BIND_SERVICE"]
  runAsNonRoot: true
  readOnlyRootFilesystem: true
The mental model: capabilities are the principle of least privilege made concrete. Anything you cannot justify by name should not be in the bag.
Pitfall 2: seccomp filter holes — syscalls that transitively reach forbidden onesEngineers reach for seccomp profiles assuming the syscall list is the whole attack surface. It is not. A syscall you allow can call into kernel code paths that ultimately invoke syscalls you blocked. Classic example: you block mprotect because you do not want anyone changing page permissions. But printf calls into vfprintf, which can call into the dynamic linker, which uses lazy binding — and lazy binding fixes up the GOT by calling mprotect to make the GOT writable, then back to read-only. Block mprotect and printf segfaults the first time it touches the dynamic linker.The general pattern: glibc and the dynamic loader have invisible dependencies on mmap, mprotect, arch_prctl, sigaltstack, prctl, rseq, and others. A “minimal” syscall whitelist generated by strace on a happy-path test will miss all of these because they only fire on certain code paths — error handling, signal delivery, malloc growth, TLS allocation. The application crashes hours into production with SIGSYS.
Solution: build seccomp profiles iteratively, log first, kill laterThe first profile you deploy should be SCMP_ACT_LOG (audit-only). Run the application under realistic load — including failure scenarios — and watch /var/log/audit/audit.log for SECCOMP events. Add anything legitimate to the allowlist. Only after a stable observation window do you flip to SCMP_ACT_KILL_PROCESS.
// Phase 1: log mode -- production observation
seccomp_init(SCMP_ACT_LOG);

// Phase 2: enforce, but with structured fallback
seccomp_init(SCMP_ACT_ERRNO(EPERM));  // return EPERM to userspace
// gives the app a chance to log and fail gracefully

// Phase 3: hard enforcement
seccomp_init(SCMP_ACT_KILL_PROCESS);
For containers, do not write a seccomp profile from scratch. Start from docker/default, audit it for your workload, and tighten. Tools like containerd-shim’s seccomp profile generator and Falco’s policy engine help generate realistic profiles from real workloads.Practical rule: if a syscall is required for crash handling (rt_sigreturn, restart_syscall, exit, exit_group), it stays unconditionally. Locking these out turns recoverable errors into kernel oopses or zombie processes.
Pitfall 3: ASLR with insufficient entropy — 32-bit and PIE-disabled binariesASLR works by randomizing base addresses. The strength is set by the entropy in those addresses. On 64-bit systems, libraries get ~28-30 bits of randomization, which makes brute force impractical. On 32-bit systems, you have at most 16 bits of entropy for shared libraries — and given typical alignment and layout constraints, often closer to 8-12 effective bits. That is 256 to 4096 guesses to defeat. A network-facing service that survives crashes (forking server, supervisor that restarts) gives an attacker effectively unlimited attempts.Worse, ASLR only randomizes binaries that opted in. If a binary is built without -fPIE -pie, its .text section sits at a fixed address regardless of ASLR settings. CVE history is rich with examples: Apache modules built without PIE on RHEL, vendor binaries shipped without ASLR-aware compilation flags, JIT-compiled code regions that the runtime maps at deterministic addresses.
Solution: enforce 64-bit, PIE, and full RELRO at the toolchain level
# Compiler flags for ASLR-aware binaries
gcc -fPIE -pie -Wl,-z,relro,-z,now -fstack-protector-strong \
    -D_FORTIFY_SOURCE=2 myapp.c -o myapp

# Verify the result
checksec --file=./myapp
# Expected: PIE enabled, Full RELRO, NX enabled, Canary found, FORTIFY enabled
Audit your fleet with checksec or hardening-check across every shipped binary. Treat any binary without PIE as a finding.For 32-bit — the honest answer in 2026 is “do not ship 32-bit network services.” If you have legacy 32-bit binaries that must remain, run them inside a stricter sandbox: gVisor, Firecracker, or at minimum a dedicated user namespace with no network capabilities. The CPU is the wrong place to defend a 32-bit address space against a determined attacker.Modern bonus: enable -fcf-protection=full on x86 to opt into Intel CET (Indirect Branch Tracking and Shadow Stack), and -mbranch-protection=standard on ARM for PAC and BTI. These are the hardware-supported successors to ASLR-only defenses.
Pitfall 4: namespace escape via /proc/self vs procfs assumptionsUser namespaces let unprivileged users gain capabilities scoped to a new namespace. The classic attack pattern: create a user namespace, become “root” inside it, then exploit a kernel bug that does not properly check whether your capability is namespaced or global. Pre-2018 kernels were riddled with these checks-without-namespaces, leading to escapes via mount, keyctl, and bpf.The procfs version of the same trap: /proc/self resolves relative to the kernel’s view of the calling process, which can differ from the namespace’s view in subtle ways. A container that mounts /proc from the host (rather than its own private procfs) leaks information about every process on the host, and /proc/self/exe and /proc/self/root can be used to bypass chroot in some configurations. Worse, /proc/<pid>/setgroups, /proc/<pid>/uid_map, /proc/<pid>/gid_map are the gatekeepers for user namespace permissions — a misconfigured container that allows write access to these can be escaped from.In 2019, runc had CVE-2019-5736 — a container could overwrite the host’s runc binary by exploiting the way /proc/self/exe resolved at exec time. The fix was substantial: runc now copies its own binary into a memfd and re-execs from there.
Solution: defense in depth around procfs, plus user-namespace guardrails
# Mount /proc with hidepid -- containers cannot see other processes
mount -t proc -o hidepid=2 proc /proc

# Disable unprivileged user namespaces if you do not use them
sysctl -w kernel.unprivileged_userns_clone=0

# In containers, mount /proc as a fresh procfs scoped to the namespace
# Most container runtimes do this; verify with /proc/self/mountinfo
For container runtimes, follow the runc-CVE-2019-5736 lesson: never re-exec from a path the sandbox can write to. Modern runtimes use memfd_create plus execveat to load the runtime binary from a memory-backed fd that no namespaced process can touch.Auditing approach: enumerate every path in /proc your container can read or write, and for each, ask “what does this give an attacker if they can write arbitrary bytes here?” The answers are sometimes scary — /proc/sys/kernel/core_pattern historically allowed pipe-to-program syntax, which let containers execute host commands by triggering a core dump. CVE-2022-0492 was the most recent variant.Stronger pattern: use rootless containers (Podman’s default, Docker’s optional mode) so there is no root inside the namespace at all. The escape primitives that need CAP_SYS_ADMIN or root simply do not apply.

7. Interview Questions & Answers

NX (No-Execute) / DEP (Data Execution Prevention) uses the CPU’s NX bit in page table entries.Page Table Entry Structure (x86-64):
  • Bit 63: NX (No-Execute) bit
  • When set: Page cannot be executed (will fault with #PF if IP points here)
  • When clear: Page is executable
Kernel Implementation:
// When mapping stack
vma->vm_flags = VM_READ | VM_WRITE | VM_GROWSDOWN;
// NO VM_EXEC flag!

// Page table entry will have NX bit SET
pte = pfn_pte(pfn, PAGE_KERNEL);  // Default kernel page (with NX)
set_pte(pte_addr, pte);
Protection:
  1. Attacker overflows buffer on stack
  2. Injects shellcode
  3. Overwrites return address to point to shellcode
  4. Function returns, jumps to shellcode address
  5. CPU checks NX bit → Page is not executable
  6. #PF (Page Fault) → Kernel kills process
W^X Policy: Page is writable OR executable, never both.
  • Stack/Heap: Writable, NOT executable
  • Code: Executable, NOT writable
  • Prevents: Code injection attacks
Bypass: Return-Oriented Programming (ROP) - reuse existing executable code instead of injecting new code.
ASLR (Address Space Layout Randomization) randomizes memory layout at program start.Randomized Regions:
  • Stack base address
  • Heap base address
  • Libraries (libc, etc.)
  • Executable base (if PIE - Position Independent Executable)
  • vDSO, vvar
Entropy (x86-64 Linux):
  • Stack: 19 bits → 524,288 possible positions
  • Heap: 13 bits → 8,192 possible positions
  • Libraries: 28 bits → 268 million possible positions
  • PIE executable: 28 bits → 268 million possible positions
How It Prevents Exploitation:Traditional exploit (no ASLR):
Attacker knows: libc is at 0x7ffff7a0d000
Attacker's ROP chain:
  return to 0x7ffff7a52390 (system)
  argument: 0x7ffff7b99d88 ("/bin/sh")

Works every time!
With ASLR:
Run 1: libc at 0x7f8a2e456000
Run 2: libc at 0x7f3c81de2000
Run 3: libc at 0x7fb1c9a2f000

Attacker's hardcoded addresses: WRONG!
Exploit crashes instead of succeeding
Weaknesses:
  1. Information Leak:
    • Pointer disclosure → calculate base addresses → bypass ASLR
    • Format string bugs, memory corruption leaks
  2. Entropy Limitations:
    • 13 bits (heap) = 8,192 attempts
    • If process doesn’t crash (fork server), brute-forceable
  3. 32-bit Systems:
    • Limited address space → low entropy
    • 8 bits library randomization → 256 attempts
  4. Non-PIE Executables:
    • Main executable at fixed address
    • Contains ROP gadgets at known addresses
  5. Cache Timing Attacks:
    • Side-channel attacks can determine addresses
Mitigations for Weaknesses:
  • Use PIE (Position Independent Executable)
  • Fix information leaks
  • Crash on exploit attempts (don’t fork)
  • Use Control Flow Integrity (CFI)
  • Combine with other defenses (NX, stack canaries)
Stack Canary: Random value placed between local variables and return address.Mechanism:
Stack Frame:
┌──────────────────┐ High address
│ Return Address   │ ← Protected
├──────────────────┤
│ Saved RBP        │
├──────────────────┤
│ CANARY (random)  │ ← __stack_chk_guard (stored in TLS)
├──────────────────┤
│ Local vars       │
│ char buf[100]    │ ← Overflow starts here
└──────────────────┘ Low address

Function Prologue:
  mov rax, fs:0x28      ; Load canary from TLS
  mov [rbp-8], rax      ; Store on stack

Function Epilogue:
  mov rax, [rbp-8]      ; Load stack canary
  xor rax, fs:0x28      ; Compare with original
  je .L_OK              ; Match? OK
  call __stack_chk_fail ; Mismatch? ABORT
.L_OK:
  ret
Detection:
  1. Buffer overflow overwrites local variables
  2. Overflow continues, overwrites canary
  3. Function returns
  4. Kernel checks: stack_canary == __stack_chk_guard?
  5. Mismatch → Stack smashing detected! → abort()
Bypass Techniques:1. Leak Canary:
// Format string vulnerability
printf(user_input);  // User provides: "%p %p %p ..."
// Leaks stack contents, including canary!

// Attacker:
// 1. Leak canary value
// 2. Craft overflow to include correct canary value
// 3. Overflow succeeds without detection
2. Overwrite Pointer Before Canary:
char buf[100];
char *ptr = &authorized;
unsigned long canary;
void *return_address;

// Overflow overwrites ptr but not canary
strcpy(buf, malicious_input);  // Overflow only buf and ptr

// ptr now points to attacker-controlled memory
// Canary intact → No detection!
3. Fork Without Re-randomization (rare):
// Parent forks children with same canary
while (1) {
    if (fork() == 0) {
        handle_request();  // Sandbox child
        exit(0);
    }
}

// Attacker brute-forces canary byte-by-byte
// Try 0x00, 0x01, 0x02, ... 0xFF for first byte
// If child crashes: wrong guess
// If child doesn't crash: correct! Move to next byte
// 8 bytes × 256 attempts = 2,048 attempts max
4. Partial Overflow:
// Overflow only return address, not canary
// (Requires knowledge of stack layout)

┌──────────────┐
│ Ret Addr     │ ← Overflow 1 byte (change low byte only)
├──────────────┤
│ Saved RBP    │ ← Skip
├──────────────┤
│ Canary       │ ← Leave untouched!
├──────────────┤
buf[100]     │
└──────────────┘

// Careful overflow changes return address without touching canary
Mitigations:
  • Combine with ASLR (randomize canary address)
  • Use fortified functions (_strcpy_chk) to prevent overflows
  • Re-randomize canary after fork
  • Stack Clash protection (prevent jumping over canary)
Traditional setuid:
# setuid binary runs with owner's privileges

ls -l /usr/bin/passwd
# -rwsr-xr-x root root /usr/bin/passwd
#    ↑ setuid bit

# When user runs passwd:
# 1. Process starts with user's UID
# 2. Kernel sees setuid bit
# 3. Sets effective UID to file owner (root)
# 4. Process has FULL root privileges

# Problem: All or nothing!
# passwd only needs to write /etc/shadow
# But gets ALL root capabilities
Capabilities:
Divide root into 41 distinct privileges:

CAP_CHOWN            - Change file ownership
CAP_DAC_OVERRIDE     - Bypass file permissions
CAP_NET_BIND_SERVICE - Bind ports < 1024
CAP_NET_RAW          - Use raw sockets
CAP_SYS_ADMIN        - System administration
CAP_SYS_MODULE       - Load kernel modules
... 35 more ...

Process gets ONLY what it needs!
Comparison:
FeaturesetuidCapabilities
GranularityAll or nothingFine-grained (41 capabilities)
SecurityOver-privilegedLeast privilege
PersistenceLost on exec (unless binary is setuid)Can be inherited
AuditabilityHard to see why root is neededClear which capability is used
Example: Network Server:
// Old way: setuid root binary
int main() {
    // Running as root (UID 0)
    // Can do ANYTHING!

    int sock = socket(AF_INET, SOCK_STREAM, 0);
    bind(sock, ..., 80);  // Bind port 80 (needs root)

    setuid(nobody);  // Drop privileges after bind

    // Problem: Race window while root
    // If exploit before setuid(), full root access!
}

// New way: Capabilities
int main() {
    // Running as nobody (UID 65534)
    // Has ONLY CAP_NET_BIND_SERVICE

    int sock = socket(AF_INET, SOCK_STREAM, 0);
    bind(sock, ..., 80);  // Works! (has capability)

    open("/etc/shadow", O_RDONLY);  // FAIL! (no CAP_DAC_OVERRIDE)

    // Even if exploited, attacker only has port binding
    // Cannot read files, cannot exec as root, etc.
}
Setting Capabilities:
# Give binary capability instead of setuid
# Before:
chmod u+s /usr/bin/ping  # setuid root (dangerous!)

# After:
setcap cap_net_raw+ep /usr/bin/ping  # Only raw socket capability

# Verify
getcap /usr/bin/ping
# /usr/bin/ping = cap_net_raw+ep

# Now ping can create raw sockets but has NO other root powers
Why Capabilities Are Better:
  1. Principle of Least Privilege: Only grant necessary permissions
  2. Reduced Attack Surface: Exploit gets limited capabilities, not full root
  3. Better Auditability: Clear why each capability is needed
  4. Flexibility: Can grant to non-root users
  5. Inheritance: Can design capability-aware services
Real-World Usage:
  • systemd services with capabilities
  • Docker containers (run as non-root with specific capabilities)
  • Network daemons (CAP_NET_BIND_SERVICE instead of setuid)
Meltdown Vulnerability:
CPU speculatively executes kernel memory access from user mode:

// User-mode code
char kernel_byte = *(char *)0xffff880000000000;  // Kernel address

// CPU behavior:
// 1. Starts speculative execution before permission check
// 2. Loads kernel memory (should fault, but hasn't checked yet)
// 3. Uses loaded byte to index array: probe[kernel_byte * 4096]
// 4. This brings probe[...] into cache ← SIDE EFFECT!
// 5. Permission check completes → Exception!
// 6. Architectural state rolled back
// 7. But cache state remains! ← LEAK!

// Attacker measures cache timing → recovers kernel_byte
KPTI (Kernel Page Table Isolation) Solution:
Without KPTI (vulnerable):
┌─────────────────────────────┐
│ User Mode (CR3 = user_pgd)  │
│                             │
│ User virtual addresses      │
│ 0x0 - 0x7fffffffffff        │
│   │                         │
│   ├─> User pages            │
│   │                         │
│ Kernel virtual addresses    │
│ 0xffff800000000000 - ...    │ ← Mapped in user page tables!
│   │                         │ ← Meltdown can speculatively access
│   ├─> Kernel pages          │
└─────────────────────────────┘

With KPTI (secure):
User Mode:
┌─────────────────────────────┐
│ CR3 = user_pgd              │
│                             │
│ User virtual addresses      │
│ 0x0 - 0x7fffffffffff        │
│   ├─> User pages            │
│                             │
│ Kernel virtual addresses    │
│ 0xffff800000000000 - ...    │
│   ├─> MINIMAL kernel stubs  │ ← Only entry/exit trampolines!
│   │    (entry_SYSCALL_64)   │ ← Rest of kernel NOT MAPPED
└─────────────────────────────┘

Kernel Mode (after syscall):
┌─────────────────────────────┐
│ CR3 = kernel_pgd            │
│                             │
│ User virtual addresses      │
│   ├─> User pages            │
│                             │
│ Kernel virtual addresses    │
│   ├─> FULL kernel mapping   │ ← All kernel code/data accessible
└─────────────────────────────┘
Syscall Flow with KPTI:
; User-mode application
mov rax, 1        ; SYS_write
mov rdi, 1        ; fd = stdout
syscall           ; Enter kernel

; ← CPU switches to kernel mode ←

entry_SYSCALL_64:
    ; Still using user page tables!
    swapgs                    ; Swap GS (get kernel stack)

    ; SWITCH PAGE TABLES (the expensive part!)
    mov rax, CR3              ; Read current CR3 (user page table)
    or rax, 0x1000            ; Set bit to switch to kernel tables
    mov CR3, rax              ; ← PAGE TABLE SWITCH (TLB flush!)

    ; Now kernel is fully mapped
    ; Execute syscall handler...
    call do_syscall_64

    ; SWITCH BACK to user page tables
    mov rax, CR3
    and rax, ~0x1000
    mov CR3, rax              ; ← PAGE TABLE SWITCH (TLB flush!)

    swapgs
    sysretq                   ; Return to user mode
Performance Cost:What makes it expensive:
  1. CR3 Write (page table switch):
    • ~150-300 CPU cycles per switch
    • 2 switches per syscall (enter + exit)
  2. TLB Flush:
    • Translation Lookaside Buffer caches virtual→physical address translations
    • Changing CR3 flushes TLB (must reload from memory)
    • TLB misses add ~100 cycles per memory access
  3. Frequency of Syscalls:
    • I/O-heavy workloads: Many syscalls → high overhead
    • CPU-bound workloads: Few syscalls → low overhead
Measured Impact (varies by workload):
Workload TypePerformance Loss
CPU-intensive (scientific computing)0-3%
Light I/O (web browsing)3-5%
Heavy I/O (file server)5-10%
Heavy syscalls (database, Redis)10-30%
Optimizations:
  1. PCID (Process Context ID):
    • Tag TLB entries with PCID
    • Avoid full TLB flush on CR3 switch
    • Reduces overhead to 1-5%
  2. Lazy TLB Switching:
    • Kernel threads don’t switch page tables
    • Reuse previous user’s kernel mapping
  3. CPU Microcode Updates:
    • Intel CPUs without Meltdown bug → no KPTI needed
    • Check: cat /sys/devices/system/cpu/vulnerabilities/meltdown
    • If says “Not affected” → KPTI not active
Disable KPTI (for testing/benchmarking only!):
# Boot parameter
nopti

# Or runtime (requires recompiled kernel)
echo 0 > /sys/kernel/debug/x86/pti_enabled

# WARNING: Disabling KPTI leaves system vulnerable to Meltdown!
Spectre Vulnerability (Branch Target Injection):CPU Speculative Execution:
// Victim code
if (x < array_size) {          // ← Branch
    y = array[x];              // ← Speculative execution
}

CPU's Branch Predictor:
- Predicts if branch will be taken or not
- Speculatively executes ahead while check happens
- If prediction wrong, rollback
- If prediction right, save time!

Problem: Rollback discards architectural state but NOT cache state!
Attack:
// Step 1: Train branch predictor
for (int i = 0; i < 1000; i++) {
    victim_function(valid_x);  // x < array_size, branch TAKEN
}
// Branch predictor learns: "This branch is ALWAYS taken"

// Step 2: Prepare side-channel
for (int i = 0; i < 256; i++) {
    clflush(&probe_array[i * 4096]);  // Flush cache
}

// Step 3: Attack with out-of-bounds x
victim_function(malicious_x);  // malicious_x >= array_size

// What happens:
// 1. Branch predictor predicts: TAKEN (based on training)
// 2. CPU speculatively executes: y = array[malicious_x]
// 3. This accesses out-of-bounds memory (kernel memory!)
// 4. Uses leaked byte to index: probe_array[y * 4096]
// 5. This brings probe_array[y * 4096] into cache ← LEAK!
// 6. Branch check completes: x < array_size? FALSE
// 7. Rollback! Discard y, but cache state remains!

// Step 4: Recover leaked byte via timing
for (int i = 0; i < 256; i++) {
    t0 = rdtsc();
    temp = probe_array[i * 4096];
    t1 = rdtsc();

    if (t1 - t0 < THRESHOLD) {
        printf("Leaked byte: 0x%02x\n", i);  // Cache hit!
        break;
    }
}

// Result: Read arbitrary memory across privilege boundaries!
Why Retpolines Work:Problem with Indirect Branches:
; Vulnerable indirect jump
jmp *%rax              ; Jump to address in rax

; Attacker can manipulate branch predictor to:
; 1. Predict wrong target
; 2. Cause speculative execution to gadget
; 3. Leak data via cache side-channel
Retpoline (Return Trampoline):
; Instead of: jmp *%rax
; Use:

call .set_target      ; Push return address on stack
.set_target:
    mov %rax, (%rsp)  ; Replace return address with rax
    ret               ; Return to rax

; Why this is safe:

; CPU's Return Stack Buffer (RSB):
; - Separate predictor for RET instructions
; - Tracks call/return pairs
; - NOT poisonable by attacker

; When ret executes:
; - CPU predicts target from RSB
; - RSB says: return to .capture_spec
; - Speculative execution goes to .capture_spec
; - NOT to attacker-controlled address!

.capture_spec:
    pause             ; Prevent speculation
    lfence            ; Serializing instruction
    jmp .capture_spec ; Infinite loop (never executed)
Visual Comparison:
Traditional Indirect Jump (vulnerable):
┌─────────────┐
│   jmp *rax  │ → Branch predictor → Attacker controls prediction
└─────────────┘                      ↓
                              Speculative execution to gadget

                              Leak via cache timing

Retpoline (safe):
┌─────────────┐
│ call .label │ → Push return addr on stack
│ .label:     │
│  mov rax,SP │ → Replace return addr with rax
│  ret        │ → RSB predicts return to .capture
└─────────────┘    (NOT attacker-controlled!)

.capture_spec:
  pause
  lfence
  jmp .capture_spec  ← Speculation contained in loop
                     ← No leak possible!
Kernel Implementation:
// Compiler generates retpolines for indirect branches
// gcc -mindirect-branch=thunk-extern

// Original code:
void (*func_ptr)(void);
func_ptr();  // Indirect call

// Compiled without retpoline:
call *%rax

// Compiled with retpoline:
call __x86_indirect_thunk_rax

// Retpoline thunk (arch/x86/lib/retpoline.S):
SYM_FUNC_START(__x86_indirect_thunk_rax)
    JMP_NOSPEC %rax
SYM_FUNC_END(__x86_indirect_thunk_rax)

#define JMP_NOSPEC(reg)                 \
    call    .Ldo_rop_##reg;             \
.Lspec_trap_##reg:                      \
    pause;                              \
    lfence;                             \
    jmp .Lspec_trap_##reg;              \
.Ldo_rop_##reg:                         \
    mov %reg, (%rsp);                   \
    ret
Performance Impact:
  • Retpolines are slower than direct jumps (5-20% overhead)
  • But necessary for security on vulnerable CPUs
  • Modern CPUs have hardware mitigations (IBRS - Indirect Branch Restricted Speculation)
Check Mitigations:
cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
# Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, RSB filling

# Retpoline: Software mitigation (compiler-generated)
# IBRS: Hardware mitigation (CPU feature)
# IBPB: Indirect Branch Predictor Barrier (flush predictor)
# RSB filling: Prevent RSB underflow attacks
Why Effective:
  1. Return instructions are different: RSB not poisonable
  2. Speculation contained: Loop prevents speculative execution reaching gadgets
  3. Works on all CPUs: Software mitigation (doesn’t need hardware support)
  4. Comprehensive: Protects all indirect branches
Limitations:
  • Performance overhead (modern CPUs use IBRS instead)
  • Doesn’t protect against Spectre v1 (bounds check bypass)
  • Doesn’t protect against other speculative execution attacks (L1TF, MDS, etc.)
Fundamental Difference:SELinux: Label-based MAC
Security Context: user:role:type:level

Files:   httpd_sys_content_t
Process: httpd_t

Rule: allow httpd_t httpd_sys_content_t:file { read open };
      ─────────────── ──────────────────  ──── ────────────
         Subject          Object          Class Permissions

Decision: Based on labels (NOT paths)
AppArmor: Path-based MAC
Profile:
/usr/sbin/nginx {
    /etc/nginx/** r,
    /var/www/** r,
    /var/log/nginx/** rw,
    deny /etc/shadow r,
}

Decision: Based on absolute filesystem paths

Detailed Comparison:1. Security Model:SELinux:
  • Type Enforcement (TE): Subjects (processes) have types, objects (files) have types
  • Multi-Level Security (MLS): Confidentiality levels (Top Secret, Secret, etc.)
  • Multi-Category Security (MCS): Categories for compartmentalization
  • Very fine-grained control
AppArmor:
  • Path-based access control
  • Capabilities control
  • Network access control (protocol/address)
  • Simpler model, easier to understand
2. Complexity:SELinux:
# Policy is complex
# Example policy snippet:
allow httpd_t httpd_sys_content_t:file { getattr read open };
allow httpd_t http_port_t:tcp_socket { bind listen };
allow httpd_t httpd_log_t:file { write append create };
allow httpd_t proc_t:file read;
allow httpd_t self:capability { setgid setuid };

# Hundreds of rules per service!
# Requires understanding of:
# - Type enforcement
# - Security contexts
# - Policy language
# - Domain transitions
AppArmor:
# Policy is readable
/usr/sbin/nginx {
  #include <abstractions/base>
  #include <abstractions/nameservice>

  capability dac_override,
  capability net_bind_service,
  capability setgid,
  capability setuid,

  /etc/nginx/** r,
  /var/log/nginx/** rw,
  /var/www/** r,

  network inet stream,
}

# Human-readable!
# Easy to audit
3. Administration:
TaskSELinuxAppArmor
Create policyComplex (audit2allow helps)Simple (aa-genprof)
Debug denialsausearch, sealertaa-logprof, dmesg
Enable/Disablesetenforceaa-enforce/aa-complain
View statussestatus, getenforceaa-status
Temporary allowsemodb-booleanaa-complain mode
4. Performance:SELinux:
  • Label lookups in xattrs (extended attributes)
  • Hash table lookups for policy decisions
  • Overhead: 3-7% typically
AppArmor:
  • Path resolution for every access
  • Simpler policy checks
  • Overhead: 1-3% typically
5. Filesystem Requirements:SELinux:
  • Requires filesystem with xattr support
  • Labels stored as extended attributes
  • ls -Z shows labels
  • Relabeling filesystem can be slow
AppArmor:
  • No special filesystem requirements
  • Works on any filesystem (even FAT, NFS)
  • No labels to manage
6. Use Cases:Use SELinux when:
  • Maximum security required (government, military)
  • Need MLS/MCS (confidentiality levels)
  • Want very fine-grained control
  • Already familiar with it (RHEL/Fedora/CentOS)
  • Need label-based security (labels follow files even if moved)
Use AppArmor when:
  • Simplicity preferred over maximum granularity
  • Easier policy management desired
  • Filesystem doesn’t support xattrs (NFS, FAT)
  • Developers/admins less experienced with MAC
  • Debian/Ubuntu/SUSE environment
7. Real-World Scenarios:Scenario 1: Web ServerSELinux:
# Pre-defined policy exists
# But need to handle custom app

# App stores files in /opt/myapp/
# SELinux denies access (wrong label)

# Solution:
semanage fcontext -a -t httpd_sys_content_t "/opt/myapp(/.*)?"
restorecon -R /opt/myapp

# More denials? Debug with:
ausearch -m avc -ts recent
audit2allow -a -M mypolicy
semodule -i mypolicy.pp
AppArmor:
# Create profile
cat > /etc/apparmor.d/usr.sbin.myapp <<EOF
/usr/sbin/myapp {
  #include <abstractions/base>
  #include <abstractions/apache2-common>

  /opt/myapp/** r,
  /var/log/myapp/** rw,

  network inet stream,
  capability net_bind_service,
}
EOF

# Load and enforce
apparmor_parser -r /etc/apparmor.d/usr.sbin.myapp

# Done! Much simpler.
Scenario 2: Container SecuritySELinux:
  • Docker/Podman use SELinux contexts
  • Each container gets unique MCS label
  • Container svirt_sandbox_file_t, host container_file_t
  • Strong isolation via labels
AppArmor:
  • Docker uses AppArmor profiles
  • Default profile restricts mount, capabilities, etc.
  • Custom profiles for specific containers
  • Path-based restrictions easier to understand
8. Policy Portability:SELinux:
  • Labels stored with files (xattrs)
  • Policy is separate from filesystem
  • Moving files between systems: labels can be lost
  • Need to relabel after restore from backup
AppArmor:
  • Policy references absolute paths
  • Moving profile to different system: works if paths same
  • But path changes require profile updates

Recommendation Matrix:
PriorityChoose
Maximum securitySELinux
Ease of useAppArmor
Fine-grained controlSELinux
Simple policiesAppArmor
RHEL/CentOSSELinux (default)
Debian/UbuntuAppArmor (default)
NFS/non-xattr FSAppArmor
MLS/MCS requiredSELinux
Container hostBoth work (SELinux more common)
Can you use both?: No, they conflict (both use LSM hooks). Choose one.Neither?: Not recommended. MAC adds significant security layer beyond DAC.
Seccomp-BPF (Secure Computing with Berkeley Packet Filter):Core Concept: Whitelist syscalls a process can make using BPF bytecode filters.
Architecture:
User Space Process

     │ syscall (e.g., open, read, write)

┌─────────────────────────────┐
│   Syscall Entry Point       │
│   (entry_SYSCALL_64)        │
└─────────┬───────────────────┘

          │ ① Check: Seccomp filter installed?

┌─────────────────────────────┐
│   Seccomp BPF Filter        │
│                             │
│   BPF Program:              │
│   - Load syscall number     │
│   - Load arguments          │
│   - Check against rules     │
│   - Return action:          │
│     • ALLOW                 │
│     • KILL                  │
│     • ERRNO                 │
│     • TRAP                  │
└─────────┬───────────────────┘

          │ ② Action

    ALLOW?  ──────────────────> Execute Syscall
    KILL?   ──────────────────> SIGSYS (kill process)
    ERRNO?  ──────────────────> Return error code
    TRAP?   ──────────────────> Send SIGSYS signal (debugger)

BPF Filter Structure:
// Seccomp data available to BPF program
struct seccomp_data {
    int nr;                  // Syscall number
    __u32 arch;              // Architecture (x86-64, ARM, etc.)
    __u64 instruction_pointer;
    __u64 args[6];           // Syscall arguments
};

// BPF filter example
struct sock_filter filter[] = {
    // Load syscall number into accumulator
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
             offsetof(struct seccomp_data, nr)),

    // Allow SYS_read
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_read, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Allow SYS_write
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Allow SYS_exit
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_exit, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Default: KILL
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
};

Container Security Use Case:Problem: Containers share kernel with host. Malicious container can exploit kernel vulnerabilities.Seccomp Solution: Reduce attack surface by blocking dangerous syscalls.Docker Default Seccomp Profile (simplified):
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64"],
  "syscalls": [
    {
      "names": [
        "read", "write", "open", "close", "stat",
        "fstat", "mmap", "mprotect", "munmap",
        "brk", "ioctl", "writev", "access",
        "socket", "connect", "accept", "bind",
        "listen", "select", "poll", "epoll_wait"
        /* ... ~300 allowed syscalls ... */
      ],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "names": [
        "reboot",           // Cannot reboot host!
        "swapon", "swapoff", // Cannot manage swap
        "mount", "umount",  // Cannot mount filesystems
        "pivot_root",       // Cannot change root
        "kexec_load",       // Cannot load kernel
        "add_key",          // Cannot add keyring keys
        "request_key",
        "bpf",              // Cannot load BPF programs
        "perf_event_open",  // Cannot use perf
        "ptrace"            // Cannot trace other processes
      ],
      "action": "SCMP_ACT_ERRNO"  // Return EPERM
    }
  ]
}
Why Critical for Containers:
  1. Kernel Exploit Mitigation:
Without seccomp:
  Container → Exploit in ioctl() → Kernel code execution → Host compromise

With seccomp:
  Container → ioctl() → EPERM (syscall blocked) → Exploit fails
  1. Privilege Escalation Prevention:
# Without seccomp
docker run -it ubuntu
# Inside container:
unshare --mount --uts --ipc --net --pid --fork /bin/bash
# Success! Created new namespaces → potential escape

# With seccomp (default)
unshare --mount ...
# unshare: unshare failed: Operation not permitted
# Blocked! (unshare syscall not allowed)
  1. Attack Surface Reduction:
Linux kernel: ~450 syscalls
Docker default seccomp: ~300 allowed

Blocked (~150 syscalls):
- Kernel module loading (init_module, finit_module)
- Namespace manipulation (setns, unshare)
- Performance monitoring (perf_event_open)
- System administration (reboot, sethostname)
- Capability manipulation (capset)
- Key management (add_key, keyctl)
- BPF programs (bpf)

Result: 33% reduction in kernel attack surface!

Implementing Custom Seccomp:Example: Strict Sandbox:
#include <seccomp.h>

void install_strict_seccomp() {
    scmp_filter_ctx ctx;

    // Default: KILL (strictest!)
    ctx = seccomp_init(SCMP_ACT_KILL);

    // Allow ONLY these syscalls
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);

    // Conditional: Allow open ONLY for /tmp/*
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(openat), 1,
                     SCMP_A1(SCMP_CMP_STR, "/tmp/"));

    // Load filter
    seccomp_load(ctx);
    seccomp_release(ctx);

    // After this point:
    // - read/write/exit: OK
    // - openat("/tmp/file"): OK
    // - openat("/etc/passwd"): KILLED!
    // - socket(): KILLED!
    // - fork(): KILLED!
}
Docker Custom Profile:
# custom-seccomp.json
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    {
      "names": ["read", "write", "exit", "exit_group"],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

# Run container with custom profile
docker run --security-opt seccomp=custom-seccomp.json myimage

Debugging Seccomp Violations:
# Enable logging (dmesg)
echo 1 > /proc/sys/kernel/seccomp/actions_logged

# Run container
docker run --rm -it --security-opt seccomp=strict.json ubuntu bash

# Inside container, try forbidden syscall:
mount -t tmpfs tmpfs /mnt
# bash: mount: Operation not permitted

# Check dmesg
dmesg | tail
# [12345.678] audit: type=1326 audit(1234567890.123:456): auid=1000 uid=0 gid=0
#              ses=3 pid=12345 comm="mount" exe="/bin/mount" sig=0 arch=c000003e
#              syscall=165 compat=0 ip=0x7f... code=0x7ffc0000
#              ^^^^^^^^^^
#              syscall 165 = mount (BLOCKED!)

# Syscall 165 (mount) was blocked by seccomp

Why BPF:
  1. Efficiency: JIT-compiled to native code (fast!)
  2. Safety: BPF verifier ensures filter cannot crash kernel
  3. Flexibility: Can inspect syscall arguments, not just number
  4. Performance: Evaluated in kernel space (no context switch)
Without BPF (old seccomp mode 1):
  • Could only allow read/write/exit/_exit
  • No flexibility
With BPF (seccomp mode 2):
  • Can allow specific syscalls
  • Can inspect arguments (e.g., allow open but only for /tmp/*)
  • Can return different actions (ERRNO, TRAP, LOG, ALLOW)

Limitations:
  1. Cannot inspect pointers: BPF cannot dereference user-space pointers (no access to path strings, only FDs)
  2. Time-of-check-time-of-use (TOCTOU): Arguments checked before syscall, but can change
  3. Bypass via allowed syscalls: If write() allowed, attacker might abuse it
  4. Complexity: Writing correct BPF filters is hard

Summary:Seccomp-BPF is critical for containers because:
  • ✅ Reduces kernel attack surface (blocks ~1/3 of syscalls)
  • ✅ Prevents privilege escalation (blocks namespace manipulation)
  • ✅ Mitigates kernel exploits (blocks vulnerable syscalls)
  • ✅ Fast (BPF JIT compilation)
  • ✅ Flexible (programmable filters)
  • ✅ Secure (BPF verifier prevents filter bugs)
Without it, containers have full access to ~450 syscalls → much larger attack surface.

12. Threat Modeling for OS-backed Services

When designing secure services, think systematically about OS-level attack surfaces.

The STRIDE Model Applied to OS

ThreatOS Attack VectorMitigation
SpoofingProcess impersonation, UID manipulationUser namespaces, strong authentication
TamperingMemory corruption, file modificationASLR, KASLR, read-only mounts
RepudiationLog deletion, timestamp manipulationAppend-only logs, audit subsystem
Info Disclosure/proc leaks, side channelshidepid=2, Spectre mitigations
Denial of ServiceFork bombs, memory exhaustionCgroups limits, ulimits, quotas
Elevation of PrivilegeKernel exploits, setuid abuseSeccomp, drop capabilities

Defense-in-Depth Checklist

# 1. Principle of Least Privilege
capsh --print                           # See current capabilities
setcap cap_net_bind_service=+ep ./app   # Grant only specific caps

# 2. Namespaces: Reduce Visibility
unshare --user --map-root-user --pid --mount-proc bash

# 3. Read-only Filesystems
mount -o remount,ro /                   # Make root read-only

# 4. Resource Limits (DoS protection)
echo "100M" > /sys/fs/cgroup/myapp/memory.max
echo "50" > /sys/fs/cgroup/myapp/pids.max

Summary

Key Takeaways:
  1. Memory Protection: NX/DEP, ASLR, and stack canaries are foundational defenses against memory corruption attacks.
  2. Control Flow Integrity: Forward-edge CFI and shadow stacks (backward-edge CFI) prevent control-flow hijacking.
  3. Privilege Separation: Capabilities provide fine-grained privileges instead of all-or-nothing root access.
  4. Mandatory Access Control: SELinux (label-based) and AppArmor (path-based) enforce policies beyond DAC.
  5. Microarchitectural Attacks: Spectre and Meltdown exploit speculative execution. KPTI and retpolines mitigate but with performance cost.
  6. Sandboxing: Namespaces, seccomp, and combinations thereof create strong isolation for untrusted code.
Defense in Depth: No single mechanism is perfect. Modern systems combine multiple layers:
  • ASLR + NX + Stack Canaries + CFI (memory safety)
  • Capabilities + Seccomp + Namespaces (privilege reduction)
  • SELinux/AppArmor (mandatory access control)
  • KPTI + Retpolines + CPU features (hardware attack mitigation)
Performance vs Security: Many mitigations have performance costs. Understand trade-offs and apply based on threat model.

Interview Deep-Dive

Strong Answer:These three mechanisms form a layered defense against the classic buffer overflow attack chain. To understand why you need all three, walk through what an attacker must accomplish to exploit a buffer overflow:
  • Step 1: Overwrite the return address — The attacker provides input that overflows a stack buffer and overwrites the saved return address (RIP) on the stack, redirecting execution to attacker-controlled code.
    • Stack canaries intervene here. A random value (the “canary”) is placed between local variables and the saved return address at function entry. Before the function returns, the compiler-inserted code checks if the canary was modified. If it was (because the overflow overwrote it on the way to the return address), the program aborts immediately. The attacker must either guess the canary (2^64 possibilities on 64-bit) or find a way to overwrite the return address without touching the canary (possible with format string bugs or non-contiguous overwrites, but much harder).
  • Step 2: Redirect execution to shellcode — If the attacker bypasses the canary, they redirect execution to injected code (shellcode) in the buffer itself.
    • NX (No-Execute) / DEP intervenes here. The stack (and heap, and data sections) are marked non-executable at the page table level. The CPU enforces this in hardware: executing an instruction from an NX page triggers a page fault. The attacker’s shellcode on the stack cannot execute. This forces the attacker to use Return-Oriented Programming (ROP) — chaining existing code snippets (“gadgets”) from the binary and libraries.
  • Step 3: Locate usable code gadgets — The attacker needs to find executable code at known addresses to build ROP chains.
    • ASLR intervenes here. The kernel randomizes the base addresses of the stack, heap, shared libraries, and (with KASLR) the kernel itself at each process start. The attacker cannot hard-code addresses of gadgets because they change every run. On 64-bit systems, the entropy is typically 28-30 bits for library randomization, making brute force impractical.
Together, the attacker must: bypass the canary (hard without an information leak), cannot inject code (NX), and cannot find existing code to reuse (ASLR). Breaking one is insufficient — you need to break at least two.Where it still fails:
  • Information leaks: A separate vulnerability that leaks memory addresses (e.g., a format string bug that prints stack values) can defeat both ASLR (reveals addresses) and canaries (reveals the canary value). This is why modern defenses add CFI (Control Flow Integrity) as a fourth layer — even if the attacker knows addresses, they cannot redirect execution to arbitrary gadgets because the CPU verifies that indirect branches target valid function entries.
Follow-up: What is KASLR and why was KPTI needed despite it?KASLR randomizes the kernel’s base address in virtual memory at each boot. The idea is that even if an attacker has a kernel vulnerability, they cannot exploit it without knowing where kernel functions are located. KPTI (Kernel Page Table Isolation) was needed because the Meltdown vulnerability allowed user-space code to speculatively read kernel memory through the CPU’s speculative execution, bypassing KASLR entirely — the attacker could read kernel addresses at ~500KB/s and then use those addresses for their exploit. KPTI unmaps the kernel from user-space page tables entirely, so there is nothing for Meltdown to speculatively read. The cost is that every syscall now requires a CR3 switch between user and kernel page tables (5-30% overhead on older CPUs).
Strong Answer:Seccomp-BPF and SELinux operate at completely different layers and are complementary, not interchangeable.
  • Seccomp-BPF (System Call Filter): Intercepts every syscall at the entry point and runs a BPF filter that decides allow/deny/kill based on the syscall number and (with some limitations) its arguments. It answers: “Can this process invoke this kernel API?” Seccomp cannot distinguish between files, network addresses, or process targets — if you allow open(), the process can open any file. If you allow connect(), it can connect to any address.
  • SELinux (Mandatory Access Control): Assigns security labels to every process, file, socket, and kernel object. A policy defines which labels can perform which operations on which other labels. It answers: “Can this specific subject access this specific object in this specific way?” SELinux can say “process with label httpd_t can read files with label httpd_content_t but cannot write to files with label etc_t.” This is far more granular than seccomp.
For hardening an untrusted container, I would use both:
  • Seccomp-BPF: Block all syscalls the container does not need. A web server does not need mount, reboot, kexec_load, ptrace, init_module, or io_uring_setup. Docker ships a default seccomp profile that blocks about 60 dangerous syscalls. For untrusted workloads, I would create a custom profile that allowlists only the ~50 syscalls the application actually uses (determined by running strace during testing). This shrinks the kernel attack surface enormously — most kernel CVEs are in obscure syscall handlers that a web server never touches.
  • SELinux (or AppArmor): Apply a policy that restricts what the container can access even with the allowed syscalls. The container process can call open(), but SELinux ensures it can only open files in its designated directory. It can call connect(), but SELinux restricts it to specific ports and network labels. This prevents a compromised container from reading /etc/shadow, connecting to the metadata service (a common cloud attack vector), or accessing the Docker socket.
The layers complement each other: seccomp removes dangerous kernel entry points, SELinux restricts what the remaining entry points can access. Neither alone is sufficient. Seccomp without SELinux means a process with open() allowed can read any file. SELinux without seccomp means a process can invoke dangerous syscalls (even if they fail due to policy, the syscall handler code still runs, potentially triggering kernel bugs).Follow-up: What is the performance overhead of running both seccomp-BPF and SELinux simultaneously?Seccomp-BPF adds 10-50 nanoseconds per syscall (running a small BPF program in the syscall entry path). SELinux adds 100-500 nanoseconds per security check (which happens on syscalls that access objects — file open, socket connect, etc.). For a web server making 10K syscalls per second, the combined overhead is roughly 0.5-5 milliseconds per second — negligible. For a storage-intensive application making 500K syscalls per second, the overhead is 25-250 milliseconds per second (2.5-25% of one core). The practical impact depends entirely on the syscall rate. For most workloads, the overhead is under 1% and invisible in application-level metrics. The security benefit far outweighs the cost.
Strong Answer:Both Spectre and Meltdown exploit speculative execution — the CPU’s optimization of executing instructions ahead of time before knowing whether they are needed. The critical difference is the trust boundary they violate.
  • Meltdown (CVE-2017-5754): Exploits the fact that on vulnerable Intel CPUs, speculative loads from kernel memory are not immediately stopped by the permission check. The CPU speculatively reads kernel data into a register, uses it to access a cache line (encoding the secret in a cache side channel), and then throws away the speculative result when the permission check fails. But the cache side channel remains — the attacker can probe which cache line was accessed and recover the kernel data. Meltdown crosses the user/kernel boundary and allows reading arbitrary kernel memory.
  • Spectre (CVE-2017-5753 Variant 1, CVE-2017-5715 Variant 2): Exploits the CPU’s branch prediction. In Variant 1 (bounds check bypass), the attacker trains the branch predictor to predict that a bounds check will pass, then triggers speculative execution past the check with an out-of-bounds index. The speculative load accesses secret data and encodes it in the cache. In Variant 2 (branch target injection), the attacker poisons the Branch Target Buffer (BTB) to redirect speculative execution of an indirect branch to attacker-chosen code (“gadgets”) within the victim’s address space.
Why Spectre is harder to mitigate:
  • Meltdown has a clean fix: KPTI (Kernel Page Table Isolation) unmaps the kernel from user-space page tables. If the kernel memory is not even mapped during user-space execution, the speculative load has nothing to read. The fix is at the OS level and is complete (with a 5-30% performance cost).
  • Spectre crosses any trust boundary: Spectre does not require reading kernel memory. It can leak data between processes, between VMs, between JavaScript contexts in a browser, between a sandbox and its host. Any code running on the same CPU can potentially be a Spectre victim or attacker.
  • Software mitigations are partial: Retpolines (replacing indirect branches with a return trampoline that defeats BTB poisoning) mitigate Variant 2 but add overhead to every indirect call. Array bounds masking (inserting an AND instruction after bounds checks to zero out speculative out-of-bounds accesses) mitigates Variant 1 but requires compiler changes and careful code auditing. Neither is a complete fix.
  • New variants keep appearing: Spectre is a class of vulnerabilities, not a single bug. Spectre-v3a, Spectre-RSB, Spectre-BHB, and MDS (Microarchitectural Data Sampling) are all variations on the same theme. Each requires its own mitigation.
The fundamental problem is that speculative execution is not a bug — it is a deliberate performance feature that provides 10-100x speedup for branch-heavy code. Disabling speculation entirely would reduce modern CPUs to 1990s performance levels. The industry is converging on hardware fixes in newer CPUs (Intel Golden Cove, AMD Zen 4) that add speculation barriers in microcode, but older hardware remains vulnerable.Follow-up: How do cloud providers like AWS protect against cross-VM Spectre attacks on shared hardware?Multiple layers: hardware partitioning (Intel CAT/MBA to partition the L3 cache between VMs, reducing cache side-channel leakage), microcode updates (clearing branch predictor state on VM entry/exit), hypervisor patches (KVM flushes speculation buffers on VMEXIT), and core scheduling (ensuring untrusted VMs do not share SMT siblings, since Hyper-Threading shares the branch predictor and L1 cache between logical cores). AWS’s Nitro system goes further by offloading virtualization to dedicated hardware, reducing the hypervisor attack surface. Despite all this, the most sensitive workloads (HSMs, cryptographic key storage) run on dedicated single-tenant hosts with no sharing.
Strong Answer Framework:
  1. Establish what the attacker should not be able to do. Before reaching for tools, define the boundary. “Cannot read /etc/shadow” is different from “cannot exfiltrate any data” is different from “cannot persist a backdoor.” Threat modeling forces you to choose mechanisms that match the goal.
  2. Apply user separation as the floor. Run the binary as a dedicated unprivileged user with no shell, no sudo entries, no group memberships beyond its own. This is the cheapest layer and rules out 80 percent of trivial attacks. Anyone who skips this layer because “I have stronger mechanisms above” loses if the stronger mechanisms have a bug.
  3. Drop capabilities to the minimum. Use prctl(PR_CAPBSET_DROP) to drop the bounding set, set SECBIT_NOROOT to prevent file capabilities or setuid from re-elevating, and add only the capabilities the workload needs. For most workloads, the answer is zero capabilities.
  4. Apply seccomp to shrink the kernel surface. Custom syscall whitelist generated from observed behavior. The kernel has 350+ syscalls; a typical workload uses 50-80. Blocking the rest closes off entire classes of kernel CVEs preemptively.
  5. Use namespaces to make the world smaller. Mount namespace with a chroot or pivot_root into a private rootfs. Network namespace with no interfaces (or just a loopback). PID namespace so the process cannot see or signal anything outside. User namespace with the workload mapped to a non-overlapping host UID, so even root-in-namespace is unprivileged on the host.
  6. Layer mandatory access control on top. SELinux or AppArmor profile that restricts what the workload can read or write even if it somehow got privilege. This is the layer that catches the bug in your seccomp profile.
  7. Cgroup limits for blast radius. Memory limit, PID limit, CPU quota, IO weight. These do not stop intrusions, but they bound the damage of fork bombs, memory hogs, and crypto-miners-as-payload.
  8. Audit what you cannot prevent. Even with all of the above, log every syscall through audit subsystem or eBPF tracing. The goal is detection within hours of a successful attack, not just prevention.
Real-World Example: Google Chrome’s renderer sandbox is the public reference design. Each renderer process drops all capabilities, applies a strict seccomp filter (about 65 syscalls allowed), runs in a user namespace with the renderer UID mapped to nobody, has no filesystem access (uses Mojo IPC to the privileged broker for file IO), and is restricted by SELinux on Android. The 2014 Pwn2Own attack on Chrome required chaining a renderer RCE with a seccomp escape and a kernel privilege escalation — three independent vulnerabilities. The 2024 attack on Chrome’s V8 still required two more bugs to escape the renderer sandbox to host code execution.
Senior follow-up 1: Why is user namespace mapping the root inside the namespace to a non-zero UID outside considered the strongest single primitive?Because most kernel privilege checks use the namespaced uid for permission decisions but the real uid for capability decisions on global resources. If your namespace’s UID 0 maps to host UID 100000, a successful exploit that gives the attacker capabilities only does so within the namespace. Operations that affect the host kernel (loading modules, mounting filesystems on host paths, ptrace of host processes) check the real UID, which is unprivileged. This is why rootless containers are a meaningful security improvement — not just a usability one.
Senior follow-up 2: A seccomp profile is too restrictive in unpredictable ways. What is your debug strategy?Set the default action to SECCOMP_RET_LOG instead of SECCOMP_RET_KILL, run the workload through realistic scenarios (not just happy path — include error handling, signal delivery, malloc growth), and watch /var/log/audit/audit.log for SECCOMP records. Each entry shows the syscall number that would have been killed; map those to names with ausyscall. After a clean observation window, flip default action to SECCOMP_RET_ERRNO(EPERM) for one more cycle (so the application can fail gracefully), then to SECCOMP_RET_KILL_PROCESS for hard enforcement. Tools like containerd-shim’s seccomp recorder, Falco, and kubectl-trace automate this loop.
Senior follow-up 3: Where does gVisor fit relative to seccomp + namespaces, and when is the extra cost justified?gVisor reimplements the Linux syscall surface in a userspace process (Sentry) that sits between the application and the host kernel. Calls that look like read() to the application are actually intercepted, validated, and either handled in Sentry or proxied to the host. This eliminates an entire class of risk: kernel CVEs in syscall handlers do not affect gVisor-sandboxed workloads because those handlers never execute on the host kernel for sandboxed traffic. The cost is real — gVisor adds 10-50 percent overhead on syscall-heavy workloads and is incompatible with some applications (Linux-namespace-specific tools, applications that mmap and then expect specific kernel behaviors). The cost is justified for workloads where you genuinely cannot trust the binary — shared CI runners, untrusted user code in PaaS, multi-tenant function execution. It is overkill for first-party microservices.
Common Wrong Answers:
  • “Just put it in Docker.” Docker by default runs as root inside the container, with most capabilities, and a default-permissive seccomp profile. Docker is a packaging tool first; security depends on configuration that must be applied explicitly.
  • “Use a VM.” VMs have a smaller attack surface than namespaces against most threat models, but the hypervisor still has CVE history (CVE-2017-2596 KVM, CVE-2020-29569 Xen). Saying “use a VM” without acknowledging hypervisor risk hand-waves the problem.
  • “Drop capabilities and you are done.” Capabilities are necessary but not sufficient. A process with zero capabilities can still read every file world-readable, connect to localhost services, and exploit kernel bugs in syscalls that do not require capabilities.
Further Reading:
  • “Sandboxing and Workload Isolation” (Google production hardening guide) — the gVisor design rationale and threat model
  • Jess Frazelle, “Hard multi-tenancy in Kubernetes” — pragmatic stack for untrusted workloads
  • Linux source: kernel/seccomp.c, kernel/user_namespace.c, security/security.c for the LSM hook integration
Strong Answer Framework:
  1. Capabilities answer: what privileged operations can this process invoke? Drop all capabilities and the process cannot bind low ports, change UIDs, mount filesystems, load kernel modules, ptrace others, or do anything else that historically required root. Capabilities do not restrict file access (DAC handles that) or syscall surface (seccomp handles that).
  2. Seccomp answers: what syscalls can this process make at all? Even without capabilities, a process can call hundreds of syscalls. Many have CVE history. Seccomp shrinks the kernel attack surface by blocking syscalls the workload does not need. It does not care about arguments deeply (only some support arg filtering), so it cannot say “open files only in /tmp.” It just says “you can or cannot call open at all.”
  3. Namespaces answer: what does this process see? Mount namespace = its own filesystem view. Network namespace = its own network stack. PID namespace = its own process tree. User namespace = its own UID/GID mapping. Namespaces isolate visibility and resource scope, not privilege. Two processes can be in the same namespace and one can attack the other; namespaces only protect across the boundary.
  4. Docker is the orchestration that wires these together. A Docker container is, mechanically, a process tree with namespaces, a default seccomp profile, dropped capabilities (most are off by default), an AppArmor or SELinux profile, and cgroup limits. Docker is not a new isolation mechanism — it is a configuration that combines existing kernel mechanisms.
  5. Where they fail to compose: seccomp filters by syscall number, but a syscall you allow can transitively reach functionality you blocked (the mprotect-via-printf issue). Namespaces leak through /proc, /sys, kernel keyrings, and shared kernel data structures. User namespaces have escalation paths through misconfigured uid_map. Capabilities have surprising scopes — CAP_SYS_ADMIN is “nearly root” because dozens of operations gate on it. Combining all four is necessary; each individually has gaps the others fill.
Real-World Example: The 2024 LeakyVessels CVEs (CVE-2024-21626 in runc, CVE-2024-23651 in BuildKit) escaped containers despite seccomp, capability dropping, and AppArmor all being in place. The escapes worked through file descriptor leaks across the namespace boundary — runc was leaking host file descriptors into containers via /proc/self/fd, and a malicious container could traverse those FDs to reach the host filesystem. None of the standard hardening primitives caught this because the attack did not violate any one mechanism’s contract — it exploited the gap between them. The fix was at the runtime level: runc closes all FDs before exec, a behavior that should have been there all along.
Senior follow-up 1: Why does a default Docker container still have CAP_NET_RAW and CAP_NET_BIND_SERVICE despite the security guidance to drop everything?Because Docker’s defaults are tuned for compatibility with common workloads — ping, DHCP clients, web servers binding to ports below 1024 in legacy configurations. Most real workloads do not need either capability and should drop them explicitly with --cap-drop=ALL --cap-add=.... The Docker maintainers chose conservative defaults so docker run would just work for as many users as possible, accepting a less-defensive baseline as the cost. For production, your image build or orchestration layer should override this default.
Senior follow-up 2: What is the difference between SECCOMP_FILTER_FLAG_TSYNC and per-thread seccomp filters, and when does it matter?SECCOMP_FILTER_FLAG_TSYNC synchronizes a seccomp filter across all threads in the process at install time, ensuring no thread escapes the filter. Without it, a multithreaded process can install a filter on the calling thread but other threads keep running unfiltered until they call prctl(PR_SET_SECCOMP) themselves. For a single-threaded program this is fine; for anything threaded (which is most modern code), TSYNC is mandatory or you have a race where a thread spawned during filter installation never gets the filter. The 2017 CVE-2017-2671 in QEMU is one example of this exact race being exploited.
Senior follow-up 3: When would you choose AppArmor over SELinux, or vice versa, and is there ever a case to run both?AppArmor uses path-based labels — “process X cannot write to /etc/*”. It is easier to write profiles for and easier to reason about, especially in containerized environments where filesystem layout is predictable. SELinux uses type labels assigned to files via xattrs — “process labeled httpd_t cannot write to files labeled etc_t”. This is more powerful (the label travels with the file regardless of path) but harder to debug. Use AppArmor for application-specific containment in container environments (Ubuntu, Debian, SUSE all default to AppArmor). Use SELinux for whole-system mandatory access control where the broader policy benefits outweigh complexity (RHEL, Fedora, Android). Running both simultaneously is theoretically possible but practically unwise — LSM stacking still has rough edges, debugging conflicts is painful, and the marginal security from running both is small compared to running either one well.
Common Wrong Answers:
  • “Containers are basically VMs.” They are emphatically not. A VM has a hardware-virtualized hypervisor between guest and host kernel; a container shares the host kernel directly. Container escapes target host kernel bugs; VM escapes target hypervisor bugs (rarer, smaller surface).
  • “Seccomp blocks system calls and that is enough.” Seccomp does not see filesystem paths or network addresses. A process with open allowed can read every file your DAC allows; a process with connect allowed can reach every IP your network namespace permits.
  • “If I drop all capabilities I am safe.” Many CVEs do not need capabilities. Reading sensitive files via standard DAC, exploiting kernel bugs in syscalls that do not require privilege, and lateral movement through the container’s mount namespace are all capability-free.
Further Reading:
  • “Container Security” by Liz Rice — the cleanest book-length tour of the kernel primitives and how Docker/Kubernetes wire them
  • LWN article: “Capabilities for system calls” (Mickael Salaun) — why caps and seccomp are complementary
  • runc CVE-2024-21626 writeup — a real-world example of compositional failure
  • Linux source: Documentation/userspace-api/seccomp_filter.rst, Documentation/security/credentials.rst
Strong Answer Framework:
  1. The CPU’s perspective: speculation as a performance feature. A modern CPU does not wait for a branch’s condition to resolve before fetching, decoding, and executing instructions on one of the predicted paths. Branch predictors — including the Pattern History Table for direct branches and the Branch Target Buffer (BTB) for indirect branches — predict where execution is going. The CPU executes speculatively, retains results in the Reorder Buffer, and either commits them (prediction correct) or discards them (prediction wrong). The trick is that “discards them” is not perfect: side effects on microarchitectural state — cache lines loaded, branch predictor state updated — persist even when the architectural result is rolled back.
  2. The attacker’s perspective: turning microarchitectural side effects into a data leak. Spectre Variant 1 (bounds check bypass): the attacker trains the branch predictor to expect a bounds check to pass, then triggers the speculative path with an out-of-bounds index. The speculative load reads secret memory, uses the secret as an index into a probe array, and brings a specific cache line into L1. The architectural result is rolled back, but the cache state is not. The attacker times accesses to the probe array; the line that hits in cache encodes the secret byte. With this primitive, the attacker reads memory at the rate of about 10-100 KB/sec.
  3. Spectre Variant 2 (branch target injection): poisoning indirect branches. The attacker pollutes the BTB with branch targets that, when used speculatively by the victim, redirect speculative execution to attacker-chosen code — “gadgets” — in the victim’s address space. Now the speculative-execution-and-cache-side-channel pattern can read across security boundaries (kernel space, other VMs, browser sandboxes).
  4. Kernel mitigations: per-variant. Variant 1 mitigated with array bounds masking (array_index_nospec) and LFENCE / speculation barriers in kernel hot paths — compiler and code review job, painful and incomplete. Variant 2 mitigated with retpolines on x86 (replacing indirect branches with a return trampoline that defeats BTB poisoning) and IBRS / IBPB / STIBP CPU features (clearing predictor state at boundary crossings). Cross-process (and cross-VM) protection via core scheduling — never schedule untrusted SMT siblings on the same physical core.
  5. What you give up. Retpolines add 5-30 percent overhead to indirect-call-heavy workloads (interpreters, VM monitors, system call entry). KPTI (which mitigates Meltdown but is part of the same family) costs 5-30 percent on syscall-heavy workloads, especially on older CPUs without PCID. Disabling SMT for security on multi-tenant hosts halves logical core count. The total cost on a Skylake-era Xeon running a syscall-heavy workload is non-trivial — often 10-25 percent throughput loss compared to a fully-mitigation-disabled baseline.
Real-World Example: When Spectre was disclosed in January 2018, AWS deployed mitigations in two phases: first, Linux KPTI plus retpolines on hosts (immediately, for all workloads), and second, a Nitro-based approach that moved virtualization to dedicated hardware so the hypervisor surface no longer ran on the same cores as guest code. Internal AWS benchmarks reported 1-5 percent average overhead for typical workloads, but specific workloads (Redis, syscall-heavy databases) showed 20-30 percent regressions until application-level tuning recovered most of it. Public web search engines saw similar cost; Google’s response involved refactoring V8’s JIT to insert speculation barriers in a way that did not pay the full retpoline cost on every JS function call.
Senior follow-up 1: Why is Meltdown easier to fully mitigate than Spectre?Meltdown exploits a specific Intel CPU bug: speculative loads to kernel addresses from user mode were not properly checked. The mitigation — KPTI, unmapping kernel from user-space page tables — removes the speculative load’s target entirely. There is nothing for the speculation to read. Spectre, in contrast, exploits a deliberate CPU feature (branch prediction) that you cannot remove without crippling performance. Every mitigation is partial: you fix one branch site, the next one is still vulnerable. You add a barrier somewhere, the attacker finds a different speculation primitive. The arms race is structurally asymmetric.
Senior follow-up 2: How does retpoline actually defeat branch target injection, mechanically?A normal indirect branch (jmp *%rax) consults the BTB for prediction, which the attacker has poisoned. Retpoline replaces it with a sequence: push the target onto the stack, then ret. The CPU’s return address predictor (Return Stack Buffer, RSB) is used instead of the BTB for ret. The attacker cannot easily poison the RSB because it is filled by call instructions, which the attacker controls less. The speculation that does happen lands in an infinite loop (pause; jmp self), so even if the predictor is wrong, the speculative path does not perform any useful work for an attacker. The cost is a few extra instructions per indirect call. On AMD CPUs and newer Intel CPUs with eIBRS (Enhanced Indirect Branch Restricted Speculation), retpoline is replaced with a hardware mode flag that gives equivalent protection at lower overhead.
Senior follow-up 3: When should I disable Spectre mitigations on a host I control?Realistic case: a single-tenant host running first-party trusted code only, where every binary on the box is built and signed by you, the kernel CVEs you fear are not in the speculation family, and the 5-25 percent throughput loss matters more than defense in depth. HPC clusters running tightly-controlled workloads disable some mitigations for this reason (mitigations=off or specific flags like nopti, spectre_v2=off). The risk you accept: an unknown future CVE that uses speculation to escape userspace, or a supply-chain compromise in a trusted dependency. Most production environments cannot make this tradeoff because the trust assumptions do not actually hold; HPC and game servers can. Document the decision explicitly so the next operator does not assume mitigations are on.
Common Wrong Answers:
  • “Spectre and Meltdown are the same thing.” They share a primitive (cache side channel after speculation) but differ in trust boundary and mitigation profile. Conflating them suggests you have read the headline but not the technical writeup.
  • “Just patch your CPU microcode.” Microcode updates are part of the mitigation but cannot fix Spectre fully because Spectre is a behavior, not a bug. Software mitigations remain mandatory.
  • “Disable Hyper-Threading and you are safe.” Disabling SMT helps against L1TF and MDS variants where SMT siblings share microarchitectural state, but does nothing for cross-process Spectre on the same core. It is one mitigation, not the answer.
Further Reading:
  • The original Spectre paper: Kocher et al., “Spectre Attacks: Exploiting Speculative Execution” (2018, USENIX Security)
  • LWN article: “The current state of kernel page-table isolation” — comprehensive KPTI walkthrough
  • Intel “Speculative Execution Side Channel Mitigations” white paper — the vendor’s view
  • Linux source: arch/x86/kernel/cpu/bugs.c, arch/x86/include/asm/nospec-branch.h

Next: Boot Process & Initialization