Skip to main content

Operating System Security

Modern OS security is a multi-layered defense against both software vulnerabilities and hardware-level attacks. From memory protection to mandatory access control, understanding these mechanisms is crucial for building secure systems.
Mastery Level: Senior Security Engineer Key Internals: Page Table Permissions, Capabilities, LSM hooks, CPU security features, Speculative execution mitigations Prerequisites: Virtual Memory, Process Internals

1. Memory Protection Fundamentals

1.1 Page-Level Protection (NX/DEP)

No-Execute (NX) / Data Execution Prevention (DEP) marks memory pages as non-executable.
Traditional (Insecure):
┌────────────────────────────────┐
│  Stack                         │  Executable!
│  ├─ Return addresses           │  ← Attacker can inject shellcode
│  └─ Local variables            │
├────────────────────────────────┤
│  Heap                          │  Executable!
│  ├─ Malloc'd buffers           │  ← Attacker can put code here
│  └─ Dynamic data               │
├────────────────────────────────┤
│  Data/BSS                      │  Executable!
└────────────────────────────────┘

With NX/DEP:
┌────────────────────────────────┐
│  Stack (NX bit set)            │  NOT Executable
│  ├─ Return addresses           │  ← Shellcode won't execute!
│  └─ Local variables            │
├────────────────────────────────┤
│  Heap (NX bit set)             │  NOT Executable
│  ├─ Malloc'd buffers           │
│  └─ Dynamic data               │
├────────────────────────────────┤
│  Data/BSS (NX bit set)         │  NOT Executable
├────────────────────────────────┤
│  Text (executable)             │  Executable
│  └─ Program code               │
└────────────────────────────────┘
Implementation:
// Kernel sets page table entry (PTE) NX bit

// x86-64 page table entry structure
struct pte {
    unsigned long present : 1;     // Page is in memory
    unsigned long rw : 1;          // Read/Write permission
    unsigned long user : 1;        // User-mode accessible
    unsigned long pwt : 1;         // Page write-through
    unsigned long pcd : 1;         // Page cache disabled
    unsigned long accessed : 1;    // Page was accessed
    unsigned long dirty : 1;       // Page was written
    unsigned long pat : 1;         // Page attribute table
    unsigned long global : 1;      // Global page
    unsigned long avail : 3;       // Available for OS use
    unsigned long pfn : 40;        // Physical frame number
    unsigned long avail2 : 11;     // Available
    unsigned long nx : 1;          // NO-EXECUTE bit (bit 63)
};

// Kernel code for stack allocation (simplified from mm/mmap.c)
unsigned long do_mmap(struct file *file, unsigned long addr,
                     unsigned long len, unsigned long prot,
                     unsigned long flags, unsigned long pgoff) {
    struct vm_area_struct *vma;

    vma = vm_area_alloc(current->mm);
    vma->vm_start = addr;
    vma->vm_end = addr + len;

    // Stack protection: read/write but NOT executable
    if (flags & MAP_STACK) {
        vma->vm_flags = VM_READ | VM_WRITE | VM_GROWSDOWN;
        // NX bit will be set in page table entries
    }

    // Code segment: executable but NOT writable
    if (prot & PROT_EXEC) {
        vma->vm_flags |= VM_EXEC;
        vma->vm_flags &= ~VM_WRITE;  // W^X: Write XOR Execute
    }

    return addr;
}
W^X Policy (Write XOR Execute):
  • A page can be writable OR executable, but never both
  • Prevents attacker from modifying code or executing data
Check NX status:
# Check if NX is enabled
dmesg | grep NX
# NX (Execute Disable) protection: active

# Check process memory maps
cat /proc/self/maps
# 7ffff7dd1000-7ffff7df3000 r-xp ... /lib/x86_64-linux-gnu/ld-2.31.so  (executable)
# 7ffff7df3000-7ffff7df4000 r--p ... /lib/x86_64-linux-gnu/ld-2.31.so  (read-only)
# 7ffffffde000-7ffffffff000 rw-p ... [stack]                           (no 'x'!)

# Check if binary has NX enabled
readelf -l /bin/ls | grep GNU_STACK
# GNU_STACK      0x000000 0x00000000 0x00000000 0x00000 0x00000 RW  0x10
#                                                                 ^^^ (RW, not RWE)

1.2 Address Space Layout Randomization (ASLR)

Problem: Without ASLR, addresses are predictable.
Without ASLR (Predictable):
┌─────────────────────────────────┐
│ Stack:         0x7ffffffde000    │  ← Always same address!
│ Heap:          0x555555559000    │  ← Attacker knows these
│ libc:          0x7ffff7a0d000    │  ← Can hardcode in exploit
│ Program:       0x555555554000    │
│ vDSO:          0x7ffff7fc9000    │
└─────────────────────────────────┘

With ASLR (Randomized):
Run 1:                            Run 2:
┌─────────────────────────────┐  ┌─────────────────────────────┐
│ Stack:    0x7ffc9e3a2000    │  │ Stack:    0x7ffe1b8d6000    │
│ Heap:     0x5643ab123000    │  │ Heap:     0x55e2d9abc000    │
│ libc:     0x7f8a2e456000    │  │ libc:     0x7f3c81de2000    │
│ Program:  0x5643ab11f000    │  │ Program:  0x55e2d9ab8000    │
│ vDSO:     0x7f8a2f1c3000    │  │ vDSO:     0x7f3c82b4f000    │
└─────────────────────────────┘  └─────────────────────────────┘
                ↑ Different every time! ↑
Kernel Implementation:
// Simplified from arch/x86/mm/mmap.c

unsigned long arch_mmap_rnd(void) {
    unsigned long rnd;

    // Get random bits from kernel PRNG
    if (mmap_is_ia32()) {
        rnd = get_random_long() & ((1UL << mmap_rnd_bits) - 1);
    } else {
        rnd = get_random_long() & ((1UL << mmap_rnd_compat_bits) - 1);
    }

    return rnd << PAGE_SHIFT;  // Align to page boundary
}

unsigned long arch_get_unmapped_area(struct file *filp,
                                    unsigned long addr,
                                    unsigned long len,
                                    unsigned long pgoff,
                                    unsigned long flags) {
    struct mm_struct *mm = current->mm;
    struct vm_area_struct *vma;
    unsigned long start_addr;

    // Add random offset
    if (!(flags & MAP_FIXED)) {
        // Random offset for ASLR
        start_addr = mm->mmap_base + arch_mmap_rnd();
    } else {
        start_addr = addr;
    }

    // Find free region starting at randomized address
    vma = find_vma(mm, start_addr);
    // ... allocation logic ...

    return start_addr;
}
Entropy Sources:
ASLR Entropy (bits of randomness):

Stack:      19 bits (on x86-64)  = 524,288 possible locations
Heap:       13 bits              = 8,192 possible locations
Libraries:  28 bits              = 268 million possible locations
PIE binary: 28 bits              = 268 million possible locations

Formula: Brute force attempts = 2^(entropy_bits)

Example: 28 bits → attacker needs avg 2^27 = 134 million attempts
If each attempt crashes the program (1 sec delay):
  134 million seconds = 1,551 days!

But if process doesn't crash (fork server):
  Attacker can brute force in minutes!
KASLR (Kernel ASLR):
// Kernel virtual address randomization (from arch/x86/boot/compressed/kaslr.c)

void choose_random_location(unsigned long input,
                           unsigned long input_size,
                           unsigned long *output,
                           unsigned long output_size,
                           unsigned long *virt_addr) {
    unsigned long random_addr, min_addr;

    // Get entropy from:
    // 1. RDRAND/RDSEED (CPU instructions)
    // 2. RDTSC (timestamp counter)
    // 3. Boot parameters
    random_addr = get_random_long();

    // Align and constrain to valid kernel address range
    min_addr = min(*output, *virt_addr);
    random_addr = find_random_phys_addr(min_addr, output_size);

    *output = random_addr;
    *virt_addr = random_addr + __START_KERNEL_map;
}
Check ASLR status:
# View ASLR setting
cat /proc/sys/kernel/randomize_va_space
# 0 = Disabled
# 1 = Randomize stack, libraries, mmap
# 2 = Full randomization (includes heap)

# Enable full ASLR
echo 2 | sudo tee /proc/sys/kernel/randomize_va_space

# Test ASLR
for i in {1..5}; do cat /proc/self/maps | grep stack; done
# 7ffc12345000-7ffc12366000 (different)
# 7ffe9abcd000-7ffe9abee000 (different)
# 7ffd45678000-7ffd45699000 (different)

1.3 Stack Canaries (Stack Smashing Protection)

Stack canary is a random value placed on the stack between local variables and the return address.
Stack Layout with Canary:
─────────────────────────

High Address
┌──────────────────────┐
│  Return Address      │  ← Protected by canary!
├──────────────────────┤
│  Saved Frame Pointer │
├──────────────────────┤
│  CANARY (random)     │  ← __stack_chk_guard
├──────────────────────┤
│  Local Variables     │  ← Buffer overflow starts here
│  char buf[100];      │
└──────────────────────┘
Low Address

Attack Scenario:
1. Attacker overflows buf
2. Overwrites canary (but doesn't know correct value)
3. Function returns
4. Kernel checks: canary == __stack_chk_guard?
5. Mismatch → Stack smashing detected! → Abort
Compiler Implementation:
// Original vulnerable code
void vulnerable_function(char *input) {
    char buffer[64];
    strcpy(buffer, input);  // Buffer overflow!
}

// Compiled with -fstack-protector-strong
void vulnerable_function(char *input) {
    char buffer[64];
    unsigned long canary = __stack_chk_guard;  // Load canary

    strcpy(buffer, input);

    if (canary != __stack_chk_guard) {
        __stack_chk_fail();  // Stack smashing detected!
    }
}

// __stack_chk_fail() implementation (in glibc)
void __attribute__((noreturn)) __stack_chk_fail(void) {
    __fortify_fail("stack smashing detected");
}

void __attribute__((noreturn)) __fortify_fail(const char *msg) {
    // Log the error
    syslog(LOG_CRIT, "%s: %s terminated", __progname, msg);

    // Terminate immediately
    abort();
}
Canary Types:
// Uses NULL, CR, LF, EOF (0x00, 0x0D, 0x0A, 0xFF)
// Idea: strcpy stops at NULL, gets/printf stop at CR/LF

unsigned long canary = 0x000d0aff00000000;

// Weakness: Attacker can guess/brute-force known bytes
Compiler Flags:
# -fstack-protector: Protect functions with vulnerable buffers
gcc -fstack-protector vulnerable.c

# -fstack-protector-strong: Protect more functions (recommended)
gcc -fstack-protector-strong vulnerable.c

# -fstack-protector-all: Protect ALL functions (performance cost)
gcc -fstack-protector-all vulnerable.c

# Check if binary has stack protector
readelf -s /bin/ls | grep stack_chk
#    123: 00000000000060a0     8 OBJECT  GLOBAL DEFAULT   25 __stack_chk_fail@@GLIBC_2.4
Bypass Techniques (and mitigations):
AttackMitigation
Leak canary via format stringUse fortified functions (_printf_chk)
Overwrite canary with correct valueUse random canary per thread
Jump over canary (partial overflow)Place canary near variables
Fork before overflow (canary same in child)Re-randomize after fork

2. Control Flow Integrity (CFI)

CFI ensures program control flow follows legitimate paths (no arbitrary jumps).

2.1 Forward-Edge CFI (Indirect Calls)

Problem: Function pointers can be hijacked.
// Vulnerable code
struct ops {
    void (*process)(char *data);
};

struct ops *vtable = malloc(sizeof(struct ops));
vtable->process = legitimate_function;

// ... buffer overflow ...
// Attacker overwrites vtable->process to point to shellcode

vtable->process(data);  // Calls shellcode!
CFI Solution:
// Compiler generates CFI check before indirect call

// Original code
vtable->process(data);

// Compiled with CFI
void *target = vtable->process;

// Check 1: Is target a valid code address?
if (!is_valid_code_address(target)) {
    cfi_violation();
}

// Check 2: Is target in allowed set for this call site?
if (!is_allowed_target(call_site_id, target)) {
    cfi_violation();
}

// Perform call
((void (*)(char *))target)(data);
Allowed Target Sets:
Function Signature-Based CFI:

void func_a(int x);           ← Set 1: void (int)
void func_b(int x);           ←
int func_c(int x, int y);     ← Set 2: int (int, int)
int func_d(int x, int y);     ←
void func_e(void);            ← Set 3: void (void)

Rule: Indirect call with signature void(int) can only jump to Set 1.

Implementation:
1. Compiler assigns ID to each function signature
2. Compiler tags each function with its ID
3. Before indirect call, check ID matches expected signature
Clang CFI:
# Compile with CFI
clang -fsanitize=cfi -flto program.c

# CFI variants
-fsanitize=cfi-icall     # Indirect calls
-fsanitize=cfi-vcall     # Virtual function calls (C++)
-fsanitize=cfi-cast      # Bad casts

# Generate CFI violation report
UBSAN_OPTIONS=print_stacktrace=1 ./program

# Example violation
SUMMARY: UndefinedBehaviorSanitizer: cfi-check-fail
pc 0x55f8a2b3c4d5 in main program.c:42

2.2 Backward-Edge CFI (Return Address Protection)

Shadow Stack: Hardware-protected copy of return addresses.
Regular Stack           Shadow Stack (Protected)
─────────────────       ────────────────────────
┌─────────────┐         ┌─────────────┐
│  Ret Addr 3 │ ◄──────►│  Ret Addr 3 │  (Copy)
├─────────────┤         ├─────────────┤
│  Locals     │         │             │
├─────────────┤         │             │
│  Ret Addr 2 │ ◄──────►│  Ret Addr 2 │
├─────────────┤         ├─────────────┤
│  Locals     │         │             │
├─────────────┤         │             │
│  Ret Addr 1 │ ◄──────►│  Ret Addr 1 │
└─────────────┘         └─────────────┘
     ↑                         ↑
  Writable!              Read-Only!
  (Attacker can          (CPU enforced,
   modify)                not accessible)

On Function Return:
1. Pop return address from regular stack → addr_stack
2. Pop return address from shadow stack → addr_shadow
3. Compare: addr_stack == addr_shadow?
4. Mismatch → #CP exception (Control Protection) → Crash
Intel CET (Control-flow Enforcement Technology):
// CPU features for shadow stack
#define X86_FEATURE_SHSTK  (1 << 7)   // Shadow stack
#define X86_FEATURE_IBT    (1 << 20)  // Indirect branch tracking

// Enable shadow stack (kernel code)
void cet_enable(void) {
    u64 msr_val;

    // Check if CPU supports CET
    if (!boot_cpu_has(X86_FEATURE_SHSTK))
        return;

    // Enable in MSR
    rdmsrl(MSR_IA32_S_CET, msr_val);
    msr_val |= MSR_IA32_CET_SHSTK_EN;  // Enable shadow stack
    wrmsrl(MSR_IA32_S_CET, msr_val);

    // Allocate shadow stack for current thread
    unsigned long ssp = alloc_shstk();  // Shadow stack pointer
    wrmsrl(MSR_IA32_PL3_SSP, ssp);
}

// Shadow stack operations (new x86 instructions)
// INCSSP - Increment shadow stack pointer
// RDSSP  - Read shadow stack pointer
// SAVEPREVSSP - Save previous SSP
// RSTORSSP - Restore SSP
// WRSSD/WRSSQ - Write to shadow stack
// SETSSBSY - Mark shadow stack busy
ARM Pointer Authentication:
// ARM PAuth uses cryptographic signing of return addresses

// On function prologue:
// PAC (Pointer Authentication Code) = sign(return_addr, context_key)
// Store: PAC || return_addr on stack

// On function epilogue:
// Verify: sign(return_addr, context_key) == PAC?
// If mismatch → Fault

// ARM instructions
PACIA  X30, SP   // Sign return address (X30) with stack pointer (SP)
RETAA            // Authenticate and return
Software Shadow Stack (Android):
// Implemented in libc (not hardware-protected)

__thread void *shadow_stack[1024];
__thread int shadow_stack_ptr = 0;

void function_entry(void *return_addr) {
    shadow_stack[shadow_stack_ptr++] = return_addr;
}

void function_exit(void *return_addr) {
    void *expected = shadow_stack[--shadow_stack_ptr];
    if (return_addr != expected) {
        abort();  // Stack corruption detected
    }
}

// Weakness: Attacker can corrupt shadow_stack too if memory bug exists
// Strength: Works on CPUs without hardware support

3. Privilege Separation & Capabilities

3.1 Traditional Unix DAC (Discretionary Access Control)

User/Group/Other Permissions:

File: /etc/shadow
Owner: root
Group: shadow
Permissions: rw-r-----
             │││││││││
             ││││││││└─ Other: no permissions
             │││││││└── Other: no permissions
             ││││││└─── Other: no permissions
             │││││└──── Group: read
             ││││└───── Group: no write
             │││└────── Group: no execute
             ││└─────── Owner: read
             │└──────── Owner: write
             └───────── Owner: no execute

Problem: All-or-nothing root privileges!
- Process needs root to bind port 80 → runs fully as root
- Process needs root to read /etc/shadow → runs fully as root

3.2 POSIX Capabilities

Divide root privileges into distinct units:
// From /usr/include/linux/capability.h

#define CAP_CHOWN            0   // Change file ownership
#define CAP_DAC_OVERRIDE     1   // Bypass file permission checks
#define CAP_DAC_READ_SEARCH  2   // Bypass read/search permissions
#define CAP_FOWNER           3   // Bypass permission checks on file operations
#define CAP_FSETID           4   // Don't clear setuid/setgid bits
#define CAP_KILL             5   // Bypass permission checks for sending signals
#define CAP_SETGID           6   // Make arbitrary setgid calls
#define CAP_SETUID           7   // Make arbitrary setuid calls
#define CAP_NET_BIND_SERVICE 10  // Bind to ports < 1024
#define CAP_NET_RAW          13  // Use RAW and PACKET sockets
#define CAP_SYS_ADMIN        21  // Lots of system admin operations
#define CAP_SYS_PTRACE       19  // Trace arbitrary processes
#define CAP_SYS_MODULE       16  // Load/unload kernel modules
// ... 41 capabilities total ...
Capability Sets:
// Each process has 5 capability sets

struct cred {
    // ...
    kernel_cap_t cap_inheritable;  // Inherited by exec'd programs
    kernel_cap_t cap_permitted;    // Can be enabled (superset)
    kernel_cap_t cap_effective;    // Actually active NOW
    kernel_cap_t cap_bset;         // Bounding set (limits inheritable)
    kernel_cap_t cap_ambient;      // Ambient set (new in Linux 4.3)
};

// Each set is a 64-bit bitmask (2^64 possible capabilities)
typedef struct {
    __u32 cap[_LINUX_CAPABILITY_U32S_3];  // 2 × 32 bits
} kernel_cap_t;
Capability Semantics:
Permitted (P):   Capabilities the process CAN use
Effective (E):   Capabilities CURRENTLY active
Inheritable (I): Capabilities that can be inherited across exec
Ambient (A):     Capabilities automatically granted after exec
Bounding (B):    Upper limit on capabilities (cannot gain capabilities not in B)

On exec():
P' = (P & I) | (F_permitted & F_inheritable) | A
E' = F_effective ? P' : A
I' = I
A' = A & P' & I'

Where F_* are file capabilities (set on executable)
Using Capabilities:
# Give ping the ability to create raw sockets (no setuid needed!)
sudo setcap cap_net_raw+ep /bin/ping

# Verify
getcap /bin/ping
# /bin/ping = cap_net_raw+ep

# Now ping works without setuid bit!
ls -l /bin/ping
# -rwxr-xr-x  ... /bin/ping  (no 's' bit!)

# Remove capabilities
sudo setcap -r /bin/ping

# Set multiple capabilities
sudo setcap cap_net_bind_service,cap_net_raw+ep /usr/bin/server

3.3 Seccomp (Secure Computing Mode)

Seccomp-BPF: Restrict system calls a process can make using BPF filters.
#include <seccomp.h>

int main() {
    scmp_filter_ctx ctx;

    // Create filter: default action = KILL
    ctx = seccomp_init(SCMP_ACT_KILL);

    // Allow essential syscalls
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);

    // Allow open only for specific file
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(open), 1,
                     SCMP_A0(SCMP_CMP_EQ, (scmp_datum_t)"/tmp/allowed.txt"));

    // Load filter into kernel
    seccomp_load(ctx);

    // After this point, any syscall not explicitly allowed → SIGSYS (kill)

    open("/tmp/allowed.txt", O_RDONLY);  // ✓ Allowed
    open("/etc/passwd", O_RDONLY);       // ✗ Killed!

    seccomp_release(ctx);
    return 0;
}

// Compile: gcc -o sandbox sandbox.c -lseccomp
Raw Seccomp-BPF:
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <sys/prctl.h>

void install_seccomp_filter() {
    struct sock_filter filter[] = {
        // Load syscall number
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                 offsetof(struct seccomp_data, nr)),

        // Allow exit syscall
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_exit, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

        // Allow write syscall
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

        // Kill on any other syscall
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
    };

    struct sock_fprog prog = {
        .len = sizeof(filter) / sizeof(filter[0]),
        .filter = filter,
    };

    // Enable seccomp
    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);  // Cannot gain privileges
    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
}

int main() {
    install_seccomp_filter();

    write(1, "Hello\n", 6);  // ✓ Allowed
    getpid();                 // ✗ Killed (SIGSYS)

    return 0;
}
Seccomp Actions:
ActionEffect
SECCOMP_RET_KILL_PROCESSKill entire process
SECCOMP_RET_KILL_THREADKill only current thread
SECCOMP_RET_TRAPSend SIGSYS signal
SECCOMP_RET_ERRNOReturn error code
SECCOMP_RET_TRACENotify tracer (ptrace)
SECCOMP_RET_LOGLog and allow
SECCOMP_RET_ALLOWAllow syscall
Real-World Usage:
# Chrome sandbox
ps aux | grep chrome
# ... --type=renderer --enable-sandbox ...

# Docker seccomp profile
docker run --security-opt seccomp=default.json alpine sh

# systemd service with seccomp
cat /etc/systemd/system/myservice.service
# [Service]
# SystemCallFilter=@system-service
# SystemCallFilter=~@privileged @resources

# View seccomp status of process
grep Seccomp /proc/self/status
# Seccomp: 2  (mode 2 = filter active)

4. Mandatory Access Control (MAC)

4.1 SELinux (Security-Enhanced Linux)

SELinux adds mandatory access control on top of DAC.
DAC says: "Can user alice read file.txt?"
  → Check: alice's UID vs file owner, group, permissions

SELinux says: "Can process with label X access file with label Y?"
  → Check: Policy rules for (process_label, file_label, operation)

Both must succeed for access!
SELinux Components:
┌─────────────────────────────────────────────────┐
│              SELinux Policy                      │
│  ┌───────────────────────────────────────────┐  │
│  │  Type Enforcement (TE) Rules              │  │
│  │  allow httpd_t http_port_t:tcp_socket bind│  │
│  │  allow httpd_t httpd_sys_content_t:file r │  │
│  └───────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────┐  │
│  │  Security Contexts (Labels)               │  │
│  │  user:role:type:level                     │  │
│  │  system_u:system_r:httpd_t:s0             │  │
│  └───────────────────────────────────────────┘  │
└─────────────────────────────────────────────────┘

         ┌──────────────────────┐
         │  LSM (Linux Security │
         │  Module) Framework   │
         └──────────────────────┘

      Kernel enforces at runtime
Security Context:
# View file context
ls -Z /var/www/html/index.html
# system_u:object_r:httpd_sys_content_t:s0 /var/www/html/index.html
#    │        │           │              │
#    │        │           │              └─ MLS level (sensitivity)
#    │        │           └─ Type (most important!)
#    │        └─ Role
#    └─ User

# View process context
ps -Z 1234
# system_u:system_r:httpd_t:s0 /usr/sbin/httpd

# Change file context
chcon -t httpd_sys_content_t /var/www/html/newfile.html

# Restore default contexts
restorecon -Rv /var/www/html/
Type Enforcement Rules:
# From /etc/selinux/targeted/policy/policy.conf (compiled binary)

# Allow httpd to bind TCP sockets on http_port_t (port 80, 443)
allow httpd_t http_port_t:tcp_socket { bind listen };

# Allow httpd to read files labeled httpd_sys_content_t
allow httpd_t httpd_sys_content_t:file { read getattr open };

# Allow httpd to write to log files
allow httpd_t httpd_log_t:file { write append create };

# Deny (example)
# If no rule allows, default is deny!
# httpd_t trying to access user_home_t → DENIED
SELinux Modes:

Enforcing

# Active enforcement
getenforce
# Enforcing

# Violations blocked
# AVC denials logged

sestatus
# SELinux status: enabled
# Current mode: enforcing

Permissive

# Log-only mode
setenforce 0

getenforce
# Permissive

# Violations logged
# but NOT blocked

# Good for debugging

Disabled

# SELinux completely off

# Edit /etc/selinux/config
SELINUX=disabled

# Reboot required

# NO security benefit!
Debugging SELinux:
# View denials
ausearch -m avc -ts recent

# Example denial
type=AVC msg=audit(1234567890.123:456): avc: denied { read } for pid=1234
  comm="httpd" name="secret.txt" dev="sda1" ino=123456
  scontext=system_u:system_r:httpd_t:s0
  tcontext=system_u:object_r:user_home_t:s0
  tclass=file permissive=0

# Translation: httpd_t tried to read user_home_t file → DENIED

# Generate policy module to allow
audit2allow -a -M mypolicy
# module mypolicy 1.0;
# require {
#     type httpd_t;
#     type user_home_t;
#     class file read;
# }
# allow httpd_t user_home_t:file read;

# Install policy module
semodule -i mypolicy.pp

# List loaded modules
semodule -l

# Remove module
semodule -r mypolicy
SELinux Booleans (runtime toggles):
# List all booleans
getsebool -a | grep httpd
# httpd_can_network_connect --> off
# httpd_can_network_connect_db --> off
# httpd_enable_cgi --> on

# Enable httpd network connections
setsebool -P httpd_can_network_connect on

# -P makes it persistent across reboot

4.2 AppArmor

AppArmor is path-based MAC (vs SELinux’s label-based).
SELinux: "Process with label X can access file with label Y"
  → Requires labeling entire filesystem

AppArmor: "Process can access /var/www/* with read permission"
  → Based on filesystem paths (easier to understand)
AppArmor Profile:
# /etc/apparmor.d/usr.sbin.nginx
#include <tunables/global>

/usr/sbin/nginx {
  #include <abstractions/base>
  #include <abstractions/nameservice>

  capability dac_override,
  capability net_bind_service,
  capability setgid,
  capability setuid,

  /etc/nginx/** r,
  /var/log/nginx/** rw,
  /var/www/** r,
  /run/nginx.pid rw,

  # Network
  network inet stream,
  network inet6 stream,

  # Deny everything else (default)
}
Profile Modes:
# Enforce mode
aa-enforce /usr/sbin/nginx

# Complain mode (log-only)
aa-complain /usr/sbin/nginx

# Disable profile
aa-disable /usr/sbin/nginx

# View status
aa-status
# apparmor module is loaded.
# 12 profiles are loaded.
# 10 profiles are in enforce mode.
# 2 profiles are in complain mode.
Creating Profiles:
# Generate profile automatically
aa-genprof /usr/bin/myapp

# Steps:
# 1. Runs app in learning mode
# 2. Exercise all app functionality
# 3. Reviews logged accesses
# 4. Generates profile

# Manually create profile
cat > /etc/apparmor.d/usr.bin.myapp <<EOF
/usr/bin/myapp {
  #include <abstractions/base>

  /etc/myapp/** r,
  /var/lib/myapp/** rw,
  /tmp/** rw,

  capability net_bind_service,

  deny /etc/shadow r,
}
EOF

# Load profile
apparmor_parser -r /etc/apparmor.d/usr.bin.myapp
SELinux vs AppArmor:
FeatureSELinuxAppArmor
GranularityVery fine (labels)Coarse (paths)
ComplexityHighLow
PerformanceSmall overheadVery small
Learning curveSteepGentle
FlexibilityMaximumGood
DefaultRHEL, Fedora, CentOSDebian, Ubuntu, SUSE

5. Microarchitectural Attacks & Mitigations

5.1 Spectre & Meltdown

Speculative Execution: CPU predicts branch and executes ahead, then discards if wrong.
// Vulnerable code
if (x < array1_size) {         // Bounds check
    y = array2[array1[x]];     // Out-of-bounds access
}

Without Speculation:
1. Check: x < array1_size?
2. If true, execute access
3. If false, skip

With Speculation (vulnerable):
1. CPU predicts branch will be taken
2. Speculatively executes: y = array2[array1[x]]
   Even if x >= array1_size!
3. Loads array1[x] (out of bounds!)
4. Uses it to index array2
5. array2[...] brought into cache ← SIDE EFFECT!
6. Branch misprediction detected
7. Architectural state rolled back
8. BUT: Cache state NOT rolled back!

Attacker observes cache timing → leaks array1[x]!
Meltdown (CVE-2017-5754):
// Leak kernel memory from user space

// 1. Flush cache
clflush(probe_array);

// 2. Access kernel memory (should fault, but speculatively executes)
char kernel_byte = *(char *)kernel_address;

// 3. Use leaked byte to index array
char dummy = probe_array[kernel_byte * 4096];
// This brings probe_array[kernel_byte * 4096] into cache

// 4. Branch misprediction, exception raised
// But probe_array[...] is NOW in cache!

// 5. Time accesses to probe_array
for (int i = 0; i < 256; i++) {
    t0 = rdtsc();
    dummy = probe_array[i * 4096];
    t1 = rdtsc();

    if (t1 - t0 < THRESHOLD) {
        // Cache hit! i == kernel_byte
        printf("Leaked kernel byte: 0x%02x\n", i);
    }
}

// Result: Leaked kernel memory byte-by-byte at ~100 KB/s!
Mitigation: KPTI (Kernel Page Table Isolation):
Without KPTI:
┌────────────────────────────────────┐
│  User Page Tables                  │
│  ┌──────────────────────────────┐  │
│  │  User Space Mappings          │  │
│  │  0x00000000 - 0x7fffffffffff  │  │
│  ├──────────────────────────────┤  │
│  │  Kernel Space Mappings        │  │
│  │  0xffff800000000000 - ...     │  │ ← Meltdown can read this!
│  └──────────────────────────────┘  │
└────────────────────────────────────┘

With KPTI (two sets of page tables):
┌────────────────────────────────────┐
│  User Page Tables                  │
│  ┌──────────────────────────────┐  │
│  │  User Space Mappings          │  │
│  ├──────────────────────────────┤  │
│  │  Minimal Kernel (entry/exit)  │  │ ← Only small trampoline
│  └──────────────────────────────┘  │
└────────────────────────────────────┘

┌────────────────────────────────────┐
│  Kernel Page Tables                │
│  ┌──────────────────────────────┐  │
│  │  User Space Mappings          │  │
│  ├──────────────────────────────┤  │
│  │  Full Kernel Space            │  │ ← Full kernel mapped
│  └──────────────────────────────┘  │
└────────────────────────────────────┘

On syscall: Switch from User PT → Kernel PT (CR3 register swap)
On return:  Switch from Kernel PT → User PT

Cost: ~5-30% performance penalty (context switch overhead)
Kernel Implementation (simplified from arch/x86/mm/pti.c):
// Enable KPTI
void pti_init(void) {
    if (!boot_cpu_has_bug(X86_BUG_CPU_MELTDOWN))
        return;  // CPU not vulnerable

    pr_info("Kernel/User page tables isolation: enabled\n");

    // Allocate separate user page tables
    pgd_t *user_pgd = (pgd_t *)__get_free_page(GFP_KERNEL);

    // Copy user space mappings
    clone_pgd_range(user_pgd, kernel_pgd, KERNEL_PGD_PTRS);

    // Map minimal kernel trampolines (entry/exit stubs)
    map_entry_trampoline(user_pgd);

    // Install user page tables
    current->mm->pgd = user_pgd;
}

// Syscall entry: switch to kernel page tables
ENTRY(entry_SYSCALL_64)
    swapgs                        // Swap GS (per-CPU data)
    movq %rsp, PER_CPU_VAR(rsp_scratch)
    movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp

    /* Switch page tables */
    movq PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
    SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp  // Load kernel CR3

    /* ... handle syscall ... */

    SWITCH_TO_USER_CR3 scratch_reg=%rsp    // Load user CR3
    swapgs
    sysretq
END(entry_SYSCALL_64)
Spectre (CVE-2017-5753/5715):
// Branch Target Injection (Spectre v2)

// Victim code
if (x < array1_size) {
    y = array2[array1[x] * 256];
}

// Attacker code
void attack() {
    // 1. Train branch predictor
    for (int i = 0; i < 1000; i++) {
        victim_function(valid_x);  // Train: branch TAKEN
    }

    // 2. Flush cache
    clflush(probe_array);

    // 3. Call with malicious x
    victim_function(malicious_x);  // x >= array1_size

    // CPU speculatively executes (branch predictor says TAKEN)
    // Leaks out-of-bounds memory into cache

    // 4. Time side-channel to recover
    for (int i = 0; i < 256; i++) {
        t0 = rdtsc();
        dummy = probe_array[i * 256];
        t1 = rdtsc();

        if (t1 - t0 < THRESHOLD) {
            printf("Leaked: 0x%02x\n", i);
        }
    }
}

// Result: Can leak arbitrary memory across privilege boundaries!
Mitigation: Retpoline (Return Trampoline):
; Traditional indirect jump (vulnerable)
jmp *%rax

; Retpoline (safe)
call retpoline_label
retpoline_label:
    pause          ; Prevent speculation
    lfence         ; Serialize execution
    jmp retpoline_label  ; Infinite loop (never executed)

; CPU's return stack buffer prevents speculation
; Indirect jump converted to return instruction
Kernel Implementation:
// Compiler flag
KBUILD_CFLAGS += -mindirect-branch=thunk-extern
KBUILD_CFLAGS += -mindirect-branch-register

// Generated code
// Before:
//   call *%rax
// After:
//   call __x86_indirect_thunk_rax

__x86_indirect_thunk_rax:
    call retpoline_label
retpoline_label:
    pause
    lfence
    jmp retpoline_label
    mov %rax, (%rsp)  // Never executed, but tricks CPU
    ret
Hardware Mitigations:
# Check CPU vulnerabilities
cat /sys/devices/system/cpu/vulnerabilities/*

# meltdown: Mitigation: PTI
# spectre_v1: Mitigation: usercopy/swapgs barriers and __user pointer sanitization
# spectre_v2: Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, RSB filling

# IBRS (Indirect Branch Restricted Speculation)
# IBPB (Indirect Branch Predictor Barrier)
# STIBP (Single Thread Indirect Branch Predictors)
# SSBD (Speculative Store Bypass Disable)

# Disable mitigations (for benchmarking)
# WARNING: Insecure!
echo 0 > /sys/kernel/debug/x86/pti_enabled
echo 0 > /sys/kernel/debug/x86/retp_enabled

5.2 Rowhammer

DRAM vulnerability: Rapidly accessing one row can flip bits in adjacent rows.
DRAM Organization:
┌─────────────────────────────────┐
│  Bank 0                         │
│  ┌───────────────────────────┐  │
│  │ Row 0   [data]            │  │
│  │ Row 1   [data] ← Target   │  │ ← Victim row
│  │ Row 2   [data]            │  │ ← Hammered (read repeatedly)
│  │ Row 3   [data] ← Target   │  │ ← Victim row
│  │ ...                       │  │
│  └───────────────────────────┘  │
└─────────────────────────────────┘

Attack:
1. Find adjacent rows in DRAM
2. Rapidly read from Row 2 (millions of times)
3. Electrical interference causes bit flips in Row 1 and Row 3
4. Attacker doesn't directly access victim rows!
Exploitation:
// Rowhammer exploit (simplified)

// 1. Spray memory with target pattern
char *spray[1000];
for (int i = 0; i < 1000; i++) {
    spray[i] = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
                    MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    memset(spray[i], 0xFF, 4096);  // All bits set
}

// 2. Find adjacent rows (via DRAM addressing)
uint64_t *hammer_addr1 = find_row_address(2);
uint64_t *hammer_addr2 = find_row_address(4);

// 3. Hammer rows
for (int i = 0; i < 1000000; i++) {
    *hammer_addr1;  // Read (causes DRAM row activation)
    *hammer_addr2;
    clflush(hammer_addr1);  // Evict from cache (force DRAM access)
    clflush(hammer_addr2);
}

// 4. Check for bit flips in victim rows
for (int i = 0; i < 1000; i++) {
    for (int j = 0; j < 4096; j++) {
        if (spray[i][j] != 0xFF) {
            printf("Bit flip at %p: 0x%02x\n", &spray[i][j], spray[i][j]);
        }
    }
}

// Real attacks:
// - Flip bit in page table → gain access to kernel memory
// - Flip bit in SELinux context → privilege escalation
// - Flip bit in RSA key → factor private key
Mitigations:

ECC Memory

Error-Correcting Code (ECC):
- Detects and corrects single-bit errors
- Detects (but can't correct) multi-bit errors

Cost: ~10% more expensive
Performance: Slight overhead

Widely used in servers, rare in consumer devices

Target Row Refresh (TRR)

Hardware solution by DRAM vendors:

- Monitor row activation counters
- If row accessed frequently, refresh adjacent rows
- Prevents charge leak that causes bit flips

Implemented in DDR4/DDR5 DRAM

Effectiveness: Good but not perfect
(bypasses exist with careful timing)

Software Mitigations

# Limit cache flush instructions (clflush)
# (Requires kernel patch)

# Increase DRAM refresh rate
# (Reduces performance)

# Memory deduplication disabled
echo 0 > /sys/kernel/mm/ksm/run

# Prevent predictable physical addresses
# (KASLR + physical address randomization)

OS-Level

// Double-sided rowhammer protection
// Kernel detects excessive row activations

void dram_protect(void) {
    // Monitor TLB misses (proxy for row activations)
    u64 tlb_misses = read_pmc(TLB_MISS_EVENT);

    if (tlb_misses > THRESHOLD) {
        // Potential rowhammer attack
        // Force memory refresh
        wbinvd();  // Write-back and invalidate caches

        // Log for analysis
        pr_warn("Potential Rowhammer attack detected\n");
    }
}

6. Sandboxing Techniques

6.1 Namespaces (Containers)

Linux namespaces isolate resources between processes.
// 7 types of namespaces

#define CLONE_NEWNS     0x00020000  // Mount namespace
#define CLONE_NEWUTS    0x04000000  // UTS (hostname) namespace
#define CLONE_NEWIPC    0x08000000  // IPC namespace
#define CLONE_NEWPID    0x20000000  // PID namespace
#define CLONE_NEWNET    0x40000000  // Network namespace
#define CLONE_NEWUSER   0x10000000  // User namespace
#define CLONE_NEWCGROUP 0x02000000  // Cgroup namespace
Creating Isolated Environment:
#define _GNU_SOURCE
#include <sched.h>
#include <unistd.h>
#include <sys/mount.h>

int sandbox_init(void *arg) {
    // New mount namespace
    mount("none", "/", NULL, MS_REC | MS_PRIVATE, NULL);

    // New hostname
    sethostname("sandbox", 7);

    // New root filesystem
    chroot("/var/sandbox");
    chdir("/");

    // Execute sandboxed program
    execl("/bin/sh", "sh", NULL);

    return 0;
}

int main() {
    char stack[4096];

    // Create new namespaces
    int flags = CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWPID |
                CLONE_NEWNET | CLONE_NEWIPC;

    // Clone process with new namespaces
    clone(sandbox_init, stack + sizeof(stack), flags | SIGCHLD, NULL);

    wait(NULL);
    return 0;
}
PID Namespace (process isolation):
// Parent namespace
// PID 1: init
// PID 123: parent
// PID 124: child (clone with CLONE_NEWPID)

// Inside child's PID namespace
getpid();  // Returns 1 (child is PID 1 in its namespace)

// Child can only see processes in its namespace
ps aux  // Only shows processes in this namespace

// Parent can still see child
// PID 124 in parent namespace == PID 1 in child namespace
Network Namespace (network isolation):
# Create new network namespace
ip netns add sandbox

# Execute command in namespace
ip netns exec sandbox ip link list
# 1: lo: <LOOPBACK> state DOWN
# (Only loopback, no network access!)

# Create virtual interface pair
ip link add veth0 type veth peer name veth1

# Move veth1 to sandbox namespace
ip link set veth1 netns sandbox

# Configure networking
ip addr add 10.0.0.1/24 dev veth0
ip link set veth0 up

ip netns exec sandbox ip addr add 10.0.0.2/24 dev veth1
ip netns exec sandbox ip link set veth1 up

# Now sandbox can communicate via veth interface

6.2 Chrome Multi-Process Sandbox

Chrome Architecture:
────────────────────

┌───────────────────────────────────────────────┐
│             Browser Process                    │
│  - Runs with full privileges                  │
│  - Manages windows, tabs, plugins             │
│  - Opens files, sockets on behalf of renderers│
│  - Passes FDs via SCM_RIGHTS                  │
└────────────┬──────────────────────────────────┘

     ┌───────┼───────┬──────────┐
     │       │       │          │
┌────▼─────┐ │  ┌────▼─────┐  ┌▼──────────┐
│ Renderer │ │  │ Renderer │  │   GPU     │
│ (Tab 1)  │ │  │ (Tab 2)  │  │  Process  │
│          │ │  │          │  │           │
│ Sandbox: │ │  │ Sandbox: │  │ Sandbox:  │
│ - seccomp│ │  │ - seccomp│  │ - seccomp │
│ - No FS  │ │  │ - No FS  │  │ - Limited │
│ - No net │ │  │ - No net │  │   access  │
└──────────┘ │  └──────────┘  └───────────┘

        ┌────▼─────┐
        │  Plugin  │
        │ Process  │
        │ (Flash)  │
        │ Sandbox  │
        └──────────┘

Sandbox Restrictions (Linux):
1. Seccomp-BPF: Allow only ~30 syscalls
2. Namespaces: PID, NET, IPC isolation
3. chroot: Fake root filesystem
4. No setuid/setgid
5. No capabilities
6. Read-only /proc, /sys
Chrome Sandbox Code (simplified from sandbox/linux/):
// Renderer process startup

void RendererMain() {
    // 1. Drop all capabilities
    drop_all_capabilities();

    // 2. Enter namespaces
    unshare(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWIPC);

    // 3. chroot to empty directory
    chroot("/var/empty");
    chdir("/");

    // 4. Install seccomp filter
    install_renderer_seccomp_filter();

    // 5. Drop privileges
    setuid(nobody_uid);
    setgid(nobody_gid);

    // 6. Enable no_new_privs
    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);

    // 7. Run renderer
    RunRendererLoop();
}

void install_renderer_seccomp_filter() {
    // Allow only essential syscalls
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);

    // Read/write/close
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0);

    // Memory management
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(munmap), 0);

    // IPC (to browser process)
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(recvmsg), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(sendmsg), 0);

    // DENY: open, socket, execve, fork, etc.

    seccomp_load(ctx);
}
Escape Detection (from browser process):
// Browser monitors renderer health

void MonitorRenderer(int renderer_pid) {
    // Check if renderer tries forbidden syscalls
    ptrace(PTRACE_SEIZE, renderer_pid, NULL, PTRACE_O_TRACESECCOMP);

    while (1) {
        int status;
        waitpid(renderer_pid, &status, 0);

        if (WIFSTOPPED(status) && status >> 8 == (SIGTRAP | (PTRACE_EVENT_SECCOMP << 8))) {
            // Seccomp violation detected!
            struct user_regs_struct regs;
            ptrace(PTRACE_GETREGS, renderer_pid, NULL, &regs);

            long syscall_nr = regs.orig_rax;
            log_security_violation(renderer_pid, syscall_nr);

            // Kill malicious renderer
            kill(renderer_pid, SIGKILL);
            respawn_renderer();
        }
    }
}

7. Interview Questions & Answers

NX (No-Execute) / DEP (Data Execution Prevention) uses the CPU’s NX bit in page table entries.Page Table Entry Structure (x86-64):
  • Bit 63: NX (No-Execute) bit
  • When set: Page cannot be executed (will fault with #PF if IP points here)
  • When clear: Page is executable
Kernel Implementation:
// When mapping stack
vma->vm_flags = VM_READ | VM_WRITE | VM_GROWSDOWN;
// NO VM_EXEC flag!

// Page table entry will have NX bit SET
pte = pfn_pte(pfn, PAGE_KERNEL);  // Default kernel page (with NX)
set_pte(pte_addr, pte);
Protection:
  1. Attacker overflows buffer on stack
  2. Injects shellcode
  3. Overwrites return address to point to shellcode
  4. Function returns, jumps to shellcode address
  5. CPU checks NX bit → Page is not executable
  6. #PF (Page Fault) → Kernel kills process
W^X Policy: Page is writable OR executable, never both.
  • Stack/Heap: Writable, NOT executable
  • Code: Executable, NOT writable
  • Prevents: Code injection attacks
Bypass: Return-Oriented Programming (ROP) - reuse existing executable code instead of injecting new code.
ASLR (Address Space Layout Randomization) randomizes memory layout at program start.Randomized Regions:
  • Stack base address
  • Heap base address
  • Libraries (libc, etc.)
  • Executable base (if PIE - Position Independent Executable)
  • vDSO, vvar
Entropy (x86-64 Linux):
  • Stack: 19 bits → 524,288 possible positions
  • Heap: 13 bits → 8,192 possible positions
  • Libraries: 28 bits → 268 million possible positions
  • PIE executable: 28 bits → 268 million possible positions
How It Prevents Exploitation:Traditional exploit (no ASLR):
Attacker knows: libc is at 0x7ffff7a0d000
Attacker's ROP chain:
  return to 0x7ffff7a52390 (system)
  argument: 0x7ffff7b99d88 ("/bin/sh")

Works every time!
With ASLR:
Run 1: libc at 0x7f8a2e456000
Run 2: libc at 0x7f3c81de2000
Run 3: libc at 0x7fb1c9a2f000

Attacker's hardcoded addresses: WRONG!
Exploit crashes instead of succeeding
Weaknesses:
  1. Information Leak:
    • Pointer disclosure → calculate base addresses → bypass ASLR
    • Format string bugs, memory corruption leaks
  2. Entropy Limitations:
    • 13 bits (heap) = 8,192 attempts
    • If process doesn’t crash (fork server), brute-forceable
  3. 32-bit Systems:
    • Limited address space → low entropy
    • 8 bits library randomization → 256 attempts
  4. Non-PIE Executables:
    • Main executable at fixed address
    • Contains ROP gadgets at known addresses
  5. Cache Timing Attacks:
    • Side-channel attacks can determine addresses
Mitigations for Weaknesses:
  • Use PIE (Position Independent Executable)
  • Fix information leaks
  • Crash on exploit attempts (don’t fork)
  • Use Control Flow Integrity (CFI)
  • Combine with other defenses (NX, stack canaries)
Stack Canary: Random value placed between local variables and return address.Mechanism:
Stack Frame:
┌──────────────────┐ High address
│ Return Address   │ ← Protected
├──────────────────┤
│ Saved RBP        │
├──────────────────┤
│ CANARY (random)  │ ← __stack_chk_guard (stored in TLS)
├──────────────────┤
│ Local vars       │
│ char buf[100]    │ ← Overflow starts here
└──────────────────┘ Low address

Function Prologue:
  mov rax, fs:0x28      ; Load canary from TLS
  mov [rbp-8], rax      ; Store on stack

Function Epilogue:
  mov rax, [rbp-8]      ; Load stack canary
  xor rax, fs:0x28      ; Compare with original
  je .L_OK              ; Match? OK
  call __stack_chk_fail ; Mismatch? ABORT
.L_OK:
  ret
Detection:
  1. Buffer overflow overwrites local variables
  2. Overflow continues, overwrites canary
  3. Function returns
  4. Kernel checks: stack_canary == __stack_chk_guard?
  5. Mismatch → Stack smashing detected! → abort()
Bypass Techniques:1. Leak Canary:
// Format string vulnerability
printf(user_input);  // User provides: "%p %p %p ..."
// Leaks stack contents, including canary!

// Attacker:
// 1. Leak canary value
// 2. Craft overflow to include correct canary value
// 3. Overflow succeeds without detection
2. Overwrite Pointer Before Canary:
char buf[100];
char *ptr = &authorized;
unsigned long canary;
void *return_address;

// Overflow overwrites ptr but not canary
strcpy(buf, malicious_input);  // Overflow only buf and ptr

// ptr now points to attacker-controlled memory
// Canary intact → No detection!
3. Fork Without Re-randomization (rare):
// Parent forks children with same canary
while (1) {
    if (fork() == 0) {
        handle_request();  // Sandbox child
        exit(0);
    }
}

// Attacker brute-forces canary byte-by-byte
// Try 0x00, 0x01, 0x02, ... 0xFF for first byte
// If child crashes: wrong guess
// If child doesn't crash: correct! Move to next byte
// 8 bytes × 256 attempts = 2,048 attempts max
4. Partial Overflow:
// Overflow only return address, not canary
// (Requires knowledge of stack layout)

┌──────────────┐
│ Ret Addr     │ ← Overflow 1 byte (change low byte only)
├──────────────┤
│ Saved RBP    │ ← Skip
├──────────────┤
│ Canary       │ ← Leave untouched!
├──────────────┤
buf[100]     │
└──────────────┘

// Careful overflow changes return address without touching canary
Mitigations:
  • Combine with ASLR (randomize canary address)
  • Use fortified functions (_strcpy_chk) to prevent overflows
  • Re-randomize canary after fork
  • Stack Clash protection (prevent jumping over canary)
Traditional setuid:
# setuid binary runs with owner's privileges

ls -l /usr/bin/passwd
# -rwsr-xr-x root root /usr/bin/passwd
#    ↑ setuid bit

# When user runs passwd:
# 1. Process starts with user's UID
# 2. Kernel sees setuid bit
# 3. Sets effective UID to file owner (root)
# 4. Process has FULL root privileges

# Problem: All or nothing!
# passwd only needs to write /etc/shadow
# But gets ALL root capabilities
Capabilities:
Divide root into 41 distinct privileges:

CAP_CHOWN            - Change file ownership
CAP_DAC_OVERRIDE     - Bypass file permissions
CAP_NET_BIND_SERVICE - Bind ports < 1024
CAP_NET_RAW          - Use raw sockets
CAP_SYS_ADMIN        - System administration
CAP_SYS_MODULE       - Load kernel modules
... 35 more ...

Process gets ONLY what it needs!
Comparison:
FeaturesetuidCapabilities
GranularityAll or nothingFine-grained (41 capabilities)
SecurityOver-privilegedLeast privilege
PersistenceLost on exec (unless binary is setuid)Can be inherited
AuditabilityHard to see why root is neededClear which capability is used
Example: Network Server:
// Old way: setuid root binary
int main() {
    // Running as root (UID 0)
    // Can do ANYTHING!

    int sock = socket(AF_INET, SOCK_STREAM, 0);
    bind(sock, ..., 80);  // Bind port 80 (needs root)

    setuid(nobody);  // Drop privileges after bind

    // Problem: Race window while root
    // If exploit before setuid(), full root access!
}

// New way: Capabilities
int main() {
    // Running as nobody (UID 65534)
    // Has ONLY CAP_NET_BIND_SERVICE

    int sock = socket(AF_INET, SOCK_STREAM, 0);
    bind(sock, ..., 80);  // Works! (has capability)

    open("/etc/shadow", O_RDONLY);  // FAIL! (no CAP_DAC_OVERRIDE)

    // Even if exploited, attacker only has port binding
    // Cannot read files, cannot exec as root, etc.
}
Setting Capabilities:
# Give binary capability instead of setuid
# Before:
chmod u+s /usr/bin/ping  # setuid root (dangerous!)

# After:
setcap cap_net_raw+ep /usr/bin/ping  # Only raw socket capability

# Verify
getcap /usr/bin/ping
# /usr/bin/ping = cap_net_raw+ep

# Now ping can create raw sockets but has NO other root powers
Why Capabilities Are Better:
  1. Principle of Least Privilege: Only grant necessary permissions
  2. Reduced Attack Surface: Exploit gets limited capabilities, not full root
  3. Better Auditability: Clear why each capability is needed
  4. Flexibility: Can grant to non-root users
  5. Inheritance: Can design capability-aware services
Real-World Usage:
  • systemd services with capabilities
  • Docker containers (run as non-root with specific capabilities)
  • Network daemons (CAP_NET_BIND_SERVICE instead of setuid)
Meltdown Vulnerability:
CPU speculatively executes kernel memory access from user mode:

// User-mode code
char kernel_byte = *(char *)0xffff880000000000;  // Kernel address

// CPU behavior:
// 1. Starts speculative execution before permission check
// 2. Loads kernel memory (should fault, but hasn't checked yet)
// 3. Uses loaded byte to index array: probe[kernel_byte * 4096]
// 4. This brings probe[...] into cache ← SIDE EFFECT!
// 5. Permission check completes → Exception!
// 6. Architectural state rolled back
// 7. But cache state remains! ← LEAK!

// Attacker measures cache timing → recovers kernel_byte
KPTI (Kernel Page Table Isolation) Solution:
Without KPTI (vulnerable):
┌─────────────────────────────┐
│ User Mode (CR3 = user_pgd)  │
│                             │
│ User virtual addresses      │
│ 0x0 - 0x7fffffffffff        │
│   │                         │
│   ├─> User pages            │
│   │                         │
│ Kernel virtual addresses    │
│ 0xffff800000000000 - ...    │ ← Mapped in user page tables!
│   │                         │ ← Meltdown can speculatively access
│   ├─> Kernel pages          │
└─────────────────────────────┘

With KPTI (secure):
User Mode:
┌─────────────────────────────┐
│ CR3 = user_pgd              │
│                             │
│ User virtual addresses      │
│ 0x0 - 0x7fffffffffff        │
│   ├─> User pages            │
│                             │
│ Kernel virtual addresses    │
│ 0xffff800000000000 - ...    │
│   ├─> MINIMAL kernel stubs  │ ← Only entry/exit trampolines!
│   │    (entry_SYSCALL_64)   │ ← Rest of kernel NOT MAPPED
└─────────────────────────────┘

Kernel Mode (after syscall):
┌─────────────────────────────┐
│ CR3 = kernel_pgd            │
│                             │
│ User virtual addresses      │
│   ├─> User pages            │
│                             │
│ Kernel virtual addresses    │
│   ├─> FULL kernel mapping   │ ← All kernel code/data accessible
└─────────────────────────────┘
Syscall Flow with KPTI:
; User-mode application
mov rax, 1        ; SYS_write
mov rdi, 1        ; fd = stdout
syscall           ; Enter kernel

; ← CPU switches to kernel mode ←

entry_SYSCALL_64:
    ; Still using user page tables!
    swapgs                    ; Swap GS (get kernel stack)

    ; SWITCH PAGE TABLES (the expensive part!)
    mov rax, CR3              ; Read current CR3 (user page table)
    or rax, 0x1000            ; Set bit to switch to kernel tables
    mov CR3, rax              ; ← PAGE TABLE SWITCH (TLB flush!)

    ; Now kernel is fully mapped
    ; Execute syscall handler...
    call do_syscall_64

    ; SWITCH BACK to user page tables
    mov rax, CR3
    and rax, ~0x1000
    mov CR3, rax              ; ← PAGE TABLE SWITCH (TLB flush!)

    swapgs
    sysretq                   ; Return to user mode
Performance Cost:What makes it expensive:
  1. CR3 Write (page table switch):
    • ~150-300 CPU cycles per switch
    • 2 switches per syscall (enter + exit)
  2. TLB Flush:
    • Translation Lookaside Buffer caches virtual→physical address translations
    • Changing CR3 flushes TLB (must reload from memory)
    • TLB misses add ~100 cycles per memory access
  3. Frequency of Syscalls:
    • I/O-heavy workloads: Many syscalls → high overhead
    • CPU-bound workloads: Few syscalls → low overhead
Measured Impact (varies by workload):
Workload TypePerformance Loss
CPU-intensive (scientific computing)0-3%
Light I/O (web browsing)3-5%
Heavy I/O (file server)5-10%
Heavy syscalls (database, Redis)10-30%
Optimizations:
  1. PCID (Process Context ID):
    • Tag TLB entries with PCID
    • Avoid full TLB flush on CR3 switch
    • Reduces overhead to 1-5%
  2. Lazy TLB Switching:
    • Kernel threads don’t switch page tables
    • Reuse previous user’s kernel mapping
  3. CPU Microcode Updates:
    • Intel CPUs without Meltdown bug → no KPTI needed
    • Check: cat /sys/devices/system/cpu/vulnerabilities/meltdown
    • If says “Not affected” → KPTI not active
Disable KPTI (for testing/benchmarking only!):
# Boot parameter
nopti

# Or runtime (requires recompiled kernel)
echo 0 > /sys/kernel/debug/x86/pti_enabled

# WARNING: Disabling KPTI leaves system vulnerable to Meltdown!
Spectre Vulnerability (Branch Target Injection):CPU Speculative Execution:
// Victim code
if (x < array_size) {          // ← Branch
    y = array[x];              // ← Speculative execution
}

CPU's Branch Predictor:
- Predicts if branch will be taken or not
- Speculatively executes ahead while check happens
- If prediction wrong, rollback
- If prediction right, save time!

Problem: Rollback discards architectural state but NOT cache state!
Attack:
// Step 1: Train branch predictor
for (int i = 0; i < 1000; i++) {
    victim_function(valid_x);  // x < array_size, branch TAKEN
}
// Branch predictor learns: "This branch is ALWAYS taken"

// Step 2: Prepare side-channel
for (int i = 0; i < 256; i++) {
    clflush(&probe_array[i * 4096]);  // Flush cache
}

// Step 3: Attack with out-of-bounds x
victim_function(malicious_x);  // malicious_x >= array_size

// What happens:
// 1. Branch predictor predicts: TAKEN (based on training)
// 2. CPU speculatively executes: y = array[malicious_x]
// 3. This accesses out-of-bounds memory (kernel memory!)
// 4. Uses leaked byte to index: probe_array[y * 4096]
// 5. This brings probe_array[y * 4096] into cache ← LEAK!
// 6. Branch check completes: x < array_size? FALSE
// 7. Rollback! Discard y, but cache state remains!

// Step 4: Recover leaked byte via timing
for (int i = 0; i < 256; i++) {
    t0 = rdtsc();
    temp = probe_array[i * 4096];
    t1 = rdtsc();

    if (t1 - t0 < THRESHOLD) {
        printf("Leaked byte: 0x%02x\n", i);  // Cache hit!
        break;
    }
}

// Result: Read arbitrary memory across privilege boundaries!
Why Retpolines Work:Problem with Indirect Branches:
; Vulnerable indirect jump
jmp *%rax              ; Jump to address in rax

; Attacker can manipulate branch predictor to:
; 1. Predict wrong target
; 2. Cause speculative execution to gadget
; 3. Leak data via cache side-channel
Retpoline (Return Trampoline):
; Instead of: jmp *%rax
; Use:

call .set_target      ; Push return address on stack
.set_target:
    mov %rax, (%rsp)  ; Replace return address with rax
    ret               ; Return to rax

; Why this is safe:

; CPU's Return Stack Buffer (RSB):
; - Separate predictor for RET instructions
; - Tracks call/return pairs
; - NOT poisonable by attacker

; When ret executes:
; - CPU predicts target from RSB
; - RSB says: return to .capture_spec
; - Speculative execution goes to .capture_spec
; - NOT to attacker-controlled address!

.capture_spec:
    pause             ; Prevent speculation
    lfence            ; Serializing instruction
    jmp .capture_spec ; Infinite loop (never executed)
Visual Comparison:
Traditional Indirect Jump (vulnerable):
┌─────────────┐
│   jmp *rax  │ → Branch predictor → Attacker controls prediction
└─────────────┘                      ↓
                              Speculative execution to gadget

                              Leak via cache timing

Retpoline (safe):
┌─────────────┐
│ call .label │ → Push return addr on stack
│ .label:     │
│  mov rax,SP │ → Replace return addr with rax
│  ret        │ → RSB predicts return to .capture
└─────────────┘    (NOT attacker-controlled!)

.capture_spec:
  pause
  lfence
  jmp .capture_spec  ← Speculation contained in loop
                     ← No leak possible!
Kernel Implementation:
// Compiler generates retpolines for indirect branches
// gcc -mindirect-branch=thunk-extern

// Original code:
void (*func_ptr)(void);
func_ptr();  // Indirect call

// Compiled without retpoline:
call *%rax

// Compiled with retpoline:
call __x86_indirect_thunk_rax

// Retpoline thunk (arch/x86/lib/retpoline.S):
SYM_FUNC_START(__x86_indirect_thunk_rax)
    JMP_NOSPEC %rax
SYM_FUNC_END(__x86_indirect_thunk_rax)

#define JMP_NOSPEC(reg)                 \
    call    .Ldo_rop_##reg;             \
.Lspec_trap_##reg:                      \
    pause;                              \
    lfence;                             \
    jmp .Lspec_trap_##reg;              \
.Ldo_rop_##reg:                         \
    mov %reg, (%rsp);                   \
    ret
Performance Impact:
  • Retpolines are slower than direct jumps (5-20% overhead)
  • But necessary for security on vulnerable CPUs
  • Modern CPUs have hardware mitigations (IBRS - Indirect Branch Restricted Speculation)
Check Mitigations:
cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
# Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, RSB filling

# Retpoline: Software mitigation (compiler-generated)
# IBRS: Hardware mitigation (CPU feature)
# IBPB: Indirect Branch Predictor Barrier (flush predictor)
# RSB filling: Prevent RSB underflow attacks
Why Effective:
  1. Return instructions are different: RSB not poisonable
  2. Speculation contained: Loop prevents speculative execution reaching gadgets
  3. Works on all CPUs: Software mitigation (doesn’t need hardware support)
  4. Comprehensive: Protects all indirect branches
Limitations:
  • Performance overhead (modern CPUs use IBRS instead)
  • Doesn’t protect against Spectre v1 (bounds check bypass)
  • Doesn’t protect against other speculative execution attacks (L1TF, MDS, etc.)
Fundamental Difference:SELinux: Label-based MAC
Security Context: user:role:type:level

Files:   httpd_sys_content_t
Process: httpd_t

Rule: allow httpd_t httpd_sys_content_t:file { read open };
      ─────────────── ──────────────────  ──── ────────────
         Subject          Object          Class Permissions

Decision: Based on labels (NOT paths)
AppArmor: Path-based MAC
Profile:
/usr/sbin/nginx {
    /etc/nginx/** r,
    /var/www/** r,
    /var/log/nginx/** rw,
    deny /etc/shadow r,
}

Decision: Based on absolute filesystem paths

Detailed Comparison:1. Security Model:SELinux:
  • Type Enforcement (TE): Subjects (processes) have types, objects (files) have types
  • Multi-Level Security (MLS): Confidentiality levels (Top Secret, Secret, etc.)
  • Multi-Category Security (MCS): Categories for compartmentalization
  • Very fine-grained control
AppArmor:
  • Path-based access control
  • Capabilities control
  • Network access control (protocol/address)
  • Simpler model, easier to understand
2. Complexity:SELinux:
# Policy is complex
# Example policy snippet:
allow httpd_t httpd_sys_content_t:file { getattr read open };
allow httpd_t http_port_t:tcp_socket { bind listen };
allow httpd_t httpd_log_t:file { write append create };
allow httpd_t proc_t:file read;
allow httpd_t self:capability { setgid setuid };

# Hundreds of rules per service!
# Requires understanding of:
# - Type enforcement
# - Security contexts
# - Policy language
# - Domain transitions
AppArmor:
# Policy is readable
/usr/sbin/nginx {
  #include <abstractions/base>
  #include <abstractions/nameservice>

  capability dac_override,
  capability net_bind_service,
  capability setgid,
  capability setuid,

  /etc/nginx/** r,
  /var/log/nginx/** rw,
  /var/www/** r,

  network inet stream,
}

# Human-readable!
# Easy to audit
3. Administration:
TaskSELinuxAppArmor
Create policyComplex (audit2allow helps)Simple (aa-genprof)
Debug denialsausearch, sealertaa-logprof, dmesg
Enable/Disablesetenforceaa-enforce/aa-complain
View statussestatus, getenforceaa-status
Temporary allowsemodb-booleanaa-complain mode
4. Performance:SELinux:
  • Label lookups in xattrs (extended attributes)
  • Hash table lookups for policy decisions
  • Overhead: 3-7% typically
AppArmor:
  • Path resolution for every access
  • Simpler policy checks
  • Overhead: 1-3% typically
5. Filesystem Requirements:SELinux:
  • Requires filesystem with xattr support
  • Labels stored as extended attributes
  • ls -Z shows labels
  • Relabeling filesystem can be slow
AppArmor:
  • No special filesystem requirements
  • Works on any filesystem (even FAT, NFS)
  • No labels to manage
6. Use Cases:Use SELinux when:
  • Maximum security required (government, military)
  • Need MLS/MCS (confidentiality levels)
  • Want very fine-grained control
  • Already familiar with it (RHEL/Fedora/CentOS)
  • Need label-based security (labels follow files even if moved)
Use AppArmor when:
  • Simplicity preferred over maximum granularity
  • Easier policy management desired
  • Filesystem doesn’t support xattrs (NFS, FAT)
  • Developers/admins less experienced with MAC
  • Debian/Ubuntu/SUSE environment
7. Real-World Scenarios:Scenario 1: Web ServerSELinux:
# Pre-defined policy exists
# But need to handle custom app

# App stores files in /opt/myapp/
# SELinux denies access (wrong label)

# Solution:
semanage fcontext -a -t httpd_sys_content_t "/opt/myapp(/.*)?"
restorecon -R /opt/myapp

# More denials? Debug with:
ausearch -m avc -ts recent
audit2allow -a -M mypolicy
semodule -i mypolicy.pp
AppArmor:
# Create profile
cat > /etc/apparmor.d/usr.sbin.myapp <<EOF
/usr/sbin/myapp {
  #include <abstractions/base>
  #include <abstractions/apache2-common>

  /opt/myapp/** r,
  /var/log/myapp/** rw,

  network inet stream,
  capability net_bind_service,
}
EOF

# Load and enforce
apparmor_parser -r /etc/apparmor.d/usr.sbin.myapp

# Done! Much simpler.
Scenario 2: Container SecuritySELinux:
  • Docker/Podman use SELinux contexts
  • Each container gets unique MCS label
  • Container svirt_sandbox_file_t, host container_file_t
  • Strong isolation via labels
AppArmor:
  • Docker uses AppArmor profiles
  • Default profile restricts mount, capabilities, etc.
  • Custom profiles for specific containers
  • Path-based restrictions easier to understand
8. Policy Portability:SELinux:
  • Labels stored with files (xattrs)
  • Policy is separate from filesystem
  • Moving files between systems: labels can be lost
  • Need to relabel after restore from backup
AppArmor:
  • Policy references absolute paths
  • Moving profile to different system: works if paths same
  • But path changes require profile updates

Recommendation Matrix:
PriorityChoose
Maximum securitySELinux
Ease of useAppArmor
Fine-grained controlSELinux
Simple policiesAppArmor
RHEL/CentOSSELinux (default)
Debian/UbuntuAppArmor (default)
NFS/non-xattr FSAppArmor
MLS/MCS requiredSELinux
Container hostBoth work (SELinux more common)
Can you use both?: No, they conflict (both use LSM hooks). Choose one.Neither?: Not recommended. MAC adds significant security layer beyond DAC.
Seccomp-BPF (Secure Computing with Berkeley Packet Filter):Core Concept: Whitelist syscalls a process can make using BPF bytecode filters.
Architecture:
User Space Process

     │ syscall (e.g., open, read, write)

┌─────────────────────────────┐
│   Syscall Entry Point       │
│   (entry_SYSCALL_64)        │
└─────────┬───────────────────┘

          │ ① Check: Seccomp filter installed?

┌─────────────────────────────┐
│   Seccomp BPF Filter        │
│                             │
│   BPF Program:              │
│   - Load syscall number     │
│   - Load arguments          │
│   - Check against rules     │
│   - Return action:          │
│     • ALLOW                 │
│     • KILL                  │
│     • ERRNO                 │
│     • TRAP                  │
└─────────┬───────────────────┘

          │ ② Action

    ALLOW?  ──────────────────> Execute Syscall
    KILL?   ──────────────────> SIGSYS (kill process)
    ERRNO?  ──────────────────> Return error code
    TRAP?   ──────────────────> Send SIGSYS signal (debugger)

BPF Filter Structure:
// Seccomp data available to BPF program
struct seccomp_data {
    int nr;                  // Syscall number
    __u32 arch;              // Architecture (x86-64, ARM, etc.)
    __u64 instruction_pointer;
    __u64 args[6];           // Syscall arguments
};

// BPF filter example
struct sock_filter filter[] = {
    // Load syscall number into accumulator
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
             offsetof(struct seccomp_data, nr)),

    // Allow SYS_read
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_read, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Allow SYS_write
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Allow SYS_exit
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_exit, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Default: KILL
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
};

Container Security Use Case:Problem: Containers share kernel with host. Malicious container can exploit kernel vulnerabilities.Seccomp Solution: Reduce attack surface by blocking dangerous syscalls.Docker Default Seccomp Profile (simplified):
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64"],
  "syscalls": [
    {
      "names": [
        "read", "write", "open", "close", "stat",
        "fstat", "mmap", "mprotect", "munmap",
        "brk", "ioctl", "writev", "access",
        "socket", "connect", "accept", "bind",
        "listen", "select", "poll", "epoll_wait"
        /* ... ~300 allowed syscalls ... */
      ],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "names": [
        "reboot",           // Cannot reboot host!
        "swapon", "swapoff", // Cannot manage swap
        "mount", "umount",  // Cannot mount filesystems
        "pivot_root",       // Cannot change root
        "kexec_load",       // Cannot load kernel
        "add_key",          // Cannot add keyring keys
        "request_key",
        "bpf",              // Cannot load BPF programs
        "perf_event_open",  // Cannot use perf
        "ptrace"            // Cannot trace other processes
      ],
      "action": "SCMP_ACT_ERRNO"  // Return EPERM
    }
  ]
}
Why Critical for Containers:
  1. Kernel Exploit Mitigation:
Without seccomp:
  Container → Exploit in ioctl() → Kernel code execution → Host compromise

With seccomp:
  Container → ioctl() → EPERM (syscall blocked) → Exploit fails
  1. Privilege Escalation Prevention:
# Without seccomp
docker run -it ubuntu
# Inside container:
unshare --mount --uts --ipc --net --pid --fork /bin/bash
# Success! Created new namespaces → potential escape

# With seccomp (default)
unshare --mount ...
# unshare: unshare failed: Operation not permitted
# Blocked! (unshare syscall not allowed)
  1. Attack Surface Reduction:
Linux kernel: ~450 syscalls
Docker default seccomp: ~300 allowed

Blocked (~150 syscalls):
- Kernel module loading (init_module, finit_module)
- Namespace manipulation (setns, unshare)
- Performance monitoring (perf_event_open)
- System administration (reboot, sethostname)
- Capability manipulation (capset)
- Key management (add_key, keyctl)
- BPF programs (bpf)

Result: 33% reduction in kernel attack surface!

Implementing Custom Seccomp:Example: Strict Sandbox:
#include <seccomp.h>

void install_strict_seccomp() {
    scmp_filter_ctx ctx;

    // Default: KILL (strictest!)
    ctx = seccomp_init(SCMP_ACT_KILL);

    // Allow ONLY these syscalls
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);

    // Conditional: Allow open ONLY for /tmp/*
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(openat), 1,
                     SCMP_A1(SCMP_CMP_STR, "/tmp/"));

    // Load filter
    seccomp_load(ctx);
    seccomp_release(ctx);

    // After this point:
    // - read/write/exit: OK
    // - openat("/tmp/file"): OK
    // - openat("/etc/passwd"): KILLED!
    // - socket(): KILLED!
    // - fork(): KILLED!
}
Docker Custom Profile:
# custom-seccomp.json
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    {
      "names": ["read", "write", "exit", "exit_group"],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

# Run container with custom profile
docker run --security-opt seccomp=custom-seccomp.json myimage

Debugging Seccomp Violations:
# Enable logging (dmesg)
echo 1 > /proc/sys/kernel/seccomp/actions_logged

# Run container
docker run --rm -it --security-opt seccomp=strict.json ubuntu bash

# Inside container, try forbidden syscall:
mount -t tmpfs tmpfs /mnt
# bash: mount: Operation not permitted

# Check dmesg
dmesg | tail
# [12345.678] audit: type=1326 audit(1234567890.123:456): auid=1000 uid=0 gid=0
#              ses=3 pid=12345 comm="mount" exe="/bin/mount" sig=0 arch=c000003e
#              syscall=165 compat=0 ip=0x7f... code=0x7ffc0000
#              ^^^^^^^^^^
#              syscall 165 = mount (BLOCKED!)

# Syscall 165 (mount) was blocked by seccomp

Why BPF:
  1. Efficiency: JIT-compiled to native code (fast!)
  2. Safety: BPF verifier ensures filter cannot crash kernel
  3. Flexibility: Can inspect syscall arguments, not just number
  4. Performance: Evaluated in kernel space (no context switch)
Without BPF (old seccomp mode 1):
  • Could only allow read/write/exit/_exit
  • No flexibility
With BPF (seccomp mode 2):
  • Can allow specific syscalls
  • Can inspect arguments (e.g., allow open but only for /tmp/*)
  • Can return different actions (ERRNO, TRAP, LOG, ALLOW)

Limitations:
  1. Cannot inspect pointers: BPF cannot dereference user-space pointers (no access to path strings, only FDs)
  2. Time-of-check-time-of-use (TOCTOU): Arguments checked before syscall, but can change
  3. Bypass via allowed syscalls: If write() allowed, attacker might abuse it
  4. Complexity: Writing correct BPF filters is hard

Summary:Seccomp-BPF is critical for containers because:
  • ✅ Reduces kernel attack surface (blocks ~1/3 of syscalls)
  • ✅ Prevents privilege escalation (blocks namespace manipulation)
  • ✅ Mitigates kernel exploits (blocks vulnerable syscalls)
  • ✅ Fast (BPF JIT compilation)
  • ✅ Flexible (programmable filters)
  • ✅ Secure (BPF verifier prevents filter bugs)
Without it, containers have full access to ~450 syscalls → much larger attack surface.

12. Threat Modeling for OS-backed Services

When designing secure services, think systematically about OS-level attack surfaces.

The STRIDE Model Applied to OS

ThreatOS Attack VectorMitigation
SpoofingProcess impersonation, UID manipulationUser namespaces, strong authentication
TamperingMemory corruption, file modificationASLR, KASLR, read-only mounts
RepudiationLog deletion, timestamp manipulationAppend-only logs, audit subsystem
Info Disclosure/proc leaks, side channelshidepid=2, Spectre mitigations
Denial of ServiceFork bombs, memory exhaustionCgroups limits, ulimits, quotas
Elevation of PrivilegeKernel exploits, setuid abuseSeccomp, drop capabilities

Defense-in-Depth Checklist

# 1. Principle of Least Privilege
capsh --print                           # See current capabilities
setcap cap_net_bind_service=+ep ./app   # Grant only specific caps

# 2. Namespaces: Reduce Visibility
unshare --user --map-root-user --pid --mount-proc bash

# 3. Read-only Filesystems
mount -o remount,ro /                   # Make root read-only

# 4. Resource Limits (DoS protection)
echo "100M" > /sys/fs/cgroup/myapp/memory.max
echo "50" > /sys/fs/cgroup/myapp/pids.max

Summary

Key Takeaways:
  1. Memory Protection: NX/DEP, ASLR, and stack canaries are foundational defenses against memory corruption attacks.
  2. Control Flow Integrity: Forward-edge CFI and shadow stacks (backward-edge CFI) prevent control-flow hijacking.
  3. Privilege Separation: Capabilities provide fine-grained privileges instead of all-or-nothing root access.
  4. Mandatory Access Control: SELinux (label-based) and AppArmor (path-based) enforce policies beyond DAC.
  5. Microarchitectural Attacks: Spectre and Meltdown exploit speculative execution. KPTI and retpolines mitigate but with performance cost.
  6. Sandboxing: Namespaces, seccomp, and combinations thereof create strong isolation for untrusted code.
Defense in Depth: No single mechanism is perfect. Modern systems combine multiple layers:
  • ASLR + NX + Stack Canaries + CFI (memory safety)
  • Capabilities + Seccomp + Namespaces (privilege reduction)
  • SELinux/AppArmor (mandatory access control)
  • KPTI + Retpolines + CPU features (hardware attack mitigation)
Performance vs Security: Many mitigations have performance costs. Understand trade-offs and apply based on threat model.
Next: Boot Process & Initialization