Operating System Security

Modern OS security is a multi-layered defense against both software vulnerabilities and hardware-level attacks. From memory protection to mandatory access control, understanding these mechanisms is crucial for building secure systems.

Mastery Level: Senior Security Engineer Key Internals: Page Table Permissions, Capabilities, LSM hooks, CPU security features, Speculative execution mitigations Prerequisites: Virtual Memory, Process Internals

1. Memory Protection Fundamentals

1.1 Page-Level Protection (NX/DEP)

No-Execute (NX) / Data Execution Prevention (DEP) marks memory pages as non-executable.

Traditional (Insecure):
┌────────────────────────────────┐
│  Stack                         │  Executable!
│  ├─ Return addresses           │  ← Attacker can inject shellcode
│  └─ Local variables            │
├────────────────────────────────┤
│  Heap                          │  Executable!
│  ├─ Malloc'd buffers           │  ← Attacker can put code here
│  └─ Dynamic data               │
├────────────────────────────────┤
│  Data/BSS                      │  Executable!
└────────────────────────────────┘

With NX/DEP:
┌────────────────────────────────┐
│  Stack (NX bit set)            │  NOT Executable
│  ├─ Return addresses           │  ← Shellcode won't execute!
│  └─ Local variables            │
├────────────────────────────────┤
│  Heap (NX bit set)             │  NOT Executable
│  ├─ Malloc'd buffers           │
│  └─ Dynamic data               │
├────────────────────────────────┤
│  Data/BSS (NX bit set)         │  NOT Executable
├────────────────────────────────┤
│  Text (executable)             │  Executable
│  └─ Program code               │
└────────────────────────────────┘

Implementation:

// Kernel sets page table entry (PTE) NX bit

// x86-64 page table entry structure
struct pte {
    unsigned long present : 1;     // Page is in memory
    unsigned long rw : 1;          // Read/Write permission
    unsigned long user : 1;        // User-mode accessible
    unsigned long pwt : 1;         // Page write-through
    unsigned long pcd : 1;         // Page cache disabled
    unsigned long accessed : 1;    // Page was accessed
    unsigned long dirty : 1;       // Page was written
    unsigned long pat : 1;         // Page attribute table
    unsigned long global : 1;      // Global page
    unsigned long avail : 3;       // Available for OS use
    unsigned long pfn : 40;        // Physical frame number
    unsigned long avail2 : 11;     // Available
    unsigned long nx : 1;          // NO-EXECUTE bit (bit 63)
};

// Kernel code for stack allocation (simplified from mm/mmap.c)
unsigned long do_mmap(struct file *file, unsigned long addr,
                     unsigned long len, unsigned long prot,
                     unsigned long flags, unsigned long pgoff) {
    struct vm_area_struct *vma;

    vma = vm_area_alloc(current->mm);
    vma->vm_start = addr;
    vma->vm_end = addr + len;

    // Stack protection: read/write but NOT executable
    if (flags & MAP_STACK) {
        vma->vm_flags = VM_READ | VM_WRITE | VM_GROWSDOWN;
        // NX bit will be set in page table entries
    }

    // Code segment: executable but NOT writable
    if (prot & PROT_EXEC) {
        vma->vm_flags |= VM_EXEC;
        vma->vm_flags &= ~VM_WRITE;  // W^X: Write XOR Execute
    }

    return addr;
}

W^X Policy (Write XOR Execute):

A page can be writable OR executable, but never both
Prevents attacker from modifying code or executing data

Check NX status:

# Check if NX is enabled
dmesg | grep NX
# NX (Execute Disable) protection: active

# Check process memory maps
cat /proc/self/maps
# 7ffff7dd1000-7ffff7df3000 r-xp ... /lib/x86_64-linux-gnu/ld-2.31.so  (executable)
# 7ffff7df3000-7ffff7df4000 r--p ... /lib/x86_64-linux-gnu/ld-2.31.so  (read-only)
# 7ffffffde000-7ffffffff000 rw-p ... [stack]                           (no 'x'!)

# Check if binary has NX enabled
readelf -l /bin/ls | grep GNU_STACK
# GNU_STACK      0x000000 0x00000000 0x00000000 0x00000 0x00000 RW  0x10
#                                                                 ^^^ (RW, not RWE)

1.2 Address Space Layout Randomization (ASLR)

Problem: Without ASLR, addresses are predictable.

Without ASLR (Predictable):
┌─────────────────────────────────┐
│ Stack:         0x7ffffffde000    │  ← Always same address!
│ Heap:          0x555555559000    │  ← Attacker knows these
│ libc:          0x7ffff7a0d000    │  ← Can hardcode in exploit
│ Program:       0x555555554000    │
│ vDSO:          0x7ffff7fc9000    │
└─────────────────────────────────┘

With ASLR (Randomized):
Run 1:                            Run 2:
┌─────────────────────────────┐  ┌─────────────────────────────┐
│ Stack:    0x7ffc9e3a2000    │  │ Stack:    0x7ffe1b8d6000    │
│ Heap:     0x5643ab123000    │  │ Heap:     0x55e2d9abc000    │
│ libc:     0x7f8a2e456000    │  │ libc:     0x7f3c81de2000    │
│ Program:  0x5643ab11f000    │  │ Program:  0x55e2d9ab8000    │
│ vDSO:     0x7f8a2f1c3000    │  │ vDSO:     0x7f3c82b4f000    │
└─────────────────────────────┘  └─────────────────────────────┘
                ↑ Different every time! ↑

Kernel Implementation:

// Simplified from arch/x86/mm/mmap.c

unsigned long arch_mmap_rnd(void) {
    unsigned long rnd;

    // Get random bits from kernel PRNG
    if (mmap_is_ia32()) {
        rnd = get_random_long() & ((1UL << mmap_rnd_bits) - 1);
    } else {
        rnd = get_random_long() & ((1UL << mmap_rnd_compat_bits) - 1);
    }

    return rnd << PAGE_SHIFT;  // Align to page boundary
}

unsigned long arch_get_unmapped_area(struct file *filp,
                                    unsigned long addr,
                                    unsigned long len,
                                    unsigned long pgoff,
                                    unsigned long flags) {
    struct mm_struct *mm = current->mm;
    struct vm_area_struct *vma;
    unsigned long start_addr;

    // Add random offset
    if (!(flags & MAP_FIXED)) {
        // Random offset for ASLR
        start_addr = mm->mmap_base + arch_mmap_rnd();
    } else {
        start_addr = addr;
    }

    // Find free region starting at randomized address
    vma = find_vma(mm, start_addr);
    // ... allocation logic ...

    return start_addr;
}

Entropy Sources:

ASLR Entropy (bits of randomness):

Stack:      19 bits (on x86-64)  = 524,288 possible locations
Heap:       13 bits              = 8,192 possible locations
Libraries:  28 bits              = 268 million possible locations
PIE binary: 28 bits              = 268 million possible locations

Formula: Brute force attempts = 2^(entropy_bits)

Example: 28 bits → attacker needs avg 2^27 = 134 million attempts
If each attempt crashes the program (1 sec delay):
  134 million seconds = 1,551 days!

But if process doesn't crash (fork server):
  Attacker can brute force in minutes!

KASLR (Kernel ASLR):

// Kernel virtual address randomization (from arch/x86/boot/compressed/kaslr.c)

void choose_random_location(unsigned long input,
                           unsigned long input_size,
                           unsigned long *output,
                           unsigned long output_size,
                           unsigned long *virt_addr) {
    unsigned long random_addr, min_addr;

    // Get entropy from:
    // 1. RDRAND/RDSEED (CPU instructions)
    // 2. RDTSC (timestamp counter)
    // 3. Boot parameters
    random_addr = get_random_long();

    // Align and constrain to valid kernel address range
    min_addr = min(*output, *virt_addr);
    random_addr = find_random_phys_addr(min_addr, output_size);

    *output = random_addr;
    *virt_addr = random_addr + __START_KERNEL_map;
}

Check ASLR status:

# View ASLR setting
cat /proc/sys/kernel/randomize_va_space
# 0 = Disabled
# 1 = Randomize stack, libraries, mmap
# 2 = Full randomization (includes heap)

# Enable full ASLR
echo 2 | sudo tee /proc/sys/kernel/randomize_va_space

# Test ASLR
for i in {1..5}; do cat /proc/self/maps | grep stack; done
# 7ffc12345000-7ffc12366000 (different)
# 7ffe9abcd000-7ffe9abee000 (different)
# 7ffd45678000-7ffd45699000 (different)

1.3 Stack Canaries (Stack Smashing Protection)

Stack canary is a random value placed on the stack between local variables and the return address.

Stack Layout with Canary:
─────────────────────────

High Address
┌──────────────────────┐
│  Return Address      │  ← Protected by canary!
├──────────────────────┤
│  Saved Frame Pointer │
├──────────────────────┤
│  CANARY (random)     │  ← __stack_chk_guard
├──────────────────────┤
│  Local Variables     │  ← Buffer overflow starts here
│  char buf[100];      │
└──────────────────────┘
Low Address

Attack Scenario:
1. Attacker overflows buf
2. Overwrites canary (but doesn't know correct value)
3. Function returns
4. Kernel checks: canary == __stack_chk_guard?
5. Mismatch → Stack smashing detected! → Abort

Compiler Implementation:

// Original vulnerable code
void vulnerable_function(char *input) {
    char buffer[64];
    strcpy(buffer, input);  // Buffer overflow!
}

// Compiled with -fstack-protector-strong
void vulnerable_function(char *input) {
    char buffer[64];
    unsigned long canary = __stack_chk_guard;  // Load canary

    strcpy(buffer, input);

    if (canary != __stack_chk_guard) {
        __stack_chk_fail();  // Stack smashing detected!
    }
}

// __stack_chk_fail() implementation (in glibc)
void __attribute__((noreturn)) __stack_chk_fail(void) {
    __fortify_fail("stack smashing detected");
}

void __attribute__((noreturn)) __fortify_fail(const char *msg) {
    // Log the error
    syslog(LOG_CRIT, "%s: %s terminated", __progname, msg);

    // Terminate immediately
    abort();
}

Canary Types:

Terminator Canary
Random Canary
XOR Canary

// Uses NULL, CR, LF, EOF (0x00, 0x0D, 0x0A, 0xFF)
// Idea: strcpy stops at NULL, gets/printf stop at CR/LF

unsigned long canary = 0x000d0aff00000000;

// Weakness: Attacker can guess/brute-force known bytes

// Completely random value generated at program startup

// In kernel (from arch/x86/kernel/cpu/common.c)
void __init cpu_init(void) {
    // ...
    __stack_chk_guard = get_random_canary();
    // ...
}

unsigned long get_random_canary(void) {
    unsigned long canary;

    // Use hardware RNG if available
    if (cpu_has_rdrand()) {
        rdrand_long(&canary);
    } else {
        // Fall back to PRNG
        get_random_bytes(&canary, sizeof(canary));
    }

    return canary;
}

// Strength: Unpredictable, unique per process

// XOR of return address, frame pointer, and random value

unsigned long canary = __stack_chk_guard ^
                       (unsigned long)__builtin_return_address(0) ^
                       (unsigned long)__builtin_frame_address(0);

// Idea: Even if attacker overwrites return address,
// canary changes correspondingly

// Weakness: Complex, rarely used

Compiler Flags:

# -fstack-protector: Protect functions with vulnerable buffers
gcc -fstack-protector vulnerable.c

# -fstack-protector-strong: Protect more functions (recommended)
gcc -fstack-protector-strong vulnerable.c

# -fstack-protector-all: Protect ALL functions (performance cost)
gcc -fstack-protector-all vulnerable.c

# Check if binary has stack protector
readelf -s /bin/ls | grep stack_chk
#    123: 00000000000060a0     8 OBJECT  GLOBAL DEFAULT   25 __stack_chk_fail@@GLIBC_2.4

Bypass Techniques (and mitigations):

Attack	Mitigation
Leak canary via format string	Use fortified functions (_printf_chk)
Overwrite canary with correct value	Use random canary per thread
Jump over canary (partial overflow)	Place canary near variables
Fork before overflow (canary same in child)	Re-randomize after fork

2. Control Flow Integrity (CFI)

CFI ensures program control flow follows legitimate paths (no arbitrary jumps).

2.1 Forward-Edge CFI (Indirect Calls)

Problem: Function pointers can be hijacked.

// Vulnerable code
struct ops {
    void (*process)(char *data);
};

struct ops *vtable = malloc(sizeof(struct ops));
vtable->process = legitimate_function;

// ... buffer overflow ...
// Attacker overwrites vtable->process to point to shellcode

vtable->process(data);  // Calls shellcode!

CFI Solution:

// Compiler generates CFI check before indirect call

// Original code
vtable->process(data);

// Compiled with CFI
void *target = vtable->process;

// Check 1: Is target a valid code address?
if (!is_valid_code_address(target)) {
    cfi_violation();
}

// Check 2: Is target in allowed set for this call site?
if (!is_allowed_target(call_site_id, target)) {
    cfi_violation();
}

// Perform call
((void (*)(char *))target)(data);

Allowed Target Sets:

Function Signature-Based CFI:

void func_a(int x);           ← Set 1: void (int)
void func_b(int x);           ←
int func_c(int x, int y);     ← Set 2: int (int, int)
int func_d(int x, int y);     ←
void func_e(void);            ← Set 3: void (void)

Rule: Indirect call with signature void(int) can only jump to Set 1.

Implementation:
1. Compiler assigns ID to each function signature
2. Compiler tags each function with its ID
3. Before indirect call, check ID matches expected signature

Clang CFI:

# Compile with CFI
clang -fsanitize=cfi -flto program.c

# CFI variants
-fsanitize=cfi-icall     # Indirect calls
-fsanitize=cfi-vcall     # Virtual function calls (C++)
-fsanitize=cfi-cast      # Bad casts

# Generate CFI violation report
UBSAN_OPTIONS=print_stacktrace=1 ./program

# Example violation
SUMMARY: UndefinedBehaviorSanitizer: cfi-check-fail
pc 0x55f8a2b3c4d5 in main program.c:42

2.2 Backward-Edge CFI (Return Address Protection)

Shadow Stack: Hardware-protected copy of return addresses.

Regular Stack           Shadow Stack (Protected)
─────────────────       ────────────────────────
┌─────────────┐         ┌─────────────┐
│  Ret Addr 3 │ ◄──────►│  Ret Addr 3 │  (Copy)
├─────────────┤         ├─────────────┤
│  Locals     │         │             │
├─────────────┤         │             │
│  Ret Addr 2 │ ◄──────►│  Ret Addr 2 │
├─────────────┤         ├─────────────┤
│  Locals     │         │             │
├─────────────┤         │             │
│  Ret Addr 1 │ ◄──────►│  Ret Addr 1 │
└─────────────┘         └─────────────┘
     ↑                         ↑
  Writable!              Read-Only!
  (Attacker can          (CPU enforced,
   modify)                not accessible)

On Function Return:
1. Pop return address from regular stack → addr_stack
2. Pop return address from shadow stack → addr_shadow
3. Compare: addr_stack == addr_shadow?
4. Mismatch → #CP exception (Control Protection) → Crash

Intel CET (Control-flow Enforcement Technology):

// CPU features for shadow stack
#define X86_FEATURE_SHSTK  (1 << 7)   // Shadow stack
#define X86_FEATURE_IBT    (1 << 20)  // Indirect branch tracking

// Enable shadow stack (kernel code)
void cet_enable(void) {
    u64 msr_val;

    // Check if CPU supports CET
    if (!boot_cpu_has(X86_FEATURE_SHSTK))
        return;

    // Enable in MSR
    rdmsrl(MSR_IA32_S_CET, msr_val);
    msr_val |= MSR_IA32_CET_SHSTK_EN;  // Enable shadow stack
    wrmsrl(MSR_IA32_S_CET, msr_val);

    // Allocate shadow stack for current thread
    unsigned long ssp = alloc_shstk();  // Shadow stack pointer
    wrmsrl(MSR_IA32_PL3_SSP, ssp);
}

// Shadow stack operations (new x86 instructions)
// INCSSP - Increment shadow stack pointer
// RDSSP  - Read shadow stack pointer
// SAVEPREVSSP - Save previous SSP
// RSTORSSP - Restore SSP
// WRSSD/WRSSQ - Write to shadow stack
// SETSSBSY - Mark shadow stack busy

ARM Pointer Authentication:

// ARM PAuth uses cryptographic signing of return addresses

// On function prologue:
// PAC (Pointer Authentication Code) = sign(return_addr, context_key)
// Store: PAC || return_addr on stack

// On function epilogue:
// Verify: sign(return_addr, context_key) == PAC?
// If mismatch → Fault

// ARM instructions
PACIA  X30, SP   // Sign return address (X30) with stack pointer (SP)
RETAA            // Authenticate and return

Software Shadow Stack (Android):

// Implemented in libc (not hardware-protected)

__thread void *shadow_stack[1024];
__thread int shadow_stack_ptr = 0;

void function_entry(void *return_addr) {
    shadow_stack[shadow_stack_ptr++] = return_addr;
}

void function_exit(void *return_addr) {
    void *expected = shadow_stack[--shadow_stack_ptr];
    if (return_addr != expected) {
        abort();  // Stack corruption detected
    }
}

// Weakness: Attacker can corrupt shadow_stack too if memory bug exists
// Strength: Works on CPUs without hardware support

3. Privilege Separation & Capabilities

3.1 Traditional Unix DAC (Discretionary Access Control)

User/Group/Other Permissions:

File: /etc/shadow
Owner: root
Group: shadow
Permissions: rw-r-----
             │││││││││
             ││││││││└─ Other: no permissions
             │││││││└── Other: no permissions
             ││││││└─── Other: no permissions
             │││││└──── Group: read
             ││││└───── Group: no write
             │││└────── Group: no execute
             ││└─────── Owner: read
             │└──────── Owner: write
             └───────── Owner: no execute

Problem: All-or-nothing root privileges!
- Process needs root to bind port 80 → runs fully as root
- Process needs root to read /etc/shadow → runs fully as root

3.2 POSIX Capabilities

Divide root privileges into distinct units:

// From /usr/include/linux/capability.h

#define CAP_CHOWN            0   // Change file ownership
#define CAP_DAC_OVERRIDE     1   // Bypass file permission checks
#define CAP_DAC_READ_SEARCH  2   // Bypass read/search permissions
#define CAP_FOWNER           3   // Bypass permission checks on file operations
#define CAP_FSETID           4   // Don't clear setuid/setgid bits
#define CAP_KILL             5   // Bypass permission checks for sending signals
#define CAP_SETGID           6   // Make arbitrary setgid calls
#define CAP_SETUID           7   // Make arbitrary setuid calls
#define CAP_NET_BIND_SERVICE 10  // Bind to ports < 1024
#define CAP_NET_RAW          13  // Use RAW and PACKET sockets
#define CAP_SYS_ADMIN        21  // Lots of system admin operations
#define CAP_SYS_PTRACE       19  // Trace arbitrary processes
#define CAP_SYS_MODULE       16  // Load/unload kernel modules
// ... 41 capabilities total ...

Capability Sets:

// Each process has 5 capability sets

struct cred {
    // ...
    kernel_cap_t cap_inheritable;  // Inherited by exec'd programs
    kernel_cap_t cap_permitted;    // Can be enabled (superset)
    kernel_cap_t cap_effective;    // Actually active NOW
    kernel_cap_t cap_bset;         // Bounding set (limits inheritable)
    kernel_cap_t cap_ambient;      // Ambient set (new in Linux 4.3)
};

// Each set is a 64-bit bitmask (2^64 possible capabilities)
typedef struct {
    __u32 cap[_LINUX_CAPABILITY_U32S_3];  // 2 × 32 bits
} kernel_cap_t;

Capability Semantics:

Permitted (P):   Capabilities the process CAN use
Effective (E):   Capabilities CURRENTLY active
Inheritable (I): Capabilities that can be inherited across exec
Ambient (A):     Capabilities automatically granted after exec
Bounding (B):    Upper limit on capabilities (cannot gain capabilities not in B)

On exec():
P' = (P & I) | (F_permitted & F_inheritable) | A
E' = F_effective ? P' : A
I' = I
A' = A & P' & I'

Where F_* are file capabilities (set on executable)

Using Capabilities:

Set File Capabilities
Capability-Aware Code
View Process Capabilities
Ambient Capabilities

# Give ping the ability to create raw sockets (no setuid needed!)
sudo setcap cap_net_raw+ep /bin/ping

# Verify
getcap /bin/ping
# /bin/ping = cap_net_raw+ep

# Now ping works without setuid bit!
ls -l /bin/ping
# -rwxr-xr-x  ... /bin/ping  (no 's' bit!)

# Remove capabilities
sudo setcap -r /bin/ping

# Set multiple capabilities
sudo setcap cap_net_bind_service,cap_net_raw+ep /usr/bin/server

#include <sys/capability.h>
#include <sys/prctl.h>

int main() {
    cap_t caps;
    cap_value_t cap_list[2] = {CAP_NET_BIND_SERVICE, CAP_NET_RAW};

    // Get current capabilities
    caps = cap_get_proc();

    // Add capabilities to permitted and effective sets
    cap_set_flag(caps, CAP_PERMITTED, 2, cap_list, CAP_SET);
    cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_list, CAP_SET);

    // Apply capabilities
    if (cap_set_proc(caps) != 0) {
        perror("cap_set_proc");
        return 1;
    }

    // Now we can bind port 80!
    int sock = socket(AF_INET, SOCK_STREAM, 0);
    struct sockaddr_in addr = {
        .sin_family = AF_INET,
        .sin_port = htons(80),
        .sin_addr.s_addr = INADDR_ANY
    };

    if (bind(sock, (struct sockaddr *)&addr, sizeof(addr)) == 0) {
        printf("Successfully bound to port 80\n");
    }

    // Drop capabilities we no longer need
    cap_clear(caps);
    cap_set_proc(caps);

    cap_free(caps);
    return 0;
}

// Compile: gcc -o server server.c -lcap
// Run: setcap cap_net_bind_service+ep ./server

# View capabilities of running process
grep Cap /proc/self/status
# CapInh: 0000000000000000
# CapPrm: 0000000000000000
# CapEff: 0000000000000000
# CapBnd: 000001ffffffffff
# CapAmb: 0000000000000000

# Decode capability mask
capsh --decode=000001ffffffffff
# 0x000001ffffffffff=cap_chown,cap_dac_override,...

# View capabilities of any process
grep Cap /proc/1234/status

# List all capabilities
capsh --print

// Ambient capabilities (Linux 4.3+)
// Allows non-root processes to exec and retain capabilities

#include <sys/prctl.h>

int main() {
    // Raise CAP_NET_RAW in ambient set
    if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE,
              CAP_NET_RAW, 0, 0) != 0) {
        perror("prctl");
        return 1;
    }

    // exec another program
    execl("/usr/bin/ping", "ping", "8.8.8.8", NULL);

    // /usr/bin/ping inherits CAP_NET_RAW!
    // (even though it's not setuid and has no file capabilities)

    return 0;
}

// Use case: Container init process grants capabilities to children

3.3 Seccomp (Secure Computing Mode)

Seccomp-BPF: Restrict system calls a process can make using BPF filters.

#include <seccomp.h>

int main() {
    scmp_filter_ctx ctx;

    // Create filter: default action = KILL
    ctx = seccomp_init(SCMP_ACT_KILL);

    // Allow essential syscalls
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);

    // Allow open only for specific file
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(open), 1,
                     SCMP_A0(SCMP_CMP_EQ, (scmp_datum_t)"/tmp/allowed.txt"));

    // Load filter into kernel
    seccomp_load(ctx);

    // After this point, any syscall not explicitly allowed → SIGSYS (kill)

    open("/tmp/allowed.txt", O_RDONLY);  // ✓ Allowed
    open("/etc/passwd", O_RDONLY);       // ✗ Killed!

    seccomp_release(ctx);
    return 0;
}

// Compile: gcc -o sandbox sandbox.c -lseccomp

Raw Seccomp-BPF:

#include <linux/filter.h>
#include <linux/seccomp.h>
#include <sys/prctl.h>

void install_seccomp_filter() {
    struct sock_filter filter[] = {
        // Load syscall number
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                 offsetof(struct seccomp_data, nr)),

        // Allow exit syscall
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_exit, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

        // Allow write syscall
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

        // Kill on any other syscall
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
    };

    struct sock_fprog prog = {
        .len = sizeof(filter) / sizeof(filter[0]),
        .filter = filter,
    };

    // Enable seccomp
    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);  // Cannot gain privileges
    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
}

int main() {
    install_seccomp_filter();

    write(1, "Hello\n", 6);  // ✓ Allowed
    getpid();                 // ✗ Killed (SIGSYS)

    return 0;
}

Seccomp Actions:

Action	Effect
`SECCOMP_RET_KILL_PROCESS`	Kill entire process
`SECCOMP_RET_KILL_THREAD`	Kill only current thread
`SECCOMP_RET_TRAP`	Send SIGSYS signal
`SECCOMP_RET_ERRNO`	Return error code
`SECCOMP_RET_TRACE`	Notify tracer (ptrace)
`SECCOMP_RET_LOG`	Log and allow
`SECCOMP_RET_ALLOW`	Allow syscall

Real-World Usage:

# Chrome sandbox
ps aux | grep chrome
# ... --type=renderer --enable-sandbox ...

# Docker seccomp profile
docker run --security-opt seccomp=default.json alpine sh

# systemd service with seccomp
cat /etc/systemd/system/myservice.service
# [Service]
# SystemCallFilter=@system-service
# SystemCallFilter=~@privileged @resources

# View seccomp status of process
grep Seccomp /proc/self/status
# Seccomp: 2  (mode 2 = filter active)

4. Mandatory Access Control (MAC)

4.1 SELinux (Security-Enhanced Linux)

SELinux adds mandatory access control on top of DAC.

DAC says: "Can user alice read file.txt?"
  → Check: alice's UID vs file owner, group, permissions

SELinux says: "Can process with label X access file with label Y?"
  → Check: Policy rules for (process_label, file_label, operation)

Both must succeed for access!

SELinux Components:

┌─────────────────────────────────────────────────┐
│              SELinux Policy                      │
│  ┌───────────────────────────────────────────┐  │
│  │  Type Enforcement (TE) Rules              │  │
│  │  allow httpd_t http_port_t:tcp_socket bind│  │
│  │  allow httpd_t httpd_sys_content_t:file r │  │
│  └───────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────┐  │
│  │  Security Contexts (Labels)               │  │
│  │  user:role:type:level                     │  │
│  │  system_u:system_r:httpd_t:s0             │  │
│  └───────────────────────────────────────────┘  │
└─────────────────────────────────────────────────┘
                     ↓
         ┌──────────────────────┐
         │  LSM (Linux Security │
         │  Module) Framework   │
         └──────────────────────┘
                     ↓
      Kernel enforces at runtime

Security Context:

# View file context
ls -Z /var/www/html/index.html
# system_u:object_r:httpd_sys_content_t:s0 /var/www/html/index.html
#    │        │           │              │
#    │        │           │              └─ MLS level (sensitivity)
#    │        │           └─ Type (most important!)
#    │        └─ Role
#    └─ User

# View process context
ps -Z 1234
# system_u:system_r:httpd_t:s0 /usr/sbin/httpd

# Change file context
chcon -t httpd_sys_content_t /var/www/html/newfile.html

# Restore default contexts
restorecon -Rv /var/www/html/

Type Enforcement Rules:

# From /etc/selinux/targeted/policy/policy.conf (compiled binary)

# Allow httpd to bind TCP sockets on http_port_t (port 80, 443)
allow httpd_t http_port_t:tcp_socket { bind listen };

# Allow httpd to read files labeled httpd_sys_content_t
allow httpd_t httpd_sys_content_t:file { read getattr open };

# Allow httpd to write to log files
allow httpd_t httpd_log_t:file { write append create };

# Deny (example)
# If no rule allows, default is deny!
# httpd_t trying to access user_home_t → DENIED

SELinux Modes:

Enforcing

# Active enforcement
getenforce
# Enforcing

# Violations blocked
# AVC denials logged

sestatus
# SELinux status: enabled
# Current mode: enforcing

Permissive

# Log-only mode
setenforce 0

getenforce
# Permissive

# Violations logged
# but NOT blocked

# Good for debugging

Disabled

# SELinux completely off

# Edit /etc/selinux/config
SELINUX=disabled

# Reboot required

# NO security benefit!

Debugging SELinux:

# View denials
ausearch -m avc -ts recent

# Example denial
type=AVC msg=audit(1234567890.123:456): avc: denied { read } for pid=1234
  comm="httpd" name="secret.txt" dev="sda1" ino=123456
  scontext=system_u:system_r:httpd_t:s0
  tcontext=system_u:object_r:user_home_t:s0
  tclass=file permissive=0

# Translation: httpd_t tried to read user_home_t file → DENIED

# Generate policy module to allow
audit2allow -a -M mypolicy
# module mypolicy 1.0;
# require {
#     type httpd_t;
#     type user_home_t;
#     class file read;
# }
# allow httpd_t user_home_t:file read;

# Install policy module
semodule -i mypolicy.pp

# List loaded modules
semodule -l

# Remove module
semodule -r mypolicy

SELinux Booleans (runtime toggles):

# List all booleans
getsebool -a | grep httpd
# httpd_can_network_connect --> off
# httpd_can_network_connect_db --> off
# httpd_enable_cgi --> on

# Enable httpd network connections
setsebool -P httpd_can_network_connect on

# -P makes it persistent across reboot

4.2 AppArmor

AppArmor is path-based MAC (vs SELinux’s label-based).

SELinux: "Process with label X can access file with label Y"
  → Requires labeling entire filesystem

AppArmor: "Process can access /var/www/* with read permission"
  → Based on filesystem paths (easier to understand)

AppArmor Profile:

# /etc/apparmor.d/usr.sbin.nginx
#include <tunables/global>

/usr/sbin/nginx {
  #include <abstractions/base>
  #include <abstractions/nameservice>

  capability dac_override,
  capability net_bind_service,
  capability setgid,
  capability setuid,

  /etc/nginx/** r,
  /var/log/nginx/** rw,
  /var/www/** r,
  /run/nginx.pid rw,

  # Network
  network inet stream,
  network inet6 stream,

  # Deny everything else (default)
}

Profile Modes:

# Enforce mode
aa-enforce /usr/sbin/nginx

# Complain mode (log-only)
aa-complain /usr/sbin/nginx

# Disable profile
aa-disable /usr/sbin/nginx

# View status
aa-status
# apparmor module is loaded.
# 12 profiles are loaded.
# 10 profiles are in enforce mode.
# 2 profiles are in complain mode.

Creating Profiles:

# Generate profile automatically
aa-genprof /usr/bin/myapp

# Steps:
# 1. Runs app in learning mode
# 2. Exercise all app functionality
# 3. Reviews logged accesses
# 4. Generates profile

# Manually create profile
cat > /etc/apparmor.d/usr.bin.myapp <<EOF
/usr/bin/myapp {
  #include <abstractions/base>

  /etc/myapp/** r,
  /var/lib/myapp/** rw,
  /tmp/** rw,

  capability net_bind_service,

  deny /etc/shadow r,
}
EOF

# Load profile
apparmor_parser -r /etc/apparmor.d/usr.bin.myapp

SELinux vs AppArmor:

Feature	SELinux	AppArmor
Granularity	Very fine (labels)	Coarse (paths)
Complexity	High	Low
Performance	Small overhead	Very small
Learning curve	Steep	Gentle
Flexibility	Maximum	Good
Default	RHEL, Fedora, CentOS	Debian, Ubuntu, SUSE

5. Microarchitectural Attacks & Mitigations

5.1 Spectre & Meltdown

Speculative Execution: CPU predicts branch and executes ahead, then discards if wrong.

// Vulnerable code
if (x < array1_size) {         // Bounds check
    y = array2[array1[x]];     // Out-of-bounds access
}

Without Speculation:
1. Check: x < array1_size?
2. If true, execute access
3. If false, skip

With Speculation (vulnerable):
1. CPU predicts branch will be taken
2. Speculatively executes: y = array2[array1[x]]
   Even if x >= array1_size!
3. Loads array1[x] (out of bounds!)
4. Uses it to index array2
5. array2[...] brought into cache ← SIDE EFFECT!
6. Branch misprediction detected
7. Architectural state rolled back
8. BUT: Cache state NOT rolled back!

Attacker observes cache timing → leaks array1[x]!

Meltdown (CVE-2017-5754):

// Leak kernel memory from user space

// 1. Flush cache
clflush(probe_array);

// 2. Access kernel memory (should fault, but speculatively executes)
char kernel_byte = *(char *)kernel_address;

// 3. Use leaked byte to index array
char dummy = probe_array[kernel_byte * 4096];
// This brings probe_array[kernel_byte * 4096] into cache

// 4. Branch misprediction, exception raised
// But probe_array[...] is NOW in cache!

// 5. Time accesses to probe_array
for (int i = 0; i < 256; i++) {
    t0 = rdtsc();
    dummy = probe_array[i * 4096];
    t1 = rdtsc();

    if (t1 - t0 < THRESHOLD) {
        // Cache hit! i == kernel_byte
        printf("Leaked kernel byte: 0x%02x\n", i);
    }
}

// Result: Leaked kernel memory byte-by-byte at ~100 KB/s!

Mitigation: KPTI (Kernel Page Table Isolation):

Without KPTI:
┌────────────────────────────────────┐
│  User Page Tables                  │
│  ┌──────────────────────────────┐  │
│  │  User Space Mappings          │  │
│  │  0x00000000 - 0x7fffffffffff  │  │
│  ├──────────────────────────────┤  │
│  │  Kernel Space Mappings        │  │
│  │  0xffff800000000000 - ...     │  │ ← Meltdown can read this!
│  └──────────────────────────────┘  │
└────────────────────────────────────┘

With KPTI (two sets of page tables):
┌────────────────────────────────────┐
│  User Page Tables                  │
│  ┌──────────────────────────────┐  │
│  │  User Space Mappings          │  │
│  ├──────────────────────────────┤  │
│  │  Minimal Kernel (entry/exit)  │  │ ← Only small trampoline
│  └──────────────────────────────┘  │
└────────────────────────────────────┘

┌────────────────────────────────────┐
│  Kernel Page Tables                │
│  ┌──────────────────────────────┐  │
│  │  User Space Mappings          │  │
│  ├──────────────────────────────┤  │
│  │  Full Kernel Space            │  │ ← Full kernel mapped
│  └──────────────────────────────┘  │
└────────────────────────────────────┘

On syscall: Switch from User PT → Kernel PT (CR3 register swap)
On return:  Switch from Kernel PT → User PT

Cost: ~5-30% performance penalty (context switch overhead)

Kernel Implementation (simplified from arch/x86/mm/pti.c):

// Enable KPTI
void pti_init(void) {
    if (!boot_cpu_has_bug(X86_BUG_CPU_MELTDOWN))
        return;  // CPU not vulnerable

    pr_info("Kernel/User page tables isolation: enabled\n");

    // Allocate separate user page tables
    pgd_t *user_pgd = (pgd_t *)__get_free_page(GFP_KERNEL);

    // Copy user space mappings
    clone_pgd_range(user_pgd, kernel_pgd, KERNEL_PGD_PTRS);

    // Map minimal kernel trampolines (entry/exit stubs)
    map_entry_trampoline(user_pgd);

    // Install user page tables
    current->mm->pgd = user_pgd;
}

// Syscall entry: switch to kernel page tables
ENTRY(entry_SYSCALL_64)
    swapgs                        // Swap GS (per-CPU data)
    movq %rsp, PER_CPU_VAR(rsp_scratch)
    movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp

    /* Switch page tables */
    movq PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
    SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp  // Load kernel CR3

    /* ... handle syscall ... */

    SWITCH_TO_USER_CR3 scratch_reg=%rsp    // Load user CR3
    swapgs
    sysretq
END(entry_SYSCALL_64)

Spectre (CVE-2017-5753/5715):

// Branch Target Injection (Spectre v2)

// Victim code
if (x < array1_size) {
    y = array2[array1[x] * 256];
}

// Attacker code
void attack() {
    // 1. Train branch predictor
    for (int i = 0; i < 1000; i++) {
        victim_function(valid_x);  // Train: branch TAKEN
    }

    // 2. Flush cache
    clflush(probe_array);

    // 3. Call with malicious x
    victim_function(malicious_x);  // x >= array1_size

    // CPU speculatively executes (branch predictor says TAKEN)
    // Leaks out-of-bounds memory into cache

    // 4. Time side-channel to recover
    for (int i = 0; i < 256; i++) {
        t0 = rdtsc();
        dummy = probe_array[i * 256];
        t1 = rdtsc();

        if (t1 - t0 < THRESHOLD) {
            printf("Leaked: 0x%02x\n", i);
        }
    }
}

// Result: Can leak arbitrary memory across privilege boundaries!

Mitigation: Retpoline (Return Trampoline):

; Traditional indirect jump (vulnerable)
jmp *%rax

; Retpoline (safe)
call retpoline_label
retpoline_label:
    pause          ; Prevent speculation
    lfence         ; Serialize execution
    jmp retpoline_label  ; Infinite loop (never executed)

; CPU's return stack buffer prevents speculation
; Indirect jump converted to return instruction

Kernel Implementation:

// Compiler flag
KBUILD_CFLAGS += -mindirect-branch=thunk-extern
KBUILD_CFLAGS += -mindirect-branch-register

// Generated code
// Before:
//   call *%rax
// After:
//   call __x86_indirect_thunk_rax

__x86_indirect_thunk_rax:
    call retpoline_label
retpoline_label:
    pause
    lfence
    jmp retpoline_label
    mov %rax, (%rsp)  // Never executed, but tricks CPU
    ret

Hardware Mitigations:

# Check CPU vulnerabilities
cat /sys/devices/system/cpu/vulnerabilities/*

# meltdown: Mitigation: PTI
# spectre_v1: Mitigation: usercopy/swapgs barriers and __user pointer sanitization
# spectre_v2: Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, RSB filling

# IBRS (Indirect Branch Restricted Speculation)
# IBPB (Indirect Branch Predictor Barrier)
# STIBP (Single Thread Indirect Branch Predictors)
# SSBD (Speculative Store Bypass Disable)

# Disable mitigations (for benchmarking)
# WARNING: Insecure!
echo 0 > /sys/kernel/debug/x86/pti_enabled
echo 0 > /sys/kernel/debug/x86/retp_enabled

5.2 Rowhammer

DRAM vulnerability: Rapidly accessing one row can flip bits in adjacent rows.

DRAM Organization:
┌─────────────────────────────────┐
│  Bank 0                         │
│  ┌───────────────────────────┐  │
│  │ Row 0   [data]            │  │
│  │ Row 1   [data] ← Target   │  │ ← Victim row
│  │ Row 2   [data]            │  │ ← Hammered (read repeatedly)
│  │ Row 3   [data] ← Target   │  │ ← Victim row
│  │ ...                       │  │
│  └───────────────────────────┘  │
└─────────────────────────────────┘

Attack:
1. Find adjacent rows in DRAM
2. Rapidly read from Row 2 (millions of times)
3. Electrical interference causes bit flips in Row 1 and Row 3
4. Attacker doesn't directly access victim rows!

Exploitation:

// Rowhammer exploit (simplified)

// 1. Spray memory with target pattern
char *spray[1000];
for (int i = 0; i < 1000; i++) {
    spray[i] = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
                    MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    memset(spray[i], 0xFF, 4096);  // All bits set
}

// 2. Find adjacent rows (via DRAM addressing)
uint64_t *hammer_addr1 = find_row_address(2);
uint64_t *hammer_addr2 = find_row_address(4);

// 3. Hammer rows
for (int i = 0; i < 1000000; i++) {
    *hammer_addr1;  // Read (causes DRAM row activation)
    *hammer_addr2;
    clflush(hammer_addr1);  // Evict from cache (force DRAM access)
    clflush(hammer_addr2);
}

// 4. Check for bit flips in victim rows
for (int i = 0; i < 1000; i++) {
    for (int j = 0; j < 4096; j++) {
        if (spray[i][j] != 0xFF) {
            printf("Bit flip at %p: 0x%02x\n", &spray[i][j], spray[i][j]);
        }
    }
}

// Real attacks:
// - Flip bit in page table → gain access to kernel memory
// - Flip bit in SELinux context → privilege escalation
// - Flip bit in RSA key → factor private key

Mitigations:

ECC Memory

Error-Correcting Code (ECC):
- Detects and corrects single-bit errors
- Detects (but can't correct) multi-bit errors

Cost: ~10% more expensive
Performance: Slight overhead

Widely used in servers, rare in consumer devices

Target Row Refresh (TRR)

Hardware solution by DRAM vendors:

- Monitor row activation counters
- If row accessed frequently, refresh adjacent rows
- Prevents charge leak that causes bit flips

Implemented in DDR4/DDR5 DRAM

Effectiveness: Good but not perfect
(bypasses exist with careful timing)

Software Mitigations

# Limit cache flush instructions (clflush)
# (Requires kernel patch)

# Increase DRAM refresh rate
# (Reduces performance)

# Memory deduplication disabled
echo 0 > /sys/kernel/mm/ksm/run

# Prevent predictable physical addresses
# (KASLR + physical address randomization)

OS-Level

// Double-sided rowhammer protection
// Kernel detects excessive row activations

void dram_protect(void) {
    // Monitor TLB misses (proxy for row activations)
    u64 tlb_misses = read_pmc(TLB_MISS_EVENT);

    if (tlb_misses > THRESHOLD) {
        // Potential rowhammer attack
        // Force memory refresh
        wbinvd();  // Write-back and invalidate caches

        // Log for analysis
        pr_warn("Potential Rowhammer attack detected\n");
    }
}

6. Sandboxing Techniques

6.1 Namespaces (Containers)

Linux namespaces isolate resources between processes.

// 7 types of namespaces

#define CLONE_NEWNS     0x00020000  // Mount namespace
#define CLONE_NEWUTS    0x04000000  // UTS (hostname) namespace
#define CLONE_NEWIPC    0x08000000  // IPC namespace
#define CLONE_NEWPID    0x20000000  // PID namespace
#define CLONE_NEWNET    0x40000000  // Network namespace
#define CLONE_NEWUSER   0x10000000  // User namespace
#define CLONE_NEWCGROUP 0x02000000  // Cgroup namespace

Creating Isolated Environment:

#define _GNU_SOURCE
#include <sched.h>
#include <unistd.h>
#include <sys/mount.h>

int sandbox_init(void *arg) {
    // New mount namespace
    mount("none", "/", NULL, MS_REC | MS_PRIVATE, NULL);

    // New hostname
    sethostname("sandbox", 7);

    // New root filesystem
    chroot("/var/sandbox");
    chdir("/");

    // Execute sandboxed program
    execl("/bin/sh", "sh", NULL);

    return 0;
}

int main() {
    char stack[4096];

    // Create new namespaces
    int flags = CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWPID |
                CLONE_NEWNET | CLONE_NEWIPC;

    // Clone process with new namespaces
    clone(sandbox_init, stack + sizeof(stack), flags | SIGCHLD, NULL);

    wait(NULL);
    return 0;
}

PID Namespace (process isolation):

// Parent namespace
// PID 1: init
// PID 123: parent
// PID 124: child (clone with CLONE_NEWPID)

// Inside child's PID namespace
getpid();  // Returns 1 (child is PID 1 in its namespace)

// Child can only see processes in its namespace
ps aux  // Only shows processes in this namespace

// Parent can still see child
// PID 124 in parent namespace == PID 1 in child namespace

Network Namespace (network isolation):

# Create new network namespace
ip netns add sandbox

# Execute command in namespace
ip netns exec sandbox ip link list
# 1: lo: <LOOPBACK> state DOWN
# (Only loopback, no network access!)

# Create virtual interface pair
ip link add veth0 type veth peer name veth1

# Move veth1 to sandbox namespace
ip link set veth1 netns sandbox

# Configure networking
ip addr add 10.0.0.1/24 dev veth0
ip link set veth0 up

ip netns exec sandbox ip addr add 10.0.0.2/24 dev veth1
ip netns exec sandbox ip link set veth1 up

# Now sandbox can communicate via veth interface

6.2 Chrome Multi-Process Sandbox

Chrome Architecture:
────────────────────

┌───────────────────────────────────────────────┐
│             Browser Process                    │
│  - Runs with full privileges                  │
│  - Manages windows, tabs, plugins             │
│  - Opens files, sockets on behalf of renderers│
│  - Passes FDs via SCM_RIGHTS                  │
└────────────┬──────────────────────────────────┘
             │
     ┌───────┼───────┬──────────┐
     │       │       │          │
┌────▼─────┐ │  ┌────▼─────┐  ┌▼──────────┐
│ Renderer │ │  │ Renderer │  │   GPU     │
│ (Tab 1)  │ │  │ (Tab 2)  │  │  Process  │
│          │ │  │          │  │           │
│ Sandbox: │ │  │ Sandbox: │  │ Sandbox:  │
│ - seccomp│ │  │ - seccomp│  │ - seccomp │
│ - No FS  │ │  │ - No FS  │  │ - Limited │
│ - No net │ │  │ - No net │  │   access  │
└──────────┘ │  └──────────┘  └───────────┘
             │
        ┌────▼─────┐
        │  Plugin  │
        │ Process  │
        │ (Flash)  │
        │ Sandbox  │
        └──────────┘

Sandbox Restrictions (Linux):
1. Seccomp-BPF: Allow only ~30 syscalls
2. Namespaces: PID, NET, IPC isolation
3. chroot: Fake root filesystem
4. No setuid/setgid
5. No capabilities
6. Read-only /proc, /sys

Chrome Sandbox Code (simplified from sandbox/linux/):

// Renderer process startup

void RendererMain() {
    // 1. Drop all capabilities
    drop_all_capabilities();

    // 2. Enter namespaces
    unshare(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWIPC);

    // 3. chroot to empty directory
    chroot("/var/empty");
    chdir("/");

    // 4. Install seccomp filter
    install_renderer_seccomp_filter();

    // 5. Drop privileges
    setuid(nobody_uid);
    setgid(nobody_gid);

    // 6. Enable no_new_privs
    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);

    // 7. Run renderer
    RunRendererLoop();
}

void install_renderer_seccomp_filter() {
    // Allow only essential syscalls
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);

    // Read/write/close
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0);

    // Memory management
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(munmap), 0);

    // IPC (to browser process)
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(recvmsg), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(sendmsg), 0);

    // DENY: open, socket, execve, fork, etc.

    seccomp_load(ctx);
}

Escape Detection (from browser process):

// Browser monitors renderer health

void MonitorRenderer(int renderer_pid) {
    // Check if renderer tries forbidden syscalls
    ptrace(PTRACE_SEIZE, renderer_pid, NULL, PTRACE_O_TRACESECCOMP);

    while (1) {
        int status;
        waitpid(renderer_pid, &status, 0);

        if (WIFSTOPPED(status) && status >> 8 == (SIGTRAP | (PTRACE_EVENT_SECCOMP << 8))) {
            // Seccomp violation detected!
            struct user_regs_struct regs;
            ptrace(PTRACE_GETREGS, renderer_pid, NULL, &regs);

            long syscall_nr = regs.orig_rax;
            log_security_violation(renderer_pid, syscall_nr);

            // Kill malicious renderer
            kill(renderer_pid, SIGKILL);
            respawn_renderer();
        }
    }
}

7. Interview Questions & Answers

Q1: How does NX/DEP prevent code execution on the stack?

NX (No-Execute) / DEP (Data Execution Prevention) uses the CPU’s NX bit in page table entries.Page Table Entry Structure (x86-64):

Bit 63: NX (No-Execute) bit
When set: Page cannot be executed (will fault with #PF if IP points here)
When clear: Page is executable

Kernel Implementation:

// When mapping stack
vma->vm_flags = VM_READ | VM_WRITE | VM_GROWSDOWN;
// NO VM_EXEC flag!

// Page table entry will have NX bit SET
pte = pfn_pte(pfn, PAGE_KERNEL);  // Default kernel page (with NX)
set_pte(pte_addr, pte);

Protection:

Attacker overflows buffer on stack
Injects shellcode
Overwrites return address to point to shellcode
Function returns, jumps to shellcode address
CPU checks NX bit → Page is not executable
#PF (Page Fault) → Kernel kills process

W^X Policy: Page is writable OR executable, never both.

Stack/Heap: Writable, NOT executable
Code: Executable, NOT writable
Prevents: Code injection attacks

Bypass: Return-Oriented Programming (ROP) - reuse existing executable code instead of injecting new code.

Q2: Explain ASLR and how it prevents exploitation. What are its weaknesses?

ASLR (Address Space Layout Randomization) randomizes memory layout at program start.Randomized Regions:

Stack base address
Heap base address
Libraries (libc, etc.)
Executable base (if PIE - Position Independent Executable)
vDSO, vvar

Entropy (x86-64 Linux):

Stack: 19 bits → 524,288 possible positions
Heap: 13 bits → 8,192 possible positions
Libraries: 28 bits → 268 million possible positions
PIE executable: 28 bits → 268 million possible positions

How It Prevents Exploitation:Traditional exploit (no ASLR):

Attacker knows: libc is at 0x7ffff7a0d000
Attacker's ROP chain:
  return to 0x7ffff7a52390 (system)
  argument: 0x7ffff7b99d88 ("/bin/sh")

Works every time!

With ASLR:

Run 1: libc at 0x7f8a2e456000
Run 2: libc at 0x7f3c81de2000
Run 3: libc at 0x7fb1c9a2f000

Attacker's hardcoded addresses: WRONG!
Exploit crashes instead of succeeding

Weaknesses:

Information Leak:
- Pointer disclosure → calculate base addresses → bypass ASLR
- Format string bugs, memory corruption leaks
Entropy Limitations:
- 13 bits (heap) = 8,192 attempts
- If process doesn’t crash (fork server), brute-forceable
32-bit Systems:
- Limited address space → low entropy
- 8 bits library randomization → 256 attempts
Non-PIE Executables:
- Main executable at fixed address
- Contains ROP gadgets at known addresses
Cache Timing Attacks:
- Side-channel attacks can determine addresses

Mitigations for Weaknesses:

Use PIE (Position Independent Executable)
Fix information leaks
Crash on exploit attempts (don’t fork)
Use Control Flow Integrity (CFI)
Combine with other defenses (NX, stack canaries)

Q3: How do stack canaries detect buffer overflows? Can they be bypassed?

Stack Canary: Random value placed between local variables and return address.Mechanism:

Stack Frame:
┌──────────────────┐ High address
│ Return Address   │ ← Protected
├──────────────────┤
│ Saved RBP        │
├──────────────────┤
│ CANARY (random)  │ ← __stack_chk_guard (stored in TLS)
├──────────────────┤
│ Local vars       │
│ char buf[100]    │ ← Overflow starts here
└──────────────────┘ Low address

Function Prologue:
  mov rax, fs:0x28      ; Load canary from TLS
  mov [rbp-8], rax      ; Store on stack

Function Epilogue:
  mov rax, [rbp-8]      ; Load stack canary
  xor rax, fs:0x28      ; Compare with original
  je .L_OK              ; Match? OK
  call __stack_chk_fail ; Mismatch? ABORT
.L_OK:
  ret

Detection:

Buffer overflow overwrites local variables
Overflow continues, overwrites canary
Function returns
Kernel checks: stack_canary == __stack_chk_guard?
Mismatch → Stack smashing detected! → abort()

Bypass Techniques:1. Leak Canary:

// Format string vulnerability
printf(user_input);  // User provides: "%p %p %p ..."
// Leaks stack contents, including canary!

// Attacker:
// 1. Leak canary value
// 2. Craft overflow to include correct canary value
// 3. Overflow succeeds without detection

2. Overwrite Pointer Before Canary:

char buf[100];
char *ptr = &authorized;
unsigned long canary;
void *return_address;

// Overflow overwrites ptr but not canary
strcpy(buf, malicious_input);  // Overflow only buf and ptr

// ptr now points to attacker-controlled memory
// Canary intact → No detection!

3. Fork Without Re-randomization (rare):

// Parent forks children with same canary
while (1) {
    if (fork() == 0) {
        handle_request();  // Sandbox child
        exit(0);
    }
}

// Attacker brute-forces canary byte-by-byte
// Try 0x00, 0x01, 0x02, ... 0xFF for first byte
// If child crashes: wrong guess
// If child doesn't crash: correct! Move to next byte
// 8 bytes × 256 attempts = 2,048 attempts max

4. Partial Overflow:

// Overflow only return address, not canary
// (Requires knowledge of stack layout)

┌──────────────┐
│ Ret Addr     │ ← Overflow 1 byte (change low byte only)
├──────────────┤
│ Saved RBP    │ ← Skip
├──────────────┤
│ Canary       │ ← Leave untouched!
├──────────────┤
│ buf[100]     │
└──────────────┘

// Careful overflow changes return address without touching canary

Mitigations:

Combine with ASLR (randomize canary address)
Use fortified functions (_strcpy_chk) to prevent overflows
Re-randomize canary after fork
Stack Clash protection (prevent jumping over canary)

Q4: What is the difference between capabilities and setuid? Why are capabilities better?

Traditional setuid:

# setuid binary runs with owner's privileges

ls -l /usr/bin/passwd
# -rwsr-xr-x root root /usr/bin/passwd
#    ↑ setuid bit

# When user runs passwd:
# 1. Process starts with user's UID
# 2. Kernel sees setuid bit
# 3. Sets effective UID to file owner (root)
# 4. Process has FULL root privileges

# Problem: All or nothing!
# passwd only needs to write /etc/shadow
# But gets ALL root capabilities

Capabilities:

Divide root into 41 distinct privileges:

CAP_CHOWN            - Change file ownership
CAP_DAC_OVERRIDE     - Bypass file permissions
CAP_NET_BIND_SERVICE - Bind ports < 1024
CAP_NET_RAW          - Use raw sockets
CAP_SYS_ADMIN        - System administration
CAP_SYS_MODULE       - Load kernel modules
... 35 more ...

Process gets ONLY what it needs!

Comparison:

Feature	setuid	Capabilities
Granularity	All or nothing	Fine-grained (41 capabilities)
Security	Over-privileged	Least privilege
Persistence	Lost on exec (unless binary is setuid)	Can be inherited
Auditability	Hard to see why root is needed	Clear which capability is used

Example: Network Server:

// Old way: setuid root binary
int main() {
    // Running as root (UID 0)
    // Can do ANYTHING!

    int sock = socket(AF_INET, SOCK_STREAM, 0);
    bind(sock, ..., 80);  // Bind port 80 (needs root)

    setuid(nobody);  // Drop privileges after bind

    // Problem: Race window while root
    // If exploit before setuid(), full root access!
}

// New way: Capabilities
int main() {
    // Running as nobody (UID 65534)
    // Has ONLY CAP_NET_BIND_SERVICE

    int sock = socket(AF_INET, SOCK_STREAM, 0);
    bind(sock, ..., 80);  // Works! (has capability)

    open("/etc/shadow", O_RDONLY);  // FAIL! (no CAP_DAC_OVERRIDE)

    // Even if exploited, attacker only has port binding
    // Cannot read files, cannot exec as root, etc.
}

Setting Capabilities:

# Give binary capability instead of setuid
# Before:
chmod u+s /usr/bin/ping  # setuid root (dangerous!)

# After:
setcap cap_net_raw+ep /usr/bin/ping  # Only raw socket capability

# Verify
getcap /usr/bin/ping
# /usr/bin/ping = cap_net_raw+ep

# Now ping can create raw sockets but has NO other root powers

Why Capabilities Are Better:

Principle of Least Privilege: Only grant necessary permissions
Reduced Attack Surface: Exploit gets limited capabilities, not full root
Better Auditability: Clear why each capability is needed
Flexibility: Can grant to non-root users
Inheritance: Can design capability-aware services

Real-World Usage:

systemd services with capabilities
Docker containers (run as non-root with specific capabilities)
Network daemons (CAP_NET_BIND_SERVICE instead of setuid)

Q5: How does KPTI mitigate Meltdown? What is the performance cost?

Meltdown Vulnerability:

CPU speculatively executes kernel memory access from user mode:

// User-mode code
char kernel_byte = *(char *)0xffff880000000000;  // Kernel address

// CPU behavior:
// 1. Starts speculative execution before permission check
// 2. Loads kernel memory (should fault, but hasn't checked yet)
// 3. Uses loaded byte to index array: probe[kernel_byte * 4096]
// 4. This brings probe[...] into cache ← SIDE EFFECT!
// 5. Permission check completes → Exception!
// 6. Architectural state rolled back
// 7. But cache state remains! ← LEAK!

// Attacker measures cache timing → recovers kernel_byte

KPTI (Kernel Page Table Isolation) Solution:

Without KPTI (vulnerable):
┌─────────────────────────────┐
│ User Mode (CR3 = user_pgd)  │
│                             │
│ User virtual addresses      │
│ 0x0 - 0x7fffffffffff        │
│   │                         │
│   ├─> User pages            │
│   │                         │
│ Kernel virtual addresses    │
│ 0xffff800000000000 - ...    │ ← Mapped in user page tables!
│   │                         │ ← Meltdown can speculatively access
│   ├─> Kernel pages          │
└─────────────────────────────┘

With KPTI (secure):
User Mode:
┌─────────────────────────────┐
│ CR3 = user_pgd              │
│                             │
│ User virtual addresses      │
│ 0x0 - 0x7fffffffffff        │
│   ├─> User pages            │
│                             │
│ Kernel virtual addresses    │
│ 0xffff800000000000 - ...    │
│   ├─> MINIMAL kernel stubs  │ ← Only entry/exit trampolines!
│   │    (entry_SYSCALL_64)   │ ← Rest of kernel NOT MAPPED
└─────────────────────────────┘

Kernel Mode (after syscall):
┌─────────────────────────────┐
│ CR3 = kernel_pgd            │
│                             │
│ User virtual addresses      │
│   ├─> User pages            │
│                             │
│ Kernel virtual addresses    │
│   ├─> FULL kernel mapping   │ ← All kernel code/data accessible
└─────────────────────────────┘

Syscall Flow with KPTI:

; User-mode application
mov rax, 1        ; SYS_write
mov rdi, 1        ; fd = stdout
syscall           ; Enter kernel

; ← CPU switches to kernel mode ←

entry_SYSCALL_64:
    ; Still using user page tables!
    swapgs                    ; Swap GS (get kernel stack)

    ; SWITCH PAGE TABLES (the expensive part!)
    mov rax, CR3              ; Read current CR3 (user page table)
    or rax, 0x1000            ; Set bit to switch to kernel tables
    mov CR3, rax              ; ← PAGE TABLE SWITCH (TLB flush!)

    ; Now kernel is fully mapped
    ; Execute syscall handler...
    call do_syscall_64

    ; SWITCH BACK to user page tables
    mov rax, CR3
    and rax, ~0x1000
    mov CR3, rax              ; ← PAGE TABLE SWITCH (TLB flush!)

    swapgs
    sysretq                   ; Return to user mode

Performance Cost:What makes it expensive:

CR3 Write (page table switch):
- ~150-300 CPU cycles per switch
- 2 switches per syscall (enter + exit)
TLB Flush:
- Translation Lookaside Buffer caches virtual→physical address translations
- Changing CR3 flushes TLB (must reload from memory)
- TLB misses add ~100 cycles per memory access
Frequency of Syscalls:
- I/O-heavy workloads: Many syscalls → high overhead
- CPU-bound workloads: Few syscalls → low overhead

Measured Impact (varies by workload):

Workload Type	Performance Loss
CPU-intensive (scientific computing)	0-3%
Light I/O (web browsing)	3-5%
Heavy I/O (file server)	5-10%
Heavy syscalls (database, Redis)	10-30%

Optimizations:

PCID (Process Context ID):
- Tag TLB entries with PCID
- Avoid full TLB flush on CR3 switch
- Reduces overhead to 1-5%
Lazy TLB Switching:
- Kernel threads don’t switch page tables
- Reuse previous user’s kernel mapping
CPU Microcode Updates:
- Intel CPUs without Meltdown bug → no KPTI needed
- Check: cat /sys/devices/system/cpu/vulnerabilities/meltdown
- If says “Not affected” → KPTI not active

Disable KPTI (for testing/benchmarking only!):

# Boot parameter
nopti

# Or runtime (requires recompiled kernel)
echo 0 > /sys/kernel/debug/x86/pti_enabled

# WARNING: Disabling KPTI leaves system vulnerable to Meltdown!

Q6: Explain how Spectre works and why retpolines are an effective mitigation.

Spectre Vulnerability (Branch Target Injection):CPU Speculative Execution:

// Victim code
if (x < array_size) {          // ← Branch
    y = array[x];              // ← Speculative execution
}

CPU's Branch Predictor:
- Predicts if branch will be taken or not
- Speculatively executes ahead while check happens
- If prediction wrong, rollback
- If prediction right, save time!

Problem: Rollback discards architectural state but NOT cache state!

Attack:

// Step 1: Train branch predictor
for (int i = 0; i < 1000; i++) {
    victim_function(valid_x);  // x < array_size, branch TAKEN
}
// Branch predictor learns: "This branch is ALWAYS taken"

// Step 2: Prepare side-channel
for (int i = 0; i < 256; i++) {
    clflush(&probe_array[i * 4096]);  // Flush cache
}

// Step 3: Attack with out-of-bounds x
victim_function(malicious_x);  // malicious_x >= array_size

// What happens:
// 1. Branch predictor predicts: TAKEN (based on training)
// 2. CPU speculatively executes: y = array[malicious_x]
// 3. This accesses out-of-bounds memory (kernel memory!)
// 4. Uses leaked byte to index: probe_array[y * 4096]
// 5. This brings probe_array[y * 4096] into cache ← LEAK!
// 6. Branch check completes: x < array_size? FALSE
// 7. Rollback! Discard y, but cache state remains!

// Step 4: Recover leaked byte via timing
for (int i = 0; i < 256; i++) {
    t0 = rdtsc();
    temp = probe_array[i * 4096];
    t1 = rdtsc();

    if (t1 - t0 < THRESHOLD) {
        printf("Leaked byte: 0x%02x\n", i);  // Cache hit!
        break;
    }
}

// Result: Read arbitrary memory across privilege boundaries!

Why Retpolines Work:Problem with Indirect Branches:

; Vulnerable indirect jump
jmp *%rax              ; Jump to address in rax

; Attacker can manipulate branch predictor to:
; 1. Predict wrong target
; 2. Cause speculative execution to gadget
; 3. Leak data via cache side-channel

Retpoline (Return Trampoline):

; Instead of: jmp *%rax
; Use:

call .set_target      ; Push return address on stack
.set_target:
    mov %rax, (%rsp)  ; Replace return address with rax
    ret               ; Return to rax

; Why this is safe:

; CPU's Return Stack Buffer (RSB):
; - Separate predictor for RET instructions
; - Tracks call/return pairs
; - NOT poisonable by attacker

; When ret executes:
; - CPU predicts target from RSB
; - RSB says: return to .capture_spec
; - Speculative execution goes to .capture_spec
; - NOT to attacker-controlled address!

.capture_spec:
    pause             ; Prevent speculation
    lfence            ; Serializing instruction
    jmp .capture_spec ; Infinite loop (never executed)

Visual Comparison:

Traditional Indirect Jump (vulnerable):
┌─────────────┐
│   jmp *rax  │ → Branch predictor → Attacker controls prediction
└─────────────┘                      ↓
                              Speculative execution to gadget
                              ↓
                              Leak via cache timing

Retpoline (safe):
┌─────────────┐
│ call .label │ → Push return addr on stack
│ .label:     │
│  mov rax,SP │ → Replace return addr with rax
│  ret        │ → RSB predicts return to .capture
└─────────────┘    (NOT attacker-controlled!)
      ↓
.capture_spec:
  pause
  lfence
  jmp .capture_spec  ← Speculation contained in loop
                     ← No leak possible!

Kernel Implementation:

// Compiler generates retpolines for indirect branches
// gcc -mindirect-branch=thunk-extern

// Original code:
void (*func_ptr)(void);
func_ptr();  // Indirect call

// Compiled without retpoline:
call *%rax

// Compiled with retpoline:
call __x86_indirect_thunk_rax

// Retpoline thunk (arch/x86/lib/retpoline.S):
SYM_FUNC_START(__x86_indirect_thunk_rax)
    JMP_NOSPEC %rax
SYM_FUNC_END(__x86_indirect_thunk_rax)

#define JMP_NOSPEC(reg)                 \
    call    .Ldo_rop_##reg;             \
.Lspec_trap_##reg:                      \
    pause;                              \
    lfence;                             \
    jmp .Lspec_trap_##reg;              \
.Ldo_rop_##reg:                         \
    mov %reg, (%rsp);                   \
    ret

Performance Impact:

Retpolines are slower than direct jumps (5-20% overhead)
But necessary for security on vulnerable CPUs
Modern CPUs have hardware mitigations (IBRS - Indirect Branch Restricted Speculation)

Check Mitigations:

cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
# Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, RSB filling

# Retpoline: Software mitigation (compiler-generated)
# IBRS: Hardware mitigation (CPU feature)
# IBPB: Indirect Branch Predictor Barrier (flush predictor)
# RSB filling: Prevent RSB underflow attacks

Why Effective:

Return instructions are different: RSB not poisonable
Speculation contained: Loop prevents speculative execution reaching gadgets
Works on all CPUs: Software mitigation (doesn’t need hardware support)
Comprehensive: Protects all indirect branches

Limitations:

Performance overhead (modern CPUs use IBRS instead)
Doesn’t protect against Spectre v1 (bounds check bypass)
Doesn’t protect against other speculative execution attacks (L1TF, MDS, etc.)

Q7: Compare SELinux vs AppArmor. When would you use each?

Fundamental Difference:SELinux: Label-based MAC

Security Context: user:role:type:level

Files:   httpd_sys_content_t
Process: httpd_t

Rule: allow httpd_t httpd_sys_content_t:file { read open };
      ─────────────── ──────────────────  ──── ────────────
         Subject          Object          Class Permissions

Decision: Based on labels (NOT paths)

AppArmor: Path-based MAC

Profile:
/usr/sbin/nginx {
    /etc/nginx/** r,
    /var/www/** r,
    /var/log/nginx/** rw,
    deny /etc/shadow r,
}

Decision: Based on absolute filesystem paths

Detailed Comparison:1. Security Model:SELinux:

Type Enforcement (TE): Subjects (processes) have types, objects (files) have types
Multi-Level Security (MLS): Confidentiality levels (Top Secret, Secret, etc.)
Multi-Category Security (MCS): Categories for compartmentalization
Very fine-grained control

AppArmor:

Path-based access control
Capabilities control
Network access control (protocol/address)
Simpler model, easier to understand

2. Complexity:SELinux:

# Policy is complex
# Example policy snippet:
allow httpd_t httpd_sys_content_t:file { getattr read open };
allow httpd_t http_port_t:tcp_socket { bind listen };
allow httpd_t httpd_log_t:file { write append create };
allow httpd_t proc_t:file read;
allow httpd_t self:capability { setgid setuid };

# Hundreds of rules per service!
# Requires understanding of:
# - Type enforcement
# - Security contexts
# - Policy language
# - Domain transitions

AppArmor:

# Policy is readable
/usr/sbin/nginx {
  #include <abstractions/base>
  #include <abstractions/nameservice>

  capability dac_override,
  capability net_bind_service,
  capability setgid,
  capability setuid,

  /etc/nginx/** r,
  /var/log/nginx/** rw,
  /var/www/** r,

  network inet stream,
}

# Human-readable!
# Easy to audit

3. Administration:

Task	SELinux	AppArmor
Create policy	Complex (audit2allow helps)	Simple (aa-genprof)
Debug denials	ausearch, sealert	aa-logprof, dmesg
Enable/Disable	setenforce	aa-enforce/aa-complain
View status	sestatus, getenforce	aa-status
Temporary allow	semodb-boolean	aa-complain mode

4. Performance:SELinux:

Label lookups in xattrs (extended attributes)
Hash table lookups for policy decisions
Overhead: 3-7% typically

AppArmor:

Path resolution for every access
Simpler policy checks
Overhead: 1-3% typically

5. Filesystem Requirements:SELinux:

Requires filesystem with xattr support
Labels stored as extended attributes
ls -Z shows labels
Relabeling filesystem can be slow

AppArmor:

No special filesystem requirements
Works on any filesystem (even FAT, NFS)
No labels to manage

6. Use Cases:Use SELinux when:

Maximum security required (government, military)
Need MLS/MCS (confidentiality levels)
Want very fine-grained control
Already familiar with it (RHEL/Fedora/CentOS)
Need label-based security (labels follow files even if moved)

Use AppArmor when:

Simplicity preferred over maximum granularity
Easier policy management desired
Filesystem doesn’t support xattrs (NFS, FAT)
Developers/admins less experienced with MAC
Debian/Ubuntu/SUSE environment

7. Real-World Scenarios:Scenario 1: Web ServerSELinux:

# Pre-defined policy exists
# But need to handle custom app

# App stores files in /opt/myapp/
# SELinux denies access (wrong label)

# Solution:
semanage fcontext -a -t httpd_sys_content_t "/opt/myapp(/.*)?"
restorecon -R /opt/myapp

# More denials? Debug with:
ausearch -m avc -ts recent
audit2allow -a -M mypolicy
semodule -i mypolicy.pp

AppArmor:

# Create profile
cat > /etc/apparmor.d/usr.sbin.myapp <<EOF
/usr/sbin/myapp {
  #include <abstractions/base>
  #include <abstractions/apache2-common>

  /opt/myapp/** r,
  /var/log/myapp/** rw,

  network inet stream,
  capability net_bind_service,
}
EOF

# Load and enforce
apparmor_parser -r /etc/apparmor.d/usr.sbin.myapp

# Done! Much simpler.

Scenario 2: Container SecuritySELinux:

Docker/Podman use SELinux contexts
Each container gets unique MCS label
Container svirt_sandbox_file_t, host container_file_t
Strong isolation via labels

AppArmor:

Docker uses AppArmor profiles
Default profile restricts mount, capabilities, etc.
Custom profiles for specific containers
Path-based restrictions easier to understand

8. Policy Portability:SELinux:

Labels stored with files (xattrs)
Policy is separate from filesystem
Moving files between systems: labels can be lost
Need to relabel after restore from backup

AppArmor:

Policy references absolute paths
Moving profile to different system: works if paths same
But path changes require profile updates

Recommendation Matrix:

Priority	Choose
Maximum security	SELinux
Ease of use	AppArmor
Fine-grained control	SELinux
Simple policies	AppArmor
RHEL/CentOS	SELinux (default)
Debian/Ubuntu	AppArmor (default)
NFS/non-xattr FS	AppArmor
MLS/MCS required	SELinux
Container host	Both work (SELinux more common)

Can you use both?: No, they conflict (both use LSM hooks). Choose one.Neither?: Not recommended. MAC adds significant security layer beyond DAC.

Q8: How does seccomp-BPF work and why is it critical for container security?

Seccomp-BPF (Secure Computing with Berkeley Packet Filter):Core Concept: Whitelist syscalls a process can make using BPF bytecode filters.

Architecture:

User Space Process
     │
     │ syscall (e.g., open, read, write)
     ↓
┌─────────────────────────────┐
│   Syscall Entry Point       │
│   (entry_SYSCALL_64)        │
└─────────┬───────────────────┘
          │
          │ ① Check: Seccomp filter installed?
          ↓
┌─────────────────────────────┐
│   Seccomp BPF Filter        │
│                             │
│   BPF Program:              │
│   - Load syscall number     │
│   - Load arguments          │
│   - Check against rules     │
│   - Return action:          │
│     • ALLOW                 │
│     • KILL                  │
│     • ERRNO                 │
│     • TRAP                  │
└─────────┬───────────────────┘
          │
          │ ② Action
          ↓
    ALLOW?  ──────────────────> Execute Syscall
    KILL?   ──────────────────> SIGSYS (kill process)
    ERRNO?  ──────────────────> Return error code
    TRAP?   ──────────────────> Send SIGSYS signal (debugger)

BPF Filter Structure:

// Seccomp data available to BPF program
struct seccomp_data {
    int nr;                  // Syscall number
    __u32 arch;              // Architecture (x86-64, ARM, etc.)
    __u64 instruction_pointer;
    __u64 args[6];           // Syscall arguments
};

// BPF filter example
struct sock_filter filter[] = {
    // Load syscall number into accumulator
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
             offsetof(struct seccomp_data, nr)),

    // Allow SYS_read
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_read, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Allow SYS_write
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Allow SYS_exit
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_exit, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Default: KILL
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
};

Container Security Use Case:Problem: Containers share kernel with host. Malicious container can exploit kernel vulnerabilities.Seccomp Solution: Reduce attack surface by blocking dangerous syscalls.Docker Default Seccomp Profile (simplified):

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64"],
  "syscalls": [
    {
      "names": [
        "read", "write", "open", "close", "stat",
        "fstat", "mmap", "mprotect", "munmap",
        "brk", "ioctl", "writev", "access",
        "socket", "connect", "accept", "bind",
        "listen", "select", "poll", "epoll_wait"
        /* ... ~300 allowed syscalls ... */
      ],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "names": [
        "reboot",           // Cannot reboot host!
        "swapon", "swapoff", // Cannot manage swap
        "mount", "umount",  // Cannot mount filesystems
        "pivot_root",       // Cannot change root
        "kexec_load",       // Cannot load kernel
        "add_key",          // Cannot add keyring keys
        "request_key",
        "bpf",              // Cannot load BPF programs
        "perf_event_open",  // Cannot use perf
        "ptrace"            // Cannot trace other processes
      ],
      "action": "SCMP_ACT_ERRNO"  // Return EPERM
    }
  ]
}

Why Critical for Containers:

Kernel Exploit Mitigation:

Without seccomp:
  Container → Exploit in ioctl() → Kernel code execution → Host compromise

With seccomp:
  Container → ioctl() → EPERM (syscall blocked) → Exploit fails

Privilege Escalation Prevention:

# Without seccomp
docker run -it ubuntu
# Inside container:
unshare --mount --uts --ipc --net --pid --fork /bin/bash
# Success! Created new namespaces → potential escape

# With seccomp (default)
unshare --mount ...
# unshare: unshare failed: Operation not permitted
# Blocked! (unshare syscall not allowed)

Attack Surface Reduction:

Linux kernel: ~450 syscalls
Docker default seccomp: ~300 allowed

Blocked (~150 syscalls):
- Kernel module loading (init_module, finit_module)
- Namespace manipulation (setns, unshare)
- Performance monitoring (perf_event_open)
- System administration (reboot, sethostname)
- Capability manipulation (capset)
- Key management (add_key, keyctl)
- BPF programs (bpf)

Result: 33% reduction in kernel attack surface!

Implementing Custom Seccomp:Example: Strict Sandbox:

#include <seccomp.h>

void install_strict_seccomp() {
    scmp_filter_ctx ctx;

    // Default: KILL (strictest!)
    ctx = seccomp_init(SCMP_ACT_KILL);

    // Allow ONLY these syscalls
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);

    // Conditional: Allow open ONLY for /tmp/*
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(openat), 1,
                     SCMP_A1(SCMP_CMP_STR, "/tmp/"));

    // Load filter
    seccomp_load(ctx);
    seccomp_release(ctx);

    // After this point:
    // - read/write/exit: OK
    // - openat("/tmp/file"): OK
    // - openat("/etc/passwd"): KILLED!
    // - socket(): KILLED!
    // - fork(): KILLED!
}

Docker Custom Profile:

# custom-seccomp.json
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    {
      "names": ["read", "write", "exit", "exit_group"],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

# Run container with custom profile
docker run --security-opt seccomp=custom-seccomp.json myimage

Debugging Seccomp Violations:

# Enable logging (dmesg)
echo 1 > /proc/sys/kernel/seccomp/actions_logged

# Run container
docker run --rm -it --security-opt seccomp=strict.json ubuntu bash

# Inside container, try forbidden syscall:
mount -t tmpfs tmpfs /mnt
# bash: mount: Operation not permitted

# Check dmesg
dmesg | tail
# [12345.678] audit: type=1326 audit(1234567890.123:456): auid=1000 uid=0 gid=0
#              ses=3 pid=12345 comm="mount" exe="/bin/mount" sig=0 arch=c000003e
#              syscall=165 compat=0 ip=0x7f... code=0x7ffc0000
#              ^^^^^^^^^^
#              syscall 165 = mount (BLOCKED!)

# Syscall 165 (mount) was blocked by seccomp

Why BPF:

Efficiency: JIT-compiled to native code (fast!)
Safety: BPF verifier ensures filter cannot crash kernel
Flexibility: Can inspect syscall arguments, not just number
Performance: Evaluated in kernel space (no context switch)

Without BPF (old seccomp mode 1):

Could only allow read/write/exit/_exit
No flexibility

With BPF (seccomp mode 2):

Can allow specific syscalls
Can inspect arguments (e.g., allow open but only for /tmp/*)
Can return different actions (ERRNO, TRAP, LOG, ALLOW)

Limitations:

Cannot inspect pointers: BPF cannot dereference user-space pointers (no access to path strings, only FDs)
Time-of-check-time-of-use (TOCTOU): Arguments checked before syscall, but can change
Bypass via allowed syscalls: If write() allowed, attacker might abuse it
Complexity: Writing correct BPF filters is hard

Summary:Seccomp-BPF is critical for containers because:

✅ Reduces kernel attack surface (blocks ~1/3 of syscalls)
✅ Prevents privilege escalation (blocks namespace manipulation)
✅ Mitigates kernel exploits (blocks vulnerable syscalls)
✅ Fast (BPF JIT compilation)
✅ Flexible (programmable filters)
✅ Secure (BPF verifier prevents filter bugs)

Without it, containers have full access to ~450 syscalls → much larger attack surface.

12. Threat Modeling for OS-backed Services

When designing secure services, think systematically about OS-level attack surfaces.

The STRIDE Model Applied to OS

Threat	OS Attack Vector	Mitigation
Spoofing	Process impersonation, UID manipulation	User namespaces, strong authentication
Tampering	Memory corruption, file modification	ASLR, KASLR, read-only mounts
Repudiation	Log deletion, timestamp manipulation	Append-only logs, audit subsystem
Info Disclosure	`/proc` leaks, side channels	`hidepid=2`, Spectre mitigations
Denial of Service	Fork bombs, memory exhaustion	Cgroups limits, ulimits, quotas
Elevation of Privilege	Kernel exploits, setuid abuse	Seccomp, drop capabilities

Defense-in-Depth Checklist

# 1. Principle of Least Privilege
capsh --print                           # See current capabilities
setcap cap_net_bind_service=+ep ./app   # Grant only specific caps

# 2. Namespaces: Reduce Visibility
unshare --user --map-root-user --pid --mount-proc bash

# 3. Read-only Filesystems
mount -o remount,ro /                   # Make root read-only

# 4. Resource Limits (DoS protection)
echo "100M" > /sys/fs/cgroup/myapp/memory.max
echo "50" > /sys/fs/cgroup/myapp/pids.max

Summary

Key Takeaways:

Memory Protection: NX/DEP, ASLR, and stack canaries are foundational defenses against memory corruption attacks.
Control Flow Integrity: Forward-edge CFI and shadow stacks (backward-edge CFI) prevent control-flow hijacking.
Privilege Separation: Capabilities provide fine-grained privileges instead of all-or-nothing root access.
Mandatory Access Control: SELinux (label-based) and AppArmor (path-based) enforce policies beyond DAC.
Microarchitectural Attacks: Spectre and Meltdown exploit speculative execution. KPTI and retpolines mitigate but with performance cost.
Sandboxing: Namespaces, seccomp, and combinations thereof create strong isolation for untrusted code.

Defense in Depth: No single mechanism is perfect. Modern systems combine multiple layers:

ASLR + NX + Stack Canaries + CFI (memory safety)
Capabilities + Seccomp + Namespaces (privilege reduction)
SELinux/AppArmor (mandatory access control)
KPTI + Retpolines + CPU features (hardware attack mitigation)

Performance vs Security: Many mitigations have performance costs. Understand trade-offs and apply based on threat model.

Next: Boot Process & Initialization →

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Operating System Security

​1. Memory Protection Fundamentals

​1.1 Page-Level Protection (NX/DEP)

​1.2 Address Space Layout Randomization (ASLR)

​1.3 Stack Canaries (Stack Smashing Protection)

​2. Control Flow Integrity (CFI)

​2.1 Forward-Edge CFI (Indirect Calls)

​2.2 Backward-Edge CFI (Return Address Protection)

​3. Privilege Separation & Capabilities

​3.1 Traditional Unix DAC (Discretionary Access Control)

​3.2 POSIX Capabilities

​3.3 Seccomp (Secure Computing Mode)

​4. Mandatory Access Control (MAC)

​4.1 SELinux (Security-Enhanced Linux)

Enforcing

Permissive

Disabled

​4.2 AppArmor

​5. Microarchitectural Attacks & Mitigations

​5.1 Spectre & Meltdown

​5.2 Rowhammer

ECC Memory

Target Row Refresh (TRR)

Software Mitigations

OS-Level

​6. Sandboxing Techniques

​6.1 Namespaces (Containers)

​6.2 Chrome Multi-Process Sandbox

​7. Interview Questions & Answers

​12. Threat Modeling for OS-backed Services

​The STRIDE Model Applied to OS

​Defense-in-Depth Checklist

​Summary

Operating System Security

1. Memory Protection Fundamentals

1.1 Page-Level Protection (NX/DEP)

1.2 Address Space Layout Randomization (ASLR)

1.3 Stack Canaries (Stack Smashing Protection)

2. Control Flow Integrity (CFI)

2.1 Forward-Edge CFI (Indirect Calls)

2.2 Backward-Edge CFI (Return Address Protection)

3. Privilege Separation & Capabilities

3.1 Traditional Unix DAC (Discretionary Access Control)

3.2 POSIX Capabilities

3.3 Seccomp (Secure Computing Mode)

4. Mandatory Access Control (MAC)

4.1 SELinux (Security-Enhanced Linux)

4.2 AppArmor

5. Microarchitectural Attacks & Mitigations

5.1 Spectre & Meltdown

5.2 Rowhammer

6. Sandboxing Techniques

6.1 Namespaces (Containers)

6.2 Chrome Multi-Process Sandbox

7. Interview Questions & Answers

12. Threat Modeling for OS-backed Services

The STRIDE Model Applied to OS

Defense-in-Depth Checklist

Summary