Operating System Security
Modern OS security is a multi-layered defense against both software vulnerabilities and hardware-level attacks. From memory protection to mandatory access control, understanding these mechanisms is crucial for building secure systems.Mastery Level: Senior Security Engineer
Key Internals: Page Table Permissions, Capabilities, LSM hooks, CPU security features, Speculative execution mitigations
Prerequisites: Virtual Memory, Process Internals
1. Memory Protection Fundamentals
1.1 Page-Level Protection (NX/DEP)
No-Execute (NX) / Data Execution Prevention (DEP) marks memory pages as non-executable.- A page can be writable OR executable, but never both
- Prevents attacker from modifying code or executing data
1.2 Address Space Layout Randomization (ASLR)
Problem: Without ASLR, addresses are predictable.1.3 Stack Canaries (Stack Smashing Protection)
Stack canary is a random value placed on the stack between local variables and the return address.- Terminator Canary
- Random Canary
- XOR Canary
| Attack | Mitigation |
|---|---|
| Leak canary via format string | Use fortified functions (_printf_chk) |
| Overwrite canary with correct value | Use random canary per thread |
| Jump over canary (partial overflow) | Place canary near variables |
| Fork before overflow (canary same in child) | Re-randomize after fork |
2. Control Flow Integrity (CFI)
CFI ensures program control flow follows legitimate paths (no arbitrary jumps).2.1 Forward-Edge CFI (Indirect Calls)
Problem: Function pointers can be hijacked.2.2 Backward-Edge CFI (Return Address Protection)
Shadow Stack: Hardware-protected copy of return addresses.3. Privilege Separation & Capabilities
3.1 Traditional Unix DAC (Discretionary Access Control)
3.2 POSIX Capabilities
Divide root privileges into distinct units:- Set File Capabilities
- Capability-Aware Code
- View Process Capabilities
- Ambient Capabilities
3.3 Seccomp (Secure Computing Mode)
Seccomp-BPF: Restrict system calls a process can make using BPF filters.| Action | Effect |
|---|---|
SECCOMP_RET_KILL_PROCESS | Kill entire process |
SECCOMP_RET_KILL_THREAD | Kill only current thread |
SECCOMP_RET_TRAP | Send SIGSYS signal |
SECCOMP_RET_ERRNO | Return error code |
SECCOMP_RET_TRACE | Notify tracer (ptrace) |
SECCOMP_RET_LOG | Log and allow |
SECCOMP_RET_ALLOW | Allow syscall |
4. Mandatory Access Control (MAC)
4.1 SELinux (Security-Enhanced Linux)
SELinux adds mandatory access control on top of DAC.Enforcing
Permissive
Disabled
4.2 AppArmor
AppArmor is path-based MAC (vs SELinux’s label-based).| Feature | SELinux | AppArmor |
|---|---|---|
| Granularity | Very fine (labels) | Coarse (paths) |
| Complexity | High | Low |
| Performance | Small overhead | Very small |
| Learning curve | Steep | Gentle |
| Flexibility | Maximum | Good |
| Default | RHEL, Fedora, CentOS | Debian, Ubuntu, SUSE |
5. Microarchitectural Attacks & Mitigations
5.1 Spectre & Meltdown
Speculative Execution: CPU predicts branch and executes ahead, then discards if wrong.5.2 Rowhammer
DRAM vulnerability: Rapidly accessing one row can flip bits in adjacent rows.ECC Memory
Target Row Refresh (TRR)
Software Mitigations
OS-Level
6. Sandboxing Techniques
6.1 Namespaces (Containers)
Linux namespaces isolate resources between processes.6.2 Chrome Multi-Process Sandbox
7. Interview Questions & Answers
Q1: How does NX/DEP prevent code execution on the stack?
Q1: How does NX/DEP prevent code execution on the stack?
NX (No-Execute) / DEP (Data Execution Prevention) uses the CPU’s NX bit in page table entries.Page Table Entry Structure (x86-64):Protection:
- Bit 63: NX (No-Execute) bit
- When set: Page cannot be executed (will fault with #PF if IP points here)
- When clear: Page is executable
- Attacker overflows buffer on stack
- Injects shellcode
- Overwrites return address to point to shellcode
- Function returns, jumps to shellcode address
- CPU checks NX bit → Page is not executable
- #PF (Page Fault) → Kernel kills process
- Stack/Heap: Writable, NOT executable
- Code: Executable, NOT writable
- Prevents: Code injection attacks
Q2: Explain ASLR and how it prevents exploitation. What are its weaknesses?
Q2: Explain ASLR and how it prevents exploitation. What are its weaknesses?
ASLR (Address Space Layout Randomization) randomizes memory layout at program start.Randomized Regions:With ASLR:Weaknesses:
- Stack base address
- Heap base address
- Libraries (libc, etc.)
- Executable base (if PIE - Position Independent Executable)
- vDSO, vvar
- Stack: 19 bits → 524,288 possible positions
- Heap: 13 bits → 8,192 possible positions
- Libraries: 28 bits → 268 million possible positions
- PIE executable: 28 bits → 268 million possible positions
-
Information Leak:
- Pointer disclosure → calculate base addresses → bypass ASLR
- Format string bugs, memory corruption leaks
-
Entropy Limitations:
- 13 bits (heap) = 8,192 attempts
- If process doesn’t crash (fork server), brute-forceable
-
32-bit Systems:
- Limited address space → low entropy
- 8 bits library randomization → 256 attempts
-
Non-PIE Executables:
- Main executable at fixed address
- Contains ROP gadgets at known addresses
-
Cache Timing Attacks:
- Side-channel attacks can determine addresses
- Use PIE (Position Independent Executable)
- Fix information leaks
- Crash on exploit attempts (don’t fork)
- Use Control Flow Integrity (CFI)
- Combine with other defenses (NX, stack canaries)
Q3: How do stack canaries detect buffer overflows? Can they be bypassed?
Q3: How do stack canaries detect buffer overflows? Can they be bypassed?
Stack Canary: Random value placed between local variables and return address.Mechanism:Detection:2. Overwrite Pointer Before Canary:3. Fork Without Re-randomization (rare):4. Partial Overflow:Mitigations:
- Buffer overflow overwrites local variables
- Overflow continues, overwrites canary
- Function returns
- Kernel checks: stack_canary == __stack_chk_guard?
- Mismatch → Stack smashing detected! → abort()
- Combine with ASLR (randomize canary address)
- Use fortified functions (_strcpy_chk) to prevent overflows
- Re-randomize canary after fork
- Stack Clash protection (prevent jumping over canary)
Q4: What is the difference between capabilities and setuid? Why are capabilities better?
Q4: What is the difference between capabilities and setuid? Why are capabilities better?
Traditional setuid:Capabilities:Comparison:
Example: Network Server:Setting Capabilities:Why Capabilities Are Better:
| Feature | setuid | Capabilities |
|---|---|---|
| Granularity | All or nothing | Fine-grained (41 capabilities) |
| Security | Over-privileged | Least privilege |
| Persistence | Lost on exec (unless binary is setuid) | Can be inherited |
| Auditability | Hard to see why root is needed | Clear which capability is used |
- Principle of Least Privilege: Only grant necessary permissions
- Reduced Attack Surface: Exploit gets limited capabilities, not full root
- Better Auditability: Clear why each capability is needed
- Flexibility: Can grant to non-root users
- Inheritance: Can design capability-aware services
- systemd services with capabilities
- Docker containers (run as non-root with specific capabilities)
- Network daemons (CAP_NET_BIND_SERVICE instead of setuid)
Q5: How does KPTI mitigate Meltdown? What is the performance cost?
Q5: How does KPTI mitigate Meltdown? What is the performance cost?
Meltdown Vulnerability:KPTI (Kernel Page Table Isolation) Solution:Syscall Flow with KPTI:Performance Cost:What makes it expensive:
Optimizations:
-
CR3 Write (page table switch):
- ~150-300 CPU cycles per switch
- 2 switches per syscall (enter + exit)
-
TLB Flush:
- Translation Lookaside Buffer caches virtual→physical address translations
- Changing CR3 flushes TLB (must reload from memory)
- TLB misses add ~100 cycles per memory access
-
Frequency of Syscalls:
- I/O-heavy workloads: Many syscalls → high overhead
- CPU-bound workloads: Few syscalls → low overhead
| Workload Type | Performance Loss |
|---|---|
| CPU-intensive (scientific computing) | 0-3% |
| Light I/O (web browsing) | 3-5% |
| Heavy I/O (file server) | 5-10% |
| Heavy syscalls (database, Redis) | 10-30% |
-
PCID (Process Context ID):
- Tag TLB entries with PCID
- Avoid full TLB flush on CR3 switch
- Reduces overhead to 1-5%
-
Lazy TLB Switching:
- Kernel threads don’t switch page tables
- Reuse previous user’s kernel mapping
-
CPU Microcode Updates:
- Intel CPUs without Meltdown bug → no KPTI needed
- Check:
cat /sys/devices/system/cpu/vulnerabilities/meltdown - If says “Not affected” → KPTI not active
Q6: Explain how Spectre works and why retpolines are an effective mitigation.
Q6: Explain how Spectre works and why retpolines are an effective mitigation.
Spectre Vulnerability (Branch Target Injection):CPU Speculative Execution:Attack:Why Retpolines Work:Problem with Indirect Branches:Retpoline (Return Trampoline):Visual Comparison:Kernel Implementation:Performance Impact:Why Effective:
- Retpolines are slower than direct jumps (5-20% overhead)
- But necessary for security on vulnerable CPUs
- Modern CPUs have hardware mitigations (IBRS - Indirect Branch Restricted Speculation)
- Return instructions are different: RSB not poisonable
- Speculation contained: Loop prevents speculative execution reaching gadgets
- Works on all CPUs: Software mitigation (doesn’t need hardware support)
- Comprehensive: Protects all indirect branches
- Performance overhead (modern CPUs use IBRS instead)
- Doesn’t protect against Spectre v1 (bounds check bypass)
- Doesn’t protect against other speculative execution attacks (L1TF, MDS, etc.)
Q7: Compare SELinux vs AppArmor. When would you use each?
Q7: Compare SELinux vs AppArmor. When would you use each?
Fundamental Difference:SELinux: Label-based MACAppArmor: Path-based MAC
Detailed Comparison:1. Security Model:SELinux:AppArmor:3. Administration:
4. Performance:SELinux:AppArmor:Scenario 2: Container SecuritySELinux:
Recommendation Matrix:
Can you use both?: No, they conflict (both use LSM hooks). Choose one.Neither?: Not recommended. MAC adds significant security layer beyond DAC.
Detailed Comparison:1. Security Model:SELinux:
- Type Enforcement (TE): Subjects (processes) have types, objects (files) have types
- Multi-Level Security (MLS): Confidentiality levels (Top Secret, Secret, etc.)
- Multi-Category Security (MCS): Categories for compartmentalization
- Very fine-grained control
- Path-based access control
- Capabilities control
- Network access control (protocol/address)
- Simpler model, easier to understand
| Task | SELinux | AppArmor |
|---|---|---|
| Create policy | Complex (audit2allow helps) | Simple (aa-genprof) |
| Debug denials | ausearch, sealert | aa-logprof, dmesg |
| Enable/Disable | setenforce | aa-enforce/aa-complain |
| View status | sestatus, getenforce | aa-status |
| Temporary allow | semodb-boolean | aa-complain mode |
- Label lookups in xattrs (extended attributes)
- Hash table lookups for policy decisions
- Overhead: 3-7% typically
- Path resolution for every access
- Simpler policy checks
- Overhead: 1-3% typically
- Requires filesystem with xattr support
- Labels stored as extended attributes
ls -Zshows labels- Relabeling filesystem can be slow
- No special filesystem requirements
- Works on any filesystem (even FAT, NFS)
- No labels to manage
- Maximum security required (government, military)
- Need MLS/MCS (confidentiality levels)
- Want very fine-grained control
- Already familiar with it (RHEL/Fedora/CentOS)
- Need label-based security (labels follow files even if moved)
- Simplicity preferred over maximum granularity
- Easier policy management desired
- Filesystem doesn’t support xattrs (NFS, FAT)
- Developers/admins less experienced with MAC
- Debian/Ubuntu/SUSE environment
- Docker/Podman use SELinux contexts
- Each container gets unique MCS label
- Container
svirt_sandbox_file_t, hostcontainer_file_t - Strong isolation via labels
- Docker uses AppArmor profiles
- Default profile restricts mount, capabilities, etc.
- Custom profiles for specific containers
- Path-based restrictions easier to understand
- Labels stored with files (xattrs)
- Policy is separate from filesystem
- Moving files between systems: labels can be lost
- Need to relabel after restore from backup
- Policy references absolute paths
- Moving profile to different system: works if paths same
- But path changes require profile updates
Recommendation Matrix:
| Priority | Choose |
|---|---|
| Maximum security | SELinux |
| Ease of use | AppArmor |
| Fine-grained control | SELinux |
| Simple policies | AppArmor |
| RHEL/CentOS | SELinux (default) |
| Debian/Ubuntu | AppArmor (default) |
| NFS/non-xattr FS | AppArmor |
| MLS/MCS required | SELinux |
| Container host | Both work (SELinux more common) |
Q8: How does seccomp-BPF work and why is it critical for container security?
Q8: How does seccomp-BPF work and why is it critical for container security?
Seccomp-BPF (Secure Computing with Berkeley Packet Filter):Core Concept: Whitelist syscalls a process can make using BPF bytecode filters.
Architecture:
BPF Filter Structure:
Container Security Use Case:Problem: Containers share kernel with host. Malicious container can exploit kernel vulnerabilities.Seccomp Solution: Reduce attack surface by blocking dangerous syscalls.Docker Default Seccomp Profile (simplified):Why Critical for Containers:
Implementing Custom Seccomp:Example: Strict Sandbox:Docker Custom Profile:
Debugging Seccomp Violations:
Why BPF:
Limitations:
Summary:Seccomp-BPF is critical for containers because:
Architecture:
BPF Filter Structure:
Container Security Use Case:Problem: Containers share kernel with host. Malicious container can exploit kernel vulnerabilities.Seccomp Solution: Reduce attack surface by blocking dangerous syscalls.Docker Default Seccomp Profile (simplified):
- Kernel Exploit Mitigation:
- Privilege Escalation Prevention:
- Attack Surface Reduction:
Implementing Custom Seccomp:Example: Strict Sandbox:
Debugging Seccomp Violations:
Why BPF:
- Efficiency: JIT-compiled to native code (fast!)
- Safety: BPF verifier ensures filter cannot crash kernel
- Flexibility: Can inspect syscall arguments, not just number
- Performance: Evaluated in kernel space (no context switch)
- Could only allow read/write/exit/_exit
- No flexibility
- Can allow specific syscalls
- Can inspect arguments (e.g., allow open but only for /tmp/*)
- Can return different actions (ERRNO, TRAP, LOG, ALLOW)
Limitations:
- Cannot inspect pointers: BPF cannot dereference user-space pointers (no access to path strings, only FDs)
- Time-of-check-time-of-use (TOCTOU): Arguments checked before syscall, but can change
- Bypass via allowed syscalls: If
write()allowed, attacker might abuse it - Complexity: Writing correct BPF filters is hard
Summary:Seccomp-BPF is critical for containers because:
- ✅ Reduces kernel attack surface (blocks ~1/3 of syscalls)
- ✅ Prevents privilege escalation (blocks namespace manipulation)
- ✅ Mitigates kernel exploits (blocks vulnerable syscalls)
- ✅ Fast (BPF JIT compilation)
- ✅ Flexible (programmable filters)
- ✅ Secure (BPF verifier prevents filter bugs)
12. Threat Modeling for OS-backed Services
When designing secure services, think systematically about OS-level attack surfaces.The STRIDE Model Applied to OS
| Threat | OS Attack Vector | Mitigation |
|---|---|---|
| Spoofing | Process impersonation, UID manipulation | User namespaces, strong authentication |
| Tampering | Memory corruption, file modification | ASLR, KASLR, read-only mounts |
| Repudiation | Log deletion, timestamp manipulation | Append-only logs, audit subsystem |
| Info Disclosure | /proc leaks, side channels | hidepid=2, Spectre mitigations |
| Denial of Service | Fork bombs, memory exhaustion | Cgroups limits, ulimits, quotas |
| Elevation of Privilege | Kernel exploits, setuid abuse | Seccomp, drop capabilities |
Defense-in-Depth Checklist
Summary
Key Takeaways:- Memory Protection: NX/DEP, ASLR, and stack canaries are foundational defenses against memory corruption attacks.
- Control Flow Integrity: Forward-edge CFI and shadow stacks (backward-edge CFI) prevent control-flow hijacking.
- Privilege Separation: Capabilities provide fine-grained privileges instead of all-or-nothing root access.
- Mandatory Access Control: SELinux (label-based) and AppArmor (path-based) enforce policies beyond DAC.
- Microarchitectural Attacks: Spectre and Meltdown exploit speculative execution. KPTI and retpolines mitigate but with performance cost.
- Sandboxing: Namespaces, seccomp, and combinations thereof create strong isolation for untrusted code.
- ASLR + NX + Stack Canaries + CFI (memory safety)
- Capabilities + Seccomp + Namespaces (privilege reduction)
- SELinux/AppArmor (mandatory access control)
- KPTI + Retpolines + CPU features (hardware attack mitigation)
Next: Boot Process & Initialization →