Skip to main content

Operating System Security

OS security protects system resources from unauthorized access and malicious attacks. Understanding security principles is essential for building robust systems.
Interview Frequency: Medium-High
Key Topics: Access control, capabilities, sandboxing, containers
Time to Master: 12-15 hours

Security Principles

Security Fundamentals

Privilege Rings

Privilege Rings

Mode Transitions

// User to Kernel transition triggers:
// 1. System call (intentional)
int result = syscall(SYS_read, fd, buffer, size);

// 2. Exception (page fault, divide by zero)
int crash = 1 / 0;

// 3. Hardware interrupt (timer, I/O)
// Handled automatically by hardware

// Kernel to User transition:
// - Return from system call/interrupt
// - sigreturn

Access Control

Discretionary Access Control (DAC)

Owner decides who can access:
# Traditional Unix permissions
$ ls -l myfile
-rwxr-x--- 1 alice devs 4096 Jan 1 10:00 myfile
│││ │││ │││
│││ │││ └─── Others: no access
│││ └─────── Group (devs): read, execute
└─────────── Owner (alice): read, write, execute

# Numeric representation
chmod 750 myfile  # rwxr-x---

# Change owner
chown alice:devs myfile

Access Control Matrix

Access Control Matrix

POSIX ACLs

Extended access control:
# View ACL
$ getfacl myfile
# file: myfile
# owner: alice
# group: devs
user::rwx
user:bob:r-x       # Specific user permission
group::r-x
group:admins:rwx   # Specific group permission
mask::rwx
other::---

# Set ACL
$ setfacl -m u:bob:rx myfile    # Add user permission
$ setfacl -m g:admins:rwx myfile  # Add group permission
$ setfacl -x u:bob myfile       # Remove user permission
$ setfacl -b myfile             # Remove all ACLs

Mandatory Access Control (MAC)

System enforces policy, owner cannot override:

SELinux

# SELinux context format: user:role:type:level
$ ls -Z /var/www/html/index.html
-rw-r--r--. root root unconfined_u:object_r:httpd_sys_content_t:s0 index.html

# httpd (Apache) can only access files with httpd_sys_content_t type
# Even if file permissions allow, SELinux blocks

# Check current mode
$ getenforce
Enforcing

# Temporarily disable (bad practice)
$ setenforce 0  # Permissive (logs only)

# Change file context
$ chcon -t httpd_sys_content_t /var/www/html/newfile

# Restore default context
$ restorecon -Rv /var/www/html/

AppArmor

# Profile example: /etc/apparmor.d/usr.sbin.nginx
#include <tunables/global>

/usr/sbin/nginx {
  #include <abstractions/base>
  
  # Network access
  network inet stream,
  network inet6 stream,
  
  # File access
  /var/www/** r,
  /var/log/nginx/** rw,
  /run/nginx.pid rw,
  
  # Config files
  /etc/nginx/** r,
  
  # Deny everything else by default
}

MAC vs DAC Comparison

AspectDACMAC
ControlOwner decidesSystem policy
OverrideOwner can grantCannot bypass
ComplexitySimpleComplex
ExamplechmodSELinux
Use caseGeneral filesHigh security

Linux Capabilities

Split root privileges into smaller units:
# Traditional: root or nothing
# Capabilities: Fine-grained privileges

# View process capabilities
$ getpcaps $$
1234: = cap_chown,cap_dac_override+ep

# Key capabilities
CAP_CHOWN           # Change file ownership
CAP_DAC_OVERRIDE    # Bypass file permissions
CAP_NET_ADMIN       # Network configuration
CAP_NET_BIND_SERVICE # Bind to ports < 1024
CAP_NET_RAW         # Raw sockets
CAP_SYS_ADMIN       # Many admin operations
CAP_SYS_PTRACE      # Debug other processes

# Set file capability
$ setcap 'cap_net_bind_service=+ep' /usr/bin/myserver
# Now myserver can bind port 80 without running as root

# Remove capabilities
$ setcap -r /usr/bin/myserver

Capability Sets

// Each process has three capability sets:
// Permitted (P): Maximum capabilities process can use
// Effective (E): Currently active capabilities
// Inheritable (I): Passed to child processes

// Drop capabilities after setup
#include <sys/capability.h>

void drop_caps() {
    cap_t caps = cap_get_proc();
    
    // Keep only what we need
    cap_value_t keep[] = {CAP_NET_BIND_SERVICE};
    
    cap_clear(caps);
    cap_set_flag(caps, CAP_PERMITTED, 1, keep, CAP_SET);
    cap_set_flag(caps, CAP_EFFECTIVE, 1, keep, CAP_SET);
    
    cap_set_proc(caps);
    cap_free(caps);
}

Sandboxing

Seccomp (Secure Computing Mode)

Filter system calls:
#include <seccomp.h>

void sandbox_process() {
    scmp_filter_ctx ctx;
    
    // Default: kill process on violation
    ctx = seccomp_init(SCMP_ACT_KILL);
    
    // Allow specific syscalls
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
    
    // Allow read only from specific fd
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 1,
                     SCMP_A0(SCMP_CMP_EQ, STDIN_FILENO));
    
    // Apply filter
    seccomp_load(ctx);
    
    // Now: any other syscall kills the process!
    // open("/etc/passwd", O_RDONLY);  // SIGKILL!
}

Namespaces

Isolate system resources:
#include <sched.h>

// Create new namespace
int clone_flags = CLONE_NEWPID |  // New PID namespace
                  CLONE_NEWNET |  // New network namespace
                  CLONE_NEWNS  |  // New mount namespace
                  CLONE_NEWUTS |  // New hostname
                  CLONE_NEWIPC |  // New IPC namespace
                  CLONE_NEWUSER;  // New user namespace

int pid = clone(child_func, stack_top, clone_flags, NULL);

// Or using unshare:
unshare(CLONE_NEWNS);  // New mount namespace

Control Groups (cgroups)

Limit resource usage:
# Create a cgroup
mkdir /sys/fs/cgroup/memory/myapp
mkdir /sys/fs/cgroup/cpu/myapp

# Set limits
echo 100000000 > /sys/fs/cgroup/memory/myapp/memory.limit_in_bytes  # 100MB
echo 50000 > /sys/fs/cgroup/cpu/myapp/cpu.cfs_quota_us              # 50% CPU

# Add process to cgroup
echo $$ > /sys/fs/cgroup/memory/myapp/cgroup.procs
echo $$ > /sys/fs/cgroup/cpu/myapp/cgroup.procs

# Using cgroupv2 (unified):
echo "+memory +cpu" > /sys/fs/cgroup/myapp/cgroup.subtree_control
echo "100M" > /sys/fs/cgroup/myapp/memory.max
echo "50000 100000" > /sys/fs/cgroup/myapp/cpu.max

Container Security Stack

Container Isolation

Memory Protection

Address Space Layout Randomization (ASLR)

ASLR is a critical security technique that randomizes memory addresses to make exploit development extremely difficult. ASLR Visualized

ASLR Configuration and Usage

# Check ASLR status
$ cat /proc/sys/kernel/randomize_va_space
2  # Full randomization

# 0: Disabled (NEVER use in production!)
# 1: Conservative (stack, mmap, vDSO)
# 2: Full (+ brk/heap) - RECOMMENDED

# View randomized addresses (note: changes each run)
$ cat /proc/self/maps
5648a0200000-5648a0201000 r--p /usr/bin/cat  # PIE executable
7f3e4c200000-7f3e4c3a0000 r--p /usr/lib/libc.so.6  # Library
7ffd9a100000-7ffd9a121000 rw-p [stack]  # Stack

# Compare two runs - addresses differ:
$ for i in {1..3}; do cat /proc/self/maps | grep stack; done
7ffd9a100000-7ffd9a121000 rw-p [stack]
7ffe3b200000-7ffe3b221000 rw-p [stack]  ◄── Different!
7ffc2c300000-7ffc2c321000 rw-p [stack]  ◄── Different!

# Disable ASLR for debugging (temporary)
$ setarch $(uname -m) -R ./myprogram

# Or per-process
$ echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
$ ./myprogram
$ echo 2 | sudo tee /proc/sys/kernel/randomize_va_space  # Re-enable!

# Check if binary is position-independent (PIE)
$ readelf -h /bin/cat | grep Type
  Type:                              DYN (Position-Independent Executable file)

# Compile with PIE (modern default)
$ gcc -fPIE -pie program.c -o program

# Without PIE (fixed address, defeats some ASLR)
$ gcc -no-pie program.c -o program

ASLR Limitations and Bypass Techniques

ASLR Bypass Techniques

ASLR Best Practices

ASLR Strengthening

Stack Canaries

Stack canaries are “canary values” placed on the stack between local variables and control data to detect buffer overflows. Stack Canary

Stack Canary Implementation

// Example function with canary protection
void vulnerable() {
    char buffer[16];
    // Compiler-generated stack layout:
    // [buffer] [canary] [saved_rbp] [return_addr]
    
    gets(buffer);  // Overflow!
    // If buffer overflow overwrites canary:
    //   __stack_chk_fail() called → process terminates
}

// What the compiler generates (simplified):
void vulnerable_protected() {
    // PROLOGUE
    unsigned long canary;
    canary = __stack_chk_guard;  // Load from TLS/global
    
    char buffer[16];
    // ... function body ...
    gets(buffer);
    
    // EPILOGUE
    if (canary != __stack_chk_guard) {
        __stack_chk_fail();  // ABORT! Stack smashed!
    }
    return;
}

// Canary is typically stored in:
// x86-64: fs:0x28 (Thread Local Storage segment)
// ARM64: x18 register or TLS

// View canary in GDB:
// (gdb) x/gx $fs_base+0x28
// 0x7ffff7fb3028: 0x1234567890abcdef  ◄── Random canary

// Compile options:
// -fstack-protector         Protect functions with buffers > 8 bytes
// -fstack-protector-strong  Protect more functions (arrays, address taken)
// -fstack-protector-all     Protect ALL functions (performance cost!)
// -fno-stack-protector      Disable (only for debugging!)

// Check if binary has stack protector:
$ readelf -s /bin/cat | grep stack_chk
    42: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __stack_chk_fail
// Presence of __stack_chk_fail means canaries enabled

Bypassing Stack Canaries

Stack Canary Bypass Techniques

Defense In Depth

Stack canaries are ONE layer. Always combine:

✓ Stack canaries      → Detect buffer overflow
✓ ASLR               → Randomize addresses  
✓ NX/DEP             → Prevent code execution
✓ RELRO              → Protect GOT/PLT
✓ PIE                → Position independent executable
✓ Seccomp            → Limit syscalls
✓ Safe coding        → No gets(), strcpy(), sprintf()

Modern systems use ALL of these together!

Non-Executable Memory (NX/DEP)

// Mark memory regions as non-executable
// Prevents code injection attacks

#include <sys/mman.h>

// Allocate executable memory (rare, dangerous)
void *mem = mmap(NULL, size,
                 PROT_READ | PROT_WRITE | PROT_EXEC,
                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

// Later, make it non-writable (W^X)
mprotect(mem, size, PROT_READ | PROT_EXEC);

// Stack and heap are typically:
// PROT_READ | PROT_WRITE (no PROT_EXEC)

Kernel Security

KASLR (Kernel ASLR)

# Check if KASLR is enabled
$ cat /proc/cmdline | grep -o nokaslr
# Empty = enabled

# Kernel symbols randomized each boot
$ cat /proc/kallsyms
ffffffffc0123456 t some_kernel_function
# Address different next boot

Kernel Hardening

# Restrict dmesg access
$ sysctl kernel.dmesg_restrict=1

# Restrict kernel pointer leaks
$ sysctl kernel.kptr_restrict=2

# Disable kernel module loading
$ sysctl kernel.modules_disabled=1

# Enable strict memory protection
$ sysctl vm.mmap_min_addr=65536

# Restrict ptrace
$ sysctl kernel.yama.ptrace_scope=2
# 0: Classic (any process)
# 1: Restricted (parent only)
# 2: Admin only
# 3: Disabled completely

Authentication & Authorization

PAM (Pluggable Authentication Modules)

# /etc/pam.d/sshd
auth       required     pam_unix.so        # Password check
auth       required     pam_faillock.so    # Lock after failures
auth       optional     pam_gnome_keyring.so

account    required     pam_unix.so
account    required     pam_nologin.so     # Check /etc/nologin

password   required     pam_pwquality.so   # Password quality
password   required     pam_unix.so sha512

session    required     pam_limits.so      # Resource limits
session    required     pam_unix.so

sudo Configuration

# /etc/sudoers (use visudo to edit!)

# User alice can run any command as root
alice ALL=(ALL) ALL

# User bob can only restart nginx
bob ALL=(root) /usr/bin/systemctl restart nginx

# Group wheel can sudo without password (dangerous!)
%wheel ALL=(ALL) NOPASSWD: ALL

# Logging
Defaults    logfile=/var/log/sudo.log
Defaults    log_input, log_output

Cryptographic Security

Disk Encryption (LUKS)

# Setup encrypted partition
$ cryptsetup luksFormat /dev/sda2
$ cryptsetup luksOpen /dev/sda2 encrypted_data
$ mkfs.ext4 /dev/mapper/encrypted_data
$ mount /dev/mapper/encrypted_data /mnt/secure

# Close when done
$ umount /mnt/secure
$ cryptsetup luksClose encrypted_data

Secure Boot Chain

Secure Boot Chain

Interview Deep Dive Questions

Answer:Principle: Grant minimum access needed to perform a function.Examples:
  1. File Permissions:
    # Bad: World-readable config with passwords
    -rw-r--r-- config.ini
    
    # Good: Only owner can read
    -rw------- config.ini
    
  2. Service Accounts:
    # Bad: Web server runs as root
    User root
    
    # Good: Dedicated unprivileged user
    User www-data
    
  3. Capabilities:
    # Bad: Run as root to bind port 80
    sudo ./webserver
    
    # Good: Grant only needed capability
    setcap 'cap_net_bind_service=+ep' ./webserver
    ./webserver
    
  4. Containers:
    # Kubernetes: Drop all capabilities, add only needed
    securityContext:
      capabilities:
        drop: ["ALL"]
        add: ["NET_BIND_SERVICE"]
      runAsNonRoot: true
      readOnlyRootFilesystem: true
    
  5. Database Access:
    -- Bad: App connects as admin
    GRANT ALL ON *.* TO 'app'@'%';
    
    -- Good: Only needed permissions
    GRANT SELECT, INSERT, UPDATE ON shop.* TO 'app'@'%';
    
Benefits:
  • Limits blast radius of compromise
  • Reduces attack surface
  • Easier auditing
  • Defense in depth
Answer:Layered Isolation:Container Layered IsolationWhat containers DON’T isolate:
  • Kernel (shared with host)
  • Time (system clock)
  • Kernel keyring
  • Some /proc, /sys entries
Container vs VM:
AspectContainerVM
KernelSharedSeparate
OverheadLowHigh
StartupMillisecondsSeconds
IsolationProcess-levelHardware-level
Attack surfaceLargerSmaller
Best practices:
# Run as non-root
docker run --user 1000:1000 myapp

# Read-only filesystem
docker run --read-only myapp

# Drop capabilities
docker run --cap-drop ALL --cap-add NET_BIND_SERVICE myapp

# Use seccomp profile
docker run --security-opt seccomp=profile.json myapp
Answer:Attack:
void vulnerable() {
    char buffer[64];
    gets(buffer);  // No bounds checking!
    // Attacker sends 100 bytes, overwrites return address
}

// Stack before overflow:
[buffer (64 bytes)] [saved_rbp] [return_addr]

// After overflow:
[shellcode..............................] [jmp_to_buf]

                                    Now points to shellcode
Protections:
  1. Stack Canaries:
    [buffer] [CANARY] [saved_rbp] [return_addr]
    
    Before return: check if canary changed
    If changed: __stack_chk_fail() → abort
    
  2. ASLR:
    Addresses randomized each run
    Attacker can't know where to jump
    
  3. NX/DEP:
    Stack marked non-executable
    Shellcode can't run even if injected
    
  4. RELRO (Relocation Read-Only):
    GOT made read-only after startup
    Prevents GOT overwrite attacks
    
  5. PIE (Position Independent Executable):
    Code segment also randomized
    No fixed addresses to target
    
Checking protections:
$ checksec --file=/usr/bin/nginx
RELRO         STACK CANARY    NX      PIE
Full RELRO    Canary found    NX on   PIE enabled
Modern attacks bypass with:
  • ROP (Return-Oriented Programming)
  • Information leaks to defeat ASLR
  • Heap exploitation
Answer:Core Concept: Every process and file has a security context
Format: user:role:type:level
Example: system_u:system_r:httpd_t:s0
Type Enforcement:
httpd_t (Apache process type)

   ├── CAN access httpd_sys_content_t (web files)
   ├── CAN access httpd_log_t (logs)
   ├── CANNOT access user_home_t (home directories)
   └── CANNOT access etc_t (system config)
Policy Rules:
# Allow httpd to read content
allow httpd_t httpd_sys_content_t:file { read getattr };

# Allow httpd to write logs
allow httpd_t httpd_log_t:file { write append };

# Deny by default - anything not allowed is blocked
Workflow when Apache accesses a file:
1. Apache (httpd_t) tries to read /var/www/index.html
2. File has context httpd_sys_content_t
3. SELinux checks policy: httpd_t → httpd_sys_content_t:file:read
4. Policy allows → access granted

If Apache tries to read /etc/shadow:
1. File has context shadow_t
2. SELinux checks: httpd_t → shadow_t:file:read
3. No policy allows this → DENIED
4. Even if DAC allows (unlikely), SELinux blocks
Troubleshooting:
# Check for denials
$ ausearch -m AVC -ts recent

# Generate policy module for denial
$ audit2allow -a -M mymodule
$ semodule -i mymodule.pp

# Temporarily set permissive (logs only)
$ setenforce 0
Answer:Requirements:
  • Multiple customers share infrastructure
  • Complete isolation between tenants
  • Resource limits per tenant
  • Audit logging
Architecture:Secure Multi-Tenant ArchitectureIsolation Mechanisms:
  1. Network Level:
    # Kubernetes NetworkPolicy
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    spec:
      podSelector:
        matchLabels:
          tenant: A
      ingress:
        - from:
          - podSelector:
              matchLabels:
                tenant: A
      # Tenant A pods can only talk to Tenant A pods
    
  2. Data Level:
    -- Row-level security
    CREATE POLICY tenant_isolation ON data
    USING (tenant_id = current_setting('app.tenant_id'));
    
  3. Compute Level:
    # Resource quotas
    resources:
      limits:
        cpu: "2"
        memory: "4Gi"
      requests:
        cpu: "500m"
        memory: "1Gi"
    
  4. Audit:
    # All API calls logged with tenant context
    # Immutable audit trail
    # Anomaly detection
    
Defense in Depth:
  • Every layer assumes others might fail
  • Multiple redundant security controls
  • Zero trust between components

Security Checklist

1

Principle of Least Privilege

Run services with minimum required permissions. Use capabilities instead of root.
2

Enable Security Features

ASLR, stack canaries, NX bit, RELRO, PIE, SELinux/AppArmor.
3

Limit Attack Surface

Remove unused software, close unused ports, disable unnecessary services.
4

Implement Defense in Depth

Multiple security layers. Don’t rely on single control.
5

Audit and Monitor

Log security events, monitor for anomalies, regular vulnerability scanning.

Key Takeaways

Least Privilege

Always grant minimum necessary access. Drop capabilities immediately.

Defense in Depth

Multiple security layers. Seccomp + capabilities + namespaces + MAC.

Container Security

Containers share kernel. Use all isolation mechanisms together.

Memory Protection

ASLR, canaries, NX work together. Modern attacks bypass with ROP.

Next: Linux Internals