Skip to main content

Security Modules & Capabilities

Linux security is multi-layered. Understanding these mechanisms is essential for infrastructure engineers building secure container platforms and debugging permission issues.
Interview Frequency: High (especially for container/cloud roles)
Key Topics: LSM framework, capabilities, seccomp-bpf, SELinux/AppArmor
Time to Master: 12-14 hours

Linux Security Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    LINUX SECURITY LAYERS                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  User Space                                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │  Application                                                             ││
│  │  └── Runs with specific UID/GID                                         ││
│  │  └── Limited capabilities                                               ││
│  │  └── Seccomp filter active                                              ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
│  ════════════════════════════════════════════════════════════════════════   │
│                         System Call Interface                                │
│  ════════════════════════════════════════════════════════════════════════   │
│                                                                              │
│  Kernel Space                                                                │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                                                                          ││
│  │  1. SECCOMP-BPF (System Call Filter)                                    ││
│  │     └── Runs first: allows/denies/traces syscalls                      ││
│  │     └── Cannot be bypassed                                              ││
│  │                                                                          ││
│  │  2. CAPABILITIES CHECK                                                   ││
│  │     └── Fine-grained root privileges                                    ││
│  │     └── Checked before operations                                       ││
│  │                                                                          ││
│  │  3. DAC (Discretionary Access Control)                                  ││
│  │     └── Traditional Unix permissions                                    ││
│  │     └── UID/GID ownership, rwx bits                                    ││
│  │                                                                          ││
│  │  4. LSM HOOKS (Mandatory Access Control)                                ││
│  │     └── SELinux, AppArmor, or other modules                            ││
│  │     └── Policy-based, cannot be overridden by owner                    ││
│  │                                                                          ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Capabilities: Dividing Root Power

Traditional Unix: UID 0 (root) can do everything. Capabilities break this into ~40 distinct privileges.

Common Capabilities

CapabilityWhat It Allows
CAP_NET_ADMINConfigure network interfaces, routing, firewall
CAP_NET_RAWUse raw and packet sockets
CAP_NET_BIND_SERVICEBind to ports < 1024
CAP_SYS_ADMIN”Kitchen sink” - mount, ptrace, many things
CAP_SYS_PTRACETrace any process
CAP_DAC_OVERRIDEBypass file permission checks
CAP_CHOWNMake arbitrary changes to file UIDs and GIDs
CAP_SETUIDSet arbitrary UID
CAP_SETGIDSet arbitrary GID
CAP_KILLSend signals to any process
CAP_SYS_RESOURCEOverride resource limits
CAP_SYS_TIMESet system clock
CAP_SYS_MODULELoad/unload kernel modules
CAP_MKNODCreate special files
CAP_AUDIT_WRITEWrite to audit log
CAP_SETFCAPSet file capabilities

Capability Sets

Each process has multiple capability sets:
┌─────────────────────────────────────────────────────────────────────────────┐
│                    CAPABILITY SETS                                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Process capability sets:                                                    │
│                                                                              │
│  EFFECTIVE (CapEff)                                                          │
│  └── Capabilities the kernel actually checks                                │
│  └── This is what matters for permission checks                             │
│                                                                              │
│  PERMITTED (CapPrm)                                                          │
│  └── Maximum capabilities this process CAN have                             │
│  └── Can add caps from permitted to effective                               │
│  └── Once dropped from permitted, cannot be regained                        │
│                                                                              │
│  INHERITABLE (CapInh)                                                        │
│  └── Capabilities preserved across execve()                                 │
│  └── Combined with file capabilities on exec                                │
│                                                                              │
│  BOUNDING (CapBnd)                                                           │
│  └── Limit on capabilities that can ever be gained                          │
│  └── Dropping is irreversible                                               │
│                                                                              │
│  AMBIENT (CapAmb)                                                            │
│  └── Capabilities preserved across exec of unprivileged programs           │
│  └── Subset of permitted ∩ inheritable                                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Working with Capabilities

# View process capabilities
cat /proc/$$/status | grep Cap
# CapInh: 0000000000000000
# CapPrm: 0000003fffffffff
# CapEff: 0000003fffffffff
# CapBnd: 0000003fffffffff
# CapAmb: 0000000000000000

# Decode capability bits
capsh --decode=0000003fffffffff

# View capabilities of a process
getpcaps $$

# Set file capabilities (run as non-root but with cap)
sudo setcap cap_net_bind_service=+ep /path/to/binary

# View file capabilities
getcap /path/to/binary

# Run with specific capabilities
capsh --caps="cap_net_bind_service+eip" -- -c "./my_server"

# Drop all capabilities except specific ones
capsh --drop=all --caps="cap_net_raw+eip" -- -c "./my_program"

Kernel Capability Checks

// Kernel checks capabilities like this:
// kernel/capability.c

bool capable(int cap)
{
    return ns_capable(&init_user_ns, cap);
}

bool ns_capable(struct user_namespace *ns, int cap)
{
    if (unlikely(!cap_valid(cap)))
        return false;
    
    if (security_capable(current_cred(), ns, cap, CAP_OPT_NONE) == 0)
        return true;
    
    return false;
}

// Example: binding to privileged port
// net/ipv4/af_inet.c
if (snum < inet_prot_sock(net) &&
    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
    return -EACCES;

Linux Security Modules (LSM)

LSM provides hooks throughout the kernel for mandatory access control:

LSM Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    LSM HOOK ARCHITECTURE                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Kernel Operation (e.g., file open)                                          │
│                                                                              │
│  vfs_open()                                                                  │
│  │                                                                           │
│  ├─── Check DAC permissions (traditional Unix)                              │
│  │                                                                           │
│  └─── Call LSM hook: security_file_open()                                   │
│       │                                                                      │
│       ▼                                                                      │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                  LSM Framework                                         │  │
│  │                                                                        │  │
│  │  security_file_open()                                                  │  │
│  │  │                                                                     │  │
│  │  ├─── call_int_hook(file_open, ...)                                   │  │
│  │  │    │                                                                │  │
│  │  │    ├─► selinux_file_open()     ← SELinux check                    │  │
│  │  │    │   └── Check if access allowed by policy                       │  │
│  │  │    │                                                                │  │
│  │  │    ├─► apparmor_file_open()    ← AppArmor check                   │  │
│  │  │    │   └── Check if profile allows access                         │  │
│  │  │    │                                                                │  │
│  │  │    └─► bpf_lsm_file_open()     ← BPF LSM (if enabled)             │  │
│  │  │        └── Custom eBPF policy                                      │  │
│  │  │                                                                     │  │
│  │  └─── All hooks must return 0 for access to be granted               │  │
│  │                                                                        │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

LSM Hooks

// include/linux/lsm_hooks.h (partial list)
union security_list_options {
    // Credential operations
    int (*cred_prepare)(struct cred *new, const struct cred *old, gfp_t gfp);
    
    // File operations
    int (*file_permission)(struct file *file, int mask);
    int (*file_open)(struct file *file);
    int (*file_mprotect)(struct vm_area_struct *vma, unsigned long prot);
    
    // Process operations
    int (*task_create)(unsigned long clone_flags);
    int (*task_kill)(struct task_struct *p, struct kernel_siginfo *info,
                     int sig, const struct cred *cred);
    
    // Socket operations
    int (*socket_create)(int family, int type, int protocol, int kern);
    int (*socket_bind)(struct socket *sock, struct sockaddr *address, int addrlen);
    int (*socket_connect)(struct socket *sock, struct sockaddr *address, int addrlen);
    
    // And many more...
};

SELinux

Type Enforcement security - every object has a security context:

SELinux Context

# View file context
ls -Z /etc/passwd
# -rw-r--r--. root root system_u:object_r:passwd_file_t:s0 /etc/passwd

# Context format: user:role:type:level
# user:    SELinux user (system_u, unconfined_u)
# role:    Role (object_r for files, unconfined_r for processes)
# type:    Type/domain (passwd_file_t, httpd_t)
# level:   MLS/MCS level (s0, s0-s15:c0.c1023)

# View process context
ps -Z
# LABEL                               PID TTY          TIME CMD
# unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 1234 pts/0 00:00:00 bash

# Current context
id -Z

SELinux Policy Rules

# Type enforcement rule format:
# allow source_type target_type : object_class { permissions }

# Example: Allow httpd to read passwd_file_t
allow httpd_t passwd_file_t : file { read open getattr };

# Example: Allow httpd to connect to HTTP ports
allow httpd_t http_port_t : tcp_socket { name_connect };

# Query policy
sesearch --allow --source httpd_t --target passwd_file_t

# View denials in audit log
ausearch -m avc -ts today

# Generate policy from denials
audit2allow -a -M mypolicy
semodule -i mypolicy.pp

SELinux Modes

# Check current mode
getenforce   # Enforcing, Permissive, or Disabled

# Temporarily change mode
setenforce 0   # Permissive (logs but doesn't block)
setenforce 1   # Enforcing

# Permanent change: edit /etc/selinux/config
SELINUX=enforcing
SELINUXTYPE=targeted

SELinux Booleans

# List all booleans
getsebool -a

# Common examples
setsebool -P httpd_can_network_connect on
setsebool -P container_manage_cgroup on

# View what booleans affect a domain
semanage boolean -l | grep httpd

AppArmor

Profile-based MAC - simpler than SELinux:

AppArmor Profiles

# View loaded profiles
aa-status

# Profile locations
ls /etc/apparmor.d/

# Example profile: /etc/apparmor.d/usr.sbin.nginx
#include <tunables/global>

/usr/sbin/nginx {
  #include <abstractions/base>
  #include <abstractions/nameservice>

  capability net_bind_service,
  capability setgid,
  capability setuid,

  /etc/nginx/** r,
  /var/log/nginx/** rw,
  /var/www/** r,
  /run/nginx.pid rw,
  
  network inet tcp,
  network inet6 tcp,
  
  deny /etc/shadow r,
}

AppArmor Modes

# Enforce mode (default)
aa-enforce /etc/apparmor.d/usr.sbin.nginx

# Complain mode (log but don't block)
aa-complain /etc/apparmor.d/usr.sbin.nginx

# Disable
aa-disable /etc/apparmor.d/usr.sbin.nginx

# Generate profile interactively
aa-genprof /usr/sbin/nginx

# Update profile from logs
aa-logprof

AppArmor in Containers

# Docker run with AppArmor profile
docker run --security-opt apparmor=docker-default nginx

# Kubernetes pod with AppArmor
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  annotations:
    container.apparmor.security.beta.kubernetes.io/nginx: runtime/default
spec:
  containers:
  - name: nginx
    image: nginx

Seccomp-BPF

Filter system calls with BPF programs:

How Seccomp Works

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SECCOMP-BPF FILTERING                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  System Call Entry                                                           │
│  │                                                                           │
│  ▼                                                                           │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                    Seccomp BPF Filter                                    ││
│  │                                                                          ││
│  │  Input: syscall number, arguments                                       ││
│  │                                                                          ││
│  │  ┌─────────────────────────────────────────────────────────────────┐    ││
│  │  │  BPF Program evaluates:                                          │    ││
│  │  │  if (syscall == SYS_open) {                                     │    ││
│  │  │      return SECCOMP_RET_ALLOW;                                  │    ││
│  │  │  } else if (syscall == SYS_ptrace) {                           │    ││
│  │  │      return SECCOMP_RET_KILL;                                  │    ││
│  │  │  } else {                                                       │    ││
│  │  │      return SECCOMP_RET_ERRNO | EPERM;                         │    ││
│  │  │  }                                                              │    ││
│  │  └─────────────────────────────────────────────────────────────────┘    ││
│  │                                                                          ││
│  │  Return values:                                                          ││
│  │  SECCOMP_RET_ALLOW     → Continue to syscall                            ││
│  │  SECCOMP_RET_KILL      → Kill process immediately                       ││
│  │  SECCOMP_RET_TRAP      → Send SIGSYS signal                             ││
│  │  SECCOMP_RET_ERRNO     → Return error without executing                 ││
│  │  SECCOMP_RET_TRACE     → Notify tracer (for debugging)                  ││
│  │  SECCOMP_RET_LOG       → Allow but log                                  ││
│  │  SECCOMP_RET_USER_NOTIF → Notify userspace handler                      ││
│  │                                                                          ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Writing Seccomp Filters

#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/prctl.h>

// Simple filter using libseccomp
#include <seccomp.h>

int setup_seccomp(void)
{
    scmp_filter_ctx ctx;
    
    // Default: kill on disallowed syscall
    ctx = seccomp_init(SCMP_ACT_KILL);
    if (!ctx)
        return -1;
    
    // Allow specific syscalls
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
    
    // Allow open only for specific fd range
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(open), 1,
                     SCMP_A1(SCMP_CMP_EQ, O_RDONLY));
    
    // Load the filter
    if (seccomp_load(ctx) < 0) {
        seccomp_release(ctx);
        return -1;
    }
    
    seccomp_release(ctx);
    return 0;
}

Docker Seccomp Profile

{
    "defaultAction": "SCMP_ACT_ERRNO",
    "defaultErrnoRet": 1,
    "architectures": ["SCMP_ARCH_X86_64"],
    "syscalls": [
        {
            "names": ["read", "write", "open", "close", "stat", "fstat"],
            "action": "SCMP_ACT_ALLOW"
        },
        {
            "names": ["ptrace"],
            "action": "SCMP_ACT_ERRNO",
            "errnoRet": 1
        },
        {
            "names": ["clone"],
            "action": "SCMP_ACT_ALLOW",
            "args": [
                {
                    "index": 0,
                    "op": "SCMP_CMP_MASKED_EQ",
                    "value": 2114060288,
                    "valueTwo": 0
                }
            ]
        }
    ]
}
# Run container with seccomp profile
docker run --security-opt seccomp=/path/to/profile.json myimage

eBPF LSM

Write custom security policies with eBPF:
// Custom LSM program
SEC("lsm/file_open")
int BPF_PROG(restrict_file_open, struct file *file)
{
    const char *filename;
    char buf[256];
    
    filename = BPF_CORE_READ(file, f_path.dentry, d_name.name);
    bpf_probe_read_kernel_str(buf, sizeof(buf), filename);
    
    // Block access to /etc/shadow
    if (buf[0] == 's' && buf[1] == 'h' && buf[2] == 'a' &&
        buf[3] == 'd' && buf[4] == 'o' && buf[5] == 'w') {
        return -EPERM;
    }
    
    return 0;  // Allow
}

SEC("lsm/task_kill")
int BPF_PROG(restrict_kill, struct task_struct *p,
             struct kernel_siginfo *info, int sig,
             const struct cred *cred)
{
    // Block SIGKILL to specific process
    if (sig == SIGKILL && BPF_CORE_READ(p, pid) == protected_pid) {
        return -EPERM;
    }
    return 0;
}

Container Security Stack

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CONTAINER SECURITY LAYERS                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Container Runtime Security                                                  │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                                                                          ││
│  │  1. NAMESPACES (Isolation)                                              ││
│  │     ├── PID: Isolated process tree                                     ││
│  │     ├── Network: Separate network stack                                ││
│  │     ├── Mount: Separate filesystem view                                ││
│  │     ├── User: UID/GID mapping (rootless containers)                   ││
│  │     └── IPC, UTS, Cgroup: Various isolations                          ││
│  │                                                                          ││
│  │  2. CGROUPS (Resource Limits)                                           ││
│  │     ├── CPU: Quota, shares                                             ││
│  │     ├── Memory: Limits, swap                                           ││
│  │     └── IO: Bandwidth limits                                           ││
│  │                                                                          ││
│  │  3. CAPABILITIES (Reduced Privileges)                                   ││
│  │     ├── Drop all, add only what's needed                              ││
│  │     └── Default: ~14 capabilities granted                             ││
│  │                                                                          ││
│  │  4. SECCOMP (Syscall Filtering)                                         ││
│  │     ├── Default profile blocks ~50 dangerous syscalls                  ││
│  │     └── Custom profiles for specific workloads                         ││
│  │                                                                          ││
│  │  5. LSM (SELinux/AppArmor)                                              ││
│  │     ├── Mandatory access control                                       ││
│  │     └── Per-container policies                                         ││
│  │                                                                          ││
│  │  6. READ-ONLY ROOTFS                                                     ││
│  │     └── Prevent persistent changes                                     ││
│  │                                                                          ││
│  │  7. NO-NEW-PRIVILEGES                                                    ││
│  │     └── Prevent privilege escalation via setuid                        ││
│  │                                                                          ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Docker Security Options

# Run with minimal capabilities
docker run --cap-drop=all --cap-add=net_bind_service nginx

# No new privileges (block setuid)
docker run --security-opt=no-new-privileges:true myapp

# Custom seccomp profile
docker run --security-opt seccomp=profile.json myapp

# Custom AppArmor profile
docker run --security-opt apparmor=my-profile myapp

# Read-only rootfs
docker run --read-only --tmpfs /tmp myapp

# Run as non-root
docker run --user 1000:1000 myapp

# Completely unprivileged (rootless Docker)
# Docker daemon runs as non-root user

Kubernetes Pod Security

apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: myapp
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
          - ALL
        add:
          - NET_BIND_SERVICE

Debugging Security Issues

Capability Denied

# Check what capabilities a process has
cat /proc/<pid>/status | grep Cap

# Check what capability is needed
# Look for "Operation not permitted" in strace
strace -f ./myapp 2>&1 | grep EPERM

# Common: check if binary has file capabilities
getcap /path/to/binary

# Add file capability
sudo setcap cap_net_bind_service=+ep /path/to/binary

SELinux Denials

# Check for denials
ausearch -m avc -ts recent

# Example denial:
# type=AVC msg=audit(...): avc: denied { read } for pid=1234
#   comm="myapp" name="secret" dev="sda1" ino=12345
#   scontext=system_u:system_r:httpd_t:s0
#   tcontext=system_u:object_r:admin_home_t:s0
#   tclass=file

# What this means:
# Process with type httpd_t tried to read file with type admin_home_t
# There's no allow rule for this

# Generate and install policy to allow
ausearch -m avc -ts recent | audit2allow -M myfix
semodule -i myfix.pp

# Or fix by changing file context
chcon -t httpd_sys_content_t /path/to/file

AppArmor Denials

# Check for denials
dmesg | grep apparmor

# Example:
# apparmor="DENIED" operation="open" profile="/usr/sbin/nginx"
#   name="/etc/secret" pid=1234 comm="nginx"

# Put profile in complain mode to debug
aa-complain /etc/apparmor.d/usr.sbin.nginx

# Update profile from logs
aa-logprof

Seccomp Violations

# Check for seccomp kills
dmesg | grep seccomp

# Run with tracing to see what's blocked
strace -f ./myapp 2>&1 | head -100

# Use audit mode in seccomp profile to log without blocking
# "defaultAction": "SCMP_ACT_LOG"

Interview Questions

Answer:Multiple layers:
  1. User namespace: Map container UID 0 to unprivileged host UID
docker run --userns=host --user 1000:1000 myapp
  1. Capabilities: Drop all, add only needed
docker run --cap-drop=all --cap-add=net_bind_service myapp
  1. No new privileges: Prevent setuid escalation
docker run --security-opt=no-new-privileges:true myapp
  1. Seccomp: Filter dangerous syscalls
  2. Read-only rootfs: Prevent persistence
  3. In application code:
// Drop capabilities programmatically
cap_t caps = cap_init();
cap_set_proc(caps);
cap_free(caps);
Answer:
AspectSELinuxAppArmor
ModelType Enforcement (labels)Path-based profiles
ComplexityComplex, fine-grainedSimpler, easier to learn
DefaultRHEL, Fedora, CentOSUbuntu, SUSE
PolicyEverything needs rulesPermissive by default
LearningAudit2allow helpsaa-genprof helps
ContainersFull supportFull support
Key difference: SELinux labels objects (files, processes) with security contexts. Policy rules define allowed interactions between types. AppArmor uses pathnames - profiles define what paths/capabilities a program can access.When to use which:
  • SELinux for high-security environments needing fine-grained control
  • AppArmor when simplicity is preferred
Answer:The problem: Containers share the host kernel. A container could exploit kernel vulnerabilities via syscalls.Seccomp-bpf solution: Filter syscalls before they execute:
  1. Container runtime installs BPF filter at container start
  2. Every syscall is checked against filter
  3. Dangerous syscalls are blocked (e.g., ptrace, mount, kexec)
Docker default profile blocks:
  • kexec_load - Replace running kernel
  • mount - Mount filesystems
  • ptrace - Debug/trace processes
  • create_module - Load kernel modules
  • init_module - Load kernel modules
  • And ~50 more
Performance: Very low overhead - BPF runs in kernel, no context switches
Answer:Traditional problem:
  • UID 0 = all privileges
  • Regular user = almost no privileges
  • Programs needing one privilege (bind port 80) got all of root’s power
Capabilities solution: Break root into ~40 discrete privileges:
  • CAP_NET_BIND_SERVICE - Bind to ports < 1024
  • CAP_SYS_ADMIN - Various admin tasks
  • etc.
Benefits:
  • Least privilege: Grant only what’s needed
  • Defense in depth: Compromised process has limited power
  • Container isolation: Different containers get different caps
Example: Web server needs only CAP_NET_BIND_SERVICE, not full root:
setcap cap_net_bind_service=+ep /usr/sbin/nginx
# Now nginx can bind port 80 without running as root

Summary

MechanismPurposeScope
DAC (Unix permissions)Owner-controlled accessFiles, basic
CapabilitiesFine-grained privilegesProcesses
Seccomp-BPFSystem call filteringPer-process
SELinuxMandatory access controlWhole system
AppArmorProfile-based accessPer-program
NamespacesResource isolationPer-container
CgroupsResource limitsPer-container

Next Steps