Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Security Modules & Capabilities

Linux security is multi-layered. Understanding these mechanisms is essential for infrastructure engineers building secure container platforms and debugging permission issues. Think of Linux security as a series of checkpoints a request must pass through — each layer can say “no” independently, and the request only succeeds if every layer says “yes.” This defense-in-depth approach means that even if one layer is misconfigured or exploited, the others still provide protection.
Interview Frequency: High (especially for container/cloud roles)
Key Topics: LSM framework, capabilities, seccomp-bpf, SELinux/AppArmor
Time to Master: 12-14 hours

Linux Security Architecture

+-----------------------------------------------------------------------------+
|                    LINUX SECURITY LAYERS                                     |
+-----------------------------------------------------------------------------+
|                                                                              |
|  User Space                                                                  |
|  +-------------------------------------------------------------------------+|
|  |  Application                                                             ||
|  |  -- Runs with specific UID/GID                                          ||
|  |  -- Limited capabilities (only what it needs)                           ||
|  |  -- Seccomp filter active (restricted syscall set)                      ||
|  +-------------------------------------------------------------------------+|
|                                                                              |
|  ========================================================================   |
|                         System Call Interface                                |
|  ========================================================================   |
|                                                                              |
|  Kernel Space (checks happen IN THIS ORDER)                                  |
|  +-------------------------------------------------------------------------+|
|  |                                                                          ||
|  |  1. SECCOMP-BPF (System Call Filter)                                    ||
|  |     -- Runs first: allows/denies/traces syscalls                        ||
|  |     -- Cannot be bypassed (runs before ANY other check)                 ||
|  |     -- Decision based on syscall number + args, NOT identity            ||
|  |                                                                          ||
|  |  2. CAPABILITIES CHECK                                                   ||
|  |     -- Fine-grained root privileges (~40 distinct caps)                 ||
|  |     -- Checked before privileged operations                              ||
|  |     -- Replaces the binary "is this root?" check                        ||
|  |                                                                          ||
|  |  3. DAC (Discretionary Access Control)                                  ||
|  |     -- Traditional Unix permissions                                      ||
|  |     -- UID/GID ownership, rwx bits, ACLs                               ||
|  |     -- "Discretionary" = owner can change the policy                    ||
|  |                                                                          ||
|  |  4. LSM HOOKS (Mandatory Access Control)                                ||
|  |     -- SELinux, AppArmor, or other modules                              ||
|  |     -- Policy-based, cannot be overridden by file owner                 ||
|  |     -- "Mandatory" = only admin can change the policy                   ||
|  |                                                                          ||
|  +-------------------------------------------------------------------------+|
|                                                                              |
+-----------------------------------------------------------------------------+
The ordering matters. Seccomp runs first because it is the cheapest check (a BPF program evaluating syscall numbers), and rejecting a syscall at this stage avoids the cost of all subsequent checks. Capabilities and DAC run next because they are fast lookups. LSM hooks run last because they can involve complex policy evaluation (especially SELinux, which consults an in-kernel policy database).
A senior engineer would say: “Security is not a feature you bolt on at the end. Each layer addresses a different threat model: seccomp limits the kernel’s attack surface, capabilities implement least privilege, DAC controls data access, and LSM enforces organizational policy. If someone asks you to ‘just disable SELinux,’ they are asking you to remove one of those layers — and you should understand which threats you are accepting before you do.”

Capabilities: Dividing Root Power

Traditional Unix has a binary security model: UID 0 (root) can do everything, everyone else is restricted. This means a web server that needs to bind to port 80 must run as root, getting ALL of root’s power including the ability to load kernel modules, read any file, kill any process, and change the system clock. Capabilities break this “all or nothing” model into approximately 40 distinct privileges.

Common Capabilities

CapabilityWhat It AllowsWhy You Care
CAP_NET_ADMINConfigure network interfaces, routing, firewallNeeded for CNI plugins, network debugging
CAP_NET_RAWUse raw and packet socketsNeeded for ping, tcpdump, network monitoring
CAP_NET_BIND_SERVICEBind to ports below 1024The reason nginx/apache historically needed root
CAP_SYS_ADMIN”Kitchen sink” — mount, ptrace, many thingsAvoid granting this. It is nearly equivalent to full root.
CAP_SYS_PTRACETrace any process, read /proc/pid/memNeeded for debuggers, strace. Dangerous in containers.
CAP_DAC_OVERRIDEBypass file read/write/execute permission checksEffectively ignores rwx bits on all files
CAP_CHOWNMake arbitrary changes to file UIDs and GIDsCan take ownership of any file
CAP_SETUIDSet arbitrary UIDCan become any user, including root
CAP_SETGIDSet arbitrary GIDCan join any group
CAP_KILLSend signals to any processCan kill processes owned by other users
CAP_SYS_RESOURCEOverride resource limits (ulimits)Escape cgroup-like constraints
CAP_SYS_TIMESet system clockCan break time-dependent security (TLS certs, Kerberos)
CAP_SYS_MODULELoad/unload kernel modulesFull kernel code execution. Never grant in containers.
CAP_MKNODCreate special device filesCan create /dev/sda and read raw disk
CAP_AUDIT_WRITEWrite to audit logNeeded for applications that must log to auditd
CAP_SETFCAPSet file capabilitiesCan grant capabilities to other binaries
The CAP_SYS_ADMIN trap: CAP_SYS_ADMIN is the most commonly requested and most dangerous capability. It controls over 30 different operations including mount(), sethostname(), setns(), pivot_root(), ioctl() on many devices, and more. Granting CAP_SYS_ADMIN to a container is essentially the same as running it as --privileged. If an application claims to need CAP_SYS_ADMIN, push back and find out which specific operation it needs — there may be a narrower capability or an alternative approach.

Capability Sets

Each process has multiple capability sets that interact during permission checks and across execve() boundaries. This is where capabilities get subtle.
+-----------------------------------------------------------------------------+
|                    CAPABILITY SETS                                           |
+-----------------------------------------------------------------------------+
|                                                                              |
|  Process capability sets:                                                    |
|                                                                              |
|  EFFECTIVE (CapEff)                                                          |
|  -- Capabilities the kernel actually checks for permission                  |
|  -- This is the ONLY set that matters at the moment of a check              |
|  -- Think of it as "what powers I am currently wielding"                    |
|                                                                              |
|  PERMITTED (CapPrm)                                                          |
|  -- Maximum capabilities this process CAN have in Effective                 |
|  -- A process can copy caps from Permitted to Effective at will             |
|  -- Once dropped from Permitted, CANNOT be regained (one-way drop)         |
|  -- Think of it as "what powers I could wield if I chose to"               |
|                                                                              |
|  INHERITABLE (CapInh)                                                        |
|  -- Capabilities preserved across execve()                                  |
|  -- Combined with file capabilities on the new binary                       |
|  -- Think of it as "what powers I can pass to programs I launch"            |
|                                                                              |
|  BOUNDING (CapBnd)                                                           |
|  -- Hard ceiling on capabilities that can ever be gained                    |
|  -- Dropping from Bounding is IRREVERSIBLE (even for root)                  |
|  -- Think of it as "the maximum powers anything in this process tree        |
|     can ever have, no matter what"                                          |
|                                                                              |
|  AMBIENT (CapAmb)                                                            |
|  -- Capabilities preserved across exec of unprivileged programs             |
|  -- Must be subset of Permitted AND Inheritable                             |
|  -- Solves the "how do I pass caps to a non-setuid binary" problem          |
|  -- Added in kernel 4.3                                                     |
|                                                                              |
+-----------------------------------------------------------------------------+

Working with Capabilities

# View process capabilities (hex-encoded bitmask)
cat /proc/$$/status | grep Cap
# CapInh: 0000000000000000  -- Inheritable: none
# CapPrm: 0000003fffffffff  -- Permitted: all caps (we are root)
# CapEff: 0000003fffffffff  -- Effective: all caps (actively wielded)
# CapBnd: 0000003fffffffff  -- Bounding: all caps (ceiling)
# CapAmb: 0000000000000000  -- Ambient: none

# Decode capability bits to human-readable names
capsh --decode=0000003fffffffff
# 0x0000003fffffffff=cap_chown,cap_dac_override,...(all 40+ caps)

# View capabilities of a specific process
getpcaps $$
# Shows: cap_chown,cap_dac_override,... for the current shell

# Set file capabilities (allow binary to run with specific caps as non-root)
# This is the SECURE alternative to setuid root
sudo setcap cap_net_bind_service=+ep /path/to/binary
# +e = add to Effective on exec, +p = add to Permitted on exec
# Now this binary can bind port 80 without being root

# View file capabilities
getcap /path/to/binary
# /path/to/binary cap_net_bind_service=ep

# Run a command with specific capabilities only
capsh --caps="cap_net_bind_service+eip" -- -c "./my_server"

# Drop ALL capabilities except specific ones
# This is what container runtimes do for unprivileged containers
capsh --drop=all --caps="cap_net_raw+eip" -- -c "./my_program"

Kernel Capability Checks

Understanding how the kernel checks capabilities helps you debug “permission denied” errors. Every privileged operation in the kernel calls capable() or ns_capable() before proceeding.
// Kernel checks capabilities like this:
// kernel/capability.c

// capable() checks against the initial (root) user namespace
bool capable(int cap)
{
    return ns_capable(&init_user_ns, cap);
}

// ns_capable() is the namespace-aware version
// In a user namespace, you can have capabilities that are ONLY valid
// within that namespace. A container can be "root" inside its namespace
// but unprivileged on the host.
bool ns_capable(struct user_namespace *ns, int cap)
{
    if (unlikely(!cap_valid(cap)))  // Sanity check: is this a real cap?
        return false;
    
    // security_capable() calls into the LSM framework
    // This is where SELinux/AppArmor can DENY even if the process
    // technically has the capability. MAC overrides capabilities.
    if (security_capable(current_cred(), ns, cap, CAP_OPT_NONE) == 0)
        return true;
    
    return false;
}

// Example: how the kernel enforces port binding restrictions
// net/ipv4/af_inet.c
// When bind() is called with a port below 1024:
if (snum < inet_prot_sock(net) &&
    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
    return -EACCES;  // This becomes errno=EACCES in userspace
// The process needs CAP_NET_BIND_SERVICE in the network namespace's
// owning user namespace -- not just in any namespace
Debugging capability denials: When you get EACCES or EPERM and cannot figure out why, use strace to find the failing syscall, then search the kernel source for that syscall’s capable() or ns_capable() checks. This tells you exactly which capability is needed. For example: strace -e trace=bind ./myapp 2>&1 | grep EACCES shows the bind() call failing, and the kernel source for inet_bind() shows it needs CAP_NET_BIND_SERVICE.

Linux Security Modules (LSM)

LSM provides a framework of hooks throughout the kernel for mandatory access control. Unlike DAC (where the file owner controls permissions), MAC policies are set by the administrator and cannot be overridden by users, even root. The key mental model: LSM hooks are checkpoints inserted at every security-relevant kernel operation. Each registered security module gets a chance to say “deny” at each checkpoint.

LSM Architecture

+-----------------------------------------------------------------------------+
|                    LSM HOOK ARCHITECTURE                                     |
+-----------------------------------------------------------------------------+
|                                                                              |
|  Kernel Operation (e.g., file open)                                          |
|                                                                              |
|  vfs_open()                                                                  |
|  |                                                                           |
|  +--- Check DAC permissions (traditional Unix rwx/ACL)                      |
|  |                                                                           |
|  +--- Call LSM hook: security_file_open()                                   |
|       |                                                                      |
|       v                                                                      |
|  +-------------------------------------------------------------------+      |
|  |                  LSM Framework                                     |      |
|  |                                                                    |      |
|  |  security_file_open()                                              |      |
|  |  |                                                                 |      |
|  |  +--- call_int_hook(file_open, ...)                               |      |
|  |  |    |                                                            |      |
|  |  |    +--> selinux_file_open()     <-- SELinux check              |      |
|  |  |    |   -- Check type enforcement policy                        |      |
|  |  |    |   -- "Can httpd_t read passwd_file_t?"                    |      |
|  |  |    |                                                            |      |
|  |  |    +--> apparmor_file_open()    <-- AppArmor check             |      |
|  |  |    |   -- Check if profile allows this path                    |      |
|  |  |    |   -- "Can /usr/sbin/nginx read /etc/nginx/*?"             |      |
|  |  |    |                                                            |      |
|  |  |    +--> bpf_lsm_file_open()    <-- BPF LSM (if enabled)       |      |
|  |  |        -- Custom eBPF policy program                           |      |
|  |  |        -- Runtime-configurable without kernel recompile        |      |
|  |  |                                                                 |      |
|  |  +--- ALL hooks must return 0 for access to be granted            |      |
|  |       (any single "deny" blocks the operation)                    |      |
|  |                                                                    |      |
|  +-------------------------------------------------------------------+      |
|                                                                              |
+-----------------------------------------------------------------------------+
Since kernel 5.4, Linux supports “stacking” multiple LSM modules. You can have SELinux + BPF LSM active simultaneously. The order is determined at compile time and boot parameters. Each hook in the chain must approve the operation for it to proceed.

LSM Hooks

// include/linux/lsm_hooks.h (partial list of the ~200 hooks)
// These hooks cover every security-relevant operation in the kernel
union security_list_options {
    // Credential operations -- control who a process can become
    int (*cred_prepare)(struct cred *new, const struct cred *old, gfp_t gfp);
    
    // File operations -- control what files a process can access
    int (*file_permission)(struct file *file, int mask);
    int (*file_open)(struct file *file);
    int (*file_mprotect)(struct vm_area_struct *vma, unsigned long prot);
    // file_mprotect is critical: prevents W+X (write+execute) memory
    // which is how many exploits work (write shellcode, then execute it)
    
    // Process operations -- control what processes can do to each other
    int (*task_create)(unsigned long clone_flags);
    int (*task_kill)(struct task_struct *p, struct kernel_siginfo *info,
                     int sig, const struct cred *cred);
    
    // Socket operations -- control network access
    int (*socket_create)(int family, int type, int protocol, int kern);
    int (*socket_bind)(struct socket *sock, struct sockaddr *address, int addrlen);
    int (*socket_connect)(struct socket *sock, struct sockaddr *address, int addrlen);
    // socket_connect is what prevents a compromised web server from
    // making outbound connections to attacker-controlled servers
    
    // And approximately 190 more hooks...
};

SELinux

Type Enforcement security — every subject (process) and object (file, socket, etc.) is labeled with a security context. Policy rules define which types can interact with which other types and how. If there is no explicit “allow” rule, the access is denied. This “default deny” model is what makes SELinux so effective and so frustrating.

SELinux Context

# View file context -- the -Z flag shows SELinux labels
ls -Z /etc/passwd
# -rw-r--r--. root root system_u:object_r:passwd_file_t:s0 /etc/passwd

# Context format: user:role:type:level
# user:    SELinux user identity (system_u = system, unconfined_u = regular user)
# role:    RBAC role (object_r for files, system_r for daemons, unconfined_r for users)
# type:    THE MOST IMPORTANT FIELD -- this is what Type Enforcement uses
#          passwd_file_t, httpd_t, container_t, etc.
# level:   MLS/MCS level (s0 = base level, s0-s15:c0.c1023 = multi-category)
#          MCS is used by containers to isolate them from each other

# View process context
ps -Z
# LABEL                               PID TTY          TIME CMD
# unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 1234 pts/0 00:00:00 bash
# "unconfined" means this process is NOT restricted by SELinux policy
# A properly confined process would show httpd_t, sshd_t, container_t, etc.

# Current context
id -Z

SELinux Policy Rules

# Type enforcement rule format:
# allow source_type target_type : object_class { permissions };
# "Allow processes with type X to do Y to objects with type Z"

# Example: Allow httpd (web server) to read password files
allow httpd_t passwd_file_t : file { read open getattr };

# Example: Allow httpd to connect to HTTP ports (80, 443)
allow httpd_t http_port_t : tcp_socket { name_connect };
# Without this rule, your web server cannot make outbound HTTP connections
# This is INTENTIONAL -- a compromised web server should not be able to
# connect to arbitrary servers on the internet

# Query the loaded policy for existing rules
sesearch --allow --source httpd_t --target passwd_file_t

# View AVC (Access Vector Cache) denials in audit log
# This is HOW you find out what SELinux is blocking
ausearch -m avc -ts today

# Generate a policy module from observed denials
# audit2allow reads denial logs and creates allow rules
audit2allow -a -M mypolicy    # -a = read all denials, -M = create module
semodule -i mypolicy.pp       # Install the generated policy module
# WARNING: audit2allow can be too permissive. Review generated rules
# before installing. It may allow more than the minimum needed.
The audit2allow trap: It is tempting to run audit2allow -a -M fix && semodule -i fix.pp every time SELinux blocks something. This gradually opens up your security policy until SELinux is technically enabled but not actually protecting anything. A better approach: understand WHY the denial happened. Often the file has the wrong context (fix with restorecon) or a boolean needs to be set (fix with setsebool). Only create custom policy rules when the standard policy genuinely does not cover your use case.

SELinux Modes

# Check current mode
getenforce   # Returns: Enforcing, Permissive, or Disabled

# Temporarily change mode (does not survive reboot)
setenforce 0   # Permissive: logs denials but does NOT block them
setenforce 1   # Enforcing: logs AND blocks

# Permanent change: edit /etc/selinux/config
SELINUX=enforcing    # enforcing, permissive, or disabled
SELINUXTYPE=targeted  # targeted (confine specific daemons) or mls (multi-level)
# NOTE: Changing from disabled to enforcing requires a full filesystem relabel
# which can take 10-30 minutes on boot. Plan accordingly.
The “just disable SELinux” anti-pattern: When SELinux blocks something, the temptation is to run setenforce 0 or add SELINUX=disabled to the config. This is the security equivalent of turning off smoke detectors because they beep. Instead: set the specific domain to permissive (semanage permissive -a httpd_t) to debug that one service without disabling protection for everything else. Then fix the root cause and re-enforce.

SELinux Booleans

Booleans are pre-defined policy switches that enable/disable common configurations without writing custom policy.
# List all booleans (there are hundreds)
getsebool -a

# Common examples that come up in production:
setsebool -P httpd_can_network_connect on   # Let Apache make outbound connections
setsebool -P container_manage_cgroup on     # Let containers manage their cgroups
setsebool -P httpd_can_sendmail on          # Let Apache send email
# -P = persistent (survives reboot)

# View what booleans affect a specific domain
semanage boolean -l | grep httpd
# httpd_can_network_connect    (off  , off)  Allow httpd to make network connections
# httpd_can_network_relay      (off  , off)  Allow httpd to act as a reverse proxy

AppArmor

Profile-based MAC — simpler than SELinux because it uses pathnames rather than labels. Each confined program has a profile that lists exactly which files, capabilities, and network operations it can use. If it is not in the profile, it is denied.

AppArmor Profiles

# View loaded profiles and their modes
aa-status
# Shows: enforced profiles, complain mode profiles, unconfined processes

# Profile locations (each file is a profile for one program)
ls /etc/apparmor.d/

# Example profile: /etc/apparmor.d/usr.sbin.nginx
#include <tunables/global>       # Include system-wide variable definitions

/usr/sbin/nginx {
  #include <abstractions/base>        # Common rules (libc, ld.so, etc.)
  #include <abstractions/nameservice>  # DNS resolution, NSS

  # Capabilities this program is allowed to use
  capability net_bind_service,   # Bind to port 80/443
  capability setgid,             # Drop to worker group
  capability setuid,             # Drop to worker user

  # File access rules: path + permission (r=read, w=write, k=lock, m=mmap exec)
  /etc/nginx/** r,               # Read nginx config files
  /var/log/nginx/** rw,          # Read/write log files
  /var/www/** r,                 # Read web content
  /run/nginx.pid rw,             # Read/write PID file
  
  # Network access rules
  network inet tcp,              # Allow IPv4 TCP (for serving HTTP)
  network inet6 tcp,             # Allow IPv6 TCP
  
  # Explicit denials (override any includes that might allow it)
  deny /etc/shadow r,            # Even if nginx is root, deny shadow file access
}
# Everything NOT listed is implicitly denied

AppArmor Modes

# Enforce mode (default) -- denials are logged AND blocked
aa-enforce /etc/apparmor.d/usr.sbin.nginx

# Complain mode -- denials are LOGGED but NOT blocked
# Use this to discover what a program needs before writing the profile
aa-complain /etc/apparmor.d/usr.sbin.nginx

# Disable a profile entirely
aa-disable /etc/apparmor.d/usr.sbin.nginx

# Generate a new profile interactively
# Runs the program, logs what it does, and asks you to allow/deny each action
aa-genprof /usr/sbin/nginx

# Update an existing profile from recent log entries
# Reads denials from the log and offers to add allow rules
aa-logprof

AppArmor in Containers

# Docker run with a specific AppArmor profile
# docker-default is the built-in profile that blocks dangerous operations
docker run --security-opt apparmor=docker-default nginx

# Kubernetes pod with AppArmor (annotation-based, beta API)
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  annotations:
    # The profile must be loaded on the node where the pod runs
    container.apparmor.security.beta.kubernetes.io/nginx: runtime/default
spec:
  containers:
  - name: nginx
    image: nginx
SELinux vs AppArmor for containers: Docker and Kubernetes work with both. SELinux uses MCS (Multi-Category Security) labels to isolate containers from each other — each container gets a unique category pair (e.g., s0:c123,c456) so container A cannot access container B’s files even if both run as the same UID. AppArmor uses per-container profiles to restrict file access and capabilities. In practice, most teams use whichever their distro defaults to: SELinux on RHEL/Fedora/CentOS, AppArmor on Ubuntu/SUSE/Debian.

Seccomp-BPF

Seccomp (Secure Computing) with BPF filters lets you restrict which system calls a process can make. This is your last line of defense against kernel exploits: even if an attacker gets code execution inside your container, they can only invoke the ~300 syscalls that the filter allows, not the ~400+ that the kernel provides.

How Seccomp Works

+-----------------------------------------------------------------------------+
|                    SECCOMP-BPF FILTERING                                     |
+-----------------------------------------------------------------------------+
|                                                                              |
|  System Call Entry                                                           |
|  |                                                                           |
|  v                                                                           |
|  +-------------------------------------------------------------------------+|
|  |                    Seccomp BPF Filter                                    ||
|  |                                                                          ||
|  |  Input: syscall number, architecture, argument values                   ||
|  |  (arguments are read-only -- filter cannot modify them)                 ||
|  |                                                                          ||
|  |  +-------------------------------------------------------------+       ||
|  |  |  BPF Program evaluates:                                      |       ||
|  |  |  if (syscall == SYS_read || syscall == SYS_write) {         |       ||
|  |  |      return SECCOMP_RET_ALLOW;   // proceed normally        |       ||
|  |  |  } else if (syscall == SYS_ptrace) {                        |       ||
|  |  |      return SECCOMP_RET_KILL;    // kill immediately        |       ||
|  |  |  } else {                                                    |       ||
|  |  |      return SECCOMP_RET_ERRNO | EPERM;  // return error     |       ||
|  |  |  }                                                           |       ||
|  |  +-------------------------------------------------------------+       ||
|  |                                                                          ||
|  |  Return values (in priority order, lowest value wins):                  ||
|  |  SECCOMP_RET_KILL_PROCESS  --> Kill entire process (not just thread)    ||
|  |  SECCOMP_RET_KILL_THREAD   --> Kill calling thread                      ||
|  |  SECCOMP_RET_TRAP          --> Send SIGSYS signal (debuggable)          ||
|  |  SECCOMP_RET_ERRNO         --> Return error without executing syscall   ||
|  |  SECCOMP_RET_USER_NOTIF    --> Notify userspace handler (supervisor)    ||
|  |  SECCOMP_RET_TRACE         --> Notify tracer (for debugging/logging)    ||
|  |  SECCOMP_RET_LOG           --> Allow but log (audit mode)               ||
|  |  SECCOMP_RET_ALLOW         --> Continue to syscall normally             ||
|  |                                                                          ||
|  +-------------------------------------------------------------------------+|
|                                                                              |
+-----------------------------------------------------------------------------+
The filter is attached to a process with prctl(PR_SET_SECCOMP) and is inherited by all child processes (including across execve()). Once attached, it cannot be removed or weakened — a process can only add more restrictive filters on top. This “no weakening” property is what makes seccomp safe against privilege escalation.

Writing Seccomp Filters

#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/prctl.h>

// Using libseccomp (higher-level API, recommended for readability)
#include <seccomp.h>

int setup_seccomp(void)
{
    scmp_filter_ctx ctx;
    
    // Default action: kill the process if a disallowed syscall is attempted
    // Alternative: SCMP_ACT_ERRNO(EPERM) to return error instead of killing
    // KILL is safer (no chance of ignoring the error), ERRNO is more debuggable
    ctx = seccomp_init(SCMP_ACT_KILL);
    if (!ctx)
        return -1;
    
    // Allowlist: explicitly permit specific syscalls
    // This is a DENY-by-default approach (much safer than blocklisting)
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
    
    // Conditional allow: permit open() ONLY with O_RDONLY flag
    // This prevents the process from opening files for writing
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(open), 1,
                     SCMP_A1(SCMP_CMP_EQ, O_RDONLY));
    
    // Load the filter into the kernel
    // After this call, the filter is ACTIVE and IRREVOCABLE
    if (seccomp_load(ctx) < 0) {
        seccomp_release(ctx);
        return -1;
    }
    
    seccomp_release(ctx);  // Free the userspace context (filter is in kernel now)
    return 0;
}
TOCTOU warning: Seccomp filters check syscall arguments at the time of the filter evaluation, but the arguments live in userspace memory. A multi-threaded process could change an argument between the seccomp check and the kernel’s use of that argument. For this reason, seccomp filters on pointer arguments (like filenames in open()) are inherently racy. Use LSM (SELinux/AppArmor) for path-based access control, and use seccomp for syscall-number-level filtering.

Docker Seccomp Profile

{
    "defaultAction": "SCMP_ACT_ERRNO",
    "defaultErrnoRet": 1,
    "architectures": ["SCMP_ARCH_X86_64"],
    "syscalls": [
        {
            "names": ["read", "write", "open", "close", "stat", "fstat"],
            "action": "SCMP_ACT_ALLOW"
        },
        {
            "names": ["ptrace"],
            "action": "SCMP_ACT_ERRNO",
            "errnoRet": 1
        },
        {
            "names": ["clone"],
            "action": "SCMP_ACT_ALLOW",
            "args": [
                {
                    "index": 0,
                    "op": "SCMP_CMP_MASKED_EQ",
                    "value": 2114060288,
                    "valueTwo": 0
                }
            ]
        }
    ]
}
# Run container with custom seccomp profile
docker run --security-opt seccomp=/path/to/profile.json myimage

# Run with NO seccomp (dangerous, but useful for debugging)
docker run --security-opt seccomp=unconfined myimage

# Docker's default profile blocks approximately 50 syscalls including:
# kexec_load, kexec_file_load -- replace the running kernel
# mount, umount2 -- modify filesystem mounts
# ptrace -- debug/trace other processes
# create_module, init_module, delete_module -- load kernel modules
# reboot -- reboot the host
# swapon, swapoff -- modify swap
# These are syscalls that would let a container escape to the host

eBPF LSM

Write custom security policies with eBPF programs that attach to LSM hooks. This gives you the flexibility of custom kernel modules without the stability risk — BPF programs are verified for safety before loading.
// Custom LSM program: block access to /etc/shadow
// This runs at the security_file_open hook, AFTER DAC and capabilities
SEC("lsm/file_open")
int BPF_PROG(restrict_file_open, struct file *file)
{
    const char *filename;
    char buf[256];
    
    // Read the filename from the kernel's dentry structure
    // BPF_CORE_READ handles cross-kernel-version struct layout differences
    filename = BPF_CORE_READ(file, f_path.dentry, d_name.name);
    bpf_probe_read_kernel_str(buf, sizeof(buf), filename);
    
    // Block access to shadow file
    // NOTE: This string comparison is simplified. Production code should
    // check the full path, not just the filename, to avoid false positives
    // on files coincidentally named "shadow" in other directories.
    if (buf[0] == 's' && buf[1] == 'h' && buf[2] == 'a' &&
        buf[3] == 'd' && buf[4] == 'o' && buf[5] == 'w') {
        return -EPERM;  // Deny access
    }
    
    return 0;  // Allow access (let other LSM hooks also decide)
}

// Custom LSM program: prevent killing a protected process
SEC("lsm/task_kill")
int BPF_PROG(restrict_kill, struct task_struct *p,
             struct kernel_siginfo *info, int sig,
             const struct cred *cred)
{
    // Block SIGKILL to a specific PID (set via BPF map)
    // This could protect a critical monitoring agent from being killed
    if (sig == SIGKILL && BPF_CORE_READ(p, pid) == protected_pid) {
        return -EPERM;
    }
    return 0;
}
When to use BPF LSM vs SELinux/AppArmor: Use SELinux/AppArmor for standard server hardening — they have mature tooling, well-tested policies, and broad community support. Use BPF LSM for dynamic, application-specific policies that need to change at runtime without restarting services. For example, a security team might deploy a BPF LSM program that blocks a specific CVE’s exploitation technique fleet-wide within minutes, without modifying any SELinux policy files or restarting any services.

Container Security Stack

+-----------------------------------------------------------------------------+
|                    CONTAINER SECURITY LAYERS                                 |
+-----------------------------------------------------------------------------+
|                                                                              |
|  Container Runtime Security (defense in depth -- each layer independent)     |
|  +-------------------------------------------------------------------------+|
|  |                                                                          ||
|  |  1. NAMESPACES (Isolation -- what can the container SEE?)               ||
|  |     +-- PID: Isolated process tree (cannot see host processes)         ||
|  |     +-- Network: Separate network stack (own IP, routes, iptables)     ||
|  |     +-- Mount: Separate filesystem view (own /, /proc, /sys)           ||
|  |     +-- User: UID/GID mapping (root inside = nobody outside)          ||
|  |     +-- IPC, UTS, Cgroup: Various isolations                          ||
|  |                                                                          ||
|  |  2. CGROUPS (Resource Limits -- what can the container USE?)            ||
|  |     +-- CPU: Quota and shares (prevent CPU starvation)                 ||
|  |     +-- Memory: Hard limits with OOM kill (prevent host OOM)           ||
|  |     +-- IO: Bandwidth and IOPS limits (prevent disk monopolization)    ||
|  |                                                                          ||
|  |  3. CAPABILITIES (Reduced Privileges -- what can the container DO?)     ||
|  |     +-- Drop all, add only what's needed                               ||
|  |     +-- Docker default: ~14 of ~40 capabilities granted                ||
|  |     +-- Secure: drop ALL, add only NET_BIND_SERVICE if needed          ||
|  |                                                                          ||
|  |  4. SECCOMP (Syscall Filtering -- what kernel APIs can it CALL?)        ||
|  |     +-- Default profile blocks ~50 dangerous syscalls                   ||
|  |     +-- Custom profiles for specific workloads                          ||
|  |     +-- Prevents kernel exploit via restricted syscall surface          ||
|  |                                                                          ||
|  |  5. LSM (SELinux/AppArmor -- what POLICY governs it?)                   ||
|  |     +-- Mandatory access control per container                          ||
|  |     +-- SELinux MCS labels isolate containers from each other           ||
|  |     +-- AppArmor profiles restrict file/network/capability access       ||
|  |                                                                          ||
|  |  6. READ-ONLY ROOTFS (Immutability -- can it PERSIST changes?)          ||
|  |     +-- Prevent malware from writing to the container filesystem        ||
|  |     +-- Mount tmpfs for /tmp and /run as needed                         ||
|  |                                                                          ||
|  |  7. NO-NEW-PRIVILEGES (Escalation Prevention)                            ||
|  |     +-- Prevent privilege escalation via setuid/setgid binaries         ||
|  |     +-- Blocks execve() from gaining capabilities from file caps        ||
|  |                                                                          ||
|  +-------------------------------------------------------------------------+|
|                                                                              |
+-----------------------------------------------------------------------------+

Docker Security Options

# Run with minimal capabilities (drop all, add only what's needed)
docker run --cap-drop=all --cap-add=net_bind_service nginx
# This nginx can bind port 80 but cannot: load kernel modules, mount
# filesystems, change file ownership, send signals to other containers, etc.

# No new privileges (block setuid escalation)
docker run --security-opt=no-new-privileges:true myapp
# Even if there is a setuid-root binary inside the container, execve()
# will NOT grant elevated privileges. Critical for defense in depth.

# Custom seccomp profile
docker run --security-opt seccomp=profile.json myapp

# Custom AppArmor profile
docker run --security-opt apparmor=my-profile myapp

# Read-only rootfs (any writes fail with EROFS)
docker run --read-only --tmpfs /tmp myapp
# The application can write to /tmp (tmpfs, in-memory) but nothing else
# This prevents an attacker from dropping persistence mechanisms

# Run as non-root user
docker run --user 1000:1000 myapp
# Even inside the container, the process runs as UID 1000, not root
# Combined with user namespaces, this UID maps to an unprivileged host UID

# Completely unprivileged (rootless Docker)
# Docker daemon itself runs as a non-root user
# Uses user namespaces, unprivileged network setup (slirp4netns)

Kubernetes Pod Security

apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
spec:
  securityContext:
    runAsNonRoot: true          # kubelet rejects pods that try to run as root
    runAsUser: 1000             # Force UID 1000
    fsGroup: 2000               # All mounted volumes owned by GID 2000
    seccompProfile:
      type: RuntimeDefault      # Use container runtime's default seccomp profile
  containers:
  - name: app
    image: myapp
    securityContext:
      allowPrivilegeEscalation: false  # Equivalent to no-new-privileges
      readOnlyRootFilesystem: true     # Read-only container filesystem
      capabilities:
        drop:
          - ALL                 # Drop every capability
        add:
          - NET_BIND_SERVICE    # Add back only what we need
      # Result: this container can bind port 80, and NOTHING else privileged.
      # It cannot: read /etc/shadow on the host, load kernel modules,
      # mount filesystems, ptrace other containers, change its UID, etc.
Pod Security Standards (PSS): Kubernetes 1.25+ enforces Pod Security Standards at the namespace level. The three levels are: privileged (no restrictions), baseline (blocks known privilege escalations), and restricted (hardened, drops all capabilities, requires non-root, read-only rootfs). Apply restricted to all production namespaces: kubectl label namespace production pod-security.kubernetes.io/enforce=restricted. Most “it worked in dev but not prod” security issues are caused by running baseline in dev and restricted in prod.

Debugging Security Issues

Capability Denied

# Check what capabilities a running process actually has
cat /proc/<pid>/status | grep Cap
# Then decode: capsh --decode=<hex_value>

# Find which syscall is failing and what capability it needs
# strace shows the syscall and the errno
strace -f ./myapp 2>&1 | grep EPERM
# Example output: bind(3, {sa_family=AF_INET, port=80}, 16) = -1 EPERM

# Common fix: check if binary has file capabilities set
getcap /path/to/binary
# If empty, the binary has no file capabilities

# Add the needed file capability
sudo setcap cap_net_bind_service=+ep /path/to/binary
# Now the binary can bind port 80 as a non-root user

# For containers: check if the capability is in the allowed list
docker inspect <container> | grep -A 20 CapAdd
# If NET_BIND_SERVICE is not in CapAdd and ALL is in CapDrop, that is the problem

SELinux Denials

# Check for recent AVC (Access Vector Cache) denials
ausearch -m avc -ts recent

# Example denial message:
# type=AVC msg=audit(...): avc: denied { read } for pid=1234
#   comm="myapp" name="secret" dev="sda1" ino=12345
#   scontext=system_u:system_r:httpd_t:s0
#   tcontext=system_u:object_r:admin_home_t:s0
#   tclass=file

# Translation: Process with type httpd_t tried to read a file with type
# admin_home_t, and there is no allow rule for httpd_t -> admin_home_t : file { read }

# FIX OPTION 1: Change the file's context to something httpd can read
chcon -t httpd_sys_content_t /path/to/file
# Or permanently (survives restorecon):
semanage fcontext -a -t httpd_sys_content_t "/path/to/file(/.*)?"
restorecon -Rv /path/to/file

# FIX OPTION 2: Generate and install a custom policy module
ausearch -m avc -ts recent | audit2allow -M myfix
semodule -i myfix.pp
# REVIEW the generated policy first: cat myfix.te

# FIX OPTION 3: Enable a boolean (if one exists for this use case)
semanage boolean -l | grep httpd

AppArmor Denials

# Check for AppArmor denials in kernel logs
dmesg | grep apparmor

# Example:
# apparmor="DENIED" operation="open" profile="/usr/sbin/nginx"
#   name="/etc/secret" pid=1234 comm="nginx"
# Translation: Nginx tried to open /etc/secret but its profile does not allow it

# Put the profile in complain mode to log all violations without blocking
aa-complain /etc/apparmor.d/usr.sbin.nginx
# Run the application, exercise all code paths
# Then update the profile from the logged violations:
aa-logprof
# Review the suggested changes, accept the ones that are legitimate
aa-enforce /etc/apparmor.d/usr.sbin.nginx  # Re-enforce after fixing

Seccomp Violations

# Check for seccomp kills in kernel logs
dmesg | grep seccomp
# Example: audit: seccomp action=kill pid=1234 comm="myapp" syscall=101
# syscall=101 = ptrace on x86_64

# Run with strace to see what syscalls the application needs
strace -f ./myapp 2>&1 | head -100
# This shows every syscall the application makes
# Use this to build an allowlist for your seccomp profile

# Use audit mode to log without blocking (find all needed syscalls first)
# In Docker seccomp profile: "defaultAction": "SCMP_ACT_LOG"
# This lets the application run normally but logs every syscall
# Then review logs and build the allowlist from actual usage

# For Kubernetes: check if the container's seccomp profile is too restrictive
kubectl describe pod <pod-name>
# Look for "seccompProfile" in the security context
The debugging workflow for “container cannot do X”: (1) Check capabilities: docker inspect | grep Cap. (2) Check seccomp: is the syscall blocked by the profile? dmesg | grep seccomp. (3) Check LSM: ausearch -m avc for SELinux or dmesg | grep apparmor. (4) Check DAC: plain file permissions. The layers are checked in order, and the FIRST denial wins. Start from the outermost layer (seccomp) and work inward.

Interview Questions

Answer:Multiple layers:
  1. User namespace: Map container UID 0 to unprivileged host UID
docker run --userns=host --user 1000:1000 myapp
  1. Capabilities: Drop all, add only needed
docker run --cap-drop=all --cap-add=net_bind_service myapp
  1. No new privileges: Prevent setuid escalation
docker run --security-opt=no-new-privileges:true myapp
  1. Seccomp: Filter dangerous syscalls
  2. Read-only rootfs: Prevent persistence
  3. In application code:
// Drop capabilities programmatically after initialization
cap_t caps = cap_init();  // Empty capability set
cap_set_proc(caps);       // Apply: now process has NO capabilities
cap_free(caps);           // Free the capability structure
Answer:
AspectSELinuxAppArmor
ModelType Enforcement (labels on everything)Path-based profiles (filenames)
ComplexityComplex, fine-grained, steep learning curveSimpler, easier to learn and debug
Default distroRHEL, Fedora, CentOSUbuntu, SUSE, Debian
Policy modelDefault-deny (must have explicit allow rule)Default-allow for unconfined programs
Learningaudit2allow generates rules from denialsaa-genprof generates profiles interactively
Container supportMCS labels isolate containers from each otherPer-container profiles
Key difference: SELinux labels objects (files, processes) with security contexts. Policy rules define allowed interactions between types. A file moved to a different directory keeps its SELinux label. AppArmor uses pathnames — profiles define what paths a program can access. A file moved to a different path might gain or lose protection depending on the profile rules.When to use which:
  • SELinux for high-security environments needing fine-grained control (government, finance)
  • AppArmor when simplicity and rapid deployment are preferred
  • Both provide strong security when properly configured
Answer:The problem: Containers share the host kernel. A container could exploit kernel vulnerabilities via syscalls. Every syscall is an entry point into the kernel, and historically, many kernel CVEs are triggered by specific syscall sequences.Seccomp-bpf solution: Filter syscalls before they execute:
  1. Container runtime installs a BPF filter at container start
  2. Every syscall is checked against the filter before entering the kernel
  3. Dangerous syscalls are blocked (e.g., ptrace, mount, kexec_load)
Docker default profile blocks:
  • kexec_load - Replace running kernel (game over if allowed)
  • mount - Mount filesystems (escape container filesystem isolation)
  • ptrace - Debug/trace processes (read other containers’ memory)
  • create_module / init_module - Load kernel modules (arbitrary kernel code)
  • And ~50 more dangerous syscalls
Performance: Very low overhead — the BPF filter runs in kernel space, evaluated in nanoseconds per syscall with no context switches. It is effectively free compared to the cost of the syscall itself.
Answer:Traditional problem:
  • UID 0 = all privileges (approximately 40 distinct powers)
  • Regular user = almost no privileges
  • Programs needing one privilege (bind port 80) got ALL of root’s power
  • A compromised web server running as root could load kernel modules
Capabilities solution: Break root into ~40 discrete privileges:
  • CAP_NET_BIND_SERVICE - Bind to ports below 1024
  • CAP_SYS_ADMIN - Various admin tasks (too broad, avoid this one)
  • CAP_SYS_MODULE - Load kernel modules
  • etc.
Benefits:
  • Least privilege: Grant only what is needed, nothing more
  • Defense in depth: Compromised process has limited blast radius
  • Container isolation: Different containers get different capability sets
  • Auditable: getpcaps shows exactly what a process can do
Example: Web server needs only CAP_NET_BIND_SERVICE, not full root:
setcap cap_net_bind_service=+ep /usr/sbin/nginx
# Now nginx can bind port 80 without running as root
# If nginx is compromised, the attacker CANNOT: load kernel modules,
# read /etc/shadow, mount filesystems, or do anything else that root can do

Interview Deep-Dive

Strong Answer:
  • I would work through the security layers from outermost to innermost to understand what the attacker could and could not do, then investigate what actually happened.
  • Seccomp (layer 1): If the RuntimeDefault seccomp profile was applied, the attacker cannot call ptrace (cannot debug other containers’ processes), mount (cannot mount the host filesystem), kexec_load (cannot replace the kernel), or init_module (cannot load kernel modules). This eliminates the most common container escape techniques. I would check kubectl get pod -o yaml to verify the seccomp profile was actually applied — if seccompProfile is not set, no seccomp filter was active.
  • Capabilities (layer 2): If drop: ALL was set with only specific capabilities added back, the attacker cannot perform privileged operations even though they may be root inside the container. Without CAP_SYS_ADMIN, they cannot call mount() or setns() to access other namespaces. Without CAP_NET_RAW, they cannot sniff network traffic. I would check the pod spec for the capabilities section and cross-reference with the container runtime’s default capability list (Docker grants 14 by default if you do not specify).
  • Namespace isolation (layer 3): The PID namespace means the attacker sees only their container’s processes. The network namespace means they only see their container’s network stack (though they may be able to reach other pods via the pod network if NetworkPolicies are not in place). The mount namespace means /proc and /sys show the container’s view, not the host’s. User namespaces (if enabled) mean that root inside the container maps to an unprivileged UID on the host.
  • LSM (layer 4): SELinux MCS labels (if enabled) prevent the container from accessing files belonging to other containers. Even if the attacker breaks out of mount namespace isolation, the SELinux label mismatch blocks access. AppArmor profiles restrict file access to paths explicitly listed in the profile.
  • For investigation: I would start with the audit log (ausearch -m avc for SELinux denials, dmesg | grep seccomp for seccomp blocks). These logs tell me what the attacker TRIED to do that was blocked. Then I would examine the container’s filesystem (if not read-only) for dropped tools or modified files. kubectl logs and container runtime logs show the initial compromise vector. For network-level investigation, I would check Cilium/Calico flow logs to see what connections the compromised pod made — did it try to reach the metadata service? The Kubernetes API? Other pods?
Follow-up: What if the pod was running as privileged: true?Follow-up Answer:
  • A privileged container effectively disables ALL security layers: no seccomp filter, all capabilities granted (including CAP_SYS_ADMIN), access to all host devices via /dev, and no LSM confinement. The attacker has essentially root access on the host. They can mount the host filesystem (mount /dev/sda1 /mnt), read any file, load kernel modules, attach to any namespace (nsenter -t 1 -m -u -i -n -p), and compromise every container on the node. This is why privileged: true should NEVER be used in production. The only legitimate use cases are system-level DaemonSets (CNI plugins, node monitoring agents) that genuinely need host access, and even those should be scrutinized for whether they can use specific capabilities instead.
Strong Answer:
  • Multi-tenancy on shared Kubernetes infrastructure requires isolation at every layer: compute, network, storage, and the Kubernetes API itself. I would implement the following:
  • Namespace-level isolation: Each team gets a dedicated Kubernetes namespace. Apply Pod Security Standards at restricted level: kubectl label namespace team-a pod-security.kubernetes.io/enforce=restricted. This forces all pods to run as non-root, drop all capabilities, use read-only root filesystem, and apply the default seccomp profile. Teams that need exceptions go through a review process.
  • Network isolation: Apply default-deny NetworkPolicies in every namespace. By default, pods in team-a’s namespace cannot communicate with pods in team-b’s namespace. Specific cross-namespace communication is explicitly allowed via policy. Use Cilium for L7-aware policies (allow HTTP GET to the API but block POST) and DNS-aware policies (allow connections to api.example.com but not arbitrary IPs).
  • Resource isolation: ResourceQuotas per namespace cap total CPU, memory, and storage. LimitRanges set per-pod defaults and maximums. This prevents one team from consuming all cluster resources. For performance isolation, use dedicated node pools with taints/tolerations for latency-sensitive workloads, preventing noisy neighbors.
  • RBAC (API-level isolation): Each team gets a Kubernetes Role scoped to their namespace. They can create/delete pods and services in their namespace but cannot access other namespaces, nodes, or cluster-level resources. Use ClusterRole bindings sparingly. Audit all RBAC permissions with kubectl auth can-i --list --as=team-a-user.
  • Image security: Enforce signed images with admission controllers (Sigstore/cosign, OPA Gatekeeper). Block latest tag usage. Scan all images for CVEs before allowing deployment. Restrict image sources to approved registries only.
  • Runtime security: Deploy Falco or Tetragon as a DaemonSet for runtime threat detection. Alert on anomalous behavior: unexpected process execution (shell in a web server container), unexpected network connections (outbound to unknown IPs), filesystem modifications in read-only containers, privilege escalation attempts.
  • SELinux/AppArmor: With SELinux, each namespace’s pods get a unique MCS category via seLinuxOptions in the pod security context. Pod A in team-a (s0:c1,c2) cannot access files created by pod B in team-b (s0:c3,c4) even if both run as the same UID and share a persistent volume.
Follow-up: A developer argues that all these restrictions slow down their development workflow. How do you balance security with developer experience?Follow-up Answer:
  • I would create a tiered environment approach. Development namespaces run baseline Pod Security Standards (not restricted), allowing developers to iterate quickly without fighting security restrictions. Staging namespaces run restricted with the same policies as production. CI/CD pipelines automatically deploy to staging and reject promotions to production if the pod spec violates restricted policies. This way, developers discover security issues in staging (where they can fix them at their pace) rather than in production (where it is an emergency). I would also invest in self-service tooling: a Helm chart library with pre-hardened pod security contexts, a policy-as-code repository where teams can request exceptions with justification, and clear documentation explaining WHY each restriction exists and what the secure alternative is. When developers understand that drop: ALL protects their service from being used as a lateral movement pivot after another team’s container is compromised, they become allies rather than adversaries.
Strong Answer:
  • User namespaces create a mapping between UIDs inside the namespace and UIDs outside. When a process has UID 0 inside a user namespace, it has full capabilities WITHIN that namespace, but those capabilities are scoped to the namespace’s resources only. The kernel checks ns_capable() which verifies that the process has the capability in the correct namespace for the operation being attempted.
  • Here is the mechanism: when a rootless container starts, it calls clone(CLONE_NEWUSER) which creates a new user namespace. The process then writes a UID mapping like 0 100000 65536 to /proc/self/uid_map, meaning UID 0 inside maps to UID 100000 outside (an unprivileged user), and UIDs 1-65535 inside map to 100001-165535 outside. Inside the namespace, the process has all capabilities in its effective set. It can call mount(), create device nodes, and perform other privileged operations — but ONLY on resources owned by the namespace.
  • The security boundary is the namespace’s resource scope. CAP_NET_ADMIN inside a user namespace lets you configure the network stack of network namespaces owned by that user namespace, but NOT the host’s network stack. CAP_SYS_ADMIN lets you mount filesystems (with restrictions — only certain fs types like tmpfs, proc, sysfs are allowed), but not mount raw block devices. CAP_MKNOD is restricted — you can create device files but they will not function for accessing actual hardware because the device cgroup (or device filtering in cgroups v2) prevents it.
  • The critical invariant: a user namespace cannot grant privileges that its creator did not have in the parent namespace. If the parent process had no capabilities in the parent namespace, the child’s capabilities (even though they are “all” inside the new namespace) cannot affect anything outside the new namespace’s scope. The bounding set in the parent namespace remains the hard ceiling.
  • Practical security implications for rootless containers: the container’s “root” can install packages, bind to port 80 inside the container’s network namespace, and manage processes — all without any privilege on the host. If the container is compromised, the attacker has UID 100000 on the host (an unprivileged user) and cannot read /etc/shadow, load kernel modules, or affect any other container or the host.
Follow-up: What are the limitations of rootless containers that prevent some workloads from running?Follow-up Answer:
  • Several operations are either impossible or require workarounds in rootless containers. First, network: rootless containers cannot create veth pairs or configure bridge networking directly because those operations require real CAP_NET_ADMIN in the initial namespace. Rootless Docker uses slirp4netns (userspace TCP/IP stack) or pasta for networking, which adds ~10-20% network latency overhead compared to bridge networking. Second, storage: rootless containers cannot use some storage drivers (devicemapper, native overlay on kernels below 5.11). Overlayfs in a user namespace was only supported starting kernel 5.11 with the metacopy and userxattr mount options. Third, cgroups v2 delegation: the systemd cgroup driver supports rootless delegation, but cgroups v1 does not. This means rootless containers on cgroups v1 systems cannot set memory limits on sub-containers. Fourth, certain syscalls like mknod for real devices, mount for block devices, and setxattr for security labels are restricted even inside the user namespace for safety reasons.
Strong Answer:
  • I would use a four-phase approach: discover, build, test, and monitor.
  • Phase 1 — Discover: Run the application with SCMP_ACT_LOG as the default action in the seccomp profile. This allows all syscalls but logs every one. Simultaneously, run the application’s full test suite and exercise all code paths (including error paths, graceful shutdown, log rotation, etc.). Collect the syscall audit logs: grep SECCOMP /var/log/audit/audit.log | awk '{print $NF}' | sort -u gives the complete set of syscalls the application uses. Alternatively, use strace -f -c ./myapp for a summary, or OCI runtime tools like oci-seccomp-bpf-hook which automatically generate seccomp profiles from observed behavior.
  • Phase 2 — Build: Start with an empty allowlist and add only the syscalls discovered in Phase 1. For syscalls with argument-level sensitivity (like clone, which should not be called with CLONE_NEWUSER or CLONE_NEWNS flags from inside a container), add argument filters: SCMP_CMP_MASKED_EQ to check specific flag bits. For the default action, I prefer SCMP_ACT_ERRNO(EPERM) over SCMP_ACT_KILL during rollout because it returns an error rather than killing the process, making it easier to discover missing syscalls. Switch to SCMP_ACT_KILL_PROCESS once the profile is validated.
  • Phase 3 — Test: Deploy the profile in a staging environment with the application under realistic load (not just unit tests — integration tests, performance tests, chaos tests). Monitor for two things: application errors (EPERM in logs indicates a missing syscall in the allowlist) and application functionality (all features work correctly). Run for at least one full application lifecycle including startup, steady state, graceful shutdown, and log rotation. Do not forget to test container restart, OOM recovery, and signal handling.
  • Phase 4 — Monitor: In production, switch the default action to SCMP_ACT_LOG for the first week, which allows blocked syscalls but logs them. Monitor for unexpected syscall attempts — these could be legitimate code paths you missed in testing or could be actual attack attempts. After one week of clean logs, switch to SCMP_ACT_KILL_PROCESS for full enforcement. Keep the monitoring active permanently: any new seccomp log entries after enforcement indicate either a bug in the profile (if the application crashes) or an attack attempt (if the application continues normally).
  • A practical shortcut for most teams: start with Docker’s default profile, which blocks the ~50 most dangerous syscalls. Only build a custom minimal profile for security-critical services or services exposed to untrusted input. The effort of maintaining a minimal profile (updating it with every dependency change) is significant and not always worth the marginal security improvement over the default.
Follow-up: How do you handle seccomp profiles when the application uses dynamic languages (Python, Node.js) that may invoke different syscalls depending on which code path is taken?Follow-up Answer:
  • Dynamic languages are harder to profile because their syscall surface depends on which modules are loaded, which Python C extensions are called, and even which JIT paths the runtime takes. My approach changes in two ways: First, the discovery phase must be longer and more thorough. I would run the application for multiple days in SCMP_ACT_LOG mode under production-like traffic, not just test traffic, to capture rare code paths. I would also parse the application’s dependency tree to identify C extensions (which make direct syscalls) and research their syscall requirements. Second, I would use a slightly broader allowlist than for a static binary — including syscalls that the runtime might use for garbage collection (madvise, mprotect), JIT compilation (mmap with PROT_EXEC), and dynamic module loading (openat, mmap). The profile for a Python application might allow 150 syscalls versus 50 for a Go static binary, but it is still significantly smaller than the full ~400 available, eliminating the most dangerous attack surface.

Summary

MechanismPurposeScopeEnforcement
DAC (Unix permissions)Owner-controlled accessFiles, basicDiscretionary (owner can change)
CapabilitiesFine-grained privilegesProcessesKernel-enforced per-operation
Seccomp-BPFSystem call filteringPer-processIrrevocable once applied
SELinuxMandatory access control (labels)Whole systemAdministrator-managed policy
AppArmorProfile-based access (paths)Per-programAdministrator-managed profiles
NamespacesResource isolationPer-containerKernel-enforced
CgroupsResource limitsPer-containerKernel-enforced

Next Steps

  • Namespaces - Container isolation primitives
  • Cgroups - Resource limiting
  • eBPF - Custom security with BPF LSM