Linux security is multi-layered. Understanding these mechanisms is essential for infrastructure engineers building secure container platforms and debugging permission issues. Think of Linux security as a series of checkpoints a request must pass through — each layer can say “no” independently, and the request only succeeds if every layer says “yes.” This defense-in-depth approach means that even if one layer is misconfigured or exploited, the others still provide protection.
Interview Frequency: High (especially for container/cloud roles) Key Topics: LSM framework, capabilities, seccomp-bpf, SELinux/AppArmor Time to Master: 12-14 hours
+-----------------------------------------------------------------------------+| LINUX SECURITY LAYERS |+-----------------------------------------------------------------------------+| || User Space || +-------------------------------------------------------------------------+|| | Application ||| | -- Runs with specific UID/GID ||| | -- Limited capabilities (only what it needs) ||| | -- Seccomp filter active (restricted syscall set) ||| +-------------------------------------------------------------------------+|| || ======================================================================== || System Call Interface || ======================================================================== || || Kernel Space (checks happen IN THIS ORDER) || +-------------------------------------------------------------------------+|| | ||| | 1. SECCOMP-BPF (System Call Filter) ||| | -- Runs first: allows/denies/traces syscalls ||| | -- Cannot be bypassed (runs before ANY other check) ||| | -- Decision based on syscall number + args, NOT identity ||| | ||| | 2. CAPABILITIES CHECK ||| | -- Fine-grained root privileges (~40 distinct caps) ||| | -- Checked before privileged operations ||| | -- Replaces the binary "is this root?" check ||| | ||| | 3. DAC (Discretionary Access Control) ||| | -- Traditional Unix permissions ||| | -- UID/GID ownership, rwx bits, ACLs ||| | -- "Discretionary" = owner can change the policy ||| | ||| | 4. LSM HOOKS (Mandatory Access Control) ||| | -- SELinux, AppArmor, or other modules ||| | -- Policy-based, cannot be overridden by file owner ||| | -- "Mandatory" = only admin can change the policy ||| | ||| +-------------------------------------------------------------------------+|| |+-----------------------------------------------------------------------------+
The ordering matters. Seccomp runs first because it is the cheapest check (a BPF program evaluating syscall numbers), and rejecting a syscall at this stage avoids the cost of all subsequent checks. Capabilities and DAC run next because they are fast lookups. LSM hooks run last because they can involve complex policy evaluation (especially SELinux, which consults an in-kernel policy database).
A senior engineer would say: “Security is not a feature you bolt on at the end. Each layer addresses a different threat model: seccomp limits the kernel’s attack surface, capabilities implement least privilege, DAC controls data access, and LSM enforces organizational policy. If someone asks you to ‘just disable SELinux,’ they are asking you to remove one of those layers — and you should understand which threats you are accepting before you do.”
Traditional Unix has a binary security model: UID 0 (root) can do everything, everyone else is restricted. This means a web server that needs to bind to port 80 must run as root, getting ALL of root’s power including the ability to load kernel modules, read any file, kill any process, and change the system clock. Capabilities break this “all or nothing” model into approximately 40 distinct privileges.
Avoid granting this. It is nearly equivalent to full root.
CAP_SYS_PTRACE
Trace any process, read /proc/pid/mem
Needed for debuggers, strace. Dangerous in containers.
CAP_DAC_OVERRIDE
Bypass file read/write/execute permission checks
Effectively ignores rwx bits on all files
CAP_CHOWN
Make arbitrary changes to file UIDs and GIDs
Can take ownership of any file
CAP_SETUID
Set arbitrary UID
Can become any user, including root
CAP_SETGID
Set arbitrary GID
Can join any group
CAP_KILL
Send signals to any process
Can kill processes owned by other users
CAP_SYS_RESOURCE
Override resource limits (ulimits)
Escape cgroup-like constraints
CAP_SYS_TIME
Set system clock
Can break time-dependent security (TLS certs, Kerberos)
CAP_SYS_MODULE
Load/unload kernel modules
Full kernel code execution. Never grant in containers.
CAP_MKNOD
Create special device files
Can create /dev/sda and read raw disk
CAP_AUDIT_WRITE
Write to audit log
Needed for applications that must log to auditd
CAP_SETFCAP
Set file capabilities
Can grant capabilities to other binaries
The CAP_SYS_ADMIN trap:CAP_SYS_ADMIN is the most commonly requested and most dangerous capability. It controls over 30 different operations including mount(), sethostname(), setns(), pivot_root(), ioctl() on many devices, and more. Granting CAP_SYS_ADMIN to a container is essentially the same as running it as --privileged. If an application claims to need CAP_SYS_ADMIN, push back and find out which specific operation it needs — there may be a narrower capability or an alternative approach.
Each process has multiple capability sets that interact during permission checks and across execve() boundaries. This is where capabilities get subtle.
+-----------------------------------------------------------------------------+| CAPABILITY SETS |+-----------------------------------------------------------------------------+| || Process capability sets: || || EFFECTIVE (CapEff) || -- Capabilities the kernel actually checks for permission || -- This is the ONLY set that matters at the moment of a check || -- Think of it as "what powers I am currently wielding" || || PERMITTED (CapPrm) || -- Maximum capabilities this process CAN have in Effective || -- A process can copy caps from Permitted to Effective at will || -- Once dropped from Permitted, CANNOT be regained (one-way drop) || -- Think of it as "what powers I could wield if I chose to" || || INHERITABLE (CapInh) || -- Capabilities preserved across execve() || -- Combined with file capabilities on the new binary || -- Think of it as "what powers I can pass to programs I launch" || || BOUNDING (CapBnd) || -- Hard ceiling on capabilities that can ever be gained || -- Dropping from Bounding is IRREVERSIBLE (even for root) || -- Think of it as "the maximum powers anything in this process tree || can ever have, no matter what" || || AMBIENT (CapAmb) || -- Capabilities preserved across exec of unprivileged programs || -- Must be subset of Permitted AND Inheritable || -- Solves the "how do I pass caps to a non-setuid binary" problem || -- Added in kernel 4.3 || |+-----------------------------------------------------------------------------+
# View process capabilities (hex-encoded bitmask)cat /proc/$$/status | grep Cap# CapInh: 0000000000000000 -- Inheritable: none# CapPrm: 0000003fffffffff -- Permitted: all caps (we are root)# CapEff: 0000003fffffffff -- Effective: all caps (actively wielded)# CapBnd: 0000003fffffffff -- Bounding: all caps (ceiling)# CapAmb: 0000000000000000 -- Ambient: none# Decode capability bits to human-readable namescapsh --decode=0000003fffffffff# 0x0000003fffffffff=cap_chown,cap_dac_override,...(all 40+ caps)# View capabilities of a specific processgetpcaps $$# Shows: cap_chown,cap_dac_override,... for the current shell# Set file capabilities (allow binary to run with specific caps as non-root)# This is the SECURE alternative to setuid rootsudo setcap cap_net_bind_service=+ep /path/to/binary# +e = add to Effective on exec, +p = add to Permitted on exec# Now this binary can bind port 80 without being root# View file capabilitiesgetcap /path/to/binary# /path/to/binary cap_net_bind_service=ep# Run a command with specific capabilities onlycapsh --caps="cap_net_bind_service+eip" -- -c "./my_server"# Drop ALL capabilities except specific ones# This is what container runtimes do for unprivileged containerscapsh --drop=all --caps="cap_net_raw+eip" -- -c "./my_program"
Understanding how the kernel checks capabilities helps you debug “permission denied” errors. Every privileged operation in the kernel calls capable() or ns_capable() before proceeding.
// Kernel checks capabilities like this:// kernel/capability.c// capable() checks against the initial (root) user namespacebool capable(int cap){ return ns_capable(&init_user_ns, cap);}// ns_capable() is the namespace-aware version// In a user namespace, you can have capabilities that are ONLY valid// within that namespace. A container can be "root" inside its namespace// but unprivileged on the host.bool ns_capable(struct user_namespace *ns, int cap){ if (unlikely(!cap_valid(cap))) // Sanity check: is this a real cap? return false; // security_capable() calls into the LSM framework // This is where SELinux/AppArmor can DENY even if the process // technically has the capability. MAC overrides capabilities. if (security_capable(current_cred(), ns, cap, CAP_OPT_NONE) == 0) return true; return false;}// Example: how the kernel enforces port binding restrictions// net/ipv4/af_inet.c// When bind() is called with a port below 1024:if (snum < inet_prot_sock(net) && !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE)) return -EACCES; // This becomes errno=EACCES in userspace// The process needs CAP_NET_BIND_SERVICE in the network namespace's// owning user namespace -- not just in any namespace
Debugging capability denials: When you get EACCES or EPERM and cannot figure out why, use strace to find the failing syscall, then search the kernel source for that syscall’s capable() or ns_capable() checks. This tells you exactly which capability is needed. For example: strace -e trace=bind ./myapp 2>&1 | grep EACCES shows the bind() call failing, and the kernel source for inet_bind() shows it needs CAP_NET_BIND_SERVICE.
LSM provides a framework of hooks throughout the kernel for mandatory access control. Unlike DAC (where the file owner controls permissions), MAC policies are set by the administrator and cannot be overridden by users, even root. The key mental model: LSM hooks are checkpoints inserted at every security-relevant kernel operation. Each registered security module gets a chance to say “deny” at each checkpoint.
Since kernel 5.4, Linux supports “stacking” multiple LSM modules. You can have SELinux + BPF LSM active simultaneously. The order is determined at compile time and boot parameters. Each hook in the chain must approve the operation for it to proceed.
// include/linux/lsm_hooks.h (partial list of the ~200 hooks)// These hooks cover every security-relevant operation in the kernelunion security_list_options { // Credential operations -- control who a process can become int (*cred_prepare)(struct cred *new, const struct cred *old, gfp_t gfp); // File operations -- control what files a process can access int (*file_permission)(struct file *file, int mask); int (*file_open)(struct file *file); int (*file_mprotect)(struct vm_area_struct *vma, unsigned long prot); // file_mprotect is critical: prevents W+X (write+execute) memory // which is how many exploits work (write shellcode, then execute it) // Process operations -- control what processes can do to each other int (*task_create)(unsigned long clone_flags); int (*task_kill)(struct task_struct *p, struct kernel_siginfo *info, int sig, const struct cred *cred); // Socket operations -- control network access int (*socket_create)(int family, int type, int protocol, int kern); int (*socket_bind)(struct socket *sock, struct sockaddr *address, int addrlen); int (*socket_connect)(struct socket *sock, struct sockaddr *address, int addrlen); // socket_connect is what prevents a compromised web server from // making outbound connections to attacker-controlled servers // And approximately 190 more hooks...};
Type Enforcement security — every subject (process) and object (file, socket, etc.) is labeled with a security context. Policy rules define which types can interact with which other types and how. If there is no explicit “allow” rule, the access is denied. This “default deny” model is what makes SELinux so effective and so frustrating.
# View file context -- the -Z flag shows SELinux labelsls -Z /etc/passwd# -rw-r--r--. root root system_u:object_r:passwd_file_t:s0 /etc/passwd# Context format: user:role:type:level# user: SELinux user identity (system_u = system, unconfined_u = regular user)# role: RBAC role (object_r for files, system_r for daemons, unconfined_r for users)# type: THE MOST IMPORTANT FIELD -- this is what Type Enforcement uses# passwd_file_t, httpd_t, container_t, etc.# level: MLS/MCS level (s0 = base level, s0-s15:c0.c1023 = multi-category)# MCS is used by containers to isolate them from each other# View process contextps -Z# LABEL PID TTY TIME CMD# unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 1234 pts/0 00:00:00 bash# "unconfined" means this process is NOT restricted by SELinux policy# A properly confined process would show httpd_t, sshd_t, container_t, etc.# Current contextid -Z
# Type enforcement rule format:# allow source_type target_type : object_class { permissions };# "Allow processes with type X to do Y to objects with type Z"# Example: Allow httpd (web server) to read password filesallow httpd_t passwd_file_t : file { read open getattr };# Example: Allow httpd to connect to HTTP ports (80, 443)allow httpd_t http_port_t : tcp_socket { name_connect };# Without this rule, your web server cannot make outbound HTTP connections# This is INTENTIONAL -- a compromised web server should not be able to# connect to arbitrary servers on the internet# Query the loaded policy for existing rulessesearch --allow --source httpd_t --target passwd_file_t# View AVC (Access Vector Cache) denials in audit log# This is HOW you find out what SELinux is blockingausearch -m avc -ts today# Generate a policy module from observed denials# audit2allow reads denial logs and creates allow rulesaudit2allow -a -M mypolicy # -a = read all denials, -M = create modulesemodule -i mypolicy.pp # Install the generated policy module# WARNING: audit2allow can be too permissive. Review generated rules# before installing. It may allow more than the minimum needed.
The audit2allow trap: It is tempting to run audit2allow -a -M fix && semodule -i fix.pp every time SELinux blocks something. This gradually opens up your security policy until SELinux is technically enabled but not actually protecting anything. A better approach: understand WHY the denial happened. Often the file has the wrong context (fix with restorecon) or a boolean needs to be set (fix with setsebool). Only create custom policy rules when the standard policy genuinely does not cover your use case.
# Check current modegetenforce # Returns: Enforcing, Permissive, or Disabled# Temporarily change mode (does not survive reboot)setenforce 0 # Permissive: logs denials but does NOT block themsetenforce 1 # Enforcing: logs AND blocks# Permanent change: edit /etc/selinux/configSELINUX=enforcing # enforcing, permissive, or disabledSELINUXTYPE=targeted # targeted (confine specific daemons) or mls (multi-level)# NOTE: Changing from disabled to enforcing requires a full filesystem relabel# which can take 10-30 minutes on boot. Plan accordingly.
The “just disable SELinux” anti-pattern: When SELinux blocks something, the temptation is to run setenforce 0 or add SELINUX=disabled to the config. This is the security equivalent of turning off smoke detectors because they beep. Instead: set the specific domain to permissive (semanage permissive -a httpd_t) to debug that one service without disabling protection for everything else. Then fix the root cause and re-enforce.
Booleans are pre-defined policy switches that enable/disable common configurations without writing custom policy.
# List all booleans (there are hundreds)getsebool -a# Common examples that come up in production:setsebool -P httpd_can_network_connect on # Let Apache make outbound connectionssetsebool -P container_manage_cgroup on # Let containers manage their cgroupssetsebool -P httpd_can_sendmail on # Let Apache send email# -P = persistent (survives reboot)# View what booleans affect a specific domainsemanage boolean -l | grep httpd# httpd_can_network_connect (off , off) Allow httpd to make network connections# httpd_can_network_relay (off , off) Allow httpd to act as a reverse proxy
Profile-based MAC — simpler than SELinux because it uses pathnames rather than labels. Each confined program has a profile that lists exactly which files, capabilities, and network operations it can use. If it is not in the profile, it is denied.
# Enforce mode (default) -- denials are logged AND blockedaa-enforce /etc/apparmor.d/usr.sbin.nginx# Complain mode -- denials are LOGGED but NOT blocked# Use this to discover what a program needs before writing the profileaa-complain /etc/apparmor.d/usr.sbin.nginx# Disable a profile entirelyaa-disable /etc/apparmor.d/usr.sbin.nginx# Generate a new profile interactively# Runs the program, logs what it does, and asks you to allow/deny each actionaa-genprof /usr/sbin/nginx# Update an existing profile from recent log entries# Reads denials from the log and offers to add allow rulesaa-logprof
# Docker run with a specific AppArmor profile# docker-default is the built-in profile that blocks dangerous operationsdocker run --security-opt apparmor=docker-default nginx# Kubernetes pod with AppArmor (annotation-based, beta API)apiVersion: v1kind: Podmetadata: name: nginx annotations: # The profile must be loaded on the node where the pod runs container.apparmor.security.beta.kubernetes.io/nginx: runtime/defaultspec: containers: - name: nginx image: nginx
SELinux vs AppArmor for containers: Docker and Kubernetes work with both. SELinux uses MCS (Multi-Category Security) labels to isolate containers from each other — each container gets a unique category pair (e.g., s0:c123,c456) so container A cannot access container B’s files even if both run as the same UID. AppArmor uses per-container profiles to restrict file access and capabilities. In practice, most teams use whichever their distro defaults to: SELinux on RHEL/Fedora/CentOS, AppArmor on Ubuntu/SUSE/Debian.
Seccomp (Secure Computing) with BPF filters lets you restrict which system calls a process can make. This is your last line of defense against kernel exploits: even if an attacker gets code execution inside your container, they can only invoke the ~300 syscalls that the filter allows, not the ~400+ that the kernel provides.
The filter is attached to a process with prctl(PR_SET_SECCOMP) and is inherited by all child processes (including across execve()). Once attached, it cannot be removed or weakened — a process can only add more restrictive filters on top. This “no weakening” property is what makes seccomp safe against privilege escalation.
#include <linux/seccomp.h>#include <linux/filter.h>#include <sys/prctl.h>// Using libseccomp (higher-level API, recommended for readability)#include <seccomp.h>int setup_seccomp(void){ scmp_filter_ctx ctx; // Default action: kill the process if a disallowed syscall is attempted // Alternative: SCMP_ACT_ERRNO(EPERM) to return error instead of killing // KILL is safer (no chance of ignoring the error), ERRNO is more debuggable ctx = seccomp_init(SCMP_ACT_KILL); if (!ctx) return -1; // Allowlist: explicitly permit specific syscalls // This is a DENY-by-default approach (much safer than blocklisting) seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0); seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0); // Conditional allow: permit open() ONLY with O_RDONLY flag // This prevents the process from opening files for writing seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(open), 1, SCMP_A1(SCMP_CMP_EQ, O_RDONLY)); // Load the filter into the kernel // After this call, the filter is ACTIVE and IRREVOCABLE if (seccomp_load(ctx) < 0) { seccomp_release(ctx); return -1; } seccomp_release(ctx); // Free the userspace context (filter is in kernel now) return 0;}
TOCTOU warning: Seccomp filters check syscall arguments at the time of the filter evaluation, but the arguments live in userspace memory. A multi-threaded process could change an argument between the seccomp check and the kernel’s use of that argument. For this reason, seccomp filters on pointer arguments (like filenames in open()) are inherently racy. Use LSM (SELinux/AppArmor) for path-based access control, and use seccomp for syscall-number-level filtering.
# Run container with custom seccomp profiledocker run --security-opt seccomp=/path/to/profile.json myimage# Run with NO seccomp (dangerous, but useful for debugging)docker run --security-opt seccomp=unconfined myimage# Docker's default profile blocks approximately 50 syscalls including:# kexec_load, kexec_file_load -- replace the running kernel# mount, umount2 -- modify filesystem mounts# ptrace -- debug/trace other processes# create_module, init_module, delete_module -- load kernel modules# reboot -- reboot the host# swapon, swapoff -- modify swap# These are syscalls that would let a container escape to the host
Write custom security policies with eBPF programs that attach to LSM hooks. This gives you the flexibility of custom kernel modules without the stability risk — BPF programs are verified for safety before loading.
// Custom LSM program: block access to /etc/shadow// This runs at the security_file_open hook, AFTER DAC and capabilitiesSEC("lsm/file_open")int BPF_PROG(restrict_file_open, struct file *file){ const char *filename; char buf[256]; // Read the filename from the kernel's dentry structure // BPF_CORE_READ handles cross-kernel-version struct layout differences filename = BPF_CORE_READ(file, f_path.dentry, d_name.name); bpf_probe_read_kernel_str(buf, sizeof(buf), filename); // Block access to shadow file // NOTE: This string comparison is simplified. Production code should // check the full path, not just the filename, to avoid false positives // on files coincidentally named "shadow" in other directories. if (buf[0] == 's' && buf[1] == 'h' && buf[2] == 'a' && buf[3] == 'd' && buf[4] == 'o' && buf[5] == 'w') { return -EPERM; // Deny access } return 0; // Allow access (let other LSM hooks also decide)}// Custom LSM program: prevent killing a protected processSEC("lsm/task_kill")int BPF_PROG(restrict_kill, struct task_struct *p, struct kernel_siginfo *info, int sig, const struct cred *cred){ // Block SIGKILL to a specific PID (set via BPF map) // This could protect a critical monitoring agent from being killed if (sig == SIGKILL && BPF_CORE_READ(p, pid) == protected_pid) { return -EPERM; } return 0;}
When to use BPF LSM vs SELinux/AppArmor: Use SELinux/AppArmor for standard server hardening — they have mature tooling, well-tested policies, and broad community support. Use BPF LSM for dynamic, application-specific policies that need to change at runtime without restarting services. For example, a security team might deploy a BPF LSM program that blocks a specific CVE’s exploitation technique fleet-wide within minutes, without modifying any SELinux policy files or restarting any services.
# Run with minimal capabilities (drop all, add only what's needed)docker run --cap-drop=all --cap-add=net_bind_service nginx# This nginx can bind port 80 but cannot: load kernel modules, mount# filesystems, change file ownership, send signals to other containers, etc.# No new privileges (block setuid escalation)docker run --security-opt=no-new-privileges:true myapp# Even if there is a setuid-root binary inside the container, execve()# will NOT grant elevated privileges. Critical for defense in depth.# Custom seccomp profiledocker run --security-opt seccomp=profile.json myapp# Custom AppArmor profiledocker run --security-opt apparmor=my-profile myapp# Read-only rootfs (any writes fail with EROFS)docker run --read-only --tmpfs /tmp myapp# The application can write to /tmp (tmpfs, in-memory) but nothing else# This prevents an attacker from dropping persistence mechanisms# Run as non-root userdocker run --user 1000:1000 myapp# Even inside the container, the process runs as UID 1000, not root# Combined with user namespaces, this UID maps to an unprivileged host UID# Completely unprivileged (rootless Docker)# Docker daemon itself runs as a non-root user# Uses user namespaces, unprivileged network setup (slirp4netns)
apiVersion: v1kind: Podmetadata: name: secure-podspec: securityContext: runAsNonRoot: true # kubelet rejects pods that try to run as root runAsUser: 1000 # Force UID 1000 fsGroup: 2000 # All mounted volumes owned by GID 2000 seccompProfile: type: RuntimeDefault # Use container runtime's default seccomp profile containers: - name: app image: myapp securityContext: allowPrivilegeEscalation: false # Equivalent to no-new-privileges readOnlyRootFilesystem: true # Read-only container filesystem capabilities: drop: - ALL # Drop every capability add: - NET_BIND_SERVICE # Add back only what we need # Result: this container can bind port 80, and NOTHING else privileged. # It cannot: read /etc/shadow on the host, load kernel modules, # mount filesystems, ptrace other containers, change its UID, etc.
Pod Security Standards (PSS): Kubernetes 1.25+ enforces Pod Security Standards at the namespace level. The three levels are: privileged (no restrictions), baseline (blocks known privilege escalations), and restricted (hardened, drops all capabilities, requires non-root, read-only rootfs). Apply restricted to all production namespaces: kubectl label namespace production pod-security.kubernetes.io/enforce=restricted. Most “it worked in dev but not prod” security issues are caused by running baseline in dev and restricted in prod.
# Check what capabilities a running process actually hascat /proc/<pid>/status | grep Cap# Then decode: capsh --decode=<hex_value># Find which syscall is failing and what capability it needs# strace shows the syscall and the errnostrace -f ./myapp 2>&1 | grep EPERM# Example output: bind(3, {sa_family=AF_INET, port=80}, 16) = -1 EPERM# Common fix: check if binary has file capabilities setgetcap /path/to/binary# If empty, the binary has no file capabilities# Add the needed file capabilitysudo setcap cap_net_bind_service=+ep /path/to/binary# Now the binary can bind port 80 as a non-root user# For containers: check if the capability is in the allowed listdocker inspect <container> | grep -A 20 CapAdd# If NET_BIND_SERVICE is not in CapAdd and ALL is in CapDrop, that is the problem
# Check for recent AVC (Access Vector Cache) denialsausearch -m avc -ts recent# Example denial message:# type=AVC msg=audit(...): avc: denied { read } for pid=1234# comm="myapp" name="secret" dev="sda1" ino=12345# scontext=system_u:system_r:httpd_t:s0# tcontext=system_u:object_r:admin_home_t:s0# tclass=file# Translation: Process with type httpd_t tried to read a file with type# admin_home_t, and there is no allow rule for httpd_t -> admin_home_t : file { read }# FIX OPTION 1: Change the file's context to something httpd can readchcon -t httpd_sys_content_t /path/to/file# Or permanently (survives restorecon):semanage fcontext -a -t httpd_sys_content_t "/path/to/file(/.*)?"restorecon -Rv /path/to/file# FIX OPTION 2: Generate and install a custom policy moduleausearch -m avc -ts recent | audit2allow -M myfixsemodule -i myfix.pp# REVIEW the generated policy first: cat myfix.te# FIX OPTION 3: Enable a boolean (if one exists for this use case)semanage boolean -l | grep httpd
# Check for AppArmor denials in kernel logsdmesg | grep apparmor# Example:# apparmor="DENIED" operation="open" profile="/usr/sbin/nginx"# name="/etc/secret" pid=1234 comm="nginx"# Translation: Nginx tried to open /etc/secret but its profile does not allow it# Put the profile in complain mode to log all violations without blockingaa-complain /etc/apparmor.d/usr.sbin.nginx# Run the application, exercise all code paths# Then update the profile from the logged violations:aa-logprof# Review the suggested changes, accept the ones that are legitimateaa-enforce /etc/apparmor.d/usr.sbin.nginx # Re-enforce after fixing
# Check for seccomp kills in kernel logsdmesg | grep seccomp# Example: audit: seccomp action=kill pid=1234 comm="myapp" syscall=101# syscall=101 = ptrace on x86_64# Run with strace to see what syscalls the application needsstrace -f ./myapp 2>&1 | head -100# This shows every syscall the application makes# Use this to build an allowlist for your seccomp profile# Use audit mode to log without blocking (find all needed syscalls first)# In Docker seccomp profile: "defaultAction": "SCMP_ACT_LOG"# This lets the application run normally but logs every syscall# Then review logs and build the allowlist from actual usage# For Kubernetes: check if the container's seccomp profile is too restrictivekubectl describe pod <pod-name># Look for "seccompProfile" in the security context
The debugging workflow for “container cannot do X”: (1) Check capabilities: docker inspect | grep Cap. (2) Check seccomp: is the syscall blocked by the profile? dmesg | grep seccomp. (3) Check LSM: ausearch -m avc for SELinux or dmesg | grep apparmor. (4) Check DAC: plain file permissions. The layers are checked in order, and the FIRST denial wins. Start from the outermost layer (seccomp) and work inward.
Q: How do you drop privileges in a containerized application?
Answer:Multiple layers:
User namespace: Map container UID 0 to unprivileged host UID
docker run --userns=host --user 1000:1000 myapp
Capabilities: Drop all, add only needed
docker run --cap-drop=all --cap-add=net_bind_service myapp
No new privileges: Prevent setuid escalation
docker run --security-opt=no-new-privileges:true myapp
Seccomp: Filter dangerous syscalls
Read-only rootfs: Prevent persistence
In application code:
// Drop capabilities programmatically after initializationcap_t caps = cap_init(); // Empty capability setcap_set_proc(caps); // Apply: now process has NO capabilitiescap_free(caps); // Free the capability structure
Q: What's the difference between SELinux and AppArmor?
Answer:
Aspect
SELinux
AppArmor
Model
Type Enforcement (labels on everything)
Path-based profiles (filenames)
Complexity
Complex, fine-grained, steep learning curve
Simpler, easier to learn and debug
Default distro
RHEL, Fedora, CentOS
Ubuntu, SUSE, Debian
Policy model
Default-deny (must have explicit allow rule)
Default-allow for unconfined programs
Learning
audit2allow generates rules from denials
aa-genprof generates profiles interactively
Container support
MCS labels isolate containers from each other
Per-container profiles
Key difference: SELinux labels objects (files, processes) with security contexts. Policy rules define allowed interactions between types. A file moved to a different directory keeps its SELinux label. AppArmor uses pathnames — profiles define what paths a program can access. A file moved to a different path might gain or lose protection depending on the profile rules.When to use which:
SELinux for high-security environments needing fine-grained control (government, finance)
AppArmor when simplicity and rapid deployment are preferred
Both provide strong security when properly configured
Q: How does seccomp-bpf protect containers?
Answer:The problem: Containers share the host kernel. A container could exploit kernel vulnerabilities via syscalls. Every syscall is an entry point into the kernel, and historically, many kernel CVEs are triggered by specific syscall sequences.Seccomp-bpf solution: Filter syscalls before they execute:
Container runtime installs a BPF filter at container start
Every syscall is checked against the filter before entering the kernel
Dangerous syscalls are blocked (e.g., ptrace, mount, kexec_load)
Docker default profile blocks:
kexec_load - Replace running kernel (game over if allowed)
mount - Mount filesystems (escape container filesystem isolation)
ptrace - Debug/trace processes (read other containers’ memory)
Performance: Very low overhead — the BPF filter runs in kernel space, evaluated in nanoseconds per syscall with no context switches. It is effectively free compared to the cost of the syscall itself.
Q: What is capability-based security and why is it better than root/non-root?
Answer:Traditional problem:
UID 0 = all privileges (approximately 40 distinct powers)
Regular user = almost no privileges
Programs needing one privilege (bind port 80) got ALL of root’s power
A compromised web server running as root could load kernel modules
Capabilities solution: Break root into ~40 discrete privileges:
CAP_NET_BIND_SERVICE - Bind to ports below 1024
CAP_SYS_ADMIN - Various admin tasks (too broad, avoid this one)
CAP_SYS_MODULE - Load kernel modules
etc.
Benefits:
Least privilege: Grant only what is needed, nothing more
Defense in depth: Compromised process has limited blast radius
Container isolation: Different containers get different capability sets
Auditable: getpcaps shows exactly what a process can do
Example: Web server needs only CAP_NET_BIND_SERVICE, not full root:
setcap cap_net_bind_service=+ep /usr/sbin/nginx# Now nginx can bind port 80 without running as root# If nginx is compromised, the attacker CANNOT: load kernel modules,# read /etc/shadow, mount filesystems, or do anything else that root can do
A container in your Kubernetes cluster was compromised. Walk through the security layers that should have limited the blast radius, how each layer restricts the attacker, and how you would investigate what happened.
Strong Answer:
I would work through the security layers from outermost to innermost to understand what the attacker could and could not do, then investigate what actually happened.
Seccomp (layer 1): If the RuntimeDefault seccomp profile was applied, the attacker cannot call ptrace (cannot debug other containers’ processes), mount (cannot mount the host filesystem), kexec_load (cannot replace the kernel), or init_module (cannot load kernel modules). This eliminates the most common container escape techniques. I would check kubectl get pod -o yaml to verify the seccomp profile was actually applied — if seccompProfile is not set, no seccomp filter was active.
Capabilities (layer 2): If drop: ALL was set with only specific capabilities added back, the attacker cannot perform privileged operations even though they may be root inside the container. Without CAP_SYS_ADMIN, they cannot call mount() or setns() to access other namespaces. Without CAP_NET_RAW, they cannot sniff network traffic. I would check the pod spec for the capabilities section and cross-reference with the container runtime’s default capability list (Docker grants 14 by default if you do not specify).
Namespace isolation (layer 3): The PID namespace means the attacker sees only their container’s processes. The network namespace means they only see their container’s network stack (though they may be able to reach other pods via the pod network if NetworkPolicies are not in place). The mount namespace means /proc and /sys show the container’s view, not the host’s. User namespaces (if enabled) mean that root inside the container maps to an unprivileged UID on the host.
LSM (layer 4): SELinux MCS labels (if enabled) prevent the container from accessing files belonging to other containers. Even if the attacker breaks out of mount namespace isolation, the SELinux label mismatch blocks access. AppArmor profiles restrict file access to paths explicitly listed in the profile.
For investigation: I would start with the audit log (ausearch -m avc for SELinux denials, dmesg | grep seccomp for seccomp blocks). These logs tell me what the attacker TRIED to do that was blocked. Then I would examine the container’s filesystem (if not read-only) for dropped tools or modified files. kubectl logs and container runtime logs show the initial compromise vector. For network-level investigation, I would check Cilium/Calico flow logs to see what connections the compromised pod made — did it try to reach the metadata service? The Kubernetes API? Other pods?
Follow-up: What if the pod was running as privileged: true?Follow-up Answer:
A privileged container effectively disables ALL security layers: no seccomp filter, all capabilities granted (including CAP_SYS_ADMIN), access to all host devices via /dev, and no LSM confinement. The attacker has essentially root access on the host. They can mount the host filesystem (mount /dev/sda1 /mnt), read any file, load kernel modules, attach to any namespace (nsenter -t 1 -m -u -i -n -p), and compromise every container on the node. This is why privileged: true should NEVER be used in production. The only legitimate use cases are system-level DaemonSets (CNI plugins, node monitoring agents) that genuinely need host access, and even those should be scrutinized for whether they can use specific capabilities instead.
Design a comprehensive security policy for a multi-tenant Kubernetes cluster where different teams run their workloads on shared infrastructure. How do you prevent one team from affecting another?
Strong Answer:
Multi-tenancy on shared Kubernetes infrastructure requires isolation at every layer: compute, network, storage, and the Kubernetes API itself. I would implement the following:
Namespace-level isolation: Each team gets a dedicated Kubernetes namespace. Apply Pod Security Standards at restricted level: kubectl label namespace team-a pod-security.kubernetes.io/enforce=restricted. This forces all pods to run as non-root, drop all capabilities, use read-only root filesystem, and apply the default seccomp profile. Teams that need exceptions go through a review process.
Network isolation: Apply default-deny NetworkPolicies in every namespace. By default, pods in team-a’s namespace cannot communicate with pods in team-b’s namespace. Specific cross-namespace communication is explicitly allowed via policy. Use Cilium for L7-aware policies (allow HTTP GET to the API but block POST) and DNS-aware policies (allow connections to api.example.com but not arbitrary IPs).
Resource isolation: ResourceQuotas per namespace cap total CPU, memory, and storage. LimitRanges set per-pod defaults and maximums. This prevents one team from consuming all cluster resources. For performance isolation, use dedicated node pools with taints/tolerations for latency-sensitive workloads, preventing noisy neighbors.
RBAC (API-level isolation): Each team gets a Kubernetes Role scoped to their namespace. They can create/delete pods and services in their namespace but cannot access other namespaces, nodes, or cluster-level resources. Use ClusterRole bindings sparingly. Audit all RBAC permissions with kubectl auth can-i --list --as=team-a-user.
Image security: Enforce signed images with admission controllers (Sigstore/cosign, OPA Gatekeeper). Block latest tag usage. Scan all images for CVEs before allowing deployment. Restrict image sources to approved registries only.
Runtime security: Deploy Falco or Tetragon as a DaemonSet for runtime threat detection. Alert on anomalous behavior: unexpected process execution (shell in a web server container), unexpected network connections (outbound to unknown IPs), filesystem modifications in read-only containers, privilege escalation attempts.
SELinux/AppArmor: With SELinux, each namespace’s pods get a unique MCS category via seLinuxOptions in the pod security context. Pod A in team-a (s0:c1,c2) cannot access files created by pod B in team-b (s0:c3,c4) even if both run as the same UID and share a persistent volume.
Follow-up: A developer argues that all these restrictions slow down their development workflow. How do you balance security with developer experience?Follow-up Answer:
I would create a tiered environment approach. Development namespaces run baseline Pod Security Standards (not restricted), allowing developers to iterate quickly without fighting security restrictions. Staging namespaces run restricted with the same policies as production. CI/CD pipelines automatically deploy to staging and reject promotions to production if the pod spec violates restricted policies. This way, developers discover security issues in staging (where they can fix them at their pace) rather than in production (where it is an emergency). I would also invest in self-service tooling: a Helm chart library with pre-hardened pod security contexts, a policy-as-code repository where teams can request exceptions with justification, and clear documentation explaining WHY each restriction exists and what the secure alternative is. When developers understand that drop: ALL protects their service from being used as a lateral movement pivot after another team’s container is compromised, they become allies rather than adversaries.
Explain how Linux capabilities interact with user namespaces in rootless containers. Why can a process be root inside a container but unprivileged on the host, and what are the security boundaries?
Strong Answer:
User namespaces create a mapping between UIDs inside the namespace and UIDs outside. When a process has UID 0 inside a user namespace, it has full capabilities WITHIN that namespace, but those capabilities are scoped to the namespace’s resources only. The kernel checks ns_capable() which verifies that the process has the capability in the correct namespace for the operation being attempted.
Here is the mechanism: when a rootless container starts, it calls clone(CLONE_NEWUSER) which creates a new user namespace. The process then writes a UID mapping like 0 100000 65536 to /proc/self/uid_map, meaning UID 0 inside maps to UID 100000 outside (an unprivileged user), and UIDs 1-65535 inside map to 100001-165535 outside. Inside the namespace, the process has all capabilities in its effective set. It can call mount(), create device nodes, and perform other privileged operations — but ONLY on resources owned by the namespace.
The security boundary is the namespace’s resource scope. CAP_NET_ADMIN inside a user namespace lets you configure the network stack of network namespaces owned by that user namespace, but NOT the host’s network stack. CAP_SYS_ADMIN lets you mount filesystems (with restrictions — only certain fs types like tmpfs, proc, sysfs are allowed), but not mount raw block devices. CAP_MKNOD is restricted — you can create device files but they will not function for accessing actual hardware because the device cgroup (or device filtering in cgroups v2) prevents it.
The critical invariant: a user namespace cannot grant privileges that its creator did not have in the parent namespace. If the parent process had no capabilities in the parent namespace, the child’s capabilities (even though they are “all” inside the new namespace) cannot affect anything outside the new namespace’s scope. The bounding set in the parent namespace remains the hard ceiling.
Practical security implications for rootless containers: the container’s “root” can install packages, bind to port 80 inside the container’s network namespace, and manage processes — all without any privilege on the host. If the container is compromised, the attacker has UID 100000 on the host (an unprivileged user) and cannot read /etc/shadow, load kernel modules, or affect any other container or the host.
Follow-up: What are the limitations of rootless containers that prevent some workloads from running?Follow-up Answer:
Several operations are either impossible or require workarounds in rootless containers. First, network: rootless containers cannot create veth pairs or configure bridge networking directly because those operations require real CAP_NET_ADMIN in the initial namespace. Rootless Docker uses slirp4netns (userspace TCP/IP stack) or pasta for networking, which adds ~10-20% network latency overhead compared to bridge networking. Second, storage: rootless containers cannot use some storage drivers (devicemapper, native overlay on kernels below 5.11). Overlayfs in a user namespace was only supported starting kernel 5.11 with the metacopy and userxattr mount options. Third, cgroups v2 delegation: the systemd cgroup driver supports rootless delegation, but cgroups v1 does not. This means rootless containers on cgroups v1 systems cannot set memory limits on sub-containers. Fourth, certain syscalls like mknod for real devices, mount for block devices, and setxattr for security labels are restricted even inside the user namespace for safety reasons.
You need to implement a seccomp profile for a new microservice. The default Docker profile is too permissive, and you want a minimal allowlist. Walk through your methodology for building and testing a production seccomp profile.
Strong Answer:
I would use a four-phase approach: discover, build, test, and monitor.
Phase 1 — Discover: Run the application with SCMP_ACT_LOG as the default action in the seccomp profile. This allows all syscalls but logs every one. Simultaneously, run the application’s full test suite and exercise all code paths (including error paths, graceful shutdown, log rotation, etc.). Collect the syscall audit logs: grep SECCOMP /var/log/audit/audit.log | awk '{print $NF}' | sort -u gives the complete set of syscalls the application uses. Alternatively, use strace -f -c ./myapp for a summary, or OCI runtime tools like oci-seccomp-bpf-hook which automatically generate seccomp profiles from observed behavior.
Phase 2 — Build: Start with an empty allowlist and add only the syscalls discovered in Phase 1. For syscalls with argument-level sensitivity (like clone, which should not be called with CLONE_NEWUSER or CLONE_NEWNS flags from inside a container), add argument filters: SCMP_CMP_MASKED_EQ to check specific flag bits. For the default action, I prefer SCMP_ACT_ERRNO(EPERM) over SCMP_ACT_KILL during rollout because it returns an error rather than killing the process, making it easier to discover missing syscalls. Switch to SCMP_ACT_KILL_PROCESS once the profile is validated.
Phase 3 — Test: Deploy the profile in a staging environment with the application under realistic load (not just unit tests — integration tests, performance tests, chaos tests). Monitor for two things: application errors (EPERM in logs indicates a missing syscall in the allowlist) and application functionality (all features work correctly). Run for at least one full application lifecycle including startup, steady state, graceful shutdown, and log rotation. Do not forget to test container restart, OOM recovery, and signal handling.
Phase 4 — Monitor: In production, switch the default action to SCMP_ACT_LOG for the first week, which allows blocked syscalls but logs them. Monitor for unexpected syscall attempts — these could be legitimate code paths you missed in testing or could be actual attack attempts. After one week of clean logs, switch to SCMP_ACT_KILL_PROCESS for full enforcement. Keep the monitoring active permanently: any new seccomp log entries after enforcement indicate either a bug in the profile (if the application crashes) or an attack attempt (if the application continues normally).
A practical shortcut for most teams: start with Docker’s default profile, which blocks the ~50 most dangerous syscalls. Only build a custom minimal profile for security-critical services or services exposed to untrusted input. The effort of maintaining a minimal profile (updating it with every dependency change) is significant and not always worth the marginal security improvement over the default.
Follow-up: How do you handle seccomp profiles when the application uses dynamic languages (Python, Node.js) that may invoke different syscalls depending on which code path is taken?Follow-up Answer:
Dynamic languages are harder to profile because their syscall surface depends on which modules are loaded, which Python C extensions are called, and even which JIT paths the runtime takes. My approach changes in two ways: First, the discovery phase must be longer and more thorough. I would run the application for multiple days in SCMP_ACT_LOG mode under production-like traffic, not just test traffic, to capture rare code paths. I would also parse the application’s dependency tree to identify C extensions (which make direct syscalls) and research their syscall requirements. Second, I would use a slightly broader allowlist than for a static binary — including syscalls that the runtime might use for garbage collection (madvise, mprotect), JIT compilation (mmap with PROT_EXEC), and dynamic module loading (openat, mmap). The profile for a Python application might allow 150 syscalls versus 50 for a Go static binary, but it is still significantly smaller than the full ~400 available, eliminating the most dangerous attack surface.