Linux security is multi-layered. Understanding these mechanisms is essential for infrastructure engineers building secure container platforms and debugging permission issues.
Interview Frequency: High (especially for container/cloud roles) Key Topics: LSM framework, capabilities, seccomp-bpf, SELinux/AppArmor Time to Master: 12-14 hours
┌─────────────────────────────────────────────────────────────────────────────┐│ CAPABILITY SETS │├─────────────────────────────────────────────────────────────────────────────┤│ ││ Process capability sets: ││ ││ EFFECTIVE (CapEff) ││ └── Capabilities the kernel actually checks ││ └── This is what matters for permission checks ││ ││ PERMITTED (CapPrm) ││ └── Maximum capabilities this process CAN have ││ └── Can add caps from permitted to effective ││ └── Once dropped from permitted, cannot be regained ││ ││ INHERITABLE (CapInh) ││ └── Capabilities preserved across execve() ││ └── Combined with file capabilities on exec ││ ││ BOUNDING (CapBnd) ││ └── Limit on capabilities that can ever be gained ││ └── Dropping is irreversible ││ ││ AMBIENT (CapAmb) ││ └── Capabilities preserved across exec of unprivileged programs ││ └── Subset of permitted ∩ inheritable ││ │└─────────────────────────────────────────────────────────────────────────────┘
# View process capabilitiescat /proc/$$/status | grep Cap# CapInh: 0000000000000000# CapPrm: 0000003fffffffff# CapEff: 0000003fffffffff# CapBnd: 0000003fffffffff# CapAmb: 0000000000000000# Decode capability bitscapsh --decode=0000003fffffffff# View capabilities of a processgetpcaps $$# Set file capabilities (run as non-root but with cap)sudo setcap cap_net_bind_service=+ep /path/to/binary# View file capabilitiesgetcap /path/to/binary# Run with specific capabilitiescapsh --caps="cap_net_bind_service+eip" -- -c "./my_server"# Drop all capabilities except specific onescapsh --drop=all --caps="cap_net_raw+eip" -- -c "./my_program"
// include/linux/lsm_hooks.h (partial list)union security_list_options { // Credential operations int (*cred_prepare)(struct cred *new, const struct cred *old, gfp_t gfp); // File operations int (*file_permission)(struct file *file, int mask); int (*file_open)(struct file *file); int (*file_mprotect)(struct vm_area_struct *vma, unsigned long prot); // Process operations int (*task_create)(unsigned long clone_flags); int (*task_kill)(struct task_struct *p, struct kernel_siginfo *info, int sig, const struct cred *cred); // Socket operations int (*socket_create)(int family, int type, int protocol, int kern); int (*socket_bind)(struct socket *sock, struct sockaddr *address, int addrlen); int (*socket_connect)(struct socket *sock, struct sockaddr *address, int addrlen); // And many more...};
# List all booleansgetsebool -a# Common examplessetsebool -P httpd_can_network_connect onsetsebool -P container_manage_cgroup on# View what booleans affect a domainsemanage boolean -l | grep httpd
# Run with minimal capabilitiesdocker run --cap-drop=all --cap-add=net_bind_service nginx# No new privileges (block setuid)docker run --security-opt=no-new-privileges:true myapp# Custom seccomp profiledocker run --security-opt seccomp=profile.json myapp# Custom AppArmor profiledocker run --security-opt apparmor=my-profile myapp# Read-only rootfsdocker run --read-only --tmpfs /tmp myapp# Run as non-rootdocker run --user 1000:1000 myapp# Completely unprivileged (rootless Docker)# Docker daemon runs as non-root user
# Check what capabilities a process hascat /proc/<pid>/status | grep Cap# Check what capability is needed# Look for "Operation not permitted" in stracestrace -f ./myapp 2>&1 | grep EPERM# Common: check if binary has file capabilitiesgetcap /path/to/binary# Add file capabilitysudo setcap cap_net_bind_service=+ep /path/to/binary
# Check for denialsausearch -m avc -ts recent# Example denial:# type=AVC msg=audit(...): avc: denied { read } for pid=1234# comm="myapp" name="secret" dev="sda1" ino=12345# scontext=system_u:system_r:httpd_t:s0# tcontext=system_u:object_r:admin_home_t:s0# tclass=file# What this means:# Process with type httpd_t tried to read file with type admin_home_t# There's no allow rule for this# Generate and install policy to allowausearch -m avc -ts recent | audit2allow -M myfixsemodule -i myfix.pp# Or fix by changing file contextchcon -t httpd_sys_content_t /path/to/file
# Check for seccomp killsdmesg | grep seccomp# Run with tracing to see what's blockedstrace -f ./myapp 2>&1 | head -100# Use audit mode in seccomp profile to log without blocking# "defaultAction": "SCMP_ACT_LOG"
Q: How do you drop privileges in a containerized application?
Answer:Multiple layers:
User namespace: Map container UID 0 to unprivileged host UID
Copy
docker run --userns=host --user 1000:1000 myapp
Capabilities: Drop all, add only needed
Copy
docker run --cap-drop=all --cap-add=net_bind_service myapp
No new privileges: Prevent setuid escalation
Copy
docker run --security-opt=no-new-privileges:true myapp
Seccomp: Filter dangerous syscalls
Read-only rootfs: Prevent persistence
In application code:
Copy
// Drop capabilities programmaticallycap_t caps = cap_init();cap_set_proc(caps);cap_free(caps);
Q: What's the difference between SELinux and AppArmor?
Answer:
Aspect
SELinux
AppArmor
Model
Type Enforcement (labels)
Path-based profiles
Complexity
Complex, fine-grained
Simpler, easier to learn
Default
RHEL, Fedora, CentOS
Ubuntu, SUSE
Policy
Everything needs rules
Permissive by default
Learning
Audit2allow helps
aa-genprof helps
Containers
Full support
Full support
Key difference: SELinux labels objects (files, processes) with security contexts. Policy rules define allowed interactions between types. AppArmor uses pathnames - profiles define what paths/capabilities a program can access.When to use which:
SELinux for high-security environments needing fine-grained control
AppArmor when simplicity is preferred
Q: How does seccomp-bpf protect containers?
Answer:The problem: Containers share the host kernel. A container could exploit kernel vulnerabilities via syscalls.Seccomp-bpf solution: Filter syscalls before they execute:
Container runtime installs BPF filter at container start
Every syscall is checked against filter
Dangerous syscalls are blocked (e.g., ptrace, mount, kexec)
Docker default profile blocks:
kexec_load - Replace running kernel
mount - Mount filesystems
ptrace - Debug/trace processes
create_module - Load kernel modules
init_module - Load kernel modules
And ~50 more
Performance: Very low overhead - BPF runs in kernel, no context switches
Q: What is capability-based security and why is it better than root/non-root?
Answer:Traditional problem:
UID 0 = all privileges
Regular user = almost no privileges
Programs needing one privilege (bind port 80) got all of root’s power
Capabilities solution: Break root into ~40 discrete privileges:
CAP_NET_BIND_SERVICE - Bind to ports < 1024
CAP_SYS_ADMIN - Various admin tasks
etc.
Benefits:
Least privilege: Grant only what’s needed
Defense in depth: Compromised process has limited power
Container isolation: Different containers get different caps
Example: Web server needs only CAP_NET_BIND_SERVICE, not full root:
Copy
setcap cap_net_bind_service=+ep /usr/sbin/nginx# Now nginx can bind port 80 without running as root