Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Modern Operating System Features
Modern operating systems continue to evolve with new abstractions and optimizations. Understanding these cutting-edge features is essential for building high-performance systems and impressing in senior interviews.Key Topics: io_uring, eBPF, modern schedulers, memory tiering
Time to Master: 15-20 hours
io_uring: Modern Async I/O
io_uring (added in Linux 5.1) is a revolutionary async I/O interface that provides high-performance, low-latency I/O without system call overhead. Think of traditional I/O like ordering food at a restaurant where you have to walk to the kitchen for every dish. io_uring is like having a conveyor belt between your table and the kitchen — you place orders on one belt, and finished dishes arrive on another, all without leaving your seat.Why io_uring?
io_uring Architecture
io_uring Example
io_uring Operations
eBPF: Programmable Kernel
eBPF (extended Berkeley Packet Filter) allows running sandboxed programs in the kernel without changing kernel code or loading modules. If the kernel is a fortified castle, eBPF is a safe, inspected drone that you can send inside the walls to observe and report back — without ever opening the gates to untrusted code. The verifier acts as the castle guard, ensuring every drone program terminates, never accesses invalid memory, and cannot crash the host.eBPF Architecture
eBPF Use Cases
eBPF Example
eBPF Maps
Modern Schedulers
EEVDF (Earliest Eligible Virtual Deadline First)
Linux 6.6+ replaces CFS with EEVDF. The core insight: CFS was “fair on average” but could produce short-term latency spikes for tasks that needed quick bursts of CPU. EEVDF adds a deadline concept so that latency-sensitive tasks (like audio or UI rendering) get their time slice when they actually need it, not just eventually.Scheduler Classes
Core Scheduling
For security (Spectre/Meltdown), run only trusted code together on SMT siblings. SMT (Simultaneous Multi-Threading, marketed as “Hyper-Threading”) shares execution resources — ALUs, caches, branch predictors — between two logical cores on the same physical core. This sharing creates side-channel leakage: a malicious thread can probe the branch predictor or cache to extract secrets from its sibling. Core Scheduling solves this by ensuring that only threads belonging to the same trust domain (same “cookie”) run simultaneously on SMT siblings:Memory Management Advances
Memory Tiering
MGLRU (Multi-Gen LRU)
Linux 6.1+ introduces Multi-Generational LRU for better page reclaim:Filesystem Innovations
SMR and ZNS Support
FUSE and Filesystem in Userspace
Security Features
Landlock
Linux Security Module for unprivileged sandboxing. Unlike seccomp (which filters syscalls) or namespaces (which isolate resources), Landlock lets a regular, unprivileged process voluntarily restrict its own filesystem access — similar to how a mobile app declares which folders it needs. This is defense-in-depth: even if an attacker exploits a bug in your application, Landlock limits what damage they can do. Practical tip: Landlock is especially useful for processing untrusted input (PDF rendering, image conversion). Sandbox the worker before it touches the data.io_uring Security
Interview Questions
What is io_uring and why is it faster than traditional I/O?
What is io_uring and why is it faster than traditional I/O?
- Shared memory rings: No copying between user/kernel
- Batched submissions: Multiple ops per syscall
- Polling mode: Zero syscalls possible (SQPOLL)
- Zero-copy: Pre-registered buffers
- Async everything: File, network, timers all async
- Syscall overhead eliminated or amortized
- Better cache utilization
- 4-10x IOPS improvement for NVMe
- High-performance servers
- Storage-intensive applications
- When syscall overhead matters
Explain eBPF and its use cases
Explain eBPF and its use cases
- No kernel recompilation
- No kernel modules
- Safe: Verified before execution
- Fast: JIT compiled
- Observability: Trace syscalls, functions, performance
- Networking: XDP for fast packet processing, load balancing
- Security: Runtime threat detection, syscall filtering
- Profiling: CPU, memory, off-CPU analysis
- Write in C or bpftrace
- Compile to BPF bytecode
- Kernel verifier checks safety
- JIT compiles to native code
- Attach to hook points (kprobes, tracepoints, XDP)
What is XDP and when would you use it?
What is XDP and when would you use it?
XDP_DROP: Drop packet (fastest firewall)XDP_PASS: Continue to network stackXDP_TX: Send back out same interfaceXDP_REDIRECT: Send to different interface/CPU
- Millions of packets per second per core
- 10-100x faster than iptables for simple filtering
- DDoS mitigation (drop bad traffic early)
- Load balancing (Facebook’s Katran)
- Traffic filtering
- Packet modification
- Limited to packet processing
- Must handle raw packets
- Driver support required for best performance
Production Caveats and Patterns
The features in this chapter are powerful and dangerous. They are powerful because they unlock performance and observability that were impossible a few years ago; dangerous because they expose new attack surfaces, lifecycle complexity, and footguns that production teams keep stepping on.Summary
Interview Deep-Dive
Your team is building a high-throughput storage service that needs to sustain 1M+ IOPS on NVMe drives. An engineer proposes replacing epoll with io_uring. Walk through why io_uring would help, what the migration risks are, and what security considerations you would raise.
Your team is building a high-throughput storage service that needs to sustain 1M+ IOPS on NVMe drives. An engineer proposes replacing epoll with io_uring. Walk through why io_uring would help, what the migration risks are, and what security considerations you would raise.
- Why io_uring helps: The submission queue (SQ) and completion queue (CQ) are shared memory ring buffers between user space and kernel. You can batch 32 or 64 I/O operations into a single
io_uring_enter()call, or in SQPOLL mode, eliminate syscalls entirely because a kernel thread polls the SQ autonomously. At 1M IOPS, SQPOLL means the kernel thread continuously drains SQEs without any user-to-kernel transition. The result is 4-10x IOPS improvement over epoll+read for NVMe workloads. - Pre-registered buffers and files:
io_uring_register_buffers()andio_uring_register_files()let you pre-register memory regions and file descriptors. This skips per-operation validation (fget/fput for fd lookup, get_user_pages for buffer pinning) and enables true zero-copy paths.
- Error handling changes: With synchronous I/O, errors are returned inline. With io_uring, errors appear asynchronously in CQEs. Your entire error-handling architecture needs redesigning.
- Backpressure: If the application submits faster than the device can complete, the SQ fills up. You need to handle
io_uring_get_sqe()returning NULL gracefully. - Debugging difficulty: Async I/O is harder to trace. Standard strace does not show io_uring operations well. You need bpftrace or perf to observe the ring buffer activity.
- io_uring has been a major attack surface: Multiple CVEs in 2021-2023 (privilege escalation, use-after-free in registered buffers). Google’s Android team disabled io_uring entirely in Android kernels. Some container runtimes block
io_uring_setupvia seccomp by default. - SQPOLL runs a kernel thread with user-specified parameters: This thread runs at kernel privilege continuously. If you enable it, ensure the process is trusted and resource-limited via cgroups.
- Mitigation: Use
IORING_REGISTER_RESTRICTIONSto limit which operations the ring can perform. In containerized environments, audit whether your seccomp profile allowsio_uring_setupandio_uring_enter.
Explain how eBPF verification works. What guarantees does the verifier provide, and what are its limitations?
Explain how eBPF verification works. What guarantees does the verifier provide, and what are its limitations?
- Termination: The program must provably terminate. Originally this meant no loops at all. Since Linux 5.3, bounded loops are allowed if the verifier can prove a finite upper bound on iterations (e.g., a loop with a counter that decrements to zero). The verifier tracks the possible range of every register and proves the loop bound statically.
- Memory safety: Every memory access must be within bounds. The verifier tracks the type and valid range of every register. If you have a pointer to an sk_buff data region, the verifier knows the bounds (
ctx->datatoctx->data_end) and rejects any access that could go out of bounds. This is why you see thoseif ((void *)(eth + 1) > data_end) return XDP_PASS;checks everywhere — the verifier requires them. - No invalid pointers: You cannot dereference a register that might be NULL without checking first. The verifier tracks NULL-ability through conditional branches.
- Stack safety: The eBPF stack is limited to 512 bytes. The verifier ensures no stack overflow.
- Helper function safety: eBPF programs can only call pre-approved kernel helper functions (
bpf_map_lookup_elem,bpf_probe_read, etc.). Each helper has a type signature the verifier checks.
- Complexity limit: The verifier has a maximum instruction limit it will analyze (1 million verified instructions as of recent kernels). Complex programs with many branches can hit this limit and be rejected even if they are correct. This forces you to simplify control flow or split programs.
- Conservative analysis: The verifier over-approximates. It may reject a program that is actually safe because it cannot prove the safety statically. For example, a perfectly safe pointer arithmetic expression might be rejected because the verifier loses track of the value range through a complex series of operations.
- No floating point: eBPF has no floating-point support. All math is integer.
- Limited data structures: You cannot dynamically allocate memory inside an eBPF program. All data sharing goes through pre-defined BPF maps.
bpf() syscall. For concurrency: per-CPU maps (BPF_MAP_TYPE_PERCPU_HASH, BPF_MAP_TYPE_PERCPU_ARRAY) eliminate locking entirely because each CPU has its own copy — ideal for counters and statistics. Regular maps use RCU internally for reads and spinlocks for writes, so readers never block. Ring buffers (BPF_MAP_TYPE_RINGBUF, added in Linux 5.8) are the preferred way to stream events from kernel to user space because they support variable-length records and are more efficient than the older perf buffer approach. The gotcha is that if user space cannot drain the ring buffer fast enough, events are silently dropped — you must monitor the drop counter.Linux 6.6 replaced CFS with EEVDF. What specific problem with CFS motivated this change, and how does EEVDF's deadline mechanism solve it?
Linux 6.6 replaced CFS with EEVDF. What specific problem with CFS motivated this change, and how does EEVDF's deadline mechanism solve it?
vruntime — the CPU time consumed, weighted by priority — and always picks the task with the lowest vruntime to run next. Over long periods, this is mathematically fair. The problem is short-term unfairness, which CFS calls “lag.”- The CFS problem: Imagine two tasks, A (interactive, short bursts) and B (CPU-bound, long bursts). Task A sleeps for 50ms, wakes up, and needs 2ms of CPU. Under CFS, A’s vruntime fell behind while it slept, so it gets picked next — good. But if 20 tasks all wake up simultaneously (common in event-driven servers), CFS picks them in vruntime order, and the last few tasks might wait 10-20ms even though they only need a 1ms burst. CFS has no notion of urgency or deadline — it only knows “who has consumed the least so far.”
- Latency spikes for short tasks: Audio processing, UI rendering, and network packet handling are highly sensitive to scheduling latency. CFS’s
sched_latencyparameter (default 6ms for 8 tasks) provides a target, but under load it is a best-effort guarantee. A task that needs CPU now has no way to express that urgency.
- Virtual deadline: Each task gets a virtual deadline calculated as
vruntime + (time_slice / weight). This represents “by when should this task have received its fair share.” - Eligibility: A task is “eligible” if its lag is less than or equal to zero — meaning it has not exceeded its fair share. A task that has been running too much (positive lag) becomes ineligible temporarily.
- Selection: Among all eligible tasks, EEVDF picks the one with the earliest virtual deadline. This means a short-burst task that just woke up gets a near-term deadline and is scheduled quickly, while a long-running task that has been consuming CPU gets a farther-out deadline and yields.
How does Landlock differ from seccomp and namespaces as a sandboxing mechanism? When would you choose one over the others?
How does Landlock differ from seccomp and namespaces as a sandboxing mechanism? When would you choose one over the others?
- Seccomp (syscall filtering): Operates at the syscall boundary. A seccomp-BPF filter inspects each syscall number and its arguments, and decides allow/deny/kill. It answers the question: “Which kernel APIs can this process invoke?” Seccomp cannot distinguish between files — if you allow
open(), you allow opening any file. It is coarse-grained for filesystem access but excellent for reducing the kernel attack surface (blocking dangerous syscalls likeptrace,mount,kexec_load). - Namespaces (resource isolation): Change what the process can see. A PID namespace makes a process think it is PID 1. A mount namespace gives it a different filesystem tree. A network namespace gives it a separate network stack. Namespaces answer: “What does the world look like to this process?” They are powerful but require root or
CAP_SYS_ADMINto create (except user namespaces), and they are coarse — you get a whole new view, not fine-grained access control. - Landlock (filesystem access control): Operates at the VFS layer. An unprivileged process can voluntarily restrict its own filesystem access to specific directories and specific operations (read, write, execute, make_dir, etc.). It answers: “Which files and directories can this process access, and how?” The key differentiator is that Landlock is self-imposed and unprivileged — no root needed, no special setup.
- Processing untrusted input (PDF renderer, image converter): Use Landlock to restrict filesystem access to only the input/output directories, plus seccomp to block unnecessary syscalls. This is defense-in-depth: even if an attacker gets code execution through a parsing bug, they cannot read
/etc/shadow(Landlock blocks it) and cannot callptrace(seccomp blocks it). - Container isolation (Docker, Kubernetes): Use all three. Namespaces provide the fundamental isolation (separate PID, mount, network views), seccomp profiles block dangerous syscalls, and newer container runtimes are adding Landlock for fine-grained filesystem restrictions within the container.
- Reducing attack surface of a network daemon: Seccomp is the primary tool. Block everything the daemon does not need. A web server does not need
mount,reboot,kexec_load, orptrace. Drop those with seccomp-BPF.
prctl(PR_SET_NO_NEW_PRIVS) first (prevents privilege escalation via setuid binaries), then applies a seccomp filter, then applies Landlock rules. Each layer catches what the others miss.Follow-up: What is the performance cost of Landlock vs seccomp?Seccomp-BPF runs a small BPF program on every syscall entry, which adds roughly 10-50ns per syscall depending on filter complexity. Landlock hooks into the VFS at specific operations (open, mkdir, etc.) and checks the rules only for filesystem operations, so its overhead is negligible for non-filesystem syscalls. For filesystem-heavy workloads, Landlock adds a small constant cost per VFS operation to walk the ruleset tree. In benchmarks, both are well under 1% overhead for typical workloads. The cost of not using them — a privilege escalation vulnerability — is infinitely more expensive.io_uring vs epoll -- when does io_uring actually win, and where is epoll still the right answer?
io_uring vs epoll -- when does io_uring actually win, and where is epoll still the right answer?
- State the architectural difference. epoll is an event-readiness API: you ask the kernel to notify you when an FD is ready for I/O, then you do a syscall (
read,write) to actually transfer data. Two syscalls per I/O on the hot path. io_uring is a completion-based API: you submit the work (read this, write that) into a shared ring; the kernel performs the work asynchronously and posts a completion when done. With SQPOLL, zero syscalls per I/O. - Identify where io_uring wins. High-fanout, high-IOPS workloads where syscall overhead dominates. Storage servers handling 500k+ IOPS on NVMe see 4-10x improvement over epoll+pread. Network proxies handling millions of QPS see 2-5x. Workloads that batch many I/Os at once (read 100 files in parallel) — io_uring lets you submit them all in one ring write, reducing context switches.
- Identify where epoll is fine or better. Single-threaded servers with moderate concurrency (under 10k QPS): the syscall savings are not measurable. Workloads where I/O is dominated by network round-trip time, not local syscall cost (HTTP servers waiting on DNS, proxies behind slow upstreams). Workloads where you need cross-language portability or run on older kernels (epoll is universal; io_uring is Linux 5.1+, with usable network support more like 5.5+).
- Identify where io_uring is wrong. Untrusted code execution — io_uring has had multiple privilege-escalation CVEs and is blocked in some environments (Android, recent ChromeOS). Workloads with very few concurrent operations (a CLI tool sequentially reading files): the ring setup cost exceeds the syscall savings. Code where async lifecycle management would be a major architectural lift you cannot afford.
- State the crossover. Roughly: queue depth 4-8 is the break-even point. Below that, epoll is comparable or better. Above that, io_uring’s batching and zero-syscall properties dominate. Use latency targets and IOPS goals to pick the threshold. Always benchmark on your actual hardware and workload, not on synthetic numbers.
read, return); io_uring requires the buffer to outlive the operation, which conflicts with borrow-based lifetime in Rust. tokio-uring uses owned buffers (Bytes) to side-step this. The result is a slightly different async API. Most apps stay on the epoll-backed Tokio for portability and simplicity; performance-critical apps reach for tokio-uring or glommio (an io_uring-native runtime).- “io_uring is always faster than epoll.” Untrue at low queue depths and untrue for workloads where syscall cost is not the bottleneck.
- “io_uring replaces epoll entirely.” No — they coexist on every modern Linux system. epoll is older and more universally supported; io_uring is newer and more capable for specific workloads.
- “io_uring is just async I/O.” It is more general — io_uring supports network I/O, file I/O, opening/closing files, statx, and more. It is closer to “async syscall framework” than “async I/O.”
- Jens Axboe’s “Efficient IO with io_uring” paper (2019) — the canonical introduction by the author.
- Glauber Costa’s posts on Seastar and io_uring — design lessons from a real high-IOPS database.
- LWN.net articles “Ringing in a new asynchronous I/O API” and follow-ups — evolution of the API and the security history.
What does eBPF unlock for observability that older tools (strace, ftrace, kprobes alone) couldn't do?
What does eBPF unlock for observability that older tools (strace, ftrace, kprobes alone) couldn't do?
- State what older tools could and could not do. strace traces syscalls one process at a time with high overhead (ptrace per call, often 10-100x slowdown). ftrace gives you static tracepoints with low overhead but limited filtering — you typically capture a lot and filter offline. kprobes/uprobes let you attach to arbitrary kernel/user functions, but the actions are limited to “log this event” or “emit a counter.” None of these support arbitrary in-kernel logic at the trace point.
- State what eBPF adds. A safe, JIT-compiled, in-kernel program runs at the trace point. It can read function arguments and return values, walk kernel data structures, aggregate into BPF maps (per-CPU counters, histograms), and emit only the records you actually want. The result: you can answer questions like “what is the latency distribution of this syscall, broken down by process and bucket?” with ~1% overhead — something that was previously impossible without instrumenting the application or running a sampling profiler.
- List the categories of new capability. (a) Performance analysis at production load — BCC’s
biolatencyshows block-I/O latency histograms across the entire fleet without measurable overhead. (b) Security observability — Falco and Tetragon attach to syscalls and report anomalies in real time. (c) Custom networking — Cilium replaces iptables with eBPF for service-mesh-grade policy at line rate. (d) Programmatic profiling — bpftrace one-liners replace what used to require kernel modules. - Identify the limits. eBPF programs have a 1M-instruction verifier limit, a 512-byte stack, no dynamic allocation, no floating point. They can only call helper functions, not arbitrary kernel functions. Some kernel data is invisible to BPF (encrypted memory regions, certain BPF-LSM hook contexts). The verifier rejects programs it cannot prove safe, so you sometimes have to refactor to help it.
- State the operational shift. eBPF lets you ship observability tools as small, declarative programs (BCC, bpftrace, libbpf-tools) that run anywhere a recent kernel runs. You no longer need to wait for the next kernel release to add a new metric. Production diagnosis that took weeks (kernel module, peer review, deploy, monitor) now takes minutes (write a bpftrace one-liner, run, learn).
bcc and bpftrace toolkits are the de facto standard for Linux performance debugging. A specific example from his book BPF Performance Tools: diagnosing a “tail latency anomaly” in a microservice. Old approach: enable detailed JVM logging, capture for hours, re-deploy with extra metrics, repeat. New approach: a 10-line bpftrace script attached to the JVM safepoint kprobes that aggregated stop-the-world pauses by stack trace, run for 5 minutes in production, root cause identified. The total time from “I noticed a problem” to “I have a fix” went from days to hours.- “eBPF is just a faster strace.” Massively understates it. eBPF lets you run arbitrary safe code at trace points — aggregations, conditionals, data structure walks. strace is logging, eBPF is computation.
- “eBPF replaces all monitoring agents.” No — agents (Datadog, Prometheus exporters) handle metric shipping, dashboarding, alerting. eBPF is a data source those agents can use. They coexist.
- “eBPF is unsafe because it runs kernel code.” The verifier exists specifically to make it safe. The class of safety properties eBPF guarantees (no kernel crashes, no infinite loops, no out-of-bounds memory access) is provable — which is more than you can say for kernel modules.
- Brendan Gregg, BPF Performance Tools (2019) — comprehensive book on production performance analysis with eBPF.
- “Learning eBPF” by Liz Rice (O’Reilly, 2023) — accessible introduction with practical examples.
- ebpf.io — the canonical home for tooling, examples, and the eBPF Foundation’s resources.
Confidential computing (Intel SGX, AMD SEV-SNP, Arm CCA) -- explain the threat model, what it actually protects, and where the gaps are.
Confidential computing (Intel SGX, AMD SEV-SNP, Arm CCA) -- explain the threat model, what it actually protects, and where the gaps are.
- State the threat model precisely. Confidential computing protects code and data from a privileged attacker on the same machine — the cloud provider, the host OS, the hypervisor, a malicious admin. The trust boundary is the CPU package itself: you trust the hardware vendor and the silicon, and you do not trust anything outside the CPU (RAM, disk, BIOS, hypervisor, host kernel). The use case is: “I want to run my workload in someone else’s cloud and not trust the cloud provider with my data.”
- Explain Intel SGX (Software Guard Extensions). Process-level enclaves: a process can carve out an enclave region, and only enclave code can read enclave memory. The CPU encrypts enclave memory with a hardware-derived key. Any access from outside the enclave (kernel, hypervisor, other processes) sees ciphertext. Limitation: enclave size historically capped at ~128 MB, with anything beyond paged in/out (slow). SGX2 (Ice Lake+) raised this to several GB but still has limits.
- Explain AMD SEV-SNP / Intel TDX. VM-level confidential computing: an entire VM runs encrypted, with memory access from the hypervisor returning ciphertext. Larger blast radius (full VM, not just an enclave) but easier to retrofit existing applications — you do not have to re-architect into “trusted enclave + untrusted host.” SEV-SNP adds attestation and integrity protection. TDX is Intel’s equivalent. AWS Nitro Enclaves, Azure DCsv3 / DCdsv3, GCP Confidential VM all use this layer.
- Explain Arm CCA (Confidential Compute Architecture). Newer (rolling out in 2024-2025), Realm-based: a VM-style isolation backed by Arm’s Realm Management Extension (RME). Conceptually similar to TDX/SEV-SNP but designed for Arm’s heterogeneous compute model. Used for data-center deployments and increasingly for edge/AI inference workloads.
- Identify the gaps. This is where senior judgment shows.
- Side channels. SGX has had repeated side-channel breaks: Foreshadow (2018), Plundervolt (2019), SGAxe / CrossTalk (2020). SEV-SNP and TDX have had cache-timing and speculative-execution attacks. The hardware vendors patch them, but new ones keep appearing. Confidential computing is robust against direct memory access and most direct attacks; it is fragile against microarchitectural side channels.
- Attestation trust chain. You must verify the attestation report comes from a genuine CPU running the firmware version you expect. This requires bootstrapping trust through the CPU vendor’s PKI (Intel Attestation Service, AMD’s KDS). If that PKI is compromised or the attestation library has bugs, the whole chain falls.
- Software bugs in enclaves. The enclave code itself can have bugs. Confidential computing protects from external attackers, not from a vulnerable implementation inside the enclave.
- I/O. Anything that leaves the enclave (network, disk) is not protected by these mechanisms. You need TLS for network, encrypted storage for disk, and careful design for any data crossing the boundary.
- “Confidential computing means encrypted data.” Encryption-at-rest exists everywhere. Confidential computing is encryption-during-use — data is protected even while the CPU is operating on it. That is a much harder property.
- “SGX is dead because of side-channel attacks.” SGX is deprecated for client (consumer) CPUs but still actively developed for server. SGX2 with a larger enclave size and TDX as a successor are alive. The narrative “dead” is overstated; the narrative “fragile against side channels” is accurate.
- “Confidential computing replaces trust in the cloud provider.” It reduces trust scope to the CPU vendor, not eliminates trust. You now trust Intel/AMD/Arm. For some threat models that is a meaningful improvement; for others (state-level adversaries who might pressure the silicon vendor) it is not.
- Confidential Computing Consortium whitepapers — ccc.io has architectural overviews and threat-model docs.
- Costan and Devadas, “Intel SGX Explained” (2016) — the most thorough public treatment of SGX internals.
- “AMD SEV-SNP: Strengthening VM Isolation with Integrity Protection and More” (AMD whitepaper, 2020) — canonical reference for SEV-SNP design.