Modern Operating System Features
Modern operating systems continue to evolve with new abstractions and optimizations. Understanding these cutting-edge features is essential for building high-performance systems and impressing in senior interviews.Interview Frequency: High for performance-critical roles
Key Topics: io_uring, eBPF, modern schedulers, memory tiering
Time to Master: 15-20 hours
Key Topics: io_uring, eBPF, modern schedulers, memory tiering
Time to Master: 15-20 hours
io_uring: Modern Async I/O
io_uring (added in Linux 5.1) is a revolutionary async I/O interface that provides high-performance, low-latency I/O without system call overhead.Why io_uring?
Copy
┌─────────────────────────────────────────────────────────────────┐
│ I/O EVOLUTION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Blocking I/O Async I/O (aio) io_uring │
│ ──────────── ────────────── ──────── │
│ read() blocks Limited to O_DIRECT Full async │
│ One I/O at a time Complex API Ring buffers │
│ Thread per request Kernel thread pool Zero-copy │
│ Syscall per I/O Syscall per I/O Batched syscalls│
│ │
│ Performance (NVMe SSD, QD=1): │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Blocking read(): ~20,000 IOPS │ │
│ │ aio: ~100,000 IOPS │ │
│ │ io_uring: ~400,000+ IOPS │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
io_uring Architecture
Copy
┌─────────────────────────────────────────────────────────────────┐
│ io_uring RING BUFFERS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ User Space Kernel Space │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Submission Queue (SQ) Completion Queue (CQ) │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ SQE1: read() │ │ CQE1: result=64 │ │ │
│ │ │ SQE2: write() │ │ CQE2: result=32 │ │ │
│ │ │ SQE3: sendmsg() │ │ (waiting...) │ │ │
│ │ │ (empty slots) │ │ │ │ │
│ │ └────────┬────────┘ └────────▲────────┘ │ │
│ │ │ │ │ │
│ └────────────│─────────────────────────│─────────────────┘ │
│ │ │ │
│ io_uring_enter() Results placed │
│ (optional, or poll) after completion │
│ │ │ │
│ ─────────────│─────────────────────────│──────────────────── │
│ │ │ │
│ Kernel: ▼ │ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Process SQEs │ │
│ │ • Execute I/O operations │ │
│ │ • Write CQEs when complete │ │
│ │ • Optional: polling mode (no syscalls!) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Key Benefits: │
│ • Shared memory: No copy between user/kernel │
│ • Batched: Submit many ops, one syscall (or none) │
│ • Linked: Chain dependent operations │
│ • Fixed buffers: Pre-registered for zero-copy │
│ │
└─────────────────────────────────────────────────────────────────┘
io_uring Example
Copy
#include <liburing.h>
#include <fcntl.h>
#include <string.h>
int main() {
struct io_uring ring;
struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;
char buffer[1024];
// Initialize ring with 8 entries
io_uring_queue_init(8, &ring, 0);
int fd = open("test.txt", O_RDONLY);
// Get submission queue entry
sqe = io_uring_get_sqe(&ring);
// Prepare read operation
io_uring_prep_read(sqe, fd, buffer, sizeof(buffer), 0);
sqe->user_data = 1; // Identifier for completion
// Submit
io_uring_submit(&ring);
// Wait for completion
io_uring_wait_cqe(&ring, &cqe);
if (cqe->res > 0) {
printf("Read %d bytes\n", cqe->res);
}
// Mark completion as seen
io_uring_cqe_seen(&ring, cqe);
close(fd);
io_uring_queue_exit(&ring);
return 0;
}
io_uring Operations
Copy
// Supported operations (partial list)
io_uring_prep_read() // Read from file
io_uring_prep_write() // Write to file
io_uring_prep_readv() // Vectored read
io_uring_prep_writev() // Vectored write
io_uring_prep_send() // Send on socket
io_uring_prep_recv() // Receive on socket
io_uring_prep_accept() // Accept connection
io_uring_prep_connect() // Connect socket
io_uring_prep_openat() // Open file
io_uring_prep_close() // Close file
io_uring_prep_statx() // Stat file
io_uring_prep_poll_add() // Poll for events
io_uring_prep_timeout() // Set timeout
io_uring_prep_link_timeout()// Linked timeout
// Advanced features
io_uring_register_buffers() // Pre-register for zero-copy
io_uring_register_files() // Pre-register file descriptors
IORING_SETUP_SQPOLL // Kernel polling (no syscalls!)
IOSQE_IO_LINK // Chain operations
eBPF: Programmable Kernel
eBPF (extended Berkeley Packet Filter) allows running sandboxed programs in the kernel without changing kernel code or loading modules.eBPF Architecture
Copy
┌─────────────────────────────────────────────────────────────────┐
│ eBPF ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ User Space │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ eBPF Program (C) │ │
│ │ │ clang/llvm │ │
│ │ ▼ │ │
│ │ eBPF Bytecode │ │
│ │ │ bpf() syscall │ │
│ └───────│─────────────────────────────────────────────────┘ │
│ │ │
│ ────────│──────────────────────────────────────────────────── │
│ ▼ │
│ Kernel Space │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Verifier │ │
│ │ • Checks safety │ │
│ │ • No loops (or bounded) │ │
│ │ • Valid memory access │ │
│ │ • Terminates │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ Verified │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ JIT Compiler │ │
│ │ • Compiles to native code │ │
│ │ • Near-native performance │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Attach to Hook Points │ │
│ │ • kprobes (any kernel function) │ │
│ │ • tracepoints (stable kernel events) │ │
│ │ • XDP (network packet processing) │ │
│ │ • tc (traffic control) │ │
│ │ • cgroup (resource control) │ │
│ │ • LSM (security) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
eBPF Use Cases
Copy
┌─────────────────────────────────────────────────────────────────┐
│ eBPF USE CASES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Category │ Examples │
│ ────────────────┼─────────────────────────────────────────────│
│ Observability │ • Performance monitoring │
│ │ • Distributed tracing (Cilium Hubble) │
│ │ • Custom metrics │
│ │ • Runtime debugging │
│ │
│ Networking │ • Load balancing (Cilium, Katran) │
│ │ • Packet filtering (XDP firewalls) │
│ │ • DDoS mitigation │
│ │ • Service mesh (Cilium) │
│ │
│ Security │ • Runtime security (Falco, Tetragon) │
│ │ • Syscall filtering │
│ │ • Container security │
│ │
│ Profiling │ • Continuous profiling (Parca, Pyroscope) │
│ │ • CPU flame graphs │
│ │ • Off-CPU analysis │
│ │
└─────────────────────────────────────────────────────────────────┘
eBPF Example
Copy
// Simple eBPF program to count syscalls
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 256);
__type(key, u32); // syscall number
__type(value, u64); // count
} syscall_counts SEC(".maps");
SEC("tracepoint/raw_syscalls/sys_enter")
int trace_syscall(struct trace_event_raw_sys_enter *ctx) {
u32 syscall_id = ctx->id;
u64 *count = bpf_map_lookup_elem(&syscall_counts, &syscall_id);
if (count) {
__sync_fetch_and_add(count, 1);
} else {
u64 init_val = 1;
bpf_map_update_elem(&syscall_counts, &syscall_id, &init_val, BPF_ANY);
}
return 0;
}
char LICENSE[] SEC("license") = "GPL";
eBPF Maps
Copy
// Map types for sharing data
// Hash map (key-value store)
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 1024);
__type(key, u32);
__type(value, struct stats);
} hash_map SEC(".maps");
// Array (indexed access)
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(max_entries, 256);
__type(key, u32);
__type(value, u64);
} array_map SEC(".maps");
// Per-CPU array (no locking needed)
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__uint(max_entries, 256);
__type(key, u32);
__type(value, u64);
} percpu_map SEC(".maps");
// Ring buffer (efficient event streaming)
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024); // 256KB
} events SEC(".maps");
// LRU hash (automatic eviction)
struct {
__uint(type, BPF_MAP_TYPE_LRU_HASH);
__uint(max_entries, 10000);
__type(key, struct flow_key);
__type(value, struct flow_stats);
} lru_map SEC(".maps");
Modern Schedulers
EEVDF (Earliest Eligible Virtual Deadline First)
Linux 6.6+ replaces CFS with EEVDF:Copy
┌─────────────────────────────────────────────────────────────────┐
│ EEVDF vs CFS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ CFS (Completely Fair Scheduler) │
│ ──────────────────────────────── │
│ • Virtual runtime tracking │
│ • Red-black tree for task selection │
│ • Good average case, but: │
│ - Latency spikes possible │
│ - Unfair to short-running tasks │
│ │
│ EEVDF (Earliest Eligible Virtual Deadline First) │
│ ───────────────────────────────────────────────── │
│ • Each task has a virtual deadline │
│ • Eligible: Has runnable time slice │
│ • Pick task with earliest deadline │
│ • Better latency guarantees │
│ • Fairer to short tasks │
│ │
│ Key Concept: │
│ • lag = service_received - service_expected │
│ • Positive lag: Got more than fair share │
│ • Negative lag: Got less than fair share │
│ • Eligible: lag ≤ 0 (hasn't exceeded fair share) │
│ • Pick eligible task with earliest deadline │
│ │
└─────────────────────────────────────────────────────────────────┘
Scheduler Classes
Copy
┌─────────────────────────────────────────────────────────────────┐
│ LINUX SCHEDULER CLASSES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Priority │ Class │ Policy │ Use Case │
│ ──────────┼────────────────┼────────────────┼──────────────── │
│ Highest │ stop_sched │ (internal) │ Stop machine │
│ │ dl_sched │ SCHED_DEADLINE │ Hard real-time │
│ │ rt_sched │ SCHED_FIFO │ Soft real-time │
│ │ │ SCHED_RR │ │
│ │ fair_sched │ SCHED_NORMAL │ Regular tasks │
│ Lowest │ idle_sched │ SCHED_IDLE │ Very low prio │
│ │
│ SCHED_DEADLINE: │
│ • Earliest Deadline First (EDF) │
│ • Parameters: runtime, deadline, period │
│ • Guaranteed CPU time if schedulable │
│ • Used for hard real-time tasks │
│ │
│ Example: │
│ struct sched_attr attr = { │
│ .sched_policy = SCHED_DEADLINE, │
│ .sched_runtime = 10 * 1000 * 1000, // 10ms │
│ .sched_deadline = 30 * 1000 * 1000, // 30ms │
│ .sched_period = 30 * 1000 * 1000, // 30ms │
│ }; │
│ sched_setattr(0, &attr, 0); │
│ │
└─────────────────────────────────────────────────────────────────┘
Core Scheduling
For security (Spectre/Meltdown), run only trusted code together on SMT siblings:Copy
# Enable core scheduling
echo 1 > /sys/kernel/debug/sched/core_sched_enabled
# Create core scheduling group
# Only threads in same cookie run on SMT siblings
prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 0, PIDTYPE_TGID, 0);
# Kernel ensures:
# - SMT siblings run tasks with same cookie
# - Mitigates some side-channel attacks
Memory Management Advances
Memory Tiering
Copy
┌─────────────────────────────────────────────────────────────────┐
│ MEMORY TIERING │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Modern systems have heterogeneous memory: │
│ │
│ Tier 0 (Fastest) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ HBM (High Bandwidth Memory) / Local DRAM │ │
│ │ ~100 ns latency │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ Page migration │
│ Tier 1 (Medium) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ CXL Memory / Remote NUMA │ │
│ │ ~200-500 ns latency │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ Page migration │
│ Tier 2 (Slowest) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Persistent Memory (PMEM) / Compressed Memory │ │
│ │ ~1000+ ns latency │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Kernel features: │
│ • kswapd: Demotes cold pages │
│ • NUMA balancing: Promotes hot pages │
│ • Memory tiering (Linux 5.15+): Explicit tier management │
│ │
│ # Check memory tiers │
│ cat /sys/devices/system/node/node*/memory_tier │
│ │
└─────────────────────────────────────────────────────────────────┘
MGLRU (Multi-Gen LRU)
Linux 6.1+ introduces Multi-Generational LRU for better page reclaim:Copy
┌─────────────────────────────────────────────────────────────────┐
│ MGLRU vs Classic LRU │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Classic LRU: │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Active List ◄──► Inactive List ──► Reclaim │ │
│ └────────────────────────────────────────────────────────┘ │
│ • Two lists only │
│ • Age information limited │
│ • Scanning overhead for large memory │
│ │
│ MGLRU: │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Gen 0 Gen 1 Gen 2 Gen 3 ──► Reclaim │ │
│ │ (oldest) (youngest) │ │
│ └────────────────────────────────────────────────────────┘ │
│ • Multiple generations (typically 4) │
│ • Better age tracking │
│ • Faster page aging │
│ • Reduces CPU overhead for scanning │
│ • Better for large memory systems │
│ │
│ Enable MGLRU: │
│ echo 1 > /sys/kernel/mm/lru_gen/enabled │
│ │
│ Benefits: │
│ • 10-20% better throughput under memory pressure │
│ • Lower tail latency │
│ • Better workload isolation │
│ │
└─────────────────────────────────────────────────────────────────┘
Filesystem Innovations
SMR and ZNS Support
Copy
┌─────────────────────────────────────────────────────────────────┐
│ ZONED STORAGE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Traditional SSD: │
│ ┌────┬────┬────┬────┬────┬────┬────┬────┐ │
│ │ R/W│ R/W│ R/W│ R/W│ R/W│ R/W│ R/W│ R/W│ Random access │
│ └────┴────┴────┴────┴────┴────┴────┴────┘ │
│ │
│ Zoned Namespace (ZNS) SSD / SMR HDD: │
│ ┌──────────────────────┬──────────────────────┐ │
│ │ Zone 1 │ Zone 2 │ │
│ │ ████████░░░░░░░░░░░░│░░░░░░░░░░░░░░░░░░░░│ │
│ │ ──────► Sequential │ Write pointer │ │
│ └──────────────────────┴──────────────────────┘ │
│ │
│ Rules: │
│ • Must write sequentially within a zone │
│ • Can only reset (erase) entire zone │
│ • Better SSD endurance and cost │
│ │
│ Filesystem support: │
│ • Btrfs: Native zone support │
│ • f2fs: Flash-Friendly FS with zone support │
│ • ZoneFS: Exposes zones as files │
│ │
│ # Check zone info │
│ cat /sys/block/nvme0n1/queue/zoned │
│ # Values: none, host-aware, host-managed │
│ │
└─────────────────────────────────────────────────────────────────┘
FUSE and Filesystem in Userspace
Copy
┌─────────────────────────────────────────────────────────────────┐
│ FUSE ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ User Space │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Application │ │
│ │ │ VFS operation (open, read, write) │ │
│ └─────│───────────────────────────────────────────────────┘ │
│ │ │
│ ──────│────────────────────────────────────────────────────── │
│ ▼ │
│ Kernel │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ VFS Layer │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ FUSE Kernel Module │ │
│ │ │ │ │
│ └─────│───────────────────────────────────────────────────┘ │
│ │ /dev/fuse │
│ ──────│────────────────────────────────────────────────────── │
│ ▼ │
│ User Space │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ FUSE Daemon (Your Filesystem) │ │
│ │ • libfuse │ │
│ │ • Implement operations │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Use cases: │
│ • sshfs: Mount remote filesystem over SSH │
│ • s3fs: Mount S3 bucket as filesystem │
│ • rclone: Mount cloud storage │
│ • Avoids kernel development for custom filesystems │
│ │
└─────────────────────────────────────────────────────────────────┘
Security Features
Landlock
Linux Security Module for unprivileged sandboxing:Copy
#include <linux/landlock.h>
#include <sys/syscall.h>
// Create ruleset
struct landlock_ruleset_attr ruleset_attr = {
.handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE |
LANDLOCK_ACCESS_FS_WRITE_FILE,
};
int ruleset_fd = syscall(SYS_landlock_create_ruleset,
&ruleset_attr, sizeof(ruleset_attr), 0);
// Add rule: Allow read from /usr
struct landlock_path_beneath_attr path_beneath = {
.allowed_access = LANDLOCK_ACCESS_FS_READ_FILE,
.parent_fd = open("/usr", O_PATH),
};
syscall(SYS_landlock_add_rule, ruleset_fd,
LANDLOCK_RULE_PATH_BENEATH, &path_beneath, 0);
// Enforce ruleset
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
syscall(SYS_landlock_restrict_self, ruleset_fd, 0);
// Now process can only read from /usr
// All other file access will fail with EACCES
io_uring Security
Copy
// io_uring can be restricted for security
// IORING_REGISTER_RESTRICTIONS
struct io_uring_restriction res[] = {
// Only allow specific operations
{ .opcode = IORING_RESTRICTION_SQE_OP,
.sqe_op = IORING_OP_READ },
{ .opcode = IORING_RESTRICTION_SQE_OP,
.sqe_op = IORING_OP_WRITE },
// Require fixed files only
{ .opcode = IORING_RESTRICTION_SQE_FLAGS_REQUIRED,
.sqe_flags = IOSQE_FIXED_FILE },
};
io_uring_register_restrictions(ring, res, ARRAY_SIZE(res));
io_uring_enable_rings(ring);
// Some systems disable io_uring entirely
// sysctl kernel.io_uring_disabled=2
Interview Questions
What is io_uring and why is it faster than traditional I/O?
What is io_uring and why is it faster than traditional I/O?
io_uring advantages:
- Shared memory rings: No copying between user/kernel
- Batched submissions: Multiple ops per syscall
- Polling mode: Zero syscalls possible (SQPOLL)
- Zero-copy: Pre-registered buffers
- Async everything: File, network, timers all async
- Syscall overhead eliminated or amortized
- Better cache utilization
- 4-10x IOPS improvement for NVMe
- High-performance servers
- Storage-intensive applications
- When syscall overhead matters
Explain eBPF and its use cases
Explain eBPF and its use cases
eBPF (extended Berkeley Packet Filter):Run verified, sandboxed programs in kernel:
- No kernel recompilation
- No kernel modules
- Safe: Verified before execution
- Fast: JIT compiled
- Observability: Trace syscalls, functions, performance
- Networking: XDP for fast packet processing, load balancing
- Security: Runtime threat detection, syscall filtering
- Profiling: CPU, memory, off-CPU analysis
- Write in C or bpftrace
- Compile to BPF bytecode
- Kernel verifier checks safety
- JIT compiles to native code
- Attach to hook points (kprobes, tracepoints, XDP)
What is XDP and when would you use it?
What is XDP and when would you use it?
XDP (eXpress Data Path):eBPF programs that run at the NIC driver level, before sk_buff allocation.Actions:
XDP_DROP: Drop packet (fastest firewall)XDP_PASS: Continue to network stackXDP_TX: Send back out same interfaceXDP_REDIRECT: Send to different interface/CPU
- Millions of packets per second per core
- 10-100x faster than iptables for simple filtering
- DDoS mitigation (drop bad traffic early)
- Load balancing (Facebook’s Katran)
- Traffic filtering
- Packet modification
- Limited to packet processing
- Must handle raw packets
- Driver support required for best performance
Summary
Copy
┌─────────────────────────────────────────────────────────────────┐
│ MODERN OS FEATURES SUMMARY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ io_uring: │
│ • Ring buffer based async I/O │
│ • Minimal syscall overhead │
│ • Supports all I/O types │
│ │
│ eBPF: │
│ • Safe kernel programmability │
│ • Tracing, networking, security │
│ • XDP for fast packet processing │
│ │
│ Schedulers: │
│ • EEVDF: Better fairness and latency │
│ • SCHED_DEADLINE: Hard real-time │
│ • Core scheduling: SMT security │
│ │
│ Memory: │
│ • MGLRU: Better page reclaim │
│ • Memory tiering: CXL and PMEM │
│ │
│ Security: │
│ • Landlock: Unprivileged sandboxing │
│ • Improved io_uring restrictions │
│ │
└─────────────────────────────────────────────────────────────────┘