Skip to main content

Modern Operating System Features

Modern operating systems continue to evolve with new abstractions and optimizations. Understanding these cutting-edge features is essential for building high-performance systems and impressing in senior interviews.
Interview Frequency: High for performance-critical roles
Key Topics: io_uring, eBPF, modern schedulers, memory tiering
Time to Master: 15-20 hours

io_uring: Modern Async I/O

io_uring (added in Linux 5.1) is a revolutionary async I/O interface that provides high-performance, low-latency I/O without system call overhead.

Why io_uring?

┌─────────────────────────────────────────────────────────────────┐
│                    I/O EVOLUTION                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Blocking I/O          Async I/O (aio)        io_uring        │
│   ────────────          ──────────────         ────────         │
│   read() blocks         Limited to O_DIRECT    Full async       │
│   One I/O at a time     Complex API            Ring buffers     │
│   Thread per request    Kernel thread pool     Zero-copy        │
│   Syscall per I/O       Syscall per I/O        Batched syscalls│
│                                                                  │
│   Performance (NVMe SSD, QD=1):                                 │
│   ┌──────────────────────────────────────────────────────────┐  │
│   │  Blocking read():     ~20,000 IOPS                       │  │
│   │  aio:                 ~100,000 IOPS                      │  │
│   │  io_uring:            ~400,000+ IOPS                     │  │
│   └──────────────────────────────────────────────────────────┘  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

io_uring Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    io_uring RING BUFFERS                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   User Space                   Kernel Space                     │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                                                         │   │
│   │   Submission Queue (SQ)      Completion Queue (CQ)     │   │
│   │   ┌─────────────────┐       ┌─────────────────┐        │   │
│   │   │ SQE1: read()    │       │ CQE1: result=64 │        │   │
│   │   │ SQE2: write()   │       │ CQE2: result=32 │        │   │
│   │   │ SQE3: sendmsg() │       │ (waiting...)    │        │   │
│   │   │ (empty slots)   │       │                 │        │   │
│   │   └────────┬────────┘       └────────▲────────┘        │   │
│   │            │                         │                 │   │
│   └────────────│─────────────────────────│─────────────────┘   │
│                │                         │                      │
│          io_uring_enter()          Results placed              │
│          (optional, or poll)       after completion            │
│                │                         │                      │
│   ─────────────│─────────────────────────│────────────────────  │
│                │                         │                      │
│   Kernel:      ▼                         │                      │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  Process SQEs                                            │   │
│   │  • Execute I/O operations                                │   │
│   │  • Write CQEs when complete                              │   │
│   │  • Optional: polling mode (no syscalls!)                 │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│   Key Benefits:                                                  │
│   • Shared memory: No copy between user/kernel                 │
│   • Batched: Submit many ops, one syscall (or none)            │
│   • Linked: Chain dependent operations                          │
│   • Fixed buffers: Pre-registered for zero-copy                │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

io_uring Example

#include <liburing.h>
#include <fcntl.h>
#include <string.h>

int main() {
    struct io_uring ring;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    char buffer[1024];
    
    // Initialize ring with 8 entries
    io_uring_queue_init(8, &ring, 0);
    
    int fd = open("test.txt", O_RDONLY);
    
    // Get submission queue entry
    sqe = io_uring_get_sqe(&ring);
    
    // Prepare read operation
    io_uring_prep_read(sqe, fd, buffer, sizeof(buffer), 0);
    sqe->user_data = 1;  // Identifier for completion
    
    // Submit
    io_uring_submit(&ring);
    
    // Wait for completion
    io_uring_wait_cqe(&ring, &cqe);
    
    if (cqe->res > 0) {
        printf("Read %d bytes\n", cqe->res);
    }
    
    // Mark completion as seen
    io_uring_cqe_seen(&ring, cqe);
    
    close(fd);
    io_uring_queue_exit(&ring);
    return 0;
}

io_uring Operations

// Supported operations (partial list)
io_uring_prep_read()        // Read from file
io_uring_prep_write()       // Write to file
io_uring_prep_readv()       // Vectored read
io_uring_prep_writev()      // Vectored write
io_uring_prep_send()        // Send on socket
io_uring_prep_recv()        // Receive on socket
io_uring_prep_accept()      // Accept connection
io_uring_prep_connect()     // Connect socket
io_uring_prep_openat()      // Open file
io_uring_prep_close()       // Close file
io_uring_prep_statx()       // Stat file
io_uring_prep_poll_add()    // Poll for events
io_uring_prep_timeout()     // Set timeout
io_uring_prep_link_timeout()// Linked timeout

// Advanced features
io_uring_register_buffers() // Pre-register for zero-copy
io_uring_register_files()   // Pre-register file descriptors
IORING_SETUP_SQPOLL         // Kernel polling (no syscalls!)
IOSQE_IO_LINK               // Chain operations

eBPF: Programmable Kernel

eBPF (extended Berkeley Packet Filter) allows running sandboxed programs in the kernel without changing kernel code or loading modules.

eBPF Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    eBPF ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   User Space                                                     │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  eBPF Program (C)                                        │   │
│   │       │ clang/llvm                                       │   │
│   │       ▼                                                  │   │
│   │  eBPF Bytecode                                           │   │
│   │       │ bpf() syscall                                    │   │
│   └───────│─────────────────────────────────────────────────┘   │
│           │                                                      │
│   ────────│──────────────────────────────────────────────────── │
│           ▼                                                      │
│   Kernel Space                                                   │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  Verifier                                                │   │
│   │  • Checks safety                                         │   │
│   │  • No loops (or bounded)                                 │   │
│   │  • Valid memory access                                   │   │
│   │  • Terminates                                            │   │
│   └─────────────────────────────────────────────────────────┘   │
│           │ Verified                                             │
│           ▼                                                      │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  JIT Compiler                                            │   │
│   │  • Compiles to native code                               │   │
│   │  • Near-native performance                               │   │
│   └─────────────────────────────────────────────────────────┘   │
│           │                                                      │
│           ▼                                                      │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  Attach to Hook Points                                   │   │
│   │  • kprobes (any kernel function)                         │   │
│   │  • tracepoints (stable kernel events)                    │   │
│   │  • XDP (network packet processing)                       │   │
│   │  • tc (traffic control)                                  │   │
│   │  • cgroup (resource control)                             │   │
│   │  • LSM (security)                                        │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

eBPF Use Cases

┌─────────────────────────────────────────────────────────────────┐
│                    eBPF USE CASES                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Category        │ Examples                                     │
│   ────────────────┼─────────────────────────────────────────────│
│   Observability   │ • Performance monitoring                    │
│                   │ • Distributed tracing (Cilium Hubble)       │
│                   │ • Custom metrics                            │
│                   │ • Runtime debugging                          │
│                                                                  │
│   Networking      │ • Load balancing (Cilium, Katran)           │
│                   │ • Packet filtering (XDP firewalls)          │
│                   │ • DDoS mitigation                           │
│                   │ • Service mesh (Cilium)                     │
│                                                                  │
│   Security        │ • Runtime security (Falco, Tetragon)        │
│                   │ • Syscall filtering                         │
│                   │ • Container security                        │
│                                                                  │
│   Profiling       │ • Continuous profiling (Parca, Pyroscope)   │
│                   │ • CPU flame graphs                          │
│                   │ • Off-CPU analysis                          │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

eBPF Example

// Simple eBPF program to count syscalls
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 256);
    __type(key, u32);         // syscall number
    __type(value, u64);       // count
} syscall_counts SEC(".maps");

SEC("tracepoint/raw_syscalls/sys_enter")
int trace_syscall(struct trace_event_raw_sys_enter *ctx) {
    u32 syscall_id = ctx->id;
    u64 *count = bpf_map_lookup_elem(&syscall_counts, &syscall_id);
    
    if (count) {
        __sync_fetch_and_add(count, 1);
    } else {
        u64 init_val = 1;
        bpf_map_update_elem(&syscall_counts, &syscall_id, &init_val, BPF_ANY);
    }
    
    return 0;
}

char LICENSE[] SEC("license") = "GPL";

eBPF Maps

// Map types for sharing data

// Hash map (key-value store)
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1024);
    __type(key, u32);
    __type(value, struct stats);
} hash_map SEC(".maps");

// Array (indexed access)
struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 256);
    __type(key, u32);
    __type(value, u64);
} array_map SEC(".maps");

// Per-CPU array (no locking needed)
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
    __uint(max_entries, 256);
    __type(key, u32);
    __type(value, u64);
} percpu_map SEC(".maps");

// Ring buffer (efficient event streaming)
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);  // 256KB
} events SEC(".maps");

// LRU hash (automatic eviction)
struct {
    __uint(type, BPF_MAP_TYPE_LRU_HASH);
    __uint(max_entries, 10000);
    __type(key, struct flow_key);
    __type(value, struct flow_stats);
} lru_map SEC(".maps");

Modern Schedulers

EEVDF (Earliest Eligible Virtual Deadline First)

Linux 6.6+ replaces CFS with EEVDF:
┌─────────────────────────────────────────────────────────────────┐
│                    EEVDF vs CFS                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   CFS (Completely Fair Scheduler)                               │
│   ────────────────────────────────                              │
│   • Virtual runtime tracking                                    │
│   • Red-black tree for task selection                           │
│   • Good average case, but:                                     │
│     - Latency spikes possible                                   │
│     - Unfair to short-running tasks                             │
│                                                                  │
│   EEVDF (Earliest Eligible Virtual Deadline First)              │
│   ─────────────────────────────────────────────────             │
│   • Each task has a virtual deadline                            │
│   • Eligible: Has runnable time slice                           │
│   • Pick task with earliest deadline                            │
│   • Better latency guarantees                                   │
│   • Fairer to short tasks                                       │
│                                                                  │
│   Key Concept:                                                   │
│   • lag = service_received - service_expected                   │
│   • Positive lag: Got more than fair share                      │
│   • Negative lag: Got less than fair share                      │
│   • Eligible: lag ≤ 0 (hasn't exceeded fair share)             │
│   • Pick eligible task with earliest deadline                   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Scheduler Classes

┌─────────────────────────────────────────────────────────────────┐
│                    LINUX SCHEDULER CLASSES                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Priority  │ Class          │ Policy         │ Use Case        │
│   ──────────┼────────────────┼────────────────┼──────────────── │
│   Highest   │ stop_sched     │ (internal)     │ Stop machine    │
│             │ dl_sched       │ SCHED_DEADLINE │ Hard real-time  │
│             │ rt_sched       │ SCHED_FIFO     │ Soft real-time  │
│             │                │ SCHED_RR       │                 │
│             │ fair_sched     │ SCHED_NORMAL   │ Regular tasks   │
│   Lowest    │ idle_sched     │ SCHED_IDLE     │ Very low prio   │
│                                                                  │
│   SCHED_DEADLINE:                                                │
│   • Earliest Deadline First (EDF)                               │
│   • Parameters: runtime, deadline, period                       │
│   • Guaranteed CPU time if schedulable                          │
│   • Used for hard real-time tasks                               │
│                                                                  │
│   Example:                                                       │
│   struct sched_attr attr = {                                    │
│       .sched_policy = SCHED_DEADLINE,                           │
│       .sched_runtime = 10 * 1000 * 1000,   // 10ms              │
│       .sched_deadline = 30 * 1000 * 1000,  // 30ms              │
│       .sched_period = 30 * 1000 * 1000,    // 30ms              │
│   };                                                             │
│   sched_setattr(0, &attr, 0);                                   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Core Scheduling

For security (Spectre/Meltdown), run only trusted code together on SMT siblings:
# Enable core scheduling
echo 1 > /sys/kernel/debug/sched/core_sched_enabled

# Create core scheduling group
# Only threads in same cookie run on SMT siblings

prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 0, PIDTYPE_TGID, 0);

# Kernel ensures:
# - SMT siblings run tasks with same cookie
# - Mitigates some side-channel attacks

Memory Management Advances

Memory Tiering

┌─────────────────────────────────────────────────────────────────┐
│                    MEMORY TIERING                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Modern systems have heterogeneous memory:                     │
│                                                                  │
│   Tier 0 (Fastest)                                               │
│   ┌──────────────────────────────────────────────────────────┐  │
│   │  HBM (High Bandwidth Memory) / Local DRAM                │  │
│   │  ~100 ns latency                                         │  │
│   └──────────────────────────────────────────────────────────┘  │
│                              │                                   │
│                              ▼ Page migration                    │
│   Tier 1 (Medium)                                                │
│   ┌──────────────────────────────────────────────────────────┐  │
│   │  CXL Memory / Remote NUMA                                 │  │
│   │  ~200-500 ns latency                                     │  │
│   └──────────────────────────────────────────────────────────┘  │
│                              │                                   │
│                              ▼ Page migration                    │
│   Tier 2 (Slowest)                                               │
│   ┌──────────────────────────────────────────────────────────┐  │
│   │  Persistent Memory (PMEM) / Compressed Memory            │  │
│   │  ~1000+ ns latency                                       │  │
│   └──────────────────────────────────────────────────────────┘  │
│                                                                  │
│   Kernel features:                                               │
│   • kswapd: Demotes cold pages                                  │
│   • NUMA balancing: Promotes hot pages                          │
│   • Memory tiering (Linux 5.15+): Explicit tier management     │
│                                                                  │
│   # Check memory tiers                                           │
│   cat /sys/devices/system/node/node*/memory_tier                │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

MGLRU (Multi-Gen LRU)

Linux 6.1+ introduces Multi-Generational LRU for better page reclaim:
┌─────────────────────────────────────────────────────────────────┐
│                    MGLRU vs Classic LRU                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Classic LRU:                                                   │
│   ┌────────────────────────────────────────────────────────┐    │
│   │  Active List ◄──► Inactive List ──► Reclaim            │    │
│   └────────────────────────────────────────────────────────┘    │
│   • Two lists only                                               │
│   • Age information limited                                      │
│   • Scanning overhead for large memory                          │
│                                                                  │
│   MGLRU:                                                         │
│   ┌────────────────────────────────────────────────────────┐    │
│   │  Gen 0   Gen 1   Gen 2   Gen 3    ──►  Reclaim         │    │
│   │ (oldest)                 (youngest)                     │    │
│   └────────────────────────────────────────────────────────┘    │
│   • Multiple generations (typically 4)                          │
│   • Better age tracking                                          │
│   • Faster page aging                                            │
│   • Reduces CPU overhead for scanning                           │
│   • Better for large memory systems                              │
│                                                                  │
│   Enable MGLRU:                                                  │
│   echo 1 > /sys/kernel/mm/lru_gen/enabled                       │
│                                                                  │
│   Benefits:                                                      │
│   • 10-20% better throughput under memory pressure              │
│   • Lower tail latency                                          │
│   • Better workload isolation                                    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Filesystem Innovations

SMR and ZNS Support

┌─────────────────────────────────────────────────────────────────┐
│                    ZONED STORAGE                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Traditional SSD:                                               │
│   ┌────┬────┬────┬────┬────┬────┬────┬────┐                    │
│   │ R/W│ R/W│ R/W│ R/W│ R/W│ R/W│ R/W│ R/W│  Random access    │
│   └────┴────┴────┴────┴────┴────┴────┴────┘                    │
│                                                                  │
│   Zoned Namespace (ZNS) SSD / SMR HDD:                          │
│   ┌──────────────────────┬──────────────────────┐               │
│   │     Zone 1           │     Zone 2           │               │
│   │ ████████░░░░░░░░░░░░│░░░░░░░░░░░░░░░░░░░░│              │
│   │ ──────► Sequential   │ Write pointer        │               │
│   └──────────────────────┴──────────────────────┘               │
│                                                                  │
│   Rules:                                                         │
│   • Must write sequentially within a zone                       │
│   • Can only reset (erase) entire zone                          │
│   • Better SSD endurance and cost                               │
│                                                                  │
│   Filesystem support:                                            │
│   • Btrfs: Native zone support                                  │
│   • f2fs: Flash-Friendly FS with zone support                   │
│   • ZoneFS: Exposes zones as files                              │
│                                                                  │
│   # Check zone info                                              │
│   cat /sys/block/nvme0n1/queue/zoned                            │
│   # Values: none, host-aware, host-managed                      │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

FUSE and Filesystem in Userspace

┌─────────────────────────────────────────────────────────────────┐
│                    FUSE ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   User Space                                                     │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  Application                                             │   │
│   │     │ VFS operation (open, read, write)                 │   │
│   └─────│───────────────────────────────────────────────────┘   │
│         │                                                        │
│   ──────│────────────────────────────────────────────────────── │
│         ▼                                                        │
│   Kernel                                                         │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  VFS Layer                                               │   │
│   │     │                                                    │   │
│   │     ▼                                                    │   │
│   │  FUSE Kernel Module                                      │   │
│   │     │                                                    │   │
│   └─────│───────────────────────────────────────────────────┘   │
│         │ /dev/fuse                                              │
│   ──────│────────────────────────────────────────────────────── │
│         ▼                                                        │
│   User Space                                                     │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  FUSE Daemon (Your Filesystem)                           │   │
│   │  • libfuse                                               │   │
│   │  • Implement operations                                  │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│   Use cases:                                                     │
│   • sshfs: Mount remote filesystem over SSH                    │
│   • s3fs: Mount S3 bucket as filesystem                        │
│   • rclone: Mount cloud storage                                 │
│   • Avoids kernel development for custom filesystems           │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Security Features

Landlock

Linux Security Module for unprivileged sandboxing:
#include <linux/landlock.h>
#include <sys/syscall.h>

// Create ruleset
struct landlock_ruleset_attr ruleset_attr = {
    .handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE |
                         LANDLOCK_ACCESS_FS_WRITE_FILE,
};

int ruleset_fd = syscall(SYS_landlock_create_ruleset,
                         &ruleset_attr, sizeof(ruleset_attr), 0);

// Add rule: Allow read from /usr
struct landlock_path_beneath_attr path_beneath = {
    .allowed_access = LANDLOCK_ACCESS_FS_READ_FILE,
    .parent_fd = open("/usr", O_PATH),
};

syscall(SYS_landlock_add_rule, ruleset_fd,
        LANDLOCK_RULE_PATH_BENEATH, &path_beneath, 0);

// Enforce ruleset
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
syscall(SYS_landlock_restrict_self, ruleset_fd, 0);

// Now process can only read from /usr
// All other file access will fail with EACCES

io_uring Security

// io_uring can be restricted for security

// IORING_REGISTER_RESTRICTIONS
struct io_uring_restriction res[] = {
    // Only allow specific operations
    { .opcode = IORING_RESTRICTION_SQE_OP,
      .sqe_op = IORING_OP_READ },
    { .opcode = IORING_RESTRICTION_SQE_OP,
      .sqe_op = IORING_OP_WRITE },
    // Require fixed files only
    { .opcode = IORING_RESTRICTION_SQE_FLAGS_REQUIRED,
      .sqe_flags = IOSQE_FIXED_FILE },
};

io_uring_register_restrictions(ring, res, ARRAY_SIZE(res));
io_uring_enable_rings(ring);

// Some systems disable io_uring entirely
// sysctl kernel.io_uring_disabled=2

Interview Questions

io_uring advantages:
  1. Shared memory rings: No copying between user/kernel
  2. Batched submissions: Multiple ops per syscall
  3. Polling mode: Zero syscalls possible (SQPOLL)
  4. Zero-copy: Pre-registered buffers
  5. Async everything: File, network, timers all async
Performance gains:
  • Syscall overhead eliminated or amortized
  • Better cache utilization
  • 4-10x IOPS improvement for NVMe
When to use:
  • High-performance servers
  • Storage-intensive applications
  • When syscall overhead matters
eBPF (extended Berkeley Packet Filter):Run verified, sandboxed programs in kernel:
  • No kernel recompilation
  • No kernel modules
  • Safe: Verified before execution
  • Fast: JIT compiled
Use cases:
  1. Observability: Trace syscalls, functions, performance
  2. Networking: XDP for fast packet processing, load balancing
  3. Security: Runtime threat detection, syscall filtering
  4. Profiling: CPU, memory, off-CPU analysis
How it works:
  • Write in C or bpftrace
  • Compile to BPF bytecode
  • Kernel verifier checks safety
  • JIT compiles to native code
  • Attach to hook points (kprobes, tracepoints, XDP)
XDP (eXpress Data Path):eBPF programs that run at the NIC driver level, before sk_buff allocation.Actions:
  • XDP_DROP: Drop packet (fastest firewall)
  • XDP_PASS: Continue to network stack
  • XDP_TX: Send back out same interface
  • XDP_REDIRECT: Send to different interface/CPU
Performance:
  • Millions of packets per second per core
  • 10-100x faster than iptables for simple filtering
Use cases:
  • DDoS mitigation (drop bad traffic early)
  • Load balancing (Facebook’s Katran)
  • Traffic filtering
  • Packet modification
Limitations:
  • Limited to packet processing
  • Must handle raw packets
  • Driver support required for best performance

Summary

┌─────────────────────────────────────────────────────────────────┐
│              MODERN OS FEATURES SUMMARY                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  io_uring:                                                       │
│  • Ring buffer based async I/O                                  │
│  • Minimal syscall overhead                                     │
│  • Supports all I/O types                                       │
│                                                                  │
│  eBPF:                                                           │
│  • Safe kernel programmability                                  │
│  • Tracing, networking, security                                │
│  • XDP for fast packet processing                               │
│                                                                  │
│  Schedulers:                                                     │
│  • EEVDF: Better fairness and latency                          │
│  • SCHED_DEADLINE: Hard real-time                               │
│  • Core scheduling: SMT security                                │
│                                                                  │
│  Memory:                                                         │
│  • MGLRU: Better page reclaim                                   │
│  • Memory tiering: CXL and PMEM                                 │
│                                                                  │
│  Security:                                                       │
│  • Landlock: Unprivileged sandboxing                           │
│  • Improved io_uring restrictions                               │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘