Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Linux I/O Subsystem - Block layer, schedulers, and the journey from VFS to disk

I/O Subsystem

The Linux I/O subsystem handles all storage operations. Understanding the block layer, I/O schedulers, and modern async I/O with io_uring is essential for building and debugging high-performance systems.
Prerequisites: Filesystem fundamentals, system calls
Interview Focus: I/O schedulers, async I/O, io_uring
Time to Master: 4-5 hours

Block Layer Architecture

Linux I/O Stack

bio and request Structures

The bio Structure

struct bio {
    struct block_device *bi_bdev;    // Target device
    unsigned int        bi_opf;       // Operation and flags
    unsigned short      bi_vcnt;      // Number of bio_vecs
    unsigned short      bi_max_vecs;  // Max bio_vecs
    atomic_t            bi_cnt;       // Reference count
    struct bio_vec      *bi_io_vec;   // Vector of pages
    sector_t            bi_iter.bi_sector;  // Start sector
    unsigned int        bi_iter.bi_size;    // Remaining bytes
    bio_end_io_t        *bi_end_io;   // Completion callback
    void                *bi_private;  // Private data
};

struct bio_vec {
    struct page *bv_page;    // Page containing data
    unsigned int bv_len;     // Length of data
    unsigned int bv_offset;  // Offset within page
};

bio Lifecycle

┌─────────────────────────────────────────────────────────────────────┐
│                        BIO LIFECYCLE                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. Allocate bio                                                     │
│     bio = bio_alloc(GFP_KERNEL, nr_pages);                          │
│                                                                      │
│  2. Set up bio                                                       │
│     bio->bi_bdev = block_device;                                    │
│     bio->bi_iter.bi_sector = start_sector;                          │
│     bio->bi_opf = REQ_OP_READ;                                      │
│     bio->bi_end_io = my_completion_handler;                         │
│                                                                      │
│  3. Add pages                                                        │
│     bio_add_page(bio, page, len, offset);                           │
│                                                                      │
│  4. Submit bio                                                       │
│     submit_bio(bio);                                                │
│         │                                                            │
│         ├─▶ Block layer merges with other bios                      │
│         ├─▶ I/O scheduler reorders                                  │
│         └─▶ Driver dispatches to hardware                           │
│                                                                      │
│  5. Completion (interrupt context)                                   │
│     bi_end_io(bio);                                                 │
│         └─▶ bio_put(bio);                                           │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

I/O Schedulers

Multi-Queue Architecture (blk-mq)

┌─────────────────────────────────────────────────────────────────────┐
│                    MULTI-QUEUE BLOCK LAYER (blk-mq)                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Applications submitting I/O (multiple threads)                      │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐                   │
│  │ Thread 1│ │ Thread 2│ │ Thread 3│ │ Thread 4│                   │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘                   │
│       │          │          │          │                            │
│       ▼          ▼          ▼          ▼                            │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │              Software Staging Queues (per CPU)               │  │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐        │  │
│  │  │ CPU 0    │ │ CPU 1    │ │ CPU 2    │ │ CPU 3    │        │  │
│  │  │ sw queue │ │ sw queue │ │ sw queue │ │ sw queue │        │  │
│  │  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘        │  │
│  └───────│────────────│────────────│────────────│───────────────┘  │
│          │            │            │            │                   │
│          └──────┬─────┴─────┬──────┴─────┬──────┘                  │
│                 │           │            │                          │
│                 ▼           ▼            ▼                          │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                Hardware Dispatch Queues                       │  │
│  │  ┌─────────────────┐  ┌─────────────────┐                    │  │
│  │  │  hw queue 0     │  │  hw queue 1     │  ← maps to NVMe   │  │
│  │  │  (NVMe sq 0)    │  │  (NVMe sq 1)    │    submission     │  │
│  │  └────────┬────────┘  └────────┬────────┘    queues         │  │
│  └───────────│─────────────────────│────────────────────────────┘  │
│              ▼                     ▼                                │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                        NVMe SSD                                 ││
│  │           (multiple submission/completion queues)               ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Scheduler Comparison

Best for: Latency-sensitive workloads, databasesCharacteristics:
  • Read and write request deadlines
  • Read priority over writes (reads typically blocking)
  • Batch dispatch for efficiency
Configuration:
echo mq-deadline > /sys/block/sda/queue/scheduler

# Tune deadlines (milliseconds)
cat /sys/block/sda/queue/iosched/read_expire    # 500
cat /sys/block/sda/queue/iosched/write_expire   # 5000

# Batch size
cat /sys/block/sda/queue/iosched/fifo_batch     # 16

Asynchronous I/O

Traditional AIO (libaio)

#include <libaio.h>

io_context_t ctx;
struct iocb cb;
struct iocb *cbs[1] = {&cb};
struct io_event events[1];

// Initialize
io_setup(128, &ctx);

// Prepare read
io_prep_pread(&cb, fd, buffer, 4096, 0);

// Submit
io_submit(ctx, 1, cbs);

// Wait for completion
int n = io_getevents(ctx, 1, 1, events, NULL);

// Cleanup
io_destroy(ctx);
Limitations of libaio:
  • Only supports O_DIRECT
  • Limited to block I/O
  • System call per submit/complete

io_uring: Modern Async I/O

io_uring Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         io_uring ARCHITECTURE                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  User Space                     Kernel Space                         │
│  ┌────────────────────────────┐ ┌────────────────────────────────┐  │
│  │                            │ │                                │  │
│  │    Application             │ │                                │  │
│  │                            │ │                                │  │
│  │    ┌──────────────────┐    │ │                                │  │
│  │    │                  │    │ │                                │  │
│  │    │  SQE Pool        │    │ │    ┌──────────────────────┐   │  │
│  │    │  (submissions)   │────┼─┼───▶│   Submission Queue   │   │  │
│  │    │                  │    │ │    │   (SQ) Ring Buffer   │   │  │
│  │    └──────────────────┘    │ │    └──────────┬───────────┘   │  │
│  │                            │ │               │                │  │
│  │                            │ │               ▼                │  │
│  │                            │ │    ┌──────────────────────┐   │  │
│  │                            │ │    │   Kernel Processing  │   │  │
│  │                            │ │    │   - Syscall handler  │   │  │
│  │                            │ │    │   - Async workers    │   │  │
│  │                            │ │    └──────────┬───────────┘   │  │
│  │                            │ │               │                │  │
│  │    ┌──────────────────┐    │ │               ▼                │  │
│  │    │                  │◀───┼─┼────┌──────────────────────┐   │  │
│  │    │  CQE Pool        │    │ │    │   Completion Queue   │   │  │
│  │    │  (completions)   │    │ │    │   (CQ) Ring Buffer   │   │  │
│  │    │                  │    │ │    └──────────────────────┘   │  │
│  │    └──────────────────┘    │ │                                │  │
│  │                            │ │                                │  │
│  └────────────────────────────┘ └────────────────────────────────┘  │
│                                                                      │
│  Zero-copy: SQ and CQ are mmap'd shared memory                      │
│  No syscall needed for submit (SQPOLL mode)                         │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

io_uring Example

#include <liburing.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>

#define QUEUE_DEPTH 32
#define BLOCK_SIZE 4096

int main() {
    struct io_uring ring;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    
    // Initialize io_uring
    int ret = io_uring_queue_init(QUEUE_DEPTH, &ring, 0);
    if (ret < 0) {
        perror("io_uring_queue_init");
        return 1;
    }
    
    // Open file
    int fd = open("test.txt", O_RDONLY);
    char buffer[BLOCK_SIZE];
    
    // Get a submission queue entry
    sqe = io_uring_get_sqe(&ring);
    
    // Prepare a read operation
    io_uring_prep_read(sqe, fd, buffer, BLOCK_SIZE, 0);
    
    // Set user data for identifying this request
    io_uring_sqe_set_data(sqe, (void*)123);
    
    // Submit the request
    io_uring_submit(&ring);
    
    // Wait for completion
    ret = io_uring_wait_cqe(&ring, &cqe);
    if (ret < 0) {
        perror("io_uring_wait_cqe");
        return 1;
    }
    
    // Check result
    if (cqe->res < 0) {
        printf("Read failed: %s\n", strerror(-cqe->res));
    } else {
        printf("Read %d bytes\n", cqe->res);
    }
    
    // Mark CQE as consumed
    io_uring_cqe_seen(&ring, cqe);
    
    // Cleanup
    close(fd);
    io_uring_queue_exit(&ring);
    
    return 0;
}

io_uring Advanced Features

// Registered files (avoid fd lookup per operation)
int fds[10];
io_uring_register_files(&ring, fds, 10);
// Use fixed file index instead of fd
sqe->flags |= IOSQE_FIXED_FILE;
sqe->fd = 0;  // Index into registered array

// Registered buffers (avoid page pinning per operation)
struct iovec iovecs[10];
io_uring_register_buffers(&ring, iovecs, 10);
io_uring_prep_read_fixed(sqe, fd, buffer, len, offset, buf_index);

// SQPOLL: Kernel polls SQ, no submit syscall needed
struct io_uring_params params = {
    .flags = IORING_SETUP_SQPOLL,
    .sq_thread_idle = 2000,  // ms before thread sleeps
};
io_uring_queue_init_params(QUEUE_DEPTH, &ring, &params);

// Linked operations (dependent I/Os)
sqe1 = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe1, fd, buf1, len, 0);
sqe1->flags |= IOSQE_IO_LINK;  // Next SQE depends on this

sqe2 = io_uring_get_sqe(&ring);
io_uring_prep_write(sqe2, fd, buf2, len, 0);  // Only runs if read succeeds

io_uring Supported Operations

CategoryOperations
File I/Oread, write, readv, writev, fsync, sync_file_range
Networkaccept, connect, send, recv, sendmsg, recvmsg
Otheropenat, close, statx, poll, timeout, cancel
Advancedsplice, tee, provide_buffers, multishot accept

Direct I/O and O_DIRECT

// Open with O_DIRECT
int fd = open("/path/to/file", O_RDWR | O_DIRECT);

// Buffer must be aligned
void *buffer;
posix_memalign(&buffer, 512, 4096);  // 512-byte alignment

// Offset and size must be aligned to logical block size
pread(fd, buffer, 4096, 0);  // Read 4KB at offset 0

// Get required alignment
#include <linux/fs.h>
unsigned int logical_block_size;
ioctl(fd, BLKSSZGET, &logical_block_size);

When to Use O_DIRECT

Use O_DIRECTDon’t Use O_DIRECT
Database enginesGeneral applications
Application manages cacheBenefit from page cache
Predictable latency neededSmall, random reads
Very large sequential I/OTypical file access

I/O Profiling and Debugging

blktrace and blkparse

# Start tracing
blktrace -d /dev/sda -o - | blkparse -i -

# Output interpretation:
#   8,0    3        1     0.000000000  1234  Q  WS 1000 + 8 [myapp]
#   │      │        │     │            │     │  │  │      │  │
#   │      │        │     │            │     │  │  │      │  └─ Process
#   │      │        │     │            │     │  │  │      └─ Length (sectors)
#   │      │        │     │            │     │  │  └─ Start sector
#   │      │        │     │            │     │  └─ Write, Sync
#   │      │        │     │            │     └─ Action (Q=queued, C=complete)
#   │      │        │     │            └─ PID
#   │      │        │     └─ Timestamp
#   │      │        └─ Sequence
#   │      └─ CPU
#   └─ Major,minor

# Generate I/O stats
blktrace -d /dev/sda -w 10 -o trace
blkparse -i trace.blktrace.0 -d trace.bin
btt -i trace.bin

# Actions:
# Q = Queued
# G = Get request
# I = Inserted
# D = Dispatched
# C = Completed

BPF-based Tools

# I/O latency histogram
sudo biolatency-bpfcc 10

# Slow I/O (> threshold)
sudo biosnoop-bpfcc

# Random vs sequential I/O
sudo bitesize-bpfcc

# I/O by process
sudo biotop-bpfcc

# Trace specific file
sudo fileslower-bpfcc 10 -p $(pidof myapp)

iostat Analysis

# Extended statistics
iostat -xz 1

# Key metrics:
# r/s, w/s     - Reads/writes per second
# rkB/s, wkB/s - Throughput
# await        - Average I/O time (includes queue)
# svctm        - Service time (deprecated, removed in newer)
# %util        - Device utilization

# Example output:
# Device  r/s    w/s   rkB/s   wkB/s  await  %util
# sda    100.0  50.0  1024.0   512.0   2.5   15.0

# Watch for:
# - High await with low %util = queue congestion
# - %util = 100% doesn't mean saturated (parallel I/O)

Interview Deep Dives

Complete flow:
  1. Application: write(fd, buf, len)
  2. VFS Layer:
    • Find inode from fd
    • Call filesystem’s write_iter
  3. Page Cache (buffered write):
    • Find/create page in cache
    • Copy data from user space
    • Mark page dirty
    • Return to application (write “complete”)
  4. Writeback (background or sync):
    • pdflush/writeback worker wakes
    • Allocates bio for dirty pages
    • Submits bio to block layer
  5. Block Layer:
    • bio enters request queue
    • Scheduler may merge/reorder
    • Dispatch to driver
  6. Device Driver:
    • Translate to device commands
    • DMA data to device
  7. Hardware:
    • Device writes to persistent storage
    • Interrupt on completion
  8. Completion:
    • Driver handles interrupt
    • bio completion callback
    • Page marked clean
sync():
  • Triggers writeback for ALL dirty data
  • Doesn’t wait for completion
  • System-wide operation
fsync(fd):
  • Flushes data AND metadata for specific file
  • Waits for completion
  • Includes directory entry if new file
fdatasync(fd):
  • Flushes data for specific file
  • Only flushes metadata if required for data retrieval
  • Skips non-essential metadata (atime, mtime)
Performance comparison:
Append 1 byte to file:
- fdatasync: Write data + update size = 2 writes
- fsync: Write data + update all metadata = 2+ writes

Update 1 byte in existing file:
- fdatasync: Just write data = 1 write
- fsync: Write data + update mtime = 2 writes
Use io_uring when:
  1. High I/O rate: Thousands of IOPS
    • Syscall overhead becomes significant
    • Batching amortizes overhead
  2. Network servers: Accept/read/write patterns
    • Single interface for all I/O
    • Async accept with multishot
  3. Low latency requirements:
    • SQPOLL avoids syscall entirely
    • Registered files/buffers reduce overhead
  4. Mixed I/O workloads:
    • File + network in same ring
    • Unified completion handling
Don’t use io_uring for:
  • Simple applications (complexity not worth it)
  • Few I/O operations (no benefit)
  • Portability required (Linux-specific)
Systematic approach:
  1. Identify the bottleneck:
    iostat -xz 1  # Check device metrics
    # High await + low %util = queueing issue
    # High %util = device saturation
    
  2. Check for throttling:
    # I/O cgroup throttling
    cat /sys/fs/cgroup/<path>/io.stat
    cat /sys/fs/cgroup/<path>/io.pressure
    
  3. Profile I/O patterns:
    sudo biosnoop-bpfcc -d sda  # Per-I/O latency
    sudo bitesize-bpfcc         # I/O size distribution
    
  4. Check scheduler:
    cat /sys/block/sda/queue/scheduler
    # Try different scheduler
    echo mq-deadline > /sys/block/sda/queue/scheduler
    
  5. Application level:
    strace -e trace=read,write,fsync -p <pid>
    # Look for sync patterns
    

NVMe Specifics

# NVMe device information
nvme list
nvme id-ctrl /dev/nvme0

# Queue depth and namespaces
nvme list-ns /dev/nvme0
cat /sys/block/nvme0n1/queue/nr_requests

# NVMe specific stats
cat /sys/block/nvme0n1/device/device/iostats

# Temperature and health
nvme smart-log /dev/nvme0

NVMe vs SATA SSD

AspectSATA SSDNVMe SSD
InterfaceSATA (AHCI)PCIe
Queue depth3265535 per queue
Queues1Up to 65535
Latency~100μs~10-20μs
Throughput~550 MB/s~3000+ MB/s
Best schedulermq-deadlinenone

Interview Deep-Dive

Strong Answer:
  • Traditional I/O requires one syscall per operation: each read() or write() call costs 200-500 CPU cycles for the user-kernel transition alone. For a database doing 100K IOPS, that is 5-10% CPU spent purely on syscall overhead.
  • io_uring eliminates this by using two memory-mapped ring buffers shared between user space and the kernel. The Submission Queue (SQ) is where user space writes Submission Queue Entries (SQEs) describing I/O operations. The Completion Queue (CQ) is where the kernel writes Completion Queue Entries (CQEs) with results. Both are lock-free single-producer single-consumer rings, so no synchronization is needed for typical operation.
  • In normal mode, user space fills SQEs, then calls io_uring_enter() to notify the kernel. The kernel processes the SQEs and writes CQEs. This batches multiple operations per syscall, amortizing the transition cost. Submitting 32 operations requires only 1 syscall instead of 32.
  • SQPOLL mode eliminates even that single syscall. The kernel spawns a dedicated polling thread (io_sq_thread) that continuously polls the SQ for new entries. When user space writes an SQE, the kernel thread picks it up without any syscall at all. The kernel thread sleeps after sq_thread_idle milliseconds of inactivity and is woken by a subsequent io_uring_enter() call. This achieves true zero-syscall I/O submission for sustained workloads.
  • Additional optimizations: registered files (io_uring_register_files()) avoid per-operation file descriptor lookup, and registered buffers (io_uring_register_buffers()) avoid per-operation page pinning. Combined, these can reduce per-I/O overhead to under 100 cycles.
Follow-up: What are the security implications of SQPOLL mode, and why does it require elevated privileges?Follow-up Answer:
  • SQPOLL requires CAP_SYS_ADMIN (or IORING_SETUP_SQPOLL with the newer IORING_FEAT_SQPOLL_NONFIXED flag on recent kernels) because the kernel thread runs on behalf of the user process and continuously consumes CPU cycles even when the process is not actively submitting I/O. A malicious user could create many SQPOLL io_uring instances to consume CPU resources in the kernel, effectively creating a kernel-space denial of service. Additionally, the kernel thread runs with the credentials of the creating process, so careful accounting is needed to charge CPU time correctly to the right cgroup. Recent kernel versions (5.19+) have improved this with per-ring CPU accounting and the ability to limit SQPOLL to specific CPUs.
Strong Answer:
  • For NVMe-backed database servers, my recommendation is none (no scheduler), with the caveat that workload testing should validate this.
  • mq-deadline maintains separate read and write queues with deadline guarantees: reads default to 500ms deadline, writes to 5000ms. It prioritizes reads over writes because reads are typically in the synchronous path. This is valuable for HDD where seeking is expensive and reordering requests by sector can save milliseconds. But NVMe devices have no seek time, so reordering adds latency without reducing device-side cost.
  • kyber uses a token-based system with target latencies for reads and writes. When I/O latency exceeds the target, kyber reduces the number of in-flight requests (throttles) to reduce queueing. This is useful for shared environments where multiple workloads compete for NVMe bandwidth. However, for a dedicated database server, the database’s own I/O scheduler (InnoDB’s adaptive flushing, PostgreSQL’s bgwriter) already manages I/O prioritization.
  • none passes I/O requests directly to the NVMe device with no kernel-side reordering or scheduling. NVMe devices have internal schedulers optimized for their flash topology (channel interleaving, die-level parallelism), and the device’s 64K queue depth per submission queue means it can handle massive parallelism. Adding a kernel scheduler on top adds latency (microseconds per request for lock acquisition and queue insertion) without improving throughput.
  • The exception: if multiple containers share the NVMe with different priority classes, I would use kyber or mq-deadline with I/O cgroup limits to prevent noisy neighbors.
Follow-up: How does the blk-mq multi-queue architecture map to NVMe submission queues?Follow-up Answer:
  • NVMe devices expose multiple hardware submission/completion queue pairs (typically one per CPU core). The blk-mq layer creates per-CPU software staging queues that map to these hardware queues. When a thread submits I/O, the request enters the software queue for the thread’s current CPU, is optionally processed by the I/O scheduler, and then dispatched to the corresponding hardware submission queue. This per-CPU design eliminates cross-CPU lock contention: each CPU has its own software queue feeding its own hardware queue. The mapping is configurable via /sys/block/nvme0n1/queue/nr_requests (per-queue depth) and irq_affinity (which CPUs handle completion interrupts). For optimal performance, the completion interrupt for a hardware queue should be handled by the same CPU that submitted the I/O, keeping the data cache warm.
Strong Answer:
  • First, I would capture the I/O latency distribution with sudo biolatency-bpfcc -D 10 (grouped by device) to confirm the bimodal pattern: most I/Os under 1ms with a tail at 50-100ms. Then sudo biosnoop-bpfcc -d nvme0n1 to see individual slow I/Os with their PID, operation type, sector, and size.
  • Common causes of periodic I/O spikes on SSDs: First, garbage collection — SSD firmware periodically reclaims erased blocks, which can stall writes for 10-100ms. This manifests as periodic write latency spikes regardless of host activity. Check with nvme smart-log /dev/nvme0 for wear leveling counts. Second, journal commits — ext4/XFS periodically commit journal transactions (default every 5 seconds), which issues synchronous writes that can queue behind other I/O. Check with bpftrace -e 'kprobe:jbd2_journal_commit_transaction { printf("%s\n", comm); }'. Third, filesystem metadata operations — sync, fsync, or flusher threads writing dirty pages can cause queue depth spikes.
  • At the block layer, I would check queue depth using bpftrace to trace block_rq_issue and block_rq_complete events, computing the instantaneous queue depth. If the spike correlates with high queue depth, the device is saturated. If the spike happens at low queue depth, the device itself is stalling (firmware GC, thermal throttling, or defective NAND).
  • I would also check the I/O scheduler: cat /sys/block/nvme0n1/queue/scheduler — if it is not none, try switching to rule out scheduler-induced delays. And check I/O cgroup throttling: cat /sys/fs/cgroup/<path>/io.stat for the relevant device.
Follow-up: How would you distinguish between device-side stalls and kernel-side queueing delays?Follow-up Answer:
  • I would trace both block_rq_issue (when the kernel dispatches the request to the driver) and block_rq_complete (when the device signals completion). The delta between issue and complete is purely device-side latency. Separately, I would trace block_rq_insert (when the request enters the scheduler queue) and block_rq_issue — this delta is kernel scheduler queueing time. If the device-side delta shows spikes, the SSD is stalling. If the scheduler delta shows spikes, the kernel is holding requests in the queue (possibly throttled by cgroup I/O limits or the scheduler’s admission control).

Next: Networking Stack →