> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# I/O Subsystem

> Master Linux I/O internals: block layer, schedulers, and io_uring for high-performance applications

<Frame>
  <img src="https://mintcdn.com/devweeekends/1GcDwVN8SzYRbJg1/images/courses/linux-internals/io-subsystem-concept.svg?fit=max&auto=format&n=1GcDwVN8SzYRbJg1&q=85&s=905809384cb61c9d4dd3d4dba8f696d4" alt="Linux I/O Subsystem - Block layer, schedulers, and the journey from VFS to disk" width="1080" height="1080" data-path="images/courses/linux-internals/io-subsystem-concept.svg" />
</Frame>

# I/O Subsystem

The Linux I/O subsystem handles all storage operations. Understanding the block layer, I/O schedulers, and modern async I/O with io\_uring is essential for building and debugging high-performance systems.

<Info>
  **Prerequisites**: Filesystem fundamentals, system calls\
  **Interview Focus**: I/O schedulers, async I/O, io\_uring\
  **Time to Master**: 4-5 hours
</Info>

***

## Block Layer Architecture

<img src="https://mintcdn.com/devweeekends/1GcDwVN8SzYRbJg1/images/courses/linux-io-stack.svg?fit=max&auto=format&n=1GcDwVN8SzYRbJg1&q=85&s=a9e9345afe377eb3c3fbd25cbfcf62ec" alt="Linux I/O Stack" width="1080" height="1080" data-path="images/courses/linux-io-stack.svg" />

***

## bio and request Structures

### The bio Structure

```c theme={null}
struct bio {
    struct block_device *bi_bdev;    // Target device
    unsigned int        bi_opf;       // Operation and flags
    unsigned short      bi_vcnt;      // Number of bio_vecs
    unsigned short      bi_max_vecs;  // Max bio_vecs
    atomic_t            bi_cnt;       // Reference count
    struct bio_vec      *bi_io_vec;   // Vector of pages
    sector_t            bi_iter.bi_sector;  // Start sector
    unsigned int        bi_iter.bi_size;    // Remaining bytes
    bio_end_io_t        *bi_end_io;   // Completion callback
    void                *bi_private;  // Private data
};

struct bio_vec {
    struct page *bv_page;    // Page containing data
    unsigned int bv_len;     // Length of data
    unsigned int bv_offset;  // Offset within page
};
```

### bio Lifecycle

```
┌─────────────────────────────────────────────────────────────────────┐
│                        BIO LIFECYCLE                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. Allocate bio                                                     │
│     bio = bio_alloc(GFP_KERNEL, nr_pages);                          │
│                                                                      │
│  2. Set up bio                                                       │
│     bio->bi_bdev = block_device;                                    │
│     bio->bi_iter.bi_sector = start_sector;                          │
│     bio->bi_opf = REQ_OP_READ;                                      │
│     bio->bi_end_io = my_completion_handler;                         │
│                                                                      │
│  3. Add pages                                                        │
│     bio_add_page(bio, page, len, offset);                           │
│                                                                      │
│  4. Submit bio                                                       │
│     submit_bio(bio);                                                │
│         │                                                            │
│         ├─▶ Block layer merges with other bios                      │
│         ├─▶ I/O scheduler reorders                                  │
│         └─▶ Driver dispatches to hardware                           │
│                                                                      │
│  5. Completion (interrupt context)                                   │
│     bi_end_io(bio);                                                 │
│         └─▶ bio_put(bio);                                           │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

***

## I/O Schedulers

### Multi-Queue Architecture (blk-mq)

```
┌─────────────────────────────────────────────────────────────────────┐
│                    MULTI-QUEUE BLOCK LAYER (blk-mq)                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Applications submitting I/O (multiple threads)                      │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐                   │
│  │ Thread 1│ │ Thread 2│ │ Thread 3│ │ Thread 4│                   │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘                   │
│       │          │          │          │                            │
│       ▼          ▼          ▼          ▼                            │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │              Software Staging Queues (per CPU)               │  │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐        │  │
│  │  │ CPU 0    │ │ CPU 1    │ │ CPU 2    │ │ CPU 3    │        │  │
│  │  │ sw queue │ │ sw queue │ │ sw queue │ │ sw queue │        │  │
│  │  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘        │  │
│  └───────│────────────│────────────│────────────│───────────────┘  │
│          │            │            │            │                   │
│          └──────┬─────┴─────┬──────┴─────┬──────┘                  │
│                 │           │            │                          │
│                 ▼           ▼            ▼                          │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                Hardware Dispatch Queues                       │  │
│  │  ┌─────────────────┐  ┌─────────────────┐                    │  │
│  │  │  hw queue 0     │  │  hw queue 1     │  ← maps to NVMe   │  │
│  │  │  (NVMe sq 0)    │  │  (NVMe sq 1)    │    submission     │  │
│  │  └────────┬────────┘  └────────┬────────┘    queues         │  │
│  └───────────│─────────────────────│────────────────────────────┘  │
│              ▼                     ▼                                │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                        NVMe SSD                                 ││
│  │           (multiple submission/completion queues)               ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

### Scheduler Comparison

<Tabs>
  <Tab title="mq-deadline">
    **Best for**: Latency-sensitive workloads, databases

    **Characteristics**:

    * Read and write request deadlines
    * Read priority over writes (reads typically blocking)
    * Batch dispatch for efficiency

    **Configuration**:

    ```bash theme={null}
    echo mq-deadline > /sys/block/sda/queue/scheduler

    # Tune deadlines (milliseconds)
    cat /sys/block/sda/queue/iosched/read_expire    # 500
    cat /sys/block/sda/queue/iosched/write_expire   # 5000

    # Batch size
    cat /sys/block/sda/queue/iosched/fifo_batch     # 16
    ```
  </Tab>

  <Tab title="bfq">
    **Best for**: Interactive desktops, mixed workloads, fairness

    **Characteristics**:

    * Budget Fair Queueing (per-process fairness)
    * Low latency guarantee for interactive processes
    * I/O cgroups integration
    * Higher CPU overhead

    **Configuration**:

    ```bash theme={null}
    echo bfq > /sys/block/sda/queue/scheduler

    # Check per-process settings
    cat /sys/block/sda/queue/iosched/low_latency  # 1 = enabled

    # Tune slice duration
    cat /sys/block/sda/queue/iosched/slice_idle  # 8 (ms)
    ```
  </Tab>

  <Tab title="kyber">
    **Best for**: High-IOPS NVMe devices, scale-out workloads

    **Characteristics**:

    * Token-based throttling
    * Separate read/write queues
    * Low CPU overhead
    * Target latency-based

    **Configuration**:

    ```bash theme={null}
    echo kyber > /sys/block/nvme0n1/queue/scheduler

    # Target latencies (microseconds)
    cat /sys/block/nvme0n1/queue/iosched/read_lat_nsec   # 2000000
    cat /sys/block/nvme0n1/queue/iosched/write_lat_nsec  # 10000000
    ```
  </Tab>

  <Tab title="none">
    **Best for**: NVMe devices with internal scheduling

    **Characteristics**:

    * No software scheduling
    * Lowest CPU overhead
    * Relies on device-level scheduling
    * Best for high-end SSDs/NVMe

    **Configuration**:

    ```bash theme={null}
    echo none > /sys/block/nvme0n1/queue/scheduler

    # Verify
    cat /sys/block/nvme0n1/queue/scheduler
    # [none] mq-deadline kyber bfq
    ```
  </Tab>
</Tabs>

***

## Asynchronous I/O

### Traditional AIO (libaio)

```c theme={null}
#include <libaio.h>

io_context_t ctx;
struct iocb cb;
struct iocb *cbs[1] = {&cb};
struct io_event events[1];

// Initialize
io_setup(128, &ctx);

// Prepare read
io_prep_pread(&cb, fd, buffer, 4096, 0);

// Submit
io_submit(ctx, 1, cbs);

// Wait for completion
int n = io_getevents(ctx, 1, 1, events, NULL);

// Cleanup
io_destroy(ctx);
```

**Limitations of libaio**:

* Only supports O\_DIRECT
* Limited to block I/O
* System call per submit/complete

***

## io\_uring: Modern Async I/O

### io\_uring Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                         io_uring ARCHITECTURE                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  User Space                     Kernel Space                         │
│  ┌────────────────────────────┐ ┌────────────────────────────────┐  │
│  │                            │ │                                │  │
│  │    Application             │ │                                │  │
│  │                            │ │                                │  │
│  │    ┌──────────────────┐    │ │                                │  │
│  │    │                  │    │ │                                │  │
│  │    │  SQE Pool        │    │ │    ┌──────────────────────┐   │  │
│  │    │  (submissions)   │────┼─┼───▶│   Submission Queue   │   │  │
│  │    │                  │    │ │    │   (SQ) Ring Buffer   │   │  │
│  │    └──────────────────┘    │ │    └──────────┬───────────┘   │  │
│  │                            │ │               │                │  │
│  │                            │ │               ▼                │  │
│  │                            │ │    ┌──────────────────────┐   │  │
│  │                            │ │    │   Kernel Processing  │   │  │
│  │                            │ │    │   - Syscall handler  │   │  │
│  │                            │ │    │   - Async workers    │   │  │
│  │                            │ │    └──────────┬───────────┘   │  │
│  │                            │ │               │                │  │
│  │    ┌──────────────────┐    │ │               ▼                │  │
│  │    │                  │◀───┼─┼────┌──────────────────────┐   │  │
│  │    │  CQE Pool        │    │ │    │   Completion Queue   │   │  │
│  │    │  (completions)   │    │ │    │   (CQ) Ring Buffer   │   │  │
│  │    │                  │    │ │    └──────────────────────┘   │  │
│  │    └──────────────────┘    │ │                                │  │
│  │                            │ │                                │  │
│  └────────────────────────────┘ └────────────────────────────────┘  │
│                                                                      │
│  Zero-copy: SQ and CQ are mmap'd shared memory                      │
│  No syscall needed for submit (SQPOLL mode)                         │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

### io\_uring Example

```c theme={null}
#include <liburing.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>

#define QUEUE_DEPTH 32
#define BLOCK_SIZE 4096

int main() {
    struct io_uring ring;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    
    // Initialize io_uring
    int ret = io_uring_queue_init(QUEUE_DEPTH, &ring, 0);
    if (ret < 0) {
        perror("io_uring_queue_init");
        return 1;
    }
    
    // Open file
    int fd = open("test.txt", O_RDONLY);
    char buffer[BLOCK_SIZE];
    
    // Get a submission queue entry
    sqe = io_uring_get_sqe(&ring);
    
    // Prepare a read operation
    io_uring_prep_read(sqe, fd, buffer, BLOCK_SIZE, 0);
    
    // Set user data for identifying this request
    io_uring_sqe_set_data(sqe, (void*)123);
    
    // Submit the request
    io_uring_submit(&ring);
    
    // Wait for completion
    ret = io_uring_wait_cqe(&ring, &cqe);
    if (ret < 0) {
        perror("io_uring_wait_cqe");
        return 1;
    }
    
    // Check result
    if (cqe->res < 0) {
        printf("Read failed: %s\n", strerror(-cqe->res));
    } else {
        printf("Read %d bytes\n", cqe->res);
    }
    
    // Mark CQE as consumed
    io_uring_cqe_seen(&ring, cqe);
    
    // Cleanup
    close(fd);
    io_uring_queue_exit(&ring);
    
    return 0;
}
```

### io\_uring Advanced Features

```c theme={null}
// Registered files (avoid fd lookup per operation)
int fds[10];
io_uring_register_files(&ring, fds, 10);
// Use fixed file index instead of fd
sqe->flags |= IOSQE_FIXED_FILE;
sqe->fd = 0;  // Index into registered array

// Registered buffers (avoid page pinning per operation)
struct iovec iovecs[10];
io_uring_register_buffers(&ring, iovecs, 10);
io_uring_prep_read_fixed(sqe, fd, buffer, len, offset, buf_index);

// SQPOLL: Kernel polls SQ, no submit syscall needed
struct io_uring_params params = {
    .flags = IORING_SETUP_SQPOLL,
    .sq_thread_idle = 2000,  // ms before thread sleeps
};
io_uring_queue_init_params(QUEUE_DEPTH, &ring, &params);

// Linked operations (dependent I/Os)
sqe1 = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe1, fd, buf1, len, 0);
sqe1->flags |= IOSQE_IO_LINK;  // Next SQE depends on this

sqe2 = io_uring_get_sqe(&ring);
io_uring_prep_write(sqe2, fd, buf2, len, 0);  // Only runs if read succeeds
```

### io\_uring Supported Operations

| Category     | Operations                                           |
| ------------ | ---------------------------------------------------- |
| **File I/O** | read, write, readv, writev, fsync, sync\_file\_range |
| **Network**  | accept, connect, send, recv, sendmsg, recvmsg        |
| **Other**    | openat, close, statx, poll, timeout, cancel          |
| **Advanced** | splice, tee, provide\_buffers, multishot accept      |

***

## Direct I/O and O\_DIRECT

```c theme={null}
// Open with O_DIRECT
int fd = open("/path/to/file", O_RDWR | O_DIRECT);

// Buffer must be aligned
void *buffer;
posix_memalign(&buffer, 512, 4096);  // 512-byte alignment

// Offset and size must be aligned to logical block size
pread(fd, buffer, 4096, 0);  // Read 4KB at offset 0

// Get required alignment
#include <linux/fs.h>
unsigned int logical_block_size;
ioctl(fd, BLKSSZGET, &logical_block_size);
```

### When to Use O\_DIRECT

| Use O\_DIRECT              | Don't Use O\_DIRECT     |
| -------------------------- | ----------------------- |
| Database engines           | General applications    |
| Application manages cache  | Benefit from page cache |
| Predictable latency needed | Small, random reads     |
| Very large sequential I/O  | Typical file access     |

***

## I/O Profiling and Debugging

### blktrace and blkparse

```bash theme={null}
# Start tracing
blktrace -d /dev/sda -o - | blkparse -i -

# Output interpretation:
#   8,0    3        1     0.000000000  1234  Q  WS 1000 + 8 [myapp]
#   │      │        │     │            │     │  │  │      │  │
#   │      │        │     │            │     │  │  │      │  └─ Process
#   │      │        │     │            │     │  │  │      └─ Length (sectors)
#   │      │        │     │            │     │  │  └─ Start sector
#   │      │        │     │            │     │  └─ Write, Sync
#   │      │        │     │            │     └─ Action (Q=queued, C=complete)
#   │      │        │     │            └─ PID
#   │      │        │     └─ Timestamp
#   │      │        └─ Sequence
#   │      └─ CPU
#   └─ Major,minor

# Generate I/O stats
blktrace -d /dev/sda -w 10 -o trace
blkparse -i trace.blktrace.0 -d trace.bin
btt -i trace.bin

# Actions:
# Q = Queued
# G = Get request
# I = Inserted
# D = Dispatched
# C = Completed
```

### BPF-based Tools

```bash theme={null}
# I/O latency histogram
sudo biolatency-bpfcc 10

# Slow I/O (> threshold)
sudo biosnoop-bpfcc

# Random vs sequential I/O
sudo bitesize-bpfcc

# I/O by process
sudo biotop-bpfcc

# Trace specific file
sudo fileslower-bpfcc 10 -p $(pidof myapp)
```

### iostat Analysis

```bash theme={null}
# Extended statistics
iostat -xz 1

# Key metrics:
# r/s, w/s     - Reads/writes per second
# rkB/s, wkB/s - Throughput
# await        - Average I/O time (includes queue)
# svctm        - Service time (deprecated, removed in newer)
# %util        - Device utilization

# Example output:
# Device  r/s    w/s   rkB/s   wkB/s  await  %util
# sda    100.0  50.0  1024.0   512.0   2.5   15.0

# Watch for:
# - High await with low %util = queue congestion
# - %util = 100% doesn't mean saturated (parallel I/O)
```

***

## Interview Deep Dives

<AccordionGroup>
  <Accordion title="Q: Explain the journey of a write() call from application to disk" icon="question">
    **Complete flow**:

    1. **Application**: `write(fd, buf, len)`

    2. **VFS Layer**:
       * Find inode from fd
       * Call filesystem's write\_iter

    3. **Page Cache** (buffered write):
       * Find/create page in cache
       * Copy data from user space
       * Mark page dirty
       * Return to application (write "complete")

    4. **Writeback** (background or sync):
       * pdflush/writeback worker wakes
       * Allocates bio for dirty pages
       * Submits bio to block layer

    5. **Block Layer**:
       * bio enters request queue
       * Scheduler may merge/reorder
       * Dispatch to driver

    6. **Device Driver**:
       * Translate to device commands
       * DMA data to device

    7. **Hardware**:
       * Device writes to persistent storage
       * Interrupt on completion

    8. **Completion**:
       * Driver handles interrupt
       * bio completion callback
       * Page marked clean
  </Accordion>

  <Accordion title="Q: What's the difference between sync, fsync, and fdatasync?" icon="question">
    **sync()**:

    * Triggers writeback for ALL dirty data
    * Doesn't wait for completion
    * System-wide operation

    **fsync(fd)**:

    * Flushes data AND metadata for specific file
    * Waits for completion
    * Includes directory entry if new file

    **fdatasync(fd)**:

    * Flushes data for specific file
    * Only flushes metadata if required for data retrieval
    * Skips non-essential metadata (atime, mtime)

    **Performance comparison**:

    ```
    Append 1 byte to file:
    - fdatasync: Write data + update size = 2 writes
    - fsync: Write data + update all metadata = 2+ writes

    Update 1 byte in existing file:
    - fdatasync: Just write data = 1 write
    - fsync: Write data + update mtime = 2 writes
    ```
  </Accordion>

  <Accordion title="Q: When would you use io_uring over regular syscalls?" icon="question">
    **Use io\_uring when**:

    1. **High I/O rate**: Thousands of IOPS
       * Syscall overhead becomes significant
       * Batching amortizes overhead

    2. **Network servers**: Accept/read/write patterns
       * Single interface for all I/O
       * Async accept with multishot

    3. **Low latency requirements**:
       * SQPOLL avoids syscall entirely
       * Registered files/buffers reduce overhead

    4. **Mixed I/O workloads**:
       * File + network in same ring
       * Unified completion handling

    **Don't use io\_uring for**:

    * Simple applications (complexity not worth it)
    * Few I/O operations (no benefit)
    * Portability required (Linux-specific)
  </Accordion>

  <Accordion title="Q: How would you debug slow disk I/O?" icon="question">
    **Systematic approach**:

    1. **Identify the bottleneck**:
       ```bash theme={null}
       iostat -xz 1  # Check device metrics
       # High await + low %util = queueing issue
       # High %util = device saturation
       ```

    2. **Check for throttling**:
       ```bash theme={null}
       # I/O cgroup throttling
       cat /sys/fs/cgroup/<path>/io.stat
       cat /sys/fs/cgroup/<path>/io.pressure
       ```

    3. **Profile I/O patterns**:
       ```bash theme={null}
       sudo biosnoop-bpfcc -d sda  # Per-I/O latency
       sudo bitesize-bpfcc         # I/O size distribution
       ```

    4. **Check scheduler**:
       ```bash theme={null}
       cat /sys/block/sda/queue/scheduler
       # Try different scheduler
       echo mq-deadline > /sys/block/sda/queue/scheduler
       ```

    5. **Application level**:
       ```bash theme={null}
       strace -e trace=read,write,fsync -p <pid>
       # Look for sync patterns
       ```
  </Accordion>
</AccordionGroup>

***

## NVMe Specifics

```bash theme={null}
# NVMe device information
nvme list
nvme id-ctrl /dev/nvme0

# Queue depth and namespaces
nvme list-ns /dev/nvme0
cat /sys/block/nvme0n1/queue/nr_requests

# NVMe specific stats
cat /sys/block/nvme0n1/device/device/iostats

# Temperature and health
nvme smart-log /dev/nvme0
```

### NVMe vs SATA SSD

| Aspect         | SATA SSD    | NVMe SSD        |
| -------------- | ----------- | --------------- |
| Interface      | SATA (AHCI) | PCIe            |
| Queue depth    | 32          | 65535 per queue |
| Queues         | 1           | Up to 65535     |
| Latency        | \~100μs     | \~10-20μs       |
| Throughput     | \~550 MB/s  | \~3000+ MB/s    |
| Best scheduler | mq-deadline | none            |

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="Explain how io_uring achieves lower syscall overhead than traditional read/write. Walk through the shared memory ring buffer design and explain SQPOLL mode." icon="message">
    **Strong Answer:**

    * Traditional I/O requires one syscall per operation: each `read()` or `write()` call costs 200-500 CPU cycles for the user-kernel transition alone. For a database doing 100K IOPS, that is 5-10% CPU spent purely on syscall overhead.
    * io\_uring eliminates this by using two memory-mapped ring buffers shared between user space and the kernel. The Submission Queue (SQ) is where user space writes Submission Queue Entries (SQEs) describing I/O operations. The Completion Queue (CQ) is where the kernel writes Completion Queue Entries (CQEs) with results. Both are lock-free single-producer single-consumer rings, so no synchronization is needed for typical operation.
    * In normal mode, user space fills SQEs, then calls `io_uring_enter()` to notify the kernel. The kernel processes the SQEs and writes CQEs. This batches multiple operations per syscall, amortizing the transition cost. Submitting 32 operations requires only 1 syscall instead of 32.
    * SQPOLL mode eliminates even that single syscall. The kernel spawns a dedicated polling thread (`io_sq_thread`) that continuously polls the SQ for new entries. When user space writes an SQE, the kernel thread picks it up without any syscall at all. The kernel thread sleeps after `sq_thread_idle` milliseconds of inactivity and is woken by a subsequent `io_uring_enter()` call. This achieves true zero-syscall I/O submission for sustained workloads.
    * Additional optimizations: registered files (`io_uring_register_files()`) avoid per-operation file descriptor lookup, and registered buffers (`io_uring_register_buffers()`) avoid per-operation page pinning. Combined, these can reduce per-I/O overhead to under 100 cycles.

    **Follow-up:** What are the security implications of SQPOLL mode, and why does it require elevated privileges?

    **Follow-up Answer:**

    * SQPOLL requires `CAP_SYS_ADMIN` (or `IORING_SETUP_SQPOLL` with the newer `IORING_FEAT_SQPOLL_NONFIXED` flag on recent kernels) because the kernel thread runs on behalf of the user process and continuously consumes CPU cycles even when the process is not actively submitting I/O. A malicious user could create many SQPOLL io\_uring instances to consume CPU resources in the kernel, effectively creating a kernel-space denial of service. Additionally, the kernel thread runs with the credentials of the creating process, so careful accounting is needed to charge CPU time correctly to the right cgroup. Recent kernel versions (5.19+) have improved this with per-ring CPU accounting and the ability to limit SQPOLL to specific CPUs.
  </Accordion>

  <Accordion title="You are choosing an I/O scheduler for a fleet of NVMe-backed database servers. Compare mq-deadline, kyber, and 'none', and explain your recommendation." icon="message">
    **Strong Answer:**

    * For NVMe-backed database servers, my recommendation is `none` (no scheduler), with the caveat that workload testing should validate this.
    * `mq-deadline` maintains separate read and write queues with deadline guarantees: reads default to 500ms deadline, writes to 5000ms. It prioritizes reads over writes because reads are typically in the synchronous path. This is valuable for HDD where seeking is expensive and reordering requests by sector can save milliseconds. But NVMe devices have no seek time, so reordering adds latency without reducing device-side cost.
    * `kyber` uses a token-based system with target latencies for reads and writes. When I/O latency exceeds the target, kyber reduces the number of in-flight requests (throttles) to reduce queueing. This is useful for shared environments where multiple workloads compete for NVMe bandwidth. However, for a dedicated database server, the database's own I/O scheduler (InnoDB's adaptive flushing, PostgreSQL's bgwriter) already manages I/O prioritization.
    * `none` passes I/O requests directly to the NVMe device with no kernel-side reordering or scheduling. NVMe devices have internal schedulers optimized for their flash topology (channel interleaving, die-level parallelism), and the device's 64K queue depth per submission queue means it can handle massive parallelism. Adding a kernel scheduler on top adds latency (microseconds per request for lock acquisition and queue insertion) without improving throughput.
    * The exception: if multiple containers share the NVMe with different priority classes, I would use `kyber` or `mq-deadline` with I/O cgroup limits to prevent noisy neighbors.

    **Follow-up:** How does the blk-mq multi-queue architecture map to NVMe submission queues?

    **Follow-up Answer:**

    * NVMe devices expose multiple hardware submission/completion queue pairs (typically one per CPU core). The blk-mq layer creates per-CPU software staging queues that map to these hardware queues. When a thread submits I/O, the request enters the software queue for the thread's current CPU, is optionally processed by the I/O scheduler, and then dispatched to the corresponding hardware submission queue. This per-CPU design eliminates cross-CPU lock contention: each CPU has its own software queue feeding its own hardware queue. The mapping is configurable via `/sys/block/nvme0n1/queue/nr_requests` (per-queue depth) and `irq_affinity` (which CPUs handle completion interrupts). For optimal performance, the completion interrupt for a hardware queue should be handled by the same CPU that submitted the I/O, keeping the data cache warm.
  </Accordion>

  <Accordion title="A production service is experiencing periodic I/O latency spikes of 50-100ms on an SSD that normally serves requests in under 1ms. How would you investigate this at the block layer level?" icon="message">
    **Strong Answer:**

    * First, I would capture the I/O latency distribution with `sudo biolatency-bpfcc -D 10` (grouped by device) to confirm the bimodal pattern: most I/Os under 1ms with a tail at 50-100ms. Then `sudo biosnoop-bpfcc -d nvme0n1` to see individual slow I/Os with their PID, operation type, sector, and size.
    * Common causes of periodic I/O spikes on SSDs: First, garbage collection -- SSD firmware periodically reclaims erased blocks, which can stall writes for 10-100ms. This manifests as periodic write latency spikes regardless of host activity. Check with `nvme smart-log /dev/nvme0` for wear leveling counts. Second, journal commits -- ext4/XFS periodically commit journal transactions (default every 5 seconds), which issues synchronous writes that can queue behind other I/O. Check with `bpftrace -e 'kprobe:jbd2_journal_commit_transaction { printf("%s\n", comm); }'`. Third, filesystem metadata operations -- `sync`, `fsync`, or flusher threads writing dirty pages can cause queue depth spikes.
    * At the block layer, I would check queue depth using `bpftrace` to trace `block_rq_issue` and `block_rq_complete` events, computing the instantaneous queue depth. If the spike correlates with high queue depth, the device is saturated. If the spike happens at low queue depth, the device itself is stalling (firmware GC, thermal throttling, or defective NAND).
    * I would also check the I/O scheduler: `cat /sys/block/nvme0n1/queue/scheduler` -- if it is not `none`, try switching to rule out scheduler-induced delays. And check I/O cgroup throttling: `cat /sys/fs/cgroup/<path>/io.stat` for the relevant device.

    **Follow-up:** How would you distinguish between device-side stalls and kernel-side queueing delays?

    **Follow-up Answer:**

    * I would trace both `block_rq_issue` (when the kernel dispatches the request to the driver) and `block_rq_complete` (when the device signals completion). The delta between issue and complete is purely device-side latency. Separately, I would trace `block_rq_insert` (when the request enters the scheduler queue) and `block_rq_issue` -- this delta is kernel scheduler queueing time. If the device-side delta shows spikes, the SSD is stalling. If the scheduler delta shows spikes, the kernel is holding requests in the queue (possibly throttled by cgroup I/O limits or the scheduler's admission control).
  </Accordion>
</AccordionGroup>

***

Next: [Networking Stack →](/courses/linux-internals/networking-stack)
