I/O Subsystem

The Linux I/O subsystem handles all storage operations. Understanding the block layer, I/O schedulers, and modern async I/O with io_uring is essential for building and debugging high-performance systems.

Prerequisites: Filesystem fundamentals, system calls
Interview Focus: I/O schedulers, async I/O, io_uring
Time to Master: 4-5 hours

Block Layer Architecture

bio and request Structures

The bio Structure

struct bio {
    struct block_device *bi_bdev;    // Target device
    unsigned int        bi_opf;       // Operation and flags
    unsigned short      bi_vcnt;      // Number of bio_vecs
    unsigned short      bi_max_vecs;  // Max bio_vecs
    atomic_t            bi_cnt;       // Reference count
    struct bio_vec      *bi_io_vec;   // Vector of pages
    sector_t            bi_iter.bi_sector;  // Start sector
    unsigned int        bi_iter.bi_size;    // Remaining bytes
    bio_end_io_t        *bi_end_io;   // Completion callback
    void                *bi_private;  // Private data
};

struct bio_vec {
    struct page *bv_page;    // Page containing data
    unsigned int bv_len;     // Length of data
    unsigned int bv_offset;  // Offset within page
};

bio Lifecycle

┌─────────────────────────────────────────────────────────────────────┐
│                        BIO LIFECYCLE                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. Allocate bio                                                     │
│     bio = bio_alloc(GFP_KERNEL, nr_pages);                          │
│                                                                      │
│  2. Set up bio                                                       │
│     bio->bi_bdev = block_device;                                    │
│     bio->bi_iter.bi_sector = start_sector;                          │
│     bio->bi_opf = REQ_OP_READ;                                      │
│     bio->bi_end_io = my_completion_handler;                         │
│                                                                      │
│  3. Add pages                                                        │
│     bio_add_page(bio, page, len, offset);                           │
│                                                                      │
│  4. Submit bio                                                       │
│     submit_bio(bio);                                                │
│         │                                                            │
│         ├─▶ Block layer merges with other bios                      │
│         ├─▶ I/O scheduler reorders                                  │
│         └─▶ Driver dispatches to hardware                           │
│                                                                      │
│  5. Completion (interrupt context)                                   │
│     bi_end_io(bio);                                                 │
│         └─▶ bio_put(bio);                                           │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

I/O Schedulers

Multi-Queue Architecture (blk-mq)

┌─────────────────────────────────────────────────────────────────────┐
│                    MULTI-QUEUE BLOCK LAYER (blk-mq)                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Applications submitting I/O (multiple threads)                      │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐                   │
│  │ Thread 1│ │ Thread 2│ │ Thread 3│ │ Thread 4│                   │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘                   │
│       │          │          │          │                            │
│       ▼          ▼          ▼          ▼                            │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │              Software Staging Queues (per CPU)               │  │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐        │  │
│  │  │ CPU 0    │ │ CPU 1    │ │ CPU 2    │ │ CPU 3    │        │  │
│  │  │ sw queue │ │ sw queue │ │ sw queue │ │ sw queue │        │  │
│  │  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘        │  │
│  └───────│────────────│────────────│────────────│───────────────┘  │
│          │            │            │            │                   │
│          └──────┬─────┴─────┬──────┴─────┬──────┘                  │
│                 │           │            │                          │
│                 ▼           ▼            ▼                          │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                Hardware Dispatch Queues                       │  │
│  │  ┌─────────────────┐  ┌─────────────────┐                    │  │
│  │  │  hw queue 0     │  │  hw queue 1     │  ← maps to NVMe   │  │
│  │  │  (NVMe sq 0)    │  │  (NVMe sq 1)    │    submission     │  │
│  │  └────────┬────────┘  └────────┬────────┘    queues         │  │
│  └───────────│─────────────────────│────────────────────────────┘  │
│              ▼                     ▼                                │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                        NVMe SSD                                 ││
│  │           (multiple submission/completion queues)               ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Scheduler Comparison

mq-deadline
bfq
kyber
none

Best for: Latency-sensitive workloads, databasesCharacteristics:

Read and write request deadlines
Read priority over writes (reads typically blocking)
Batch dispatch for efficiency

Configuration:

echo mq-deadline > /sys/block/sda/queue/scheduler

# Tune deadlines (milliseconds)
cat /sys/block/sda/queue/iosched/read_expire    # 500
cat /sys/block/sda/queue/iosched/write_expire   # 5000

# Batch size
cat /sys/block/sda/queue/iosched/fifo_batch     # 16

Best for: Interactive desktops, mixed workloads, fairnessCharacteristics:

Budget Fair Queueing (per-process fairness)
Low latency guarantee for interactive processes
I/O cgroups integration
Higher CPU overhead

Configuration:

echo bfq > /sys/block/sda/queue/scheduler

# Check per-process settings
cat /sys/block/sda/queue/iosched/low_latency  # 1 = enabled

# Tune slice duration
cat /sys/block/sda/queue/iosched/slice_idle  # 8 (ms)

Best for: High-IOPS NVMe devices, scale-out workloadsCharacteristics:

Token-based throttling
Separate read/write queues
Low CPU overhead
Target latency-based

Configuration:

echo kyber > /sys/block/nvme0n1/queue/scheduler

# Target latencies (microseconds)
cat /sys/block/nvme0n1/queue/iosched/read_lat_nsec   # 2000000
cat /sys/block/nvme0n1/queue/iosched/write_lat_nsec  # 10000000

Best for: NVMe devices with internal schedulingCharacteristics:

No software scheduling
Lowest CPU overhead
Relies on device-level scheduling
Best for high-end SSDs/NVMe

Configuration:

echo none > /sys/block/nvme0n1/queue/scheduler

# Verify
cat /sys/block/nvme0n1/queue/scheduler
# [none] mq-deadline kyber bfq

Asynchronous I/O

Traditional AIO (libaio)

#include <libaio.h>

io_context_t ctx;
struct iocb cb;
struct iocb *cbs[1] = {&cb};
struct io_event events[1];

// Initialize
io_setup(128, &ctx);

// Prepare read
io_prep_pread(&cb, fd, buffer, 4096, 0);

// Submit
io_submit(ctx, 1, cbs);

// Wait for completion
int n = io_getevents(ctx, 1, 1, events, NULL);

// Cleanup
io_destroy(ctx);

Limitations of libaio:

Only supports O_DIRECT
Limited to block I/O
System call per submit/complete

io_uring: Modern Async I/O

io_uring Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         io_uring ARCHITECTURE                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  User Space                     Kernel Space                         │
│  ┌────────────────────────────┐ ┌────────────────────────────────┐  │
│  │                            │ │                                │  │
│  │    Application             │ │                                │  │
│  │                            │ │                                │  │
│  │    ┌──────────────────┐    │ │                                │  │
│  │    │                  │    │ │                                │  │
│  │    │  SQE Pool        │    │ │    ┌──────────────────────┐   │  │
│  │    │  (submissions)   │────┼─┼───▶│   Submission Queue   │   │  │
│  │    │                  │    │ │    │   (SQ) Ring Buffer   │   │  │
│  │    └──────────────────┘    │ │    └──────────┬───────────┘   │  │
│  │                            │ │               │                │  │
│  │                            │ │               ▼                │  │
│  │                            │ │    ┌──────────────────────┐   │  │
│  │                            │ │    │   Kernel Processing  │   │  │
│  │                            │ │    │   - Syscall handler  │   │  │
│  │                            │ │    │   - Async workers    │   │  │
│  │                            │ │    └──────────┬───────────┘   │  │
│  │                            │ │               │                │  │
│  │    ┌──────────────────┐    │ │               ▼                │  │
│  │    │                  │◀───┼─┼────┌──────────────────────┐   │  │
│  │    │  CQE Pool        │    │ │    │   Completion Queue   │   │  │
│  │    │  (completions)   │    │ │    │   (CQ) Ring Buffer   │   │  │
│  │    │                  │    │ │    └──────────────────────┘   │  │
│  │    └──────────────────┘    │ │                                │  │
│  │                            │ │                                │  │
│  └────────────────────────────┘ └────────────────────────────────┘  │
│                                                                      │
│  Zero-copy: SQ and CQ are mmap'd shared memory                      │
│  No syscall needed for submit (SQPOLL mode)                         │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

io_uring Example

#include <liburing.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>

#define QUEUE_DEPTH 32
#define BLOCK_SIZE 4096

int main() {
    struct io_uring ring;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    
    // Initialize io_uring
    int ret = io_uring_queue_init(QUEUE_DEPTH, &ring, 0);
    if (ret < 0) {
        perror("io_uring_queue_init");
        return 1;
    }
    
    // Open file
    int fd = open("test.txt", O_RDONLY);
    char buffer[BLOCK_SIZE];
    
    // Get a submission queue entry
    sqe = io_uring_get_sqe(&ring);
    
    // Prepare a read operation
    io_uring_prep_read(sqe, fd, buffer, BLOCK_SIZE, 0);
    
    // Set user data for identifying this request
    io_uring_sqe_set_data(sqe, (void*)123);
    
    // Submit the request
    io_uring_submit(&ring);
    
    // Wait for completion
    ret = io_uring_wait_cqe(&ring, &cqe);
    if (ret < 0) {
        perror("io_uring_wait_cqe");
        return 1;
    }
    
    // Check result
    if (cqe->res < 0) {
        printf("Read failed: %s\n", strerror(-cqe->res));
    } else {
        printf("Read %d bytes\n", cqe->res);
    }
    
    // Mark CQE as consumed
    io_uring_cqe_seen(&ring, cqe);
    
    // Cleanup
    close(fd);
    io_uring_queue_exit(&ring);
    
    return 0;
}

io_uring Advanced Features

// Registered files (avoid fd lookup per operation)
int fds[10];
io_uring_register_files(&ring, fds, 10);
// Use fixed file index instead of fd
sqe->flags |= IOSQE_FIXED_FILE;
sqe->fd = 0;  // Index into registered array

// Registered buffers (avoid page pinning per operation)
struct iovec iovecs[10];
io_uring_register_buffers(&ring, iovecs, 10);
io_uring_prep_read_fixed(sqe, fd, buffer, len, offset, buf_index);

// SQPOLL: Kernel polls SQ, no submit syscall needed
struct io_uring_params params = {
    .flags = IORING_SETUP_SQPOLL,
    .sq_thread_idle = 2000,  // ms before thread sleeps
};
io_uring_queue_init_params(QUEUE_DEPTH, &ring, &params);

// Linked operations (dependent I/Os)
sqe1 = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe1, fd, buf1, len, 0);
sqe1->flags |= IOSQE_IO_LINK;  // Next SQE depends on this

sqe2 = io_uring_get_sqe(&ring);
io_uring_prep_write(sqe2, fd, buf2, len, 0);  // Only runs if read succeeds

io_uring Supported Operations

Category	Operations
File I/O	read, write, readv, writev, fsync, sync_file_range
Network	accept, connect, send, recv, sendmsg, recvmsg
Other	openat, close, statx, poll, timeout, cancel
Advanced	splice, tee, provide_buffers, multishot accept

Direct I/O and O_DIRECT

// Open with O_DIRECT
int fd = open("/path/to/file", O_RDWR | O_DIRECT);

// Buffer must be aligned
void *buffer;
posix_memalign(&buffer, 512, 4096);  // 512-byte alignment

// Offset and size must be aligned to logical block size
pread(fd, buffer, 4096, 0);  // Read 4KB at offset 0

// Get required alignment
#include <linux/fs.h>
unsigned int logical_block_size;
ioctl(fd, BLKSSZGET, &logical_block_size);

When to Use O_DIRECT

Use O_DIRECT	Don’t Use O_DIRECT
Database engines	General applications
Application manages cache	Benefit from page cache
Predictable latency needed	Small, random reads
Very large sequential I/O	Typical file access

I/O Profiling and Debugging

blktrace and blkparse

# Start tracing
blktrace -d /dev/sda -o - | blkparse -i -

# Output interpretation:
#   8,0    3        1     0.000000000  1234  Q  WS 1000 + 8 [myapp]
#   │      │        │     │            │     │  │  │      │  │
#   │      │        │     │            │     │  │  │      │  └─ Process
#   │      │        │     │            │     │  │  │      └─ Length (sectors)
#   │      │        │     │            │     │  │  └─ Start sector
#   │      │        │     │            │     │  └─ Write, Sync
#   │      │        │     │            │     └─ Action (Q=queued, C=complete)
#   │      │        │     │            └─ PID
#   │      │        │     └─ Timestamp
#   │      │        └─ Sequence
#   │      └─ CPU
#   └─ Major,minor

# Generate I/O stats
blktrace -d /dev/sda -w 10 -o trace
blkparse -i trace.blktrace.0 -d trace.bin
btt -i trace.bin

# Actions:
# Q = Queued
# G = Get request
# I = Inserted
# D = Dispatched
# C = Completed

BPF-based Tools

# I/O latency histogram
sudo biolatency-bpfcc 10

# Slow I/O (> threshold)
sudo biosnoop-bpfcc

# Random vs sequential I/O
sudo bitesize-bpfcc

# I/O by process
sudo biotop-bpfcc

# Trace specific file
sudo fileslower-bpfcc 10 -p $(pidof myapp)

iostat Analysis

# Extended statistics
iostat -xz 1

# Key metrics:
# r/s, w/s     - Reads/writes per second
# rkB/s, wkB/s - Throughput
# await        - Average I/O time (includes queue)
# svctm        - Service time (deprecated, removed in newer)
# %util        - Device utilization

# Example output:
# Device  r/s    w/s   rkB/s   wkB/s  await  %util
# sda    100.0  50.0  1024.0   512.0   2.5   15.0

# Watch for:
# - High await with low %util = queue congestion
# - %util = 100% doesn't mean saturated (parallel I/O)

Interview Deep Dives

Q: Explain the journey of a write() call from application to disk

Complete flow:

Application: write(fd, buf, len)
VFS Layer:
- Find inode from fd
- Call filesystem’s write_iter
Page Cache (buffered write):
- Find/create page in cache
- Copy data from user space
- Mark page dirty
- Return to application (write “complete”)
Writeback (background or sync):
- pdflush/writeback worker wakes
- Allocates bio for dirty pages
- Submits bio to block layer
Block Layer:
- bio enters request queue
- Scheduler may merge/reorder
- Dispatch to driver
Device Driver:
- Translate to device commands
- DMA data to device
Hardware:
- Device writes to persistent storage
- Interrupt on completion
Completion:
- Driver handles interrupt
- bio completion callback
- Page marked clean

Q: What's the difference between sync, fsync, and fdatasync?

sync():

Triggers writeback for ALL dirty data
Doesn’t wait for completion
System-wide operation

fsync(fd):

Flushes data AND metadata for specific file
Waits for completion
Includes directory entry if new file

fdatasync(fd):

Flushes data for specific file
Only flushes metadata if required for data retrieval
Skips non-essential metadata (atime, mtime)

Performance comparison:

Append 1 byte to file:
- fdatasync: Write data + update size = 2 writes
- fsync: Write data + update all metadata = 2+ writes

Update 1 byte in existing file:
- fdatasync: Just write data = 1 write
- fsync: Write data + update mtime = 2 writes

Q: When would you use io_uring over regular syscalls?

Use io_uring when:

High I/O rate: Thousands of IOPS
- Syscall overhead becomes significant
- Batching amortizes overhead
Network servers: Accept/read/write patterns
- Single interface for all I/O
- Async accept with multishot
Low latency requirements:
- SQPOLL avoids syscall entirely
- Registered files/buffers reduce overhead
Mixed I/O workloads:
- File + network in same ring
- Unified completion handling

Don’t use io_uring for:

Simple applications (complexity not worth it)
Few I/O operations (no benefit)
Portability required (Linux-specific)

Q: How would you debug slow disk I/O?

Systematic approach:

Identify the bottleneck:

iostat -xz 1  # Check device metrics
# High await + low %util = queueing issue
# High %util = device saturation

Check for throttling:

# I/O cgroup throttling
cat /sys/fs/cgroup/<path>/io.stat
cat /sys/fs/cgroup/<path>/io.pressure

Profile I/O patterns:

sudo biosnoop-bpfcc -d sda  # Per-I/O latency
sudo bitesize-bpfcc         # I/O size distribution

Check scheduler:

cat /sys/block/sda/queue/scheduler
# Try different scheduler
echo mq-deadline > /sys/block/sda/queue/scheduler

Application level:

strace -e trace=read,write,fsync -p <pid>
# Look for sync patterns

NVMe Specifics

# NVMe device information
nvme list
nvme id-ctrl /dev/nvme0

# Queue depth and namespaces
nvme list-ns /dev/nvme0
cat /sys/block/nvme0n1/queue/nr_requests

# NVMe specific stats
cat /sys/block/nvme0n1/device/device/iostats

# Temperature and health
nvme smart-log /dev/nvme0

NVMe vs SATA SSD

Aspect	SATA SSD	NVMe SSD
Interface	SATA (AHCI)	PCIe
Queue depth	32	65535 per queue
Queues	1	Up to 65535
Latency	~100μs	~10-20μs
Throughput	~550 MB/s	~3000+ MB/s
Best scheduler	mq-deadline	none

Next: Networking Stack →

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​I/O Subsystem

​Block Layer Architecture

​bio and request Structures

​The bio Structure

​bio Lifecycle

​I/O Schedulers

​Multi-Queue Architecture (blk-mq)

​Scheduler Comparison

​Asynchronous I/O

​Traditional AIO (libaio)

​io_uring: Modern Async I/O

​io_uring Architecture

​io_uring Example

​io_uring Advanced Features

​io_uring Supported Operations

​Direct I/O and O_DIRECT

​When to Use O_DIRECT

​I/O Profiling and Debugging

​blktrace and blkparse

​BPF-based Tools

​iostat Analysis

​Interview Deep Dives

​NVMe Specifics

​NVMe vs SATA SSD

I/O Subsystem

Block Layer Architecture

bio and request Structures

The bio Structure

bio Lifecycle

I/O Schedulers

Multi-Queue Architecture (blk-mq)

Scheduler Comparison

Asynchronous I/O

Traditional AIO (libaio)

io_uring: Modern Async I/O

io_uring Architecture

io_uring Example

io_uring Advanced Features

io_uring Supported Operations

Direct I/O and O_DIRECT

When to Use O_DIRECT

I/O Profiling and Debugging

blktrace and blkparse

BPF-based Tools

iostat Analysis

Interview Deep Dives

NVMe Specifics

NVMe vs SATA SSD