Skip to main content

I/O Systems

The I/O subsystem bridges the gap between software and hardware devices. Understanding I/O is crucial for performance optimization and is frequently tested in systems interviews.
Interview Frequency: Medium-High
Key Topics: DMA, interrupt handling, I/O scheduling, io_uring
Time to Master: 8-10 hours

I/O Hardware Basics

Device Controllers

I/O Hardware Architecture

Device Registers

Every device controller has registers accessible by the CPU:
RegisterPurpose
StatusDevice state (busy, error, ready)
CommandWhat operation to perform
Data InData from device to CPU
Data OutData from CPU to device

I/O Addressing

Special CPU instructions for I/O:
; x86 port I/O
IN  AL, 0x60    ; Read from keyboard port
OUT 0x60, AL    ; Write to keyboard port
  • Separate I/O address space
  • Requires special instructions
  • Common on x86

I/O Methods

Programmed I/O (Polling)

CPU actively polls device status:
void write_data_polling(char *buffer, int size) {
    for (int i = 0; i < size; i++) {
        // Wait until device is ready
        while ((inb(STATUS_PORT) & READY_BIT) == 0) {
            // Busy wait (wastes CPU)
        }
        
        // Send one byte
        outb(DATA_PORT, buffer[i]);
    }
}
Pros: Simple, no interrupt overhead
Cons: Wastes CPU cycles, doesn’t scale

Interrupt-Driven I/O

Device signals CPU when ready:
// Set up interrupt handler
void disk_interrupt_handler(void) {
    // Read completed data
    char data = inb(DATA_PORT);
    
    // Process data
    buffer[buffer_index++] = data;
    
    // Acknowledge interrupt
    outb(COMMAND_PORT, ACK);
    
    // Wake up waiting process
    wakeup(waiting_process);
}

// Initiate read
void start_read(void) {
    outb(COMMAND_PORT, READ_CMD);
    // Process sleeps, CPU does other work
    sleep(current_process);
}

Interrupt Lifecycle: Step-by-Step

What exactly happens when you press a key or a packet arrives?
  1. Device Signal: Network card asserts the IRQ line on the bus.
  2. Controller: Interrupt Controller (APIC) prioritizes it and signals the CPU.
  3. CPU Context Save:
    • CPU finishes current instruction.
    • Saves EFLAGS, CS, EIP (Program Counter) to the kernel stack.
    • Switches to kernel mode (Ring 0).
  4. Vector Lookup: CPU reads the Interrupt Descriptor Table (IDT) using the IRQ number.
  5. Execution (Top Half): Jumps to the registered ISR (Interrupt Service Routine).
    • Ack: ISR acknowledges the interrupt to the APIC so it can send more.
    • Schedule: Schedules a “Bottom Half” (SoftIRQ/Tasklet) for heavy processing.
  6. Context Restore: CPU executes IRET, restoring registers and resuming the interrupted process.
Interrupt Lifecycle

Block Layer Subsystem

Between the File System (VFS) and the Device Driver lies the Block Layer. It optimizes I/O performance.

The bio Structure

The basic unit of block I/O in Linux is the struct bio. It represents an in-flight read/write request.

Request Merging (I/O Scheduling)

Disks hate random seeking. The block layer acts as a traffic controller:
  1. Merging: If App A reads sector 100-107 and App B reads 108-115, merge them into one request (100-115).
  2. Sorting: Reorder requests to minimize disk head movement (Elevator Algorithm).
Incoming Requests: 100, 5000, 101, 102, 5001
Merged & Sorted:   [100-102], [5000-5001]
Result: 2 Disk Operations instead of 5.

Direct Memory Access (DMA)

Without DMA, the CPU must copy every byte (PIO). With DMA, the CPU delegates.

Cycle Stealing & Arbitration

  • The Problem: CPU and DMA Controller share the same system bus. They can’t use it simultaneously.
  • Arbitration: The DMA controller requests the bus. The CPU “pauses” (stalls) for a few cycles to grant access.
  • Cycle Stealing: DMA steals potential CPU instruction fetch cycles to transfer data.
Wait, doesn’t stalling the CPU hurt performance? Yes, but it’s much cheaper than the CPU executing a loop to copy data byte-by-byte. DMA Architecture

Character vs Block Devices

AspectCharacter DeviceBlock Device
AccessByte streamFixed-size blocks
BufferingNo kernel bufferingUses buffer cache
Random accessNot supportedSupported
ExamplesKeyboard, serial portDisk, SSD
Major/MinorYesYes
# Check device types
ls -la /dev/
crw-rw----  1 root tty  4,  0  # c = character (tty0)
brw-rw----  1 root disk 8,  0  # b = block (sda)

Interview Deep Dive Questions

Answer:PIO (Programmed I/O):
for each byte:
  CPU: check device ready
  CPU: read byte from device register
  CPU: write byte to memory
  • CPU executes every transfer instruction
  • CPU utilization: 100% during transfer
  • Good for: Small transfers, simple devices
DMA (Direct Memory Access):
CPU: configure DMA controller
    - source address (device)
    - destination address (memory)
    - byte count
    - direction
CPU: start transfer
CPU: ...do other work...
DMA: transfers data directly (bus master)
DMA: interrupt CPU when complete
Advantages of DMA:
  1. CPU efficiency: CPU free during transfer
  2. Higher throughput: DMA controller optimized for transfers
  3. Lower latency: No per-byte CPU overhead
DMA considerations:
  • Needs contiguous physical memory (or scatter-gather DMA)
  • Cache coherency: CPU cache may have stale data
  • Setup overhead: Small transfers may be slower than PIO
Answer:Traditional read() syscall:
1. User calls read()
2. Trap to kernel (mode switch)
3. Kernel validates parameters
4. Kernel initiates I/O
5. Kernel copies data to user buffer
6. Return to user (mode switch)

Cost: 2 mode switches per operation
io_uring:
1. User writes SQE to shared memory (no syscall)
2. User updates tail pointer (no syscall)
3. [If needed] io_uring_enter() to notify kernel
4. Kernel processes submission queue
5. Kernel writes CQE to shared memory
6. User reads CQE (no syscall)

Cost: 0-1 syscall per BATCH of operations
Speed advantages:
  1. Batching: 1 syscall for 1000 operations
  2. SQPOLL mode: Kernel thread polls SQ, zero user syscalls
  3. Zero-copy: Shared memory rings, no copy in/out
  4. Registered buffers: Pre-registered, skip validation
  5. Registered files: File descriptors pre-validated
Benchmarks: 10-100x better IOPS for small I/O workloads
Answer:
1. HARDWARE: Device asserts interrupt line
   └─► Interrupt controller (APIC) receives signal

2. INTERRUPT CONTROLLER:
   └─► Determines interrupt vector (IRQ number)
   └─► Signals CPU

3. CPU:
   └─► Finishes current instruction
   └─► Saves context (registers, flags) to stack
   └─► Disables interrupts (for this level)
   └─► Looks up handler in IDT (Interrupt Descriptor Table)
   └─► Jumps to handler address

4. KERNEL INTERRUPT HANDLER (Top Half):
   └─► Acknowledge interrupt to device
   └─► Quick processing (clear device state)
   └─► Schedule bottom half for heavy work
   └─► Return from interrupt (iret)

5. KERNEL BOTTOM HALF (softirq/tasklet/workqueue):
   └─► Process data
   └─► Copy to user buffer
   └─► Wake up waiting process

6. SCHEDULER:
   └─► User process becomes runnable
   └─► Eventually scheduled

7. USER PROCESS:
   └─► read() returns with data
Top half vs Bottom half:
  • Top half: Runs with interrupts disabled, must be fast
  • Bottom half: Can sleep, do heavy processing
Modern optimizations:
  • NAPI (networking): Disable interrupts, switch to polling under load
  • Threaded IRQs: Handler runs in kernel thread context
Answer:Requirements:
  • Write-heavy (100K+ logs/sec)
  • Durability (survive crashes)
  • Low latency (don’t block application)
Design:
Application Threads


┌─────────────────────────────────────┐
│     Lock-Free Ring Buffer           │
│  (one per thread, thread-local)     │
│  [log][log][log][...][log]          │
└────────────────┬────────────────────┘


┌─────────────────────────────────────┐
│        Aggregator Thread            │
│  - Collects from all ring buffers   │
│  - Batches logs                      │
│  - Compresses (optional)            │
└────────────────┬────────────────────┘


┌─────────────────────────────────────┐
│        Async Writer Thread          │
│  - Uses io_uring for batched writes │
│  - O_APPEND for atomicity           │
│  - fdatasync() periodically         │
└─────────────────────────────────────┘
Key optimizations:
  1. Thread-local buffers: No contention between threads
  2. Ring buffers: Fixed memory, no allocation
  3. Batching: Combine small logs into larger writes
  4. io_uring: Async, batched I/O submission
  5. O_APPEND: Kernel handles concurrent appends atomically
  6. Periodic fsync: Balance durability vs performance
Durability levels:
// Level 1: OS buffer only (fastest, least durable)
write(fd, log, len);

// Level 2: fdatasync (data only, not metadata)
write(fd, log, len);
fdatasync(fd);  // Every N writes

// Level 3: fsync (data + metadata)
write(fd, log, len);
fsync(fd);  // Slowest but safest
Answer:Normal I/O path:
Application ←→ Page Cache ←→ Disk

read():
1. Check page cache
2. If miss, read from disk to cache
3. Copy from cache to user buffer

write():
1. Copy to page cache
2. Mark page dirty
3. Background writeback to disk
O_DIRECT I/O path:
Application ←→ Disk (bypass page cache)

read()/write():
1. Transfer directly between user buffer and disk
2. No caching, no copy
Requirements for O_DIRECT:
  • Buffer must be aligned (typically 512 bytes or 4KB)
  • Offset must be aligned
  • Size must be aligned
When to use O_DIRECT:Good use cases:
  • Databases (manage own buffer pool)
  • Video streaming (no reread, waste cache)
  • Large sequential scans (would pollute cache)
  • When you need predictable latency
Bad use cases:
  • General file access
  • Small random reads (cache helps)
  • When kernel caching is beneficial
Example (PostgreSQL):
// PG uses shared_buffers (its own cache)
// Can enable O_DIRECT to avoid double-caching
fd = open(filename, O_RDWR | O_DIRECT);

// Must use aligned buffer
posix_memalign(&buf, 4096, 8192);
pread(fd, buf, 8192, 0);

Practice Exercises

1

I/O Scheduler Comparison

Use fio to benchmark random vs sequential I/O with different schedulers.
2

io_uring Program

Write a file copy program using io_uring. Compare performance with cp.
3

DMA Simulation

Implement a simulation of DMA controller behavior with interrupt generation.
4

Simple Driver

Write a character device driver that logs all read/write operations.

Key Takeaways

DMA Frees the CPU

Device transfers data directly. CPU only sets up and handles completion.

io_uring is Revolutionary

Batched, zero-copy async I/O. Essential for high-performance systems.

Buffering Matters

Double buffering enables overlapping I/O and processing.

Scheduler Choice Matters

Use none/kyber for NVMe, mq-deadline for HDDs.

Next: Synchronization