Skip to main content

I/O Systems & Modern I/O

The goal of a modern I/O subsystem is to move data between devices and memory as fast as possible while minimizing CPU involvement. This chapter covers everything from basic DMA to the cutting-edge io_uring.

1. Traditional I/O Mechanisms

  • Programmed I/O (PIO): The CPU manually moves every byte between the device and memory.
    • Verdict: Horribly inefficient; wastes CPU cycles.
  • Interrupt-Driven I/O: The CPU tells the device to perform a task and goes back to other work. The device raises an interrupt when finished.
    • Verdict: Better, but interrupts have high context-switch overhead.
  • Direct Memory Access (DMA): The CPU gives the device a pointer to a memory region. The device moves the data itself and interrupts the CPU only once the entire transfer is complete.

2. Modern High-Performance I/O: io_uring

io_uring is the biggest revolution in Linux I/O in a decade. It solves the performance bottlenecks of epoll and aio (Asynchronous I/O).

The Architecture

io_uring uses two Ring Buffers shared between user-space and the kernel.
  1. Submission Queue (SQ): User-space writes I/O requests (e.g., “Read 4KB from FD 5”) and increments a tail pointer.
  2. Completion Queue (CQ): The kernel writes results (e.g., “Read completed, 4096 bytes”) and the user-space reads them.

Why it’s Faster

  • Zero System Calls: By using shared memory rings, the application can submit thousands of I/O requests without a single syscall.
  • SQPOLL (Submission Polling): The kernel can run a dedicated thread that constantly polls the SQ for new entries. This means I/O happens completely asynchronously with zero overhead for the application.
  • Fixed Buffers: The application can pre-register memory regions with the kernel. This eliminates the need for the kernel to “map” and “unmap” memory pages for every I/O operation.

3. userfaultfd: User-Space Paging

Normally, when a page fault occurs, the kernel handles it (by loading from disk or swap). userfaultfd allows an Application to handle its own page faults.
  • Flow:
    1. Application registers a memory range with userfaultfd.
    2. When a thread accesses a missing page, it hangs.
    3. A “manager” thread receives a message over a file descriptor.
    4. The manager fetches the data (e.g., from the network or a custom compressed store).
    5. The manager “copies” the data into the page, and the kernel resumes the original thread.
  • Use Case: Live migration of Virtual Machines, distributed shared memory, and garbage collection.

4. Zero-Copy: sendfile and splice

Moving data from a Disk to a Network socket usually involves: Disk → Kernel Buffer → User Buffer → Kernel Buffer → NIC Buffer. (4 copies, 2 context switches).
  • sendfile(): Tells the kernel to move data directly from the File Cache to the Socket Buffer. (2 copies, 1 context switch).
  • splice(): Moves data between two file descriptors using a Pipe as an intermediary, without copying the data—only the pointers to the pages are moved.

5. Direct Access (DAX)

For Persistent Memory (like Optane), the kernel can map the physical storage directly into the application’s address space.
  • No Page Cache: The application reads and writes directly to the hardware using MOV instructions, bypassing the entire OS storage stack.

6. I/O Patterns and Anti-patterns

The same kernel mechanisms can be used well or poorly. This section gives you a mental catalog.

6.1 Good Patterns

  • Batch and coalesce I/O:
    • Use readv/writev or sendmsg to group many small buffers into one system call.
    • For files, read and write in page-sized or larger chunks (4KB, 64KB) when possible.
  • Asynchronous multiplexing:
    • Use epoll/kqueue or io_uring to handle many sockets with few threads.
    • Keep each thread mostly busy doing useful work instead of blocking.
  • Zero-copy where it matters:
    • For proxies and file servers, prefer sendfile/splice/tee over manual read + write loops.
    • For very high throughput, design around io_uring with fixed buffers.

6.2 Anti-patterns

  • Tiny synchronous reads in a loop:
    • Example: reading 1 byte at a time from a socket in blocking mode.
    • Leads to thousands of syscalls and context switches; TCP small-packet overhead dominates.
  • One thread per connection:
    • Works for tens or hundreds of connections, collapses at thousands.
    • Spends more time context switching than doing work; blows L1/L2 caches.
  • Unbounded queues:
    • Producer threads enqueue work to a queue feeding I/O workers, but the queue is unbounded.
    • Under load, latency explodes and the process may OOM before backpressure kicks in.

6.3 Practical Checklist

When building an I/O-heavy system, ask:
  • Are we making more syscalls than necessary?
  • Are we copying data more than twice between user and kernel space?
  • Do we have bounded queues and clear backpressure behavior?
  • Are we using the right primitive (epoll vs io_uring vs blocking I/O) for our latency/throughput goals?
You can tie this back to other chapters: scheduling affects how many worker threads you can afford, and virtual memory affects page cache behavior for file I/O.

Summary for Senior Engineers

  • DMA is the baseline for modern I/O.
  • io_uring is the only way to achieve millions of IOPS (I/O Operations Per Second) on modern NVMe drives.
  • userfaultfd allows you to build custom memory management policies outside the kernel.
  • Zero-Copy is mandatory for building high-performance proxy servers or file servers.
Next: System Call Internals & vDSO