I/O Systems & Modern I/O
The goal of a modern I/O subsystem is to move data between devices and memory as fast as possible while minimizing CPU involvement. This chapter covers everything from basic DMA to the cutting-edgeio_uring.
1. Traditional I/O Mechanisms
- Programmed I/O (PIO): The CPU manually moves every byte between the device and memory.
- Verdict: Horribly inefficient; wastes CPU cycles.
- Interrupt-Driven I/O: The CPU tells the device to perform a task and goes back to other work. The device raises an interrupt when finished.
- Verdict: Better, but interrupts have high context-switch overhead.
- Direct Memory Access (DMA): The CPU gives the device a pointer to a memory region. The device moves the data itself and interrupts the CPU only once the entire transfer is complete.
2. Modern High-Performance I/O: io_uring
io_uring is the biggest revolution in Linux I/O in a decade. It solves the performance bottlenecks of epoll and aio (Asynchronous I/O).
The Architecture
io_uring uses two Ring Buffers shared between user-space and the kernel.
- Submission Queue (SQ): User-space writes I/O requests (e.g., “Read 4KB from FD 5”) and increments a tail pointer.
- Completion Queue (CQ): The kernel writes results (e.g., “Read completed, 4096 bytes”) and the user-space reads them.
Why it’s Faster
- Zero System Calls: By using shared memory rings, the application can submit thousands of I/O requests without a single syscall.
- SQPOLL (Submission Polling): The kernel can run a dedicated thread that constantly polls the SQ for new entries. This means I/O happens completely asynchronously with zero overhead for the application.
- Fixed Buffers: The application can pre-register memory regions with the kernel. This eliminates the need for the kernel to “map” and “unmap” memory pages for every I/O operation.
3. userfaultfd: User-Space Paging
Normally, when a page fault occurs, the kernel handles it (by loading from disk or swap). userfaultfd allows an Application to handle its own page faults.
- Flow:
- Application registers a memory range with
userfaultfd. - When a thread accesses a missing page, it hangs.
- A “manager” thread receives a message over a file descriptor.
- The manager fetches the data (e.g., from the network or a custom compressed store).
- The manager “copies” the data into the page, and the kernel resumes the original thread.
- Application registers a memory range with
- Use Case: Live migration of Virtual Machines, distributed shared memory, and garbage collection.
4. Zero-Copy: sendfile and splice
Moving data from a Disk to a Network socket usually involves:
Disk → Kernel Buffer → User Buffer → Kernel Buffer → NIC Buffer. (4 copies, 2 context switches).
sendfile(): Tells the kernel to move data directly from the File Cache to the Socket Buffer. (2 copies, 1 context switch).splice(): Moves data between two file descriptors using a Pipe as an intermediary, without copying the data—only the pointers to the pages are moved.
5. Direct Access (DAX)
For Persistent Memory (like Optane), the kernel can map the physical storage directly into the application’s address space.- No Page Cache: The application reads and writes directly to the hardware using
MOVinstructions, bypassing the entire OS storage stack.
6. I/O Patterns and Anti-patterns
The same kernel mechanisms can be used well or poorly. This section gives you a mental catalog.6.1 Good Patterns
- Batch and coalesce I/O:
- Use
readv/writevorsendmsgto group many small buffers into one system call. - For files, read and write in page-sized or larger chunks (4KB, 64KB) when possible.
- Use
- Asynchronous multiplexing:
- Use
epoll/kqueueorio_uringto handle many sockets with few threads. - Keep each thread mostly busy doing useful work instead of blocking.
- Use
- Zero-copy where it matters:
- For proxies and file servers, prefer
sendfile/splice/teeover manualread+writeloops. - For very high throughput, design around
io_uringwith fixed buffers.
- For proxies and file servers, prefer
6.2 Anti-patterns
- Tiny synchronous reads in a loop:
- Example: reading 1 byte at a time from a socket in blocking mode.
- Leads to thousands of syscalls and context switches; TCP small-packet overhead dominates.
- One thread per connection:
- Works for tens or hundreds of connections, collapses at thousands.
- Spends more time context switching than doing work; blows L1/L2 caches.
- Unbounded queues:
- Producer threads enqueue work to a queue feeding I/O workers, but the queue is unbounded.
- Under load, latency explodes and the process may OOM before backpressure kicks in.
6.3 Practical Checklist
When building an I/O-heavy system, ask:- Are we making more syscalls than necessary?
- Are we copying data more than twice between user and kernel space?
- Do we have bounded queues and clear backpressure behavior?
- Are we using the right primitive (
epollvsio_uringvs blocking I/O) for our latency/throughput goals?
Summary for Senior Engineers
- DMA is the baseline for modern I/O.
io_uringis the only way to achieve millions of IOPS (I/O Operations Per Second) on modern NVMe drives.userfaultfdallows you to build custom memory management policies outside the kernel.- Zero-Copy is mandatory for building high-performance proxy servers or file servers.