Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
I/O Systems & Modern I/O
The goal of a modern I/O subsystem is to move data between devices and memory as fast as possible while minimizing CPU involvement. Think of the evolution like package delivery: PIO is you walking to the warehouse for every package; interrupt-driven I/O is the warehouse calling you when a package is ready; DMA is the warehouse delivering directly to your door; andio_uring is the warehouse having a key to your house and stocking your shelves while you sleep.
This chapter covers everything from basic DMA to the cutting-edge io_uring.
Interview Frequency: High for systems and backend roles
Key Topics: DMA, io_uring, zero-copy, userfaultfd, sendfile/splice
Time to Master: 8-12 hours
Key Topics: DMA, io_uring, zero-copy, userfaultfd, sendfile/splice
Time to Master: 8-12 hours
1. Traditional I/O Mechanisms
Understanding these three generations is essential because production systems often use all three simultaneously — PIO for low-speed control plane operations, interrupts for moderate traffic, and DMA for bulk data transfers.- Programmed I/O (PIO): The CPU manually moves every byte between the device and memory using
in/outinstructions (port I/O) orMOVto MMIO addresses. For every byte transferred, the CPU executes an instruction.- Verdict: Horribly inefficient; wastes CPU cycles. A 1GB transfer requires billions of CPU instructions just for data movement. Still used today for initial device configuration (writing a few control registers).
- Interrupt-Driven I/O: The CPU tells the device to perform a task and goes back to other work. The device raises an interrupt when finished. This freed the CPU from busy-waiting.
- Verdict: Better, but each interrupt costs ~1-5 microseconds of overhead (save/restore registers, jump to handler, acknowledge interrupt controller). At 100K+ interrupts per second (a busy NIC), the CPU spends most of its time handling interrupts instead of doing useful work — this is an interrupt storm.
- Direct Memory Access (DMA): The CPU gives the device a pointer to a memory region and a transfer size. The device’s DMA engine moves the data directly to/from RAM and interrupts the CPU only once the entire transfer is complete.
- Verdict: The baseline for all modern high-performance I/O. The CPU sets up the transfer in microseconds, then is free to do other work while gigabytes of data flow between the device and memory.
2. Modern High-Performance I/O: io_uring
io_uring (introduced in Linux 5.1, 2019) is the biggest revolution in Linux I/O in a decade. It solves the performance bottlenecks of epoll and aio (POSIX Asynchronous I/O, which was poorly designed and rarely used).
The core problem io_uring solves: every syscall (read, write, sendmsg) requires a user-to-kernel transition that costs ~100-200 nanoseconds. When you are doing millions of IOPS on an NVMe drive, those transitions become the bottleneck — not the hardware.
The Architecture
io_uring uses two Ring Buffers in shared memory (mmap’d) between user-space and the kernel. This shared memory is the key insight: neither side needs to make a syscall to communicate with the other.
- Submission Queue (SQ): User-space writes I/O requests (Submission Queue Entries, or SQEs — e.g., “Read 4KB from FD 5 into buffer at 0x1234”) and increments a tail pointer. No syscall needed.
- Completion Queue (CQ): The kernel writes results (Completion Queue Entries, or CQEs — e.g., “Read completed, 4096 bytes, status=0”) and the user-space reads them by advancing its head pointer.
Why it is Faster
- Zero System Calls: By using shared memory rings, the application can submit thousands of I/O requests without a single
io_uring_enter()syscall (when SQPOLL is enabled). The syscall overhead that dominatedepoll-based servers is eliminated. - SQPOLL (Submission Polling): The kernel can run a dedicated polling thread that constantly checks the SQ for new entries. This means the application just writes to memory, and the kernel picks it up within microseconds — no syscall, no interrupt, no context switch.
- Fixed Buffers: The application can pre-register memory regions with the kernel using
IORING_REGISTER_BUFFERS. This eliminates the per-I/O cost ofget_user_pages()(pinning user memory for DMA) — a significant overhead when doing millions of small I/Os. - Linked SQEs: You can chain operations (e.g., “read from file, then send to socket”) so the kernel executes them back-to-back without returning to user space between steps.
io_uring is not just for disk I/O. It supports network operations (accept, connect, send, recv), timers, and even file operations (open, close, stat). Modern high-performance servers (like those using Seastar or Tigerbeetle) are built entirely on io_uring, replacing both epoll and traditional threading models.
3. userfaultfd: User-Space Paging
Normally, when a page fault occurs, the kernel handles it (by loading from disk or swap). userfaultfd allows an Application to handle its own page faults. Think of it like a restaurant that lets you bring your own ingredients — the kitchen (kernel) provides the equipment (fault handling framework), but the food (page contents) comes from wherever you want.
- Flow:
- Application registers a memory range with
userfaultfd(), telling the kernel “I will handle faults in this region.” - When any thread accesses a missing page in that region, the kernel suspends that thread (it does not kill it or return an error).
- A “manager” thread receives a notification over a file descriptor (pollable, so it works with
epollandio_uring). - The manager fetches the page contents from wherever it wants — the network, a compressed store, a remote machine, a custom deduplication layer.
- The manager writes the data into the page using
UFFDIO_COPY, and the kernel transparently resumes the faulting thread as if the page had always been there.
- Application registers a memory range with
- Use Case: Live migration of Virtual Machines (QEMU uses this — migrate the VM first, then lazily fetch pages from the source host as the guest accesses them), distributed shared memory, and lazy decompression of snapshots.
userfaultfd is also used by some garbage collectors (notably Shenandoah GC in Java) to implement concurrent compaction — the GC moves objects while the application is running, and userfaultfd intercepts accesses to moved pages to redirect them transparently.
4. Zero-Copy: sendfile and splice
Moving data from a Disk to a Network socket the naive way involves a shocking number of copies:
Disk -> Kernel Page Cache -> User Buffer (via read()) -> Kernel Socket Buffer (via write()) -> NIC DMA Buffer. That is 4 data copies and 2 context switches for data the application never even looks at.
sendfile(): Tells the kernel to move data directly from the File Page Cache to the Socket Buffer, bypassing user space entirely. Result: 2 copies, 0 context switches (or even 1 copy with DMA scatter-gather on modern NICs). This is what Nginx, Apache, and Kafka use to serve static files.splice(): Moves data between two file descriptors using a Pipe as an intermediary, without copying the data — only the page reference counts are adjusted. This is more flexible thansendfilebecause it works between any two file descriptors (not just file-to-socket), making it ideal for proxy servers that forward data between sockets.
sendfile, splice, or io_uring with fixed buffers.
5. Direct Access (DAX)
For Persistent Memory (like Intel Optane), the kernel can map the physical storage directly into the application’s address space. This is the ultimate evolution of I/O: there is no I/O at all. Storage is memory.- No Page Cache: The application reads and writes directly to the hardware using
MOVinstructions, bypassing the entire OS storage stack (VFS, page cache, block layer, driver). Latency drops from microseconds (NVMe) to nanoseconds (memory bus). - Persistence: Unlike regular RAM, the data survives power loss. But this introduces new challenges: CPU caches are volatile, so you must explicitly flush cache lines (
CLFLUSH,CLWB) and use memory fences to ensure data reaches persistent media before you consider it “committed.”
5.5 Caveats and Common Pitfalls
6. I/O Patterns and Anti-patterns
The same kernel mechanisms can be used well or poorly. This section gives you a mental catalog.6.1 Good Patterns
- Batch and coalesce I/O:
- Use
readv/writevorsendmsgto group many small buffers into one system call. - For files, read and write in page-sized or larger chunks (4KB, 64KB) when possible.
- Use
- Asynchronous multiplexing:
- Use
epoll/kqueueorio_uringto handle many sockets with few threads. - Keep each thread mostly busy doing useful work instead of blocking.
- Use
- Zero-copy where it matters:
- For proxies and file servers, prefer
sendfile/splice/teeover manualread+writeloops. - For very high throughput, design around
io_uringwith fixed buffers.
- For proxies and file servers, prefer
6.2 Anti-patterns
- Tiny synchronous reads in a loop:
- Example: reading 1 byte at a time from a socket in blocking mode.
- Leads to thousands of syscalls and context switches; TCP small-packet overhead dominates.
- One thread per connection:
- Works for tens or hundreds of connections, collapses at thousands.
- Spends more time context switching than doing work; blows L1/L2 caches.
- Unbounded queues:
- Producer threads enqueue work to a queue feeding I/O workers, but the queue is unbounded.
- Under load, latency explodes and the process may OOM before backpressure kicks in.
6.3 Practical Checklist
When building an I/O-heavy system, ask:- Are we making more syscalls than necessary?
- Are we copying data more than twice between user and kernel space?
- Do we have bounded queues and clear backpressure behavior?
- Are we using the right primitive (
epollvsio_uringvs blocking I/O) for our latency/throughput goals?
Summary for Senior Engineers
- DMA is the baseline for modern I/O. If your driver is not using DMA, it is not a production driver.
io_uringis the only way to achieve millions of IOPS on modern NVMe drives. The syscall overhead ofread()/write()becomes the bottleneck long before the hardware does. A senior engineer would say: “If we are doing more than 100K IOPS, we should be benchmarkingio_uringagainst our currentepollapproach.”userfaultfdallows you to build custom memory management policies outside the kernel. It is a niche but powerful tool for VM live migration, lazy loading, and garbage collection.- Zero-Copy (
sendfile,splice,io_uringfixed buffers) is mandatory for building high-performance proxy servers or file servers. The difference between copying and not copying is often 2-3x throughput. - The I/O hierarchy: PIO (legacy) -> Interrupt-driven (moderate load) -> DMA (bulk data) ->
io_uring(millions of IOPS) -> DAX (nanosecond storage). Know where your workload fits and use the right mechanism.
Interview Deep-Dive
Compare select / poll / epoll / io_uring -- performance and API tradeoffs
Compare select / poll / epoll / io_uring -- performance and API tradeoffs
Strong Answer Framework:Common Wrong Answers:
- Frame the historical arc.
select(1983) is the original POSIX multiplexer.poll(System V) removed the fd-set size limit.epoll(Linux 2.5.45, 2002) introduced O(active) notifications.io_uring(Linux 5.1, 2019) eliminated the syscall-per-I/O entirely. Each generation responded to a real bottleneck of its predecessor. - Compare on three axes: scalability, API ergonomics, syscall cost.
selectis O(n) per call, capped at 1024 fds, copies the fd_set into the kernel every time.pollis O(n) per call, no fd cap.epollis O(1) per ready fd via interest list registration; supports edge-triggered semantics for further wakeup reduction.io_uringremoves the “syscall to learn what is ready” step entirely — you submit operations and reap results from shared memory rings. - Anchor with numbers. A
read()syscall is ~100-200 ns user/kernel transition on x86_64 with mitigations. At 1M IOPS, that is 100-200ms of CPU per second per core wasted on transitions alone.io_uringwith SQPOLL collapses that to zero. - Make a workload-specific recommendation. For a few hundred connections:
epollis fine and easier to reason about. For millions of IOPS on NVMe or millions of network connections:io_uringwith fixed buffers and registered files. For portable code (BSD/macOS):kqueue. - Acknowledge the cost.
io_uringis rapidly evolving and has had a string of CVEs (mostly in unprivileged use). Many container runtimes (Docker default seccomp) block it. Plan for the operational reality.
epoll-based proxying with io_uring for parts of their data plane. Their published benchmarks showed roughly 50 percent reduction in CPU usage at the same request rate. In 2023, Google disabled io_uring for unprivileged users on production due to repeated CVEs (CVE-2022-32250, CVE-2023-32233, etc.) — a reminder that performance advantages come with security trade-offs that need explicit acceptance.Senior Follow-up 1: “At what connection count does
epoll start losing to io_uring and why?”There is no single number, but rule of thumb: if you exceed ~200K syscalls per second, the syscall floor becomes the dominant cost. That happens around 100K active connections doing chatty work, or much earlier on storage I/O (50K IOPS hits 100K syscalls if you split read+write). Below that, the engineering complexity of io_uring rarely pays for itself.Senior Follow-up 2: “What is edge-triggered vs level-triggered, and what bug does each enable?”Level-triggered: notification fires while data remains. Edge-triggered: fires only on transition (no data to data). Edge-triggered scales better but enables a classic bug — you read once, do not drain, and never get woken up again because no transition occurred. Always loop until
EAGAIN with edge-triggered fds. With level-triggered, you can be lazy but pay in extra wakeups.Senior Follow-up 3: “How does io_uring handle backpressure?”The SQ has a fixed depth set at
io_uring_setup. If you try to queue more than depth, io_uring_submit returns -EBUSY (or you wait). The CQ also has a fixed depth — if you do not reap completions fast enough, the kernel can drop them (lossy mode) or stop submitting (lossless). Sizing the rings is application work; defaults are rarely right for high-throughput servers.- “
epollandio_uringare basically the same, just different APIs.” No.epollnotifies you that I/O is ready; you still pay a syscall to do the I/O.io_uringhas the kernel do the I/O on your behalf. The mental model is different: notification vs delegation. - “Always use the latest, so always use
io_uring.”io_uringadds complexity, has a moving security surface, and is poorly supported on macOS, Windows, and older Linux. For most apps,epoll/kqueueis the right answer for the next several years. - “
selectis fine; the 1024 limit is per-process.” The 1024 limit isFD_SETSIZEand applies to the bitmap, not the process fd table. You can have 100K open fds butselectcannot watch any beyond fd 1023.
- “Efficient I/O event notification” (the libev documentation comparing select/poll/epoll/kqueue) — Marc Lehmann
- “What is io_uring?” (kernel.org docs) and the lwn.net series on
io_uringevolution - Jens Axboe’s “Faster IO through io_uring” presentation (FOSDEM 2020)
Walk through a 1MB read() -- every layer it touches from userland to disk
Walk through a 1MB read() -- every layer it touches from userland to disk
Strong Answer Framework:Common Wrong Answers:
- Set the boundary conditions. Buffered read of 1MB from a regular file on ext4, NVMe, no
O_DIRECT, file not in page cache. The user buffer is 1MB heap, possibly straddling page boundaries. - Userland to syscall. The libc
read(3)is a thin wrapper around theread(2)syscall. On Linux x86_64 the syscall vector is__NR_read = 0. The instruction issyscall, costing ~100-200ns of mode switch (more with retpoline mitigations). - VFS dispatch.
ksys_read->vfs_read-> file’sf_op->read_iter(for ext4:ext4_file_read_iter). VFS validates fd, takes a reference onstruct file, checks permissions, and resolves the position. - Page cache lookup.
generic_file_read_iterwalks the file’saddress_spaceradix tree page by page (256 pages for 1MB). For each page: hit means copy_to_user from cached page; miss means trigger readahead. - Readahead. The kernel detects sequential access and schedules ahead-of-need reads. For 1MB fresh, expect a single ~128KB readahead bio plus on-demand reads for the rest.
- Block layer. Each cache miss creates a
biodescribing source disk sectors and target page-cache pages.submit_biohands it toblk-mqwhich places it on the per-CPU software queue, applies the I/O scheduler (mq-deadline by default for SATA, “none” common for NVMe), and dispatches to a hardware queue. - Driver.
nvme_queue_rqbuilds a 64-byte NVMe command, writes it into the SQ ring buffer, and rings the doorbell (single MMIO write). - Hardware. NVMe controller DMAs sectors directly into page-cache pages over PCIe (no CPU copy). On completion, posts to the CQ, raises an MSI-X interrupt.
- Completion path. ISR runs, calls
bio_endio, marks pages up-to-date. The original waiting task is woken. - copy_to_user. Once pages are present, kernel copies from page cache to user buffer. This is the one mandatory copy in buffered I/O. Returns to userland with bytes_read.
read()+write() to sendfile() to kTLS+sendfile() on FreeBSD for video delivery. Each layer they removed (the user-space copy, then the encryption hop) shaved double-digit percentages off CPU per gigabit. The exact same path analysis applies to a read() — every copy and every context switch is a knob you may eventually need to remove.Senior Follow-up 1: “What changes if
O_DIRECT is set?”Page cache is bypassed entirely. The user buffer must be aligned to the device logical block size (often 4KB). The bio points directly at user pages (pinned via get_user_pages). No readahead. No copy_to_user (data is DMAed straight into user memory). Failure modes get harsher: misalignment returns EINVAL, partial reads are common, retries are your problem.Senior Follow-up 2: “Where is the latency budget spent on a cold 1MB read from NVMe?”Roughly: ~150ns syscall, ~1us VFS+block layer, ~80us NVMe command latency (typical p50 for random read), ~100us NAND access for first page, then sequential reads pipeline at near line rate. The tail is dominated by NAND access, not software. On a 7GB/s drive, 1MB transfer is ~140us of pure data motion.
Senior Follow-up 3: “How would
io_uring change this trace?”No syscall in the hot path with SQPOLL. Submission entry written directly to the shared SQ ring; kernel polling thread picks it up; bio submitted; completion posted to CQ ring; userland reaps with no syscall. With IORING_REGISTER_BUFFERS, user pages are pre-pinned so get_user_pages is amortized to setup time.- “It just calls
readand the OS handles it.” That is the abstraction; the question asks for the machinery. Be specific about VFS, page cache, blk-mq, driver. - “It always reads from disk.” No — if pages are cached, it never touches the device. This is why benchmarking I/O without flushing the page cache produces fictional numbers.
- “There is one DMA copy.” For buffered I/O there are TWO data motions: DMA into page cache, then CPU copy_to_user into the user buffer.
- Robert Love, Linux Kernel Development, ch. 13 (VFS) and ch. 14 (Block Layer)
- “What every programmer should know about memory” — Ulrich Drepper (memory hierarchy interactions matter for I/O)
- Brendan Gregg, Systems Performance — the I/O chapters and
iolatency/biosnoopwalkthroughs
When does O_DIRECT help, when does it hurt?
When does O_DIRECT help, when does it hurt?
Strong Answer Framework:Common Wrong Answers:
- State the mechanism first.
O_DIRECTbypasses the page cache; reads/writes go between user buffers and the block device via DMA, with alignment constraints (typically 4KB or device logical block size). - Helps when you have a smarter cache than the kernel. Databases (Postgres, MySQL/InnoDB, Oracle) maintain their own buffer pool with workload-aware replacement policies (clock-pro, ARC, custom). Letting the kernel page cache duplicate every cached page wastes RAM proportional to the buffer pool size.
- Helps for “read it once, throw it away” patterns. A backup tool or full-table scan reading a 100GB file once would otherwise evict useful pages from the cache for the rest of the system.
O_DIRECT(orposix_fadvise(POSIX_FADV_DONTNEED)) prevents this cache pollution. - Hurts for small, random, latency-sensitive reads. No readahead, no caching. Every miss is a real disk hit. A 4KB read in a hot path goes from “page cache hit at 100ns” to “NVMe read at 50us” — a 500x slowdown.
- Hurts when alignment is fragile. Apps that build buffers via
mallocget arbitrary alignment.O_DIRECTrequiresposix_memalignor equivalent. A misalignedpwritereturnsEINVALand looks like a bug, not a constraint. - Does NOT guarantee durability. Common misconception:
O_DIRECTis notfsync. The drive’s volatile write cache may still hold data. You still need explicitfsync/fdatasyncfor durability barriers, plus the drive must respect cache flush commands (most do; some cheap consumer SSDs lie).
O_DIRECT semantics across filesystems. Andres Freund’s findings shaped Postgres 12’s WAL handling. Around the same time, MySQL’s innodb_flush_method=O_DIRECT_NO_FSYNC was reconsidered after engineers realized “no fsync” assumed the drive honored the FUA flag, which not all drives did. The takeaway: O_DIRECT is necessary but not sufficient for durability.Senior Follow-up 1: “How do you decide alignment requirements at runtime?”
ioctl(fd, BLKSSZGET, §or_size) returns the device logical block size. BLKPBSZGET returns the physical block size. Allocate buffers with posix_memalign(&buf, sector_size, len) and ensure len and offset are multiples. Many apps just hardcode 4KB which works on common hardware but breaks on 4Kn drives or unusual configurations.Senior Follow-up 2: “What is the difference between
O_DIRECT and O_SYNC?”O_DIRECT controls how I/O happens (bypass cache). O_SYNC controls when the syscall returns (after data is durable). They are orthogonal and often combined for WAL: O_DIRECT | O_DSYNC says “no kernel cache and do not return until durable.” Note O_DSYNC is weaker than O_SYNC — it elides metadata flushes that are not needed for read consistency.Senior Follow-up 3: “Why does Linux have
posix_fadvise(POSIX_FADV_DONTNEED) if we have O_DIRECT?”O_DIRECT is all-or-nothing per file. fadvise lets you keep the page cache for normal reads but evict specific ranges after sequential scans. Useful for cron jobs, backups, and data migrations that want to leave the cache as they found it. It is also far simpler than alignment gymnastics.- “
O_DIRECTmakes I/O faster.” It can; it can also be 100x slower. It removes a cache layer; whether that helps depends on whether the cache was useful. - “
O_DIRECTwrites are durable.” No. Durability requires the drive to flush its write cache — whichfsyncdoes,O_DIRECTdoes not. - “Use
O_DIRECTfor all database files.” Indexes, temp files, sort spills, and small-config tables often benefit from page cache. Postgres has historically avoidedO_DIRECTfor exactly this reason; MySQL/InnoDB embraces it for the buffer pool but not the binlog. The right answer is per-file-class, not per-database.
- Linus Torvalds, the famous LKML thread on
O_DIRECT(“the right answer is not O_DIRECT, it is…”) — read for the philosophy debate - PostgreSQL Wiki, “Fsync Errors” — the canonical write-up of fsync-gate
- “Files are hard” — Dan Luu’s survey of crash-consistency papers including
O_DIRECTinteractions
What is zero-copy and walk through how sendfile() avoids data copies vs read+write
What is zero-copy and walk through how sendfile() avoids data copies vs read+write
Strong Answer Framework:Common Wrong Answers:
- Define zero-copy precisely. Zero-copy means the data never crosses the user/kernel boundary as a CPU memory copy. The CPU may still touch metadata; the bulk data flows kernel-to-kernel or device-to-device via DMA.
- Establish the baseline cost. Naive
read()+write()from disk to socket: DMA disk -> page cache (1), copy page cache -> user buffer (2), copy user buffer -> socket buffer (3), DMA socket buffer -> NIC (4). Four data motions, two context switches per syscall pair. sendfile(out_fd, in_fd, ...)collapses the user roundtrip. Data path: DMA disk -> page cache (1), copy page cache -> socket buffer (2), DMA socket buffer -> NIC (3). With NIC scatter-gather, the middle copy disappears and the NIC DMAs directly from page cache. True zero CPU copies in the bulk path.splice()generalizes via pipes.splicemoves pages between any two fds using a kernel pipe as the intermediate, by reference (page pointers, not copies). Useful for proxies (socket-to-socket), unlikesendfilewhich only goes file-to-socket.io_uringwith fixed buffers achieves the same result for general I/O: pre-registered user pages stay pinned, no per-I/Oget_user_pages, kernel can DMA directly to/from them.- The catch: zero-copy excludes transformation. If you need to compress, encrypt at user-space, or modify headers, you must touch the bytes. kTLS (Linux 4.13+) brings TLS encryption into the kernel so
sendfilecan serve HTTPS without breaking the zero-copy chain.
sendfile end to end; brokers serve segments to consumers without ever touching the bytes in user space. Their original 2011 paper documents this as a primary reason Kafka can saturate NICs at low CPU. Nginx, HAProxy, and the Linux kernel’s NFS server are all sendfile-first. Conversely, Envoy initially could not use sendfile because of L7 inspection — they invest in user-space zero-copy via fixed buffers and io_uring instead.Senior Follow-up 1: “What breaks
sendfile?”TLS without kTLS (encryption needs the bytes). Compression (same). Application-level inspection or rewriting (any L7 proxy). Filesystems that do not implement the splice/sendfile path (some FUSE filesystems). When sendfile is unavailable, fall back to splice via a pipe or read+write with at least a user-side mmap to skip one copy.Senior Follow-up 2: “How do you measure whether you are actually zero-copy?”
perf trace -e read,write,sendfile,splice shows syscall mix. bpftrace on vfs_read/vfs_write reveals copy paths. The smoking gun is high system CPU on a network-bound workload; that is copy_to_user/copy_from_user you should not be doing. Tools like bcc’s funcslower against __copy_user_enhanced_fast_string confirm where copies happen.Senior Follow-up 3: “Does zero-copy help receive paths?”Yes, but it is harder.
MSG_ZEROCOPY (Linux 4.14+) lets send references user pages directly. Receive-side zero-copy (SO_ZEROCOPY is send-only; receive uses mmap ring with AF_XDP or io_uring’s recv with fixed buffers) still has alignment and lifetime caveats. Most “zero-copy networking” wins are on the send side.- “
sendfiledoes not copy at all.” It still copies between page cache and socket buffer unless the NIC supports scatter-gather DMA. The “zero” is from the userland perspective. - “Zero-copy is only relevant for tiny servers.” It is the difference between saturating a 100Gbps NIC at one CPU vs four. Critical for CDNs, video servers, message brokers.
- “
mmappluswriteis zero-copy.” It eliminates the user-side read copy, butwritefrom the mmap region still copies into the socket buffer. Helpful, not zero.
- “Efficient data transfer through zero copy” — IBM developerWorks classic article
- Jens Axboe’s “Zero-copy networking with io_uring” — recent state of the art
- Kafka 2011 paper, “Kafka: a Distributed Messaging System for Log Processing” — the canonical sendfile case study