Skip to main content

Kernel Networking Stack

The networking subsystem is one of the most complex and performance-critical parts of the Linux kernel. Understanding how packets flow from the Network Interface Card (NIC) through kernel layers to application sockets is essential for building high-performance networked systems.
Mastery Level: Senior Systems Engineer Key Internals: sk_buff, NAPI, RSS/RPS/RFS, XDP, TCP congestion control, Netfilter Prerequisites: Interrupts, Virtual Memory

1. The Network Stack Architecture

1.1 Layer Overview

The Linux network stack follows the OSI model but implements it in a Linux-specific way:
┌──────────────────────────────────────────────────────┐
│         Application Layer (User Space)               │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────┐ │
│  │  Web Server  │  │   Database   │  │    DNS     │ │
│  └──────┬───────┘  └──────┬───────┘  └─────┬──────┘ │
│         │                 │                 │        │
│    Socket API (read/write/send/recv)        │        │
└─────────┼─────────────────┼─────────────────┼────────┘
          │                 │                 │
          ↓ Syscall         ↓                 ↓
┌──────────────────────────────────────────────────────┐
│                  Kernel Space                         │
│  ┌───────────────────────────────────────────────┐   │
│  │  Socket Layer (struct socket, struct sock)    │   │
│  │  - File descriptor management                 │   │
│  │  - Socket buffer management                   │   │
│  └─────────────────┬─────────────────────────────┘   │
│                    ↓                                  │
│  ┌───────────────────────────────────────────────┐   │
│  │  Transport Layer (TCP/UDP/SCTP)               │   │
│  │  - Segmentation / Reassembly                  │   │
│  │  - Flow control, Congestion control           │   │
│  │  - Port management                            │   │
│  └─────────────────┬─────────────────────────────┘   │
│                    ↓                                  │
│  ┌───────────────────────────────────────────────┐   │
│  │  Network Layer (IPv4/IPv6)                    │   │
│  │  - Routing decisions (FIB lookup)             │   │
│  │  - Fragmentation / Reassembly                 │   │
│  │  - Netfilter hooks (firewall)                 │   │
│  └─────────────────┬─────────────────────────────┘   │
│                    ↓                                  │
│  ┌───────────────────────────────────────────────┐   │
│  │  Link Layer (Ethernet, WiFi)                  │   │
│  │  - ARP resolution                             │   │
│  │  - MAC address handling                       │   │
│  │  - Queueing disciplines (tc)                  │   │
│  └─────────────────┬─────────────────────────────┘   │
│                    ↓                                  │
│  ┌───────────────────────────────────────────────┐   │
│  │  Network Device (struct net_device)           │   │
│  │  - Driver interface                           │   │
│  │  - Ring buffer management                     │   │
│  └─────────────────┬─────────────────────────────┘   │
└────────────────────┼─────────────────────────────────┘
                     ↓ DMA
          ┌──────────────────────┐
          │  Network Interface   │
          │  Card (NIC)          │
          │  - RX/TX Ring Buffers│
          │  - Checksumming      │
          │  - TSO/GSO offload   │
          └──────────────────────┘

              Physical Network

2. The Core Data Structure: sk_buff

The struct sk_buff (socket buffer) is the heart of the Linux networking stack. It represents a network packet as it travels through the kernel.

2.1 sk_buff Structure

// Simplified from include/linux/skbuff.h
struct sk_buff {
    // Linked list management
    struct sk_buff *next;
    struct sk_buff *prev;

    // Socket association
    struct sock *sk;

    // Timestamps
    ktime_t tstamp;

    // Network device
    struct net_device *dev;

    // Data pointers (THE KEY TO UNDERSTANDING NETWORKING)
    unsigned char *head;      // Start of allocated buffer
    unsigned char *data;      // Start of valid data
    sk_buff_data_t tail;      // End of valid data
    sk_buff_data_t end;       // End of allocated buffer

    // Buffer size information
    unsigned int len;         // Length of actual data
    unsigned int data_len;    // Length of fragmented data

    // Protocol headers (updated as packet moves up/down stack)
    __u16 transport_header;   // Offset to transport header (TCP/UDP)
    __u16 network_header;     // Offset to network header (IP)
    __u16 mac_header;         // Offset to MAC header (Ethernet)

    // Metadata
    __u32 priority;
    __u8 ip_summed;          // Checksum status
    __u8 cloned:1;           // Is this a clone?
    __u8 nohdr:1;

    // Reference counting
    refcount_t users;

    // Fragmented data (for large packets)
    unsigned int truesize;   // Total allocated size
    atomic_t dataref;

    // Shared info (at end of buffer)
    struct skb_shared_info *shinfo;
};

2.2 sk_buff Memory Layout

Understanding the memory layout is crucial for understanding zero-copy optimizations:
Packet Reception (growing from head toward tail):
─────────────────────────────────────────────────────

┌─────────────────────────────────────────────────────┐
│                                                       │
│  head                                         end    │
│   ↓                                            ↓     │
│   ┌──────────┬───────────┬──────────┬─────────┐     │
│   │ headroom │  Eth Hdr  │  IP Hdr  │ TCP Hdr │ ... │
│   │ (unused) │  (14 B)   │  (20 B)  │ (20 B)  │ ... │
│   └──────────┴───────────┴──────────┴─────────┘     │
│              ↑                                 ↑     │
│             data                             tail    │
│                                                       │
│   <-headroom->  <----- len (data length) ---->       │
│   <-------------- truesize (total alloc) ----------> │
└─────────────────────────────────────────────────────┘

Header Pointers:
  mac_header     → Points to Ethernet header
  network_header → Points to IP header
  transport_header → Points to TCP header

Key Operations:
  skb_push()  - Decrease data pointer (add header)
  skb_pull()  - Increase data pointer (remove header)
  skb_put()   - Increase tail pointer (add data at end)
  skb_trim()  - Decrease tail pointer (remove data at end)
Example: Adding Ethernet Header:
// Before: data points to IP header
┌────────────────────────────────┐
│ headroom │ IP Hdr │ TCP Hdr │  │
└──────────┴────────┴─────────┴──┘
           ↑ data

// Add Ethernet header
unsigned char *eth_hdr = skb_push(skb, ETH_HLEN);  // ETH_HLEN = 14

// After: data now points to Ethernet header
┌────────────────────────────────┐
│ headrm │Eth│ IP Hdr │ TCP Hdr │ │
└────────┴───┴────────┴─────────┴─┘
         ↑ data

2.3 Zero-Copy Mechanisms

Problem: Copying large packets is expensive (memory bandwidth limited). Solution 1: skb_clone() - Clone sk_buff structure, share data
struct sk_buff *clone = skb_clone(original_skb, GFP_ATOMIC);

Original:        Clone:
┌──────────┐    ┌──────────┐
│ sk_buff  │    │ sk_buff  │
struct  │    │  struct
│          │    │          │
│ head ────┼───>│ head ────┼──┐
│ data ────┼───>│ data ────┼──┤  Points to SAME buffer
│ tail     │    │ tail     │  │
└──────────┘    └──────────┘  │

                    ┌────────────────────┐
                    │  Shared Data Buffer │
                    │  (refcount = 2)     │
                    └────────────────────┘

Use case: Packet sniffing (tcpdump)
- Clone packet for sniffer
- Original continues up the stack
- No data copy!
Solution 2: Fragmented Data (skb_frag) - For large packets
struct skb_shared_info {
    unsigned char nr_frags;  // Number of fragments
    skb_frag_t frags[MAX_SKB_FRAGS];  // Fragment array
};

typedef struct skb_frag_struct {
    struct page *page;       // Points to physical page
    __u16 page_offset;       // Offset within page
    __u16 size;              // Fragment size
} skb_frag_t;

Layout for 9000-byte packet (Jumbo frame):
┌──────────────────────────────────────────────┐
│ sk_buff                                      │
│  head → [headers: 54 bytes]                  │
│  data → [first chunk of data]                │
└──┬───────────────────────────────────────────┘

   └→ skb_shared_info
      ├→ frag[0] → page A (1500 bytes)
      ├→ frag[1] → page B (1500 bytes)
      ├→ frag[2] → page C (1500 bytes)
      └→ frag[3] → page D (remainder)

NIC DMAs directly into these pages (zero-copy receive!)

2.4 sk_buff Operations

#include <linux/skbuff.h>

// Add Ethernet header (14 bytes)
struct ethhdr *eth = (struct ethhdr *)skb_push(skb, sizeof(struct ethhdr));
eth->h_proto = htons(ETH_P_IP);
memcpy(eth->h_dest, dst_mac, ETH_ALEN);
memcpy(eth->h_source, src_mac, ETH_ALEN);

// Remove Ethernet header when moving up stack
skb_pull(skb, sizeof(struct ethhdr));

// Access network header
struct iphdr *iph = ip_hdr(skb);  // Macro: (struct iphdr *)skb_network_header(skb)

// Access transport header
struct tcphdr *tcph = tcp_hdr(skb);

3. Packet Reception: From Wire to Socket

3.1 The Legacy Interrupt-Driven Model (Pre-NAPI)

Old Approach (before 2.5 kernel):
1. Packet arrives at NIC
   └→ NIC DMAs packet to ring buffer
   └→ NIC raises hardware IRQ

2. CPU receives interrupt
   └→ Context switch (save current task)
   └→ Jump to interrupt handler

3. Interrupt handler (top half)
   └→ Disable interrupts (critical section)
   └→ Allocate sk_buff
   └→ Copy packet data from ring buffer to sk_buff
   └→ Queue sk_buff to network stack
   └→ Re-enable interrupts
   └→ Return from interrupt

4. Network stack processes packet (bottom half)
   └→ IP layer processing
   └→ TCP layer processing
   └→ Socket layer delivery

Problem: At 1Gbps+ speeds, 100,000+ interrupts/sec
Result: 100% CPU time in interrupt handling (interrupt storm)

3.2 NAPI: New API (Polling + Interrupts)

Solution: Hybrid polling/interrupt model
Receive Flow with NAPI:
────────────────────────

Phase 1: Low Traffic (Interrupt Mode)
┌─────────────────────────────────────────┐
│ Packet arrives                          │
│  ↓                                      │
│ NIC raises IRQ                          │
│  ↓                                      │
│ Driver IRQ handler:                     │
│  • Disable NIC interrupts               │
│  • Schedule NAPI poll (add to poll_list)│
│  • Return (very fast!)                  │
│  ↓                                      │
│ Softirq NET_RX_SOFTIRQ triggers         │
│  ↓                                      │
│ net_rx_action():                        │
│  • Call driver->poll()                  │
│  • Process up to budget packets (64)    │
│  • If ring empty: re-enable IRQ         │
└─────────────────────────────────────────┘

Phase 2: High Traffic (Polling Mode)
┌─────────────────────────────────────────┐
│ poll() processes 64 packets             │
│  ↓                                      │
│ More packets in ring buffer?            │
│  • YES: Stay in polling mode            │
│  • Keep calling poll() until ring empty │
│  • No interrupts needed!                │
│  ↓                                      │
│ Eventually ring empties                 │
│  • Re-enable NIC interrupts             │
│  • Wait for next packet                 │
└─────────────────────────────────────────┘
Kernel Code (simplified from net/core/dev.c):
// Driver's NAPI poll function
static int my_driver_poll(struct napi_struct *napi, int budget) {
    struct my_adapter *adapter = container_of(napi, struct my_adapter, napi);
    int work_done = 0;

    while (work_done < budget) {
        // Check if ring buffer has packets
        if (ring_buffer_empty(adapter))
            break;

        // Fetch packet from ring buffer
        struct sk_buff *skb = fetch_packet_from_ring(adapter);

        // Set metadata
        skb->dev = adapter->netdev;
        skb->protocol = eth_type_trans(skb, adapter->netdev);

        // Pass to network stack
        netif_receive_skb(skb);

        work_done++;
    }

    // If we processed less than budget, ring is empty
    if (work_done < budget) {
        napi_complete(napi);  // Exit polling mode
        enable_irq(adapter->irq);  // Re-enable interrupts
    }

    return work_done;
}

// IRQ handler (top half)
static irqreturn_t my_driver_irq_handler(int irq, void *data) {
    struct my_adapter *adapter = data;

    // Disable NIC interrupts
    disable_nic_interrupts(adapter);

    // Schedule NAPI polling
    napi_schedule(&adapter->napi);

    return IRQ_HANDLED;
}

// Network core: softirq handler
static void net_rx_action(struct softirq_action *h) {
    struct list_head *poll_list = this_cpu_ptr(&softnet_data.poll_list);
    int budget = netdev_budget;  // Default: 300
    unsigned long time_limit = jiffies + netdev_budget_usecs;

    while (!list_empty(poll_list)) {
        struct napi_struct *napi = list_first_entry(poll_list, ...);

        int work = napi->poll(napi, budget);
        budget -= work;

        if (budget <= 0 || time_after(jiffies, time_limit))
            break;  // Yield CPU
    }
}
Benefits:
  1. Low latency under low load: Interrupts still used
  2. High throughput under high load: Polling avoids interrupt overhead
  3. Fairness: Budget limits per-device processing
  4. CPU efficiency: No interrupt storm

3.3 Receive Packet Steering (RPS/RFS)

Problem: Single NIC queue → all packets processed on one CPU core Solution: Distribute packet processing across multiple CPUs

RSS (Hardware)

Receive Side Scaling
  • NIC has multiple RX queues
  • NIC hashes packet (IP + port)
  • Distributes to different queues
  • Each queue has own IRQ → CPU core
NIC
┌────────────────────┐
│  Hash(pkt) % 4     │
│   ↓   ↓   ↓   ↓    │
│  Q0  Q1  Q2  Q3    │
└──┼───┼───┼───┼────┘
   │   │   │   └──IRQ3→ CPU3
   │   │   └──────IRQ2→ CPU2
   │   └──────────IRQ1→ CPU1
   └──────────────IRQ0→ CPU0
Pros: Hardware acceleration Cons: Requires multi-queue NIC

RPS (Software)

Receive Packet Steering
  • Software-based RSS
  • CPU that receives IRQ hashes packet
  • Enqueues to target CPU’s backlog
  • Target CPU processes packet
CPU0 (IRQ handler)
  │ Receive packet
  │ Hash → CPU2

  Enqueue to CPU2 backlog

      CPU2 processes packet
Pros: Works with single-queue NIC Cons: Extra CPU for steering
RFS (Receive Flow Steering): Extension of RPS
Goal: Process packet on CPU where application is running
Benefit: CPU cache locality (hot cache = faster processing)

Flow:
1. Application recv() on CPU3
   └→ Kernel records: Flow X → CPU3

2. Packet for Flow X arrives on CPU0
   └→ RPS checks flow table
   └→ Steers to CPU3 (where app is!)

Result: Packet data already in CPU3's cache when app reads it
Configuration:
# Enable RPS (software steering)
echo "f" > /sys/class/net/eth0/queues/rx-0/rps_cpus  # Use CPUs 0-3

# Set RPS flow entries
echo 4096 > /proc/sys/net/core/rps_sock_flow_entries

# Set per-queue flow entries
echo 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt

4. XDP: eXpress Data Path

XDP allows running eBPF programs before sk_buff allocation, at the earliest possible point in packet processing.

4.1 XDP Architecture

Packet Flow with XDP:
─────────────────────

Without XDP:
┌───────┐   ┌──────────┐   ┌──────┐   ┌─────┐   ┌──────┐
│  NIC  │ → │ allocate │ → │  IP  │ → │ TCP │ → │ App  │
│  DMA  │   │  sk_buff │   │layer │   │layer│   │      │
└───────┘   └──────────┘   └──────┘   └─────┘   └──────┘
  ~500ns      ~200ns         ~100ns     ~100ns     user

With XDP:
┌───────┐   ┌──────────┐
│  NIC  │ → │   XDP    │ ──→ [XDP_DROP] (discard, fastest)
│  DMA  │   │  eBPF    │ ──→ [XDP_TX] (bounce back)
└───────┘   │ program  │ ──→ [XDP_REDIRECT] (other NIC/CPU)
  ~500ns    └──────────┘ ──→ [XDP_PASS] (continue to stack)
             ~100ns              ↓
                          ┌──────────┐
                          │ allocate │
                          │  sk_buff │
                          └──────────┘

                           Normal stack...

4.2 XDP Program Example

#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>

// Drop all packets from specific IP
SEC("xdp")
int xdp_drop_ip(struct xdp_md *ctx) {
    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    // Bounds check (required by BPF verifier)
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;

    // Only process IP packets
    if (eth->h_proto != htons(ETH_P_IP))
        return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;

    // Drop packets from 192.168.1.100
    __u32 blocked_ip = 0xC0A80164;  // 192.168.1.100 in network order
    if (ip->saddr == blocked_ip) {
        return XDP_DROP;  // Discard immediately!
    }

    return XDP_PASS;  // Continue to network stack
}

char _license[] SEC("license") = "GPL";
Compile and Load:
# Compile XDP program
clang -O2 -target bpf -c xdp_drop.c -o xdp_drop.o

# Load into kernel
ip link set dev eth0 xdp obj xdp_drop.o sec xdp

# Verify
ip link show eth0
# ... xdp ...

# Remove XDP program
ip link set dev eth0 xdp off

4.3 XDP Actions

// Fastest packet drop (DDoS mitigation)
if (is_attack_packet(ctx)) {
    return XDP_DROP;  // ~10 million pps possible
}
Use Cases:
  • DDoS mitigation (drop attack traffic before stack)
  • Invalid packet filtering
  • Rate limiting at wire speed
Performance: ~50ns per packet (vs ~10µs for iptables DROP)

4.4 AF_XDP: Zero-Copy to User Space

AF_XDP allows user-space programs to receive packets directly from NIC DMA buffer (bypassing kernel stack entirely).
#include <linux/if_xdp.h>
#include <bpf/xsk.h>

struct xsk_socket_info {
    struct xsk_ring_cons rx;    // RX ring (kernel → user)
    struct xsk_ring_prod tx;    // TX ring (user → kernel)
    struct xsk_ring_prod fq;    // Fill queue (user provides buffers)
    struct xsk_ring_cons cq;    // Completion queue (kernel returns buffers)
    struct xsk_socket *xsk;
};

int main() {
    // Create XDP socket
    struct xsk_socket_info *xsk = create_xsk_socket("eth0", 0);

    // Main loop
    while (1) {
        // Fill queue with empty buffers
        unsigned int idx_fq;
        if (xsk_ring_prod__reserve(&xsk->fq, BATCH_SIZE, &idx_fq) == BATCH_SIZE) {
            for (int i = 0; i < BATCH_SIZE; i++) {
                *xsk_ring_prod__fill_addr(&xsk->fq, idx_fq++) = allocate_buffer();
            }
            xsk_ring_prod__submit(&xsk->fq, BATCH_SIZE);
        }

        // Receive packets
        unsigned int rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
        for (int i = 0; i < rcvd; i++) {
            const struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);

            // Access packet data (zero-copy!)
            void *pkt = xsk_umem__get_data(umem, desc->addr);
            unsigned int len = desc->len;

            // Process packet...
            process_packet(pkt, len);
        }
        xsk_ring_cons__release(&xsk->rx, rcvd);
    }

    return 0;
}
Performance: 20+ million packets per second per core (vs ~1-2M with regular sockets)

5. The TCP/IP Stack

5.1 IP Layer Processing

// Simplified from net/ipv4/ip_input.c

int ip_rcv(struct sk_buff *skb, struct net_device *dev) {
    struct iphdr *iph;

    // 1. Validate packet
    if (skb->len < sizeof(struct iphdr))
        goto drop;

    iph = ip_hdr(skb);

    // 2. Checksum verification (if not offloaded to NIC)
    if (ip_fast_csum((u8 *)iph, iph->ihl) != 0)
        goto csum_error;

    // 3. Validate header
    if (iph->ihl < 5 || iph->version != 4)
        goto drop;

    // 4. Netfilter hook: PREROUTING
    return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL,
                   ip_rcv_finish);

drop:
    kfree_skb(skb);
    return NET_RX_DROP;
}

static int ip_rcv_finish(struct sk_buff *skb) {
    struct iphdr *iph = ip_hdr(skb);

    // 5. Route lookup (where should this packet go?)
    if (ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, skb->dev) != 0)
        goto drop;

    // 6. Deliver locally or forward
    return dst_input(skb);  // Calls ip_local_deliver() or ip_forward()
}

int ip_local_deliver(struct sk_buff *skb) {
    struct iphdr *iph = ip_hdr(skb);

    // 7. IP fragmentation reassembly
    if (iph->frag_off & htons(IP_MF | IP_OFFSET)) {
        skb = ip_defrag(skb);
        if (!skb)
            return 0;
    }

    // 8. Netfilter hook: LOCAL_IN
    return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL,
                   ip_local_deliver_finish);
}

static int ip_local_deliver_finish(struct sk_buff *skb) {
    struct iphdr *iph = ip_hdr(skb);

    // 9. Demultiplex to transport layer
    int protocol = iph->protocol;

    switch (protocol) {
        case IPPROTO_TCP:
            tcp_v4_rcv(skb);
            break;
        case IPPROTO_UDP:
            udp_rcv(skb);
            break;
        case IPPROTO_ICMP:
            icmp_rcv(skb);
            break;
        default:
            // Unknown protocol
            icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PROT_UNREACH, 0);
            kfree_skb(skb);
    }

    return 0;
}
Routing Table Lookup (FIB - Forwarding Information Base):
Route lookup for destination IP:

1. Check routing cache (fast path)
   └→ Cache hit? Return cached route

2. FIB lookup (trie-based structure)
   ┌─────────────────────────────┐
   │ Longest Prefix Match (LPM)  │
   │                             │
   │ 192.168.1.0/24 → eth0       │
   │ 10.0.0.0/8 → tun0           │
   │ 0.0.0.0/0 → eth0 (default)  │
   └─────────────────────────────┘

3. Policy routing (multiple routing tables)
   └→ Check fib rules, select table

4. Result: dst_entry structure
   ┌──────────────────────────┐
   │ output_dev = eth0        │
   │ gateway = 192.168.1.1    │
   │ output_fn = ip_finish()  │
   └──────────────────────────┘

5.2 TCP Layer: The Fast Path

TCP processing has two paths:

Fast Path

Conditions:
  • In-order segment
  • No flags (except ACK)
  • Window not full
  • No urgent data
  • Checksum OK
Processing:
// Simplified
if (tcp_fast_path_check(sk, skb)) {
    // Fast path!
    memcpy(user_buffer, skb->data, skb->len);
    tcp_send_ack(sk);
    wake_up_process(sk->sk_sleep);
    return;
}
Performance: ~1µs per packet

Slow Path

Triggers:
  • Out-of-order segment
  • Retransmission
  • Window probing
  • Options (SACK, timestamps)
  • Connection management (SYN, FIN)
Processing:
// Complex state machine
tcp_validate_incoming(sk, skb);
tcp_ack(sk, skb);  // Process ACK
tcp_data_queue(sk, skb);  // Queue OOO
tcp_send_delayed_ack(sk);
// ... many more checks ...
Performance: ~5-10µs per packet
TCP Receive Processing (simplified from net/ipv4/tcp_input.c):
int tcp_v4_rcv(struct sk_buff *skb) {
    struct tcphdr *th;
    struct sock *sk;

    // 1. Get TCP header
    th = tcp_hdr(skb);

    // 2. Find socket (demultiplexing)
    sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
    if (!sk)
        goto no_tcp_socket;  // Send RST

    // 3. Process TCP state machine
    tcp_v4_do_rcv(sk, skb);

    return 0;
}

int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) {
    // Check if established connection
    if (sk->sk_state == TCP_ESTABLISHED) {
        // Try fast path
        if (tcp_rcv_established(sk, skb) == 0)
            return 0;
    }

    // Slow path (connection management)
    return tcp_rcv_state_process(sk, skb);
}

int tcp_rcv_established(struct sock *sk, struct sk_buff *skb) {
    struct tcp_sock *tp = tcp_sk(sk);
    struct tcphdr *th = tcp_hdr(skb);

    // FAST PATH CHECK
    if (tp->rcv_nxt == ntohl(th->seq) &&  // In-order
        tp->rcv_wnd &&                     // Window open
        !th->syn && !th->fin && !th->rst)  // No special flags
    {
        int len = skb->len - th->doff * 4;

        // Copy data to socket receive buffer
        if (!skb_copy_datagram_msg(skb, th->doff * 4,
                                    &sk->sk_receive_queue, len)) {
            tp->rcv_nxt += len;

            // Send ACK
            tcp_send_ack(sk);

            // Wake up waiting process
            sk->sk_data_ready(sk);

            kfree_skb(skb);
            return 0;  // Fast path success!
        }
    }

    // Fall through to slow path...
    return tcp_slow_path(sk, skb);
}

5.3 TCP Congestion Control

Linux supports pluggable congestion control algorithms:
// Simplified CUBIC algorithm

// Congestion window growth function
W(t) = C * (t - K)³ + W_max

Where:
  t = Time since last congestion event
  K = Cube root of (W_max * β / C)
  W_max = Window size before congestion
  C = Scaling constant
  β = Multiplicative decrease factor (0.7)

Behavior:
  Slow start → Exponential growth
  Congestion avoidance → Cubic growth
  Loss detected → W = W * β (reduce 30%)
  Recovery → Cubic growth toward W_max

┌─────────────────────────────────────┐
│  Congestion Window                  │
│      ↑                              │
│ W_max│     ╱╲                       │
│      │    ╱  ╲   ← Cubic curve      │
│      │   ╱    ╲╱                    │
│      │  ╱                           │
│      │ ╱                            │
│      └──────────────────→ Time      │
│         ↑ Loss                      │
└─────────────────────────────────────┘
Pros: Good for high-bandwidth networks Cons: Can be aggressive on lossy links

6. Socket Layer & System Calls

6.1 Socket Creation

// User space
int sockfd = socket(AF_INET, SOCK_STREAM, 0);

// Kernel: net/socket.c
SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol) {
    // 1. Allocate socket structure
    struct socket *sock = sock_alloc();

    // 2. Create protocol-specific socket
    // For TCP: calls inet_create() → tcp_prot.init()
    sock->ops = &inet_stream_ops;  // TCP operations

    // 3. Allocate struct sock (protocol control block)
    struct sock *sk = sk_alloc(family, GFP_KERNEL, &tcp_prot);
    sock->sk = sk;

    // 4. Initialize TCP-specific state
    tcp_init_sock(sk);

    // 5. Create file descriptor
    int fd = sock_map_fd(sock, O_CLOEXEC);

    return fd;
}
Data Structures:
File Descriptor Layer:
┌──────────────────┐
│ struct file      │  (Generic file operations)
│  ├─ f_op         │  → socket_file_ops
│  └─ private_data │  → points to struct socket
└────────┬─────────┘

Socket Layer:
┌──────────────────┐
│ struct socket    │  (BSD socket API)
│  ├─ type         │  (SOCK_STREAM, SOCK_DGRAM)
│  ├─ ops          │  → inet_stream_ops (send, recv, bind, etc.)
│  └─ sk           │  → points to struct sock
└────────┬─────────┘

Protocol Layer:
┌──────────────────┐
│ struct sock      │  (Protocol control block)
│  ├─ sk_state     │  (TCP_ESTABLISHED, TCP_LISTEN, etc.)
│  ├─ sk_prot      │  → tcp_prot (protocol operations)
│  ├─ sk_receive_queue   │  (Received data)
│  ├─ sk_write_queue     │  (Data to send)
│  └─ ...          │
└────────┬─────────┘

Protocol-Specific:
┌──────────────────┐
│ struct tcp_sock  │  (TCP-specific state)
│  ├─ rcv_nxt      │  (Next expected sequence)
│  ├─ snd_una      │  (Oldest unacked byte)
│  ├─ srtt         │  (Smoothed RTT)
│  ├─ snd_cwnd     │  (Congestion window)
│  └─ ...          │
└──────────────────┘

6.2 send() and recv() Internals

// User space
ssize_t n = send(sockfd, buffer, length, flags);

// Kernel: net/socket.c → tcp_sendmsg()
int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) {
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;
    int copied = 0;

    // 1. Check socket state
    if (sk->sk_state != TCP_ESTABLISHED)
        return -ENOTCONN;

    // 2. Wait for send buffer space if needed
    while (size > 0) {
        // Check if send buffer full
        if (sk_stream_memory_free(sk) <= 0) {
            if (flags & MSG_DONTWAIT)
                return -EAGAIN;

            // Block waiting for buffer space
            sk_stream_wait_memory(sk, &timeo);
        }

        // 3. Allocate sk_buff
        skb = sk_stream_alloc_skb(sk, min(size, mss), GFP_KERNEL);

        // 4. Copy data from user space
        int copy = min_t(int, size, skb_availroom(skb));
        if (skb_copy_to_page_nocache(sk, &msg->msg_iter, skb, page,
                                      off, copy))
            goto do_fault;

        // 5. Add to write queue
        skb_entail(sk, skb);

        copied += copy;
        size -= copy;

        // 6. Push data if:
        // - PSH flag set
        // - Queue getting large
        // - No more data
        if ((flags & MSG_MORE) == 0 || size == 0)
            tcp_push(sk, flags, mss_now, TCP_NAGLE_PUSH);
    }

    return copied;
}

// 7. Actually transmit
void tcp_push(struct sock *sk, int flags, int mss_now, int nonagle) {
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;

    while ((skb = tcp_send_head(sk)) != NULL) {
        // Check congestion window
        if (tcp_snd_wnd_test(tp, skb, mss_now) &&
            tcp_cwnd_test(tp, skb)) {

            // Transmit!
            tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
        } else {
            break;  // Window full, wait for ACK
        }
    }
}

6.3 Zero-Copy Techniques

sendfile()

// Traditional copy (4 copies!)
fd_in = open("file.txt", O_RDONLY);
fd_out = socket(...);

char buf[4096];
while ((n = read(fd_in, buf, sizeof(buf))) > 0) {
    write(fd_out, buf, n);
}

// Copies:
// 1. DMA: Disk → Kernel buffer
// 2. CPU: Kernel buffer → User buffer (read)
// 3. CPU: User buffer → Socket buffer (write)
// 4. DMA: Socket buffer → NIC

// Zero-copy sendfile()
sendfile(fd_out, fd_in, NULL, file_size);

// Copies:
// 1. DMA: Disk → Kernel buffer
// 2. DMA: Kernel buffer → NIC (if supported)
// Or just 1 CPU copy if DMA-to-DMA not available

// Kernel implementation
ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
                    size_t count) {
    // Use splice internally
    // Transfers pages directly to socket
    return splice_direct_to_actor(in_file, &sd,
                                  direct_splice_actor);
}

MSG_ZEROCOPY

// Modern zero-copy send
int fd = socket(AF_INET, SOCK_STREAM, 0);

// Enable zero-copy
int one = 1;
setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY,
           &one, sizeof(one));

// Send with zero-copy flag
char buf[65536];
send(fd, buf, sizeof(buf), MSG_ZEROCOPY);

// Kernel behavior:
// - Increments refcount on user pages
// - Passes page pointers to NIC
// - User buffer MUST NOT be modified until...

// Wait for completion notification
struct msghdr msg = {};
struct sock_extended_err *serr;
char control[100];

msg.msg_control = control;
msg.msg_controllen = sizeof(control);

recvmsg(fd, &msg, MSG_ERRQUEUE);

// Now safe to reuse buffer

// Benefits:
// - No CPU copy of payload data
// - Reduced cache pollution
// - Higher throughput

// Caveats:
// - Only beneficial for large sends (>10KB)
// - Buffer must stay valid until ACK
// - More complex error handling

7. Netfilter & Packet Filtering

7.1 Netfilter Hook Points

Packet Flow through Netfilter:
───────────────────────────────

                    Incoming Packet

                    ┌──────────┐
                    │   NIC    │
                    └─────┬────┘

           ┌──────────────────────────┐
           │  NF_INET_PRE_ROUTING     │ ← Hook 1
           │  (raw, mangle, nat)      │
           └─────────┬────────────────┘

              Routing Decision
                 ↙        ↘
         Local          Forward
           ↓                ↓
  ┌────────────────┐  ┌────────────────┐
  │ NF_INET_       │  │ NF_INET_       │
  │ LOCAL_IN       │  │ FORWARD        │ ← Hook 3
  │ (mangle,filter)│  │ (mangle,filter)│
  └───────┬────────┘  └────────┬───────┘
          ↓                    ↓
    Local Process      Routing Decision
          ↓                    ↓
  ┌────────────────┐  ┌────────────────┐
  │ NF_INET_       │  │ NF_INET_       │
  │ LOCAL_OUT      │  │ POST_ROUTING   │ ← Hook 4
  │ (raw,mangle,   │  │ (mangle, nat)  │
  │  nat, filter)  │  └────────┬───────┘
  └───────┬────────┘           ↓
          ↓              ┌──────────┐
    Routing Decision     │   NIC    │
          ↓              └──────────┘
  ┌────────────────┐          ↓
  │ NF_INET_       │    Outgoing Packet
  │ POST_ROUTING   │ ← Hook 5
  │ (mangle, nat)  │
  └───────┬────────┘

    ┌──────────┐
    │   NIC    │
    └──────────┘

    Outgoing Packet

7.2 Connection Tracking (conntrack)

// Conntrack tracks connection state

// TCP connection tracking states:
enum ip_conntrack_status {
    IPS_EXPECTED        = (1 << 0),  // Expected connection
    IPS_SEEN_REPLY      = (1 << 1),  // Seen reply direction
    IPS_ASSURED         = (1 << 2),  // Fully established
    IPS_CONFIRMED       = (1 << 3),  // In conntrack table
    IPS_SRC_NAT         = (1 << 4),  // Source NAT applied
    IPS_DST_NAT         = (1 << 5),  // Destination NAT applied
};

// Conntrack table entry
struct nf_conn {
    struct nf_conntrack ct_general;

    // Tuple: uniquely identifies connection
    struct nf_conntrack_tuple_hash tuplehash[IP_CT_DIR_MAX];
    // [0] = ORIGINAL direction
    // [1] = REPLY direction

    // Connection state
    unsigned long status;

    // Protocol-specific data
    union nf_conntrack_proto proto;

    // Timeout
    unsigned long timeout;
};

// Example: Track TCP connection
// Client 192.168.1.100:45000 → Server 8.8.8.8:80

// ORIGINAL tuple:
//   src: 192.168.1.100:45000
//   dst: 8.8.8.8:80
//   proto: TCP

// REPLY tuple:
//   src: 8.8.8.8:80
//   dst: 192.168.1.100:45000
//   proto: TCP

// This allows matching packets in both directions!
Performance Impact:
# View conntrack table
conntrack -L

# View statistics
conntrack -S
# cpu=0       found=1234 invalid=5 ignore=0 insert=567 ...

# Max connections
sysctl net.netfilter.nf_conntrack_max
# 65536

# Increase limit
sysctl -w net.netfilter.nf_conntrack_max=1048576

# Bypass conntrack for high-traffic flows (e.g., load balancer)
iptables -t raw -A PREROUTING -p tcp --dport 80 -j NOTRACK
iptables -t raw -A OUTPUT -p tcp --sport 80 -j NOTRACK

7.3 iptables Performance

# Rules are evaluated sequentially (O(n))
# Bad: 10,000 rules = slow!

iptables -A INPUT -s 1.2.3.4 -j DROP
iptables -A INPUT -s 1.2.3.5 -j DROP
# ... 9,998 more rules ...

# Better: Use ipset (hash table, O(1) lookup)
ipset create blocklist hash:ip
ipset add blocklist 1.2.3.4
ipset add blocklist 1.2.3.5
# ... add thousands ...

iptables -A INPUT -m set --match-set blocklist src -j DROP
# One rule, fast hash lookup!

# Modern alternative: nftables
nft add table inet filter
nft add chain inet filter input { type filter hook input priority 0\; }
nft add rule inet filter input ip saddr @blocklist drop

# nftables uses bytecode VM (like BPF), much faster

8. Network Buffers & Memory Management

8.1 Socket Buffers

// Each socket has send and receive buffers

// View socket buffer sizes
getsockopt(fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf, &len);
getsockopt(fd, SOL_SOCKET, SO_SNDBUF, &sndbuf, &len);

// Set larger buffers (important for high-BDP networks)
int buf_size = 4 * 1024 * 1024;  // 4 MB
setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &buf_size, sizeof(buf_size));
setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &buf_size, sizeof(buf_size));

// System-wide defaults
sysctl net.core.rmem_default  # Default receive buffer
sysctl net.core.wmem_default  # Default send buffer
sysctl net.core.rmem_max      # Max receive buffer
sysctl net.core.wmem_max      # Max send buffer

// TCP-specific (min, default, max)
sysctl net.ipv4.tcp_rmem  # 4096 131072 6291456
sysctl net.ipv4.tcp_wmem  # 4096 16384 4194304
Buffer Sizing for High Bandwidth-Delay Product:
BDP = Bandwidth × RTT

Example: 10 Gbps link, 100ms RTT
BDP = (10 × 10⁹ bits/sec) × (0.1 sec) / 8 bits/byte
    = 125 MB

Buffer should be >= BDP to fully utilize link!

# Configure
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"  # 128 MB max
sysctl -w net.core.rmem_max=134217728

8.2 TCP Autotuning

// Modern Linux auto-tunes TCP buffers (default: enabled)

// Kernel dynamically adjusts buffer size based on:
// 1. Measured RTT
// 2. Receive rate
// 3. Available memory

// Algorithm (simplified from tcp_input.c):
void tcp_rcv_space_adjust(struct sock *sk) {
    struct tcp_sock *tp = tcp_sk(sk);
    int time, space;

    // Measure receive rate
    time = tcp_stamp_us_delta(tp->tcp_mstamp, tp->rcvq_space.time);
    space = 2 * (tp->copied_seq - tp->rcvq_space.seq);

    if (time > 0) {
        int rcvbuf = tp->rcvq_space.space;
        int new_rcvbuf = space / time;  // Bytes per unit time

        // Increase buffer if receiving faster
        if (new_rcvbuf > rcvbuf) {
            new_rcvbuf = min(new_rcvbuf, sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);
            sk->sk_rcvbuf = min(new_rcvbuf, sysctl_rmem_max);
        }
    }
}

// Disable autotuning (if you want manual control)
echo 0 > /proc/sys/net/ipv4/tcp_moderate_rcvbuf

9. Performance Monitoring & Debugging

9.1 Essential Tools

# View all TCP sockets
ss -tan

# Show TCP info (congestion control, RTT, etc.)
ss -ti

# Example output:
# ESTAB 0 0    192.168.1.100:45000  8.8.8.8:443
#  cubic wscale:7,7 rto:204 rtt:3.5/2 ato:40 mss:1448
#  cwnd:10 bytes_acked:12345 segs_out:100 segs_in:95

# Filter by state
ss -tan state established

# Show processes
ss -tap

# Watch in real-time
watch -n1 'ss -ti dst 8.8.8.8'

9.2 Tracing with BPF

// Trace TCP retransmissions with bpftrace

// tcp_retransmit.bt
bpftrace -e '
kprobe:tcp_retransmit_skb {
    $sk = (struct sock *)arg0;
    $inet_sock = (struct inet_sock *)$sk;

    printf("TCP retransmit: %s:%d -> %s:%d\n",
           ntop(AF_INET, $inet_sock->inet_saddr),
           $inet_sock->inet_sport,
           ntop(AF_INET, $inet_sock->inet_daddr),
           $inet_sock->inet_dport);
}
'

// Trace packet drops
bpftrace -e '
tracepoint:skb:kfree_skb {
    @drops[args->location] = count();
}

interval:s:5 {
    print(@drops);
    clear(@drops);
}
'

10. Interview Questions & Answers

sk_buff: The fundamental packet data structure in Linux networking.Memory Layout:
┌─────────────────────────────────────────────┐
│ head  data          tail  end              │
│  ↓     ↓             ↓     ↓                │
│  ├─────┼─────────────┼─────┤                │
│  │ HR  │ Valid Data  │ TR  │                │
│  └─────┴─────────────┴─────┘                │
│<-headroom-> <-len-> <-tailroom->             │
└─────────────────────────────────────────────┘
Why Headroom Matters: As a packet moves down the network stack (from app to wire), each layer adds a header:
  • Application data
  • +20 bytes TCP header (skb_push)
  • +20 bytes IP header (skb_push)
  • +14 bytes Ethernet header (skb_push)
Without headroom, each layer would need to reallocate and copy the entire packet. With headroom, we just move the data pointer backwards.Why Tailroom Matters:
  • For adding trailers (less common)
  • For TSO/GSO (TCP Segmentation Offload): Kernel builds large packets, NIC splits them
Performance Impact: Zero-copy header addition vs expensive realloc/memcpy.
Problem with Old Interrupt Model:
  • Each packet → hardware interrupt
  • At 1 Gbps (1.5M packets/sec), CPU spends 100% time handling interrupts
  • This is “interrupt storm” or “receive livelock”
NAPI Solution (New API):Low Traffic (interrupt mode):
  1. Packet arrives → IRQ
  2. Driver disables NIC interrupts
  3. Schedules NAPI poll
  4. Returns immediately from IRQ
High Traffic (polling mode): 5. Softirq calls driver’s poll() function 6. Poll processes up to budget packets (default 64) 7. If more packets remain, stay in polling mode 8. If ring buffer empty, re-enable interruptsBenefits:
  • Low latency (low load): Interrupts still used
  • High throughput (high load): Polling avoids interrupt overhead
  • Fairness: Budget prevents one NIC from starving others
  • Adaptive: Automatically switches modes
Key Insight: Interrupts tell us “work is available”, then we switch to polling to batch the work.
XDP (eXpress Data Path): Runs eBPF programs at the earliest possible point in packet processing.Traditional Path:
NIC → DMA → Driver → Allocate sk_buff → Protocol Stack → ...

     ~500ns

                 ~200ns + overhead
XDP Path:
NIC → DMA → Driver → XDP Program (eBPF) → Decision
         ↑           ↑
     ~500ns      ~100ns
Why So Fast:
  1. No sk_buff allocation: Operating directly on DMA buffer
  2. No cache misses: Data still in L1 cache from DMA
  3. No context switches: Runs in softirq context
  4. Early drop: Can discard packets before any processing
  5. JIT compiled: eBPF → native machine code
Actions:
  • XDP_DROP: Discard (DDoS mitigation at 10M+ pps)
  • XDP_TX: Bounce back same interface (L2 load balancer)
  • XDP_REDIRECT: Send to different NIC or CPU
  • XDP_PASS: Continue to normal stack
Use Cases:
  • DDoS mitigation
  • Load balancing
  • Packet filtering
  • Network monitoring
Limitation: Can’t modify packet headers easily (need to recompute checksums).
Fast Path (common case optimization):Conditions:
  • TCP connection is ESTABLISHED
  • Packet arrives in-order (seq == rcv_nxt)
  • No flags except ACK
  • Receive window not full
  • No urgent data
  • Checksum valid
Processing:
if (fast_path_conditions) {
    memcpy(user_buffer, packet_data, len);  // Direct copy
    rcv_nxt += len;
    send_ack();
    wake_up_application();
    return;  // Done in ~1µs!
}
Slow Path (handles complex cases):Triggers:
  • Out-of-order segment (requires reassembly)
  • Retransmission (update RTO, congestion window)
  • Connection management (SYN, FIN, RST)
  • Options processing (SACK, timestamps, window scaling)
  • Zero window probing
Processing:
// Complex state machine
validate_sequence_numbers();
check_for_duplicate_acks();
update_rtt_estimates();
process_sack_blocks();
reorder_out_of_order_segments();
update_congestion_window();
// ... many more checks ...
// ~5-10µs
Impact: Fast path handles 90%+ of packets in established bulk-data transfers. Slow path ensures correctness for edge cases.Optimization: Keep connections in fast path by:
  • Avoiding packet loss (good network)
  • Using large enough buffers (avoid window full)
  • Minimizing out-of-order delivery (good QoS)
Problem: Single NIC queue → all packets processed on one CPU → bottleneckSolution 1: RSS (Receive Side Scaling) - Hardware
  • NIC has multiple RX queues (e.g., 8 queues)
  • NIC computes hash: hash(src_ip, dst_ip, src_port, dst_port) % num_queues
  • Each queue has dedicated IRQ mapped to specific CPU
  • Result: Packets distributed across CPUs in hardware
Pros: No CPU overhead, very fast Cons: Requires multi-queue NICSolution 2: RPS (Receive Packet Steering) - Software
  • Single queue NIC
  • CPU receiving IRQ computes hash
  • Enqueues packet to target CPU’s backlog
  • Target CPU processes packet
Pros: Works with any NIC Cons: Extra CPU overhead for steeringSolution 3: RFS (Receive Flow Steering) - Locality optimization
  • Extension of RPS
  • Tracks which CPU application is running on
  • Steers packets to that specific CPU
  • Result: Packet data in cache when application reads it
Example:
Without RFS:
  Packet → CPU0 (processes) → CPU2 (app blocked on recv)
  → Cache miss when app reads data

With RFS:
  Packet → CPU2 (processes + app runs here)
  → Data in cache, very fast!
Configuration:
# RSS (hardware, automatic if NIC supports)
ethtool -l eth0  # Show queue count

# RPS
echo "f" > /sys/class/net/eth0/queues/rx-0/rps_cpus  # CPUs 0-3

# RFS
echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
echo 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
Connection Tracking: Kernel subsystem that tracks state of all connections (TCP, UDP, ICMP).Purpose:
  • Enable stateful firewall rules
  • NAT (must remember translations)
  • Connection-based filtering
How It Works:
Client initiates connection:
  192.168.1.100:45000 → 8.8.8.8:80

Conntrack creates entry:
  ORIGINAL: 192.168.1.100:45000 → 8.8.8.8:80 [NEW]
  REPLY:    8.8.8.8:80 → 192.168.1.100:45000

Subsequent packets match entry (both directions!)

States: NEW → ESTABLISHED → FIN_WAIT → CLOSE → destroyed
Performance Issues:
  1. Hash table lookup: O(1) but still overhead on every packet
  2. Global lock: (Older kernels) serializes all conntrack operations
  3. Memory: Each connection consumes memory (~300 bytes)
  4. Hash collisions: Degrade to O(n) lookup
Symptoms:
# Table full
dmesg | grep conntrack
# nf_conntrack: table full, dropping packet

# High CPU in conntrack
perf top
#   12.34%  [kernel]  [k] nf_conntrack_in
Solutions:
# 1. Increase table size
sysctl -w net.netfilter.nf_conntrack_max=1048576
sysctl -w net.netfilter.nf_conntrack_buckets=262144

# 2. Decrease timeout for short-lived connections
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30

# 3. Bypass conntrack for specific traffic (e.g., load balancer)
iptables -t raw -A PREROUTING -p tcp --dport 80 -j NOTRACK
iptables -t raw -A OUTPUT -p tcp --sport 80 -j NOTRACK
# WARNING: Breaks stateful rules for this traffic!

# 4. Use connection-less alternatives
# - eBPF/XDP for filtering (bypasses conntrack entirely)
# - Stateless firewall rules where possible
When to bypass: High-traffic stateless services (load balancers, DNS servers, CDN edges).
Problem: Traditional send/receive involves multiple memory copies.Traditional Path (4 copies!):
Sending a file:

1. read(file_fd, buffer, size)
   Disk → [Kernel buffer] → [User buffer]
        DMA copy          CPU copy

2. write(socket_fd, buffer, size)
   [User buffer] → [Socket buffer] → NIC
   CPU copy         DMA copy

Total: 2 DMA + 2 CPU copies
Zero-Copy Techniques:1. sendfile():
sendfile(socket_fd, file_fd, NULL, file_size);

// Kernel path:
Disk → [Kernel buffer] ────→ NIC
      DMA copy        DMA copy (if NIC supports)
                   or page remapping

// Avoids user-space copies entirely!
// Best for static file serving (nginx, Apache)
2. splice():
// Move data between FDs via pipe (zero-copy)
splice(file_fd, NULL, pipe_fd[1], NULL, size, 0);
splice(pipe_fd[0], NULL, socket_fd, NULL, size, 0);

// Kernel manipulates page tables, no memcpy
3. MSG_ZEROCOPY:
send(socket_fd, buffer, size, MSG_ZEROCOPY);

// Kernel:
// 1. Pin user pages in memory (increment refcount)
// 2. NIC DMAs directly from user buffer
// 3. After NIC finishes, kernel notifies app via error queue
// 4. App can now reuse buffer

// Benefit: No copy for large sends
// Caveat: Buffer must stay valid, async notification needed
4. mmap() + write():
void *data = mmap(NULL, size, PROT_READ, MAP_SHARED, file_fd, 0);
write(socket_fd, data, size);

// May avoid one copy if kernel is smart
// But less efficient than sendfile()
Performance Impact:
MethodCopiesUse Case
Traditional4Small data, flexibility needed
sendfile()1-2Static file serving
splice()0Piping data between FDs
MSG_ZEROCOPY0Large sends (>10KB)
When to use:
  • sendfile(): Web server serving files
  • splice(): Proxy/gateway (socket → socket)
  • MSG_ZEROCOPY: Bulk data transfer, streaming
Problem: Send too fast → network congestion → packet loss. Send too slow → underutilize bandwidth.Goal: Find optimal sending rate (maximize throughput, minimize loss).
CUBIC (Linux default):Algorithm:
  • Maintains congestion window (cwnd) = max packets in flight
  • On loss: cwnd = cwnd × β (reduce by 30%)
  • Recovery: Grow cwnd using cubic function
cwnd(t) = C × (t - K)³ + W_max

Where:
  W_max = window before loss
  K = time to reach W_max
  C = scaling constant
Behavior:
cwnd

  │     ╱╲
  │    ╱  ╲    ← Cubic growth
  │   ╱    ╲╱
  │  ╱
  │ ╱
  └──────────→ time
      ↑ loss (reduce 30%)
Pros:
  • Aggressive growth after loss (good for high-bandwidth links)
  • Fair to other CUBIC flows
  • Simple, well-tested
Cons:
  • Treats loss as congestion signal (wrong for wireless)
  • Can cause bufferbloat (fills queues)
  • Slow convergence on very high BDP links

BBR (Bottleneck Bandwidth and RTT):Philosophy: Model the network, don’t react to loss.Measures:
  • BtlBw (bottleneck bandwidth): Max delivery rate observed
  • RTprop (round-trip propagation time): Min RTT observed
Pacing Rate = BtlBw × gainPhases:
  1. STARTUP: Exponential growth to find BtlBw (like slow start)
  2. DRAIN: Drain queues created during startup
  3. PROBE_BW: Oscillate pacing rate around BtlBw (main phase)
  4. PROBE_RTT: Periodically reduce cwnd to re-measure RTprop
Key Insight: Packet loss doesn’t mean congestion!
  • Wireless networks drop packets due to RF interference
  • BBR ignores loss, focuses on measured bandwidth
Pros:
  • Higher throughput on lossy links (wireless, satellite)
  • Lower latency (doesn’t fill buffers)
  • Better on bufferbloat-prone networks
Cons:
  • Can be unfair to CUBIC flows (more aggressive)
  • Requires accurate RTT measurement
  • More complex

When to Use:
ScenarioBest Choice
Data center (low latency, rare loss)CUBIC
Internet (bufferbloat common)BBR
Wireless/satellite (lossy)BBR
Mixed trafficBBR (lower latency helps all)
Configuration:
# Check current
sysctl net.ipv4.tcp_congestion_control

# Change to BBR
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Per-connection (from app)
setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, "bbr", 3);

Summary

Key Takeaways:
  1. sk_buff: Central data structure. Understanding headroom/tailroom is key to zero-copy optimizations.
  2. NAPI: Hybrid interrupt/polling model solves interrupt storm problem at high packet rates.
  3. XDP: Fastest packet processing path. Process/drop packets before sk_buff allocation using eBPF.
  4. RSS/RPS/RFS: Distribute packet processing across CPUs for scalability. RFS optimizes for cache locality.
  5. TCP Fast Path: Handles common case (in-order delivery) with minimal overhead. Slow path handles edge cases.
  6. Congestion Control: CUBIC (default) vs BBR (better on lossy/bufferbloat links). Understand trade-offs.
  7. Zero-Copy: sendfile(), splice(), MSG_ZEROCOPY eliminate expensive memory copies for large transfers.
  8. Conntrack: Essential for stateful firewalls but can be bottleneck. Bypass for high-traffic stateless services.
Performance Checklist:
  • Enable multi-queue NIC and RSS
  • Tune socket buffers for high BDP
  • Use XDP for packet filtering
  • Enable BBR for internet traffic
  • Increase conntrack table for high connection count
  • Use zero-copy for large data transfers

Next: File Systems