Kernel Networking Stack

The networking subsystem is one of the most complex and performance-critical parts of the Linux kernel. Understanding how packets flow from the Network Interface Card (NIC) through kernel layers to application sockets is essential for building high-performance networked systems.

Mastery Level: Senior Systems Engineer Key Internals: sk_buff, NAPI, RSS/RPS/RFS, XDP, TCP congestion control, Netfilter Prerequisites: Interrupts, Virtual Memory

1. The Network Stack Architecture

1.1 Layer Overview

The Linux network stack follows the OSI model but implements it in a Linux-specific way:

┌──────────────────────────────────────────────────────┐
│         Application Layer (User Space)               │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────┐ │
│  │  Web Server  │  │   Database   │  │    DNS     │ │
│  └──────┬───────┘  └──────┬───────┘  └─────┬──────┘ │
│         │                 │                 │        │
│    Socket API (read/write/send/recv)        │        │
└─────────┼─────────────────┼─────────────────┼────────┘
          │                 │                 │
          ↓ Syscall         ↓                 ↓
┌──────────────────────────────────────────────────────┐
│                  Kernel Space                         │
│  ┌───────────────────────────────────────────────┐   │
│  │  Socket Layer (struct socket, struct sock)    │   │
│  │  - File descriptor management                 │   │
│  │  - Socket buffer management                   │   │
│  └─────────────────┬─────────────────────────────┘   │
│                    ↓                                  │
│  ┌───────────────────────────────────────────────┐   │
│  │  Transport Layer (TCP/UDP/SCTP)               │   │
│  │  - Segmentation / Reassembly                  │   │
│  │  - Flow control, Congestion control           │   │
│  │  - Port management                            │   │
│  └─────────────────┬─────────────────────────────┘   │
│                    ↓                                  │
│  ┌───────────────────────────────────────────────┐   │
│  │  Network Layer (IPv4/IPv6)                    │   │
│  │  - Routing decisions (FIB lookup)             │   │
│  │  - Fragmentation / Reassembly                 │   │
│  │  - Netfilter hooks (firewall)                 │   │
│  └─────────────────┬─────────────────────────────┘   │
│                    ↓                                  │
│  ┌───────────────────────────────────────────────┐   │
│  │  Link Layer (Ethernet, WiFi)                  │   │
│  │  - ARP resolution                             │   │
│  │  - MAC address handling                       │   │
│  │  - Queueing disciplines (tc)                  │   │
│  └─────────────────┬─────────────────────────────┘   │
│                    ↓                                  │
│  ┌───────────────────────────────────────────────┐   │
│  │  Network Device (struct net_device)           │   │
│  │  - Driver interface                           │   │
│  │  - Ring buffer management                     │   │
│  └─────────────────┬─────────────────────────────┘   │
└────────────────────┼─────────────────────────────────┘
                     ↓ DMA
          ┌──────────────────────┐
          │  Network Interface   │
          │  Card (NIC)          │
          │  - RX/TX Ring Buffers│
          │  - Checksumming      │
          │  - TSO/GSO offload   │
          └──────────────────────┘
                     ↕
              Physical Network

2. The Core Data Structure: sk_buff

The struct sk_buff (socket buffer) is the heart of the Linux networking stack. It represents a network packet as it travels through the kernel.

2.1 sk_buff Structure

// Simplified from include/linux/skbuff.h
struct sk_buff {
    // Linked list management
    struct sk_buff *next;
    struct sk_buff *prev;

    // Socket association
    struct sock *sk;

    // Timestamps
    ktime_t tstamp;

    // Network device
    struct net_device *dev;

    // Data pointers (THE KEY TO UNDERSTANDING NETWORKING)
    unsigned char *head;      // Start of allocated buffer
    unsigned char *data;      // Start of valid data
    sk_buff_data_t tail;      // End of valid data
    sk_buff_data_t end;       // End of allocated buffer

    // Buffer size information
    unsigned int len;         // Length of actual data
    unsigned int data_len;    // Length of fragmented data

    // Protocol headers (updated as packet moves up/down stack)
    __u16 transport_header;   // Offset to transport header (TCP/UDP)
    __u16 network_header;     // Offset to network header (IP)
    __u16 mac_header;         // Offset to MAC header (Ethernet)

    // Metadata
    __u32 priority;
    __u8 ip_summed;          // Checksum status
    __u8 cloned:1;           // Is this a clone?
    __u8 nohdr:1;

    // Reference counting
    refcount_t users;

    // Fragmented data (for large packets)
    unsigned int truesize;   // Total allocated size
    atomic_t dataref;

    // Shared info (at end of buffer)
    struct skb_shared_info *shinfo;
};

2.2 sk_buff Memory Layout

Understanding the memory layout is crucial for understanding zero-copy optimizations:

Packet Reception (growing from head toward tail):
─────────────────────────────────────────────────────

┌─────────────────────────────────────────────────────┐
│                                                       │
│  head                                         end    │
│   ↓                                            ↓     │
│   ┌──────────┬───────────┬──────────┬─────────┐     │
│   │ headroom │  Eth Hdr  │  IP Hdr  │ TCP Hdr │ ... │
│   │ (unused) │  (14 B)   │  (20 B)  │ (20 B)  │ ... │
│   └──────────┴───────────┴──────────┴─────────┘     │
│              ↑                                 ↑     │
│             data                             tail    │
│                                                       │
│   <-headroom->  <----- len (data length) ---->       │
│   <-------------- truesize (total alloc) ----------> │
└─────────────────────────────────────────────────────┘

Header Pointers:
  mac_header     → Points to Ethernet header
  network_header → Points to IP header
  transport_header → Points to TCP header

Key Operations:
  skb_push()  - Decrease data pointer (add header)
  skb_pull()  - Increase data pointer (remove header)
  skb_put()   - Increase tail pointer (add data at end)
  skb_trim()  - Decrease tail pointer (remove data at end)

Example: Adding Ethernet Header:

// Before: data points to IP header
┌────────────────────────────────┐
│ headroom │ IP Hdr │ TCP Hdr │  │
└──────────┴────────┴─────────┴──┘
           ↑ data

// Add Ethernet header
unsigned char *eth_hdr = skb_push(skb, ETH_HLEN);  // ETH_HLEN = 14

// After: data now points to Ethernet header
┌────────────────────────────────┐
│ headrm │Eth│ IP Hdr │ TCP Hdr │ │
└────────┴───┴────────┴─────────┴─┘
         ↑ data

2.3 Zero-Copy Mechanisms

Problem: Copying large packets is expensive (memory bandwidth limited). Solution 1: skb_clone() - Clone sk_buff structure, share data

struct sk_buff *clone = skb_clone(original_skb, GFP_ATOMIC);

Original:        Clone:
┌──────────┐    ┌──────────┐
│ sk_buff  │    │ sk_buff  │
│  struct  │    │  struct  │
│          │    │          │
│ head ────┼───>│ head ────┼──┐
│ data ────┼───>│ data ────┼──┤  Points to SAME buffer
│ tail     │    │ tail     │  │
└──────────┘    └──────────┘  │
                              ↓
                    ┌────────────────────┐
                    │  Shared Data Buffer │
                    │  (refcount = 2)     │
                    └────────────────────┘

Use case: Packet sniffing (tcpdump)
- Clone packet for sniffer
- Original continues up the stack
- No data copy!

Solution 2: Fragmented Data (skb_frag) - For large packets

struct skb_shared_info {
    unsigned char nr_frags;  // Number of fragments
    skb_frag_t frags[MAX_SKB_FRAGS];  // Fragment array
};

typedef struct skb_frag_struct {
    struct page *page;       // Points to physical page
    __u16 page_offset;       // Offset within page
    __u16 size;              // Fragment size
} skb_frag_t;

Layout for 9000-byte packet (Jumbo frame):
┌──────────────────────────────────────────────┐
│ sk_buff                                      │
│  head → [headers: 54 bytes]                  │
│  data → [first chunk of data]                │
└──┬───────────────────────────────────────────┘
   │
   └→ skb_shared_info
      ├→ frag[0] → page A (1500 bytes)
      ├→ frag[1] → page B (1500 bytes)
      ├→ frag[2] → page C (1500 bytes)
      └→ frag[3] → page D (remainder)

NIC DMAs directly into these pages (zero-copy receive!)

2.4 sk_buff Operations

Header Manipulation
Memory Management
Data Access

#include <linux/skbuff.h>

// Add Ethernet header (14 bytes)
struct ethhdr *eth = (struct ethhdr *)skb_push(skb, sizeof(struct ethhdr));
eth->h_proto = htons(ETH_P_IP);
memcpy(eth->h_dest, dst_mac, ETH_ALEN);
memcpy(eth->h_source, src_mac, ETH_ALEN);

// Remove Ethernet header when moving up stack
skb_pull(skb, sizeof(struct ethhdr));

// Access network header
struct iphdr *iph = ip_hdr(skb);  // Macro: (struct iphdr *)skb_network_header(skb)

// Access transport header
struct tcphdr *tcph = tcp_hdr(skb);

// Allocate new sk_buff
struct sk_buff *skb = alloc_skb(size, GFP_KERNEL);

// Clone sk_buff (share data)
struct sk_buff *clone = skb_clone(skb, GFP_ATOMIC);

// Copy sk_buff (duplicate data)
struct sk_buff *copy = skb_copy(skb, GFP_ATOMIC);

// Increase reference count
skb_get(skb);

// Decrease reference count (free if reaches 0)
kfree_skb(skb);

// Reallocate headroom
skb = skb_realloc_headroom(skb, new_headroom);

// Linear data length
unsigned int len = skb->len;

// Check if data is linear
if (skb_is_nonlinear(skb)) {
    // Data is fragmented
    skb_frag_t *frag;
    int i;

    for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
        frag = &skb_shinfo(skb)->frags[i];
        // Process fragment
        void *frag_addr = skb_frag_address(frag);
        unsigned int frag_size = skb_frag_size(frag);
    }
}

// Linearize fragmented skb (expensive!)
if (skb_linearize(skb) != 0) {
    // Failed to linearize
    kfree_skb(skb);
    return -ENOMEM;
}

3. Packet Reception: From Wire to Socket

3.1 The Legacy Interrupt-Driven Model (Pre-NAPI)

Old Approach (before 2.5 kernel):

1. Packet arrives at NIC
   └→ NIC DMAs packet to ring buffer
   └→ NIC raises hardware IRQ

2. CPU receives interrupt
   └→ Context switch (save current task)
   └→ Jump to interrupt handler

3. Interrupt handler (top half)
   └→ Disable interrupts (critical section)
   └→ Allocate sk_buff
   └→ Copy packet data from ring buffer to sk_buff
   └→ Queue sk_buff to network stack
   └→ Re-enable interrupts
   └→ Return from interrupt

4. Network stack processes packet (bottom half)
   └→ IP layer processing
   └→ TCP layer processing
   └→ Socket layer delivery

Problem: At 1Gbps+ speeds, 100,000+ interrupts/sec
Result: 100% CPU time in interrupt handling (interrupt storm)

3.2 NAPI: New API (Polling + Interrupts)

Solution: Hybrid polling/interrupt model

Receive Flow with NAPI:
────────────────────────

Phase 1: Low Traffic (Interrupt Mode)
┌─────────────────────────────────────────┐
│ Packet arrives                          │
│  ↓                                      │
│ NIC raises IRQ                          │
│  ↓                                      │
│ Driver IRQ handler:                     │
│  • Disable NIC interrupts               │
│  • Schedule NAPI poll (add to poll_list)│
│  • Return (very fast!)                  │
│  ↓                                      │
│ Softirq NET_RX_SOFTIRQ triggers         │
│  ↓                                      │
│ net_rx_action():                        │
│  • Call driver->poll()                  │
│  • Process up to budget packets (64)    │
│  • If ring empty: re-enable IRQ         │
└─────────────────────────────────────────┘

Phase 2: High Traffic (Polling Mode)
┌─────────────────────────────────────────┐
│ poll() processes 64 packets             │
│  ↓                                      │
│ More packets in ring buffer?            │
│  • YES: Stay in polling mode            │
│  • Keep calling poll() until ring empty │
│  • No interrupts needed!                │
│  ↓                                      │
│ Eventually ring empties                 │
│  • Re-enable NIC interrupts             │
│  • Wait for next packet                 │
└─────────────────────────────────────────┘

Kernel Code (simplified from net/core/dev.c):

// Driver's NAPI poll function
static int my_driver_poll(struct napi_struct *napi, int budget) {
    struct my_adapter *adapter = container_of(napi, struct my_adapter, napi);
    int work_done = 0;

    while (work_done < budget) {
        // Check if ring buffer has packets
        if (ring_buffer_empty(adapter))
            break;

        // Fetch packet from ring buffer
        struct sk_buff *skb = fetch_packet_from_ring(adapter);

        // Set metadata
        skb->dev = adapter->netdev;
        skb->protocol = eth_type_trans(skb, adapter->netdev);

        // Pass to network stack
        netif_receive_skb(skb);

        work_done++;
    }

    // If we processed less than budget, ring is empty
    if (work_done < budget) {
        napi_complete(napi);  // Exit polling mode
        enable_irq(adapter->irq);  // Re-enable interrupts
    }

    return work_done;
}

// IRQ handler (top half)
static irqreturn_t my_driver_irq_handler(int irq, void *data) {
    struct my_adapter *adapter = data;

    // Disable NIC interrupts
    disable_nic_interrupts(adapter);

    // Schedule NAPI polling
    napi_schedule(&adapter->napi);

    return IRQ_HANDLED;
}

// Network core: softirq handler
static void net_rx_action(struct softirq_action *h) {
    struct list_head *poll_list = this_cpu_ptr(&softnet_data.poll_list);
    int budget = netdev_budget;  // Default: 300
    unsigned long time_limit = jiffies + netdev_budget_usecs;

    while (!list_empty(poll_list)) {
        struct napi_struct *napi = list_first_entry(poll_list, ...);

        int work = napi->poll(napi, budget);
        budget -= work;

        if (budget <= 0 || time_after(jiffies, time_limit))
            break;  // Yield CPU
    }
}

Benefits:

Low latency under low load: Interrupts still used
High throughput under high load: Polling avoids interrupt overhead
Fairness: Budget limits per-device processing
CPU efficiency: No interrupt storm

3.3 Receive Packet Steering (RPS/RFS)

Problem: Single NIC queue → all packets processed on one CPU core Solution: Distribute packet processing across multiple CPUs

RSS (Hardware)

Receive Side Scaling

NIC has multiple RX queues
NIC hashes packet (IP + port)
Distributes to different queues
Each queue has own IRQ → CPU core

NIC
┌────────────────────┐
│  Hash(pkt) % 4     │
│   ↓   ↓   ↓   ↓    │
│  Q0  Q1  Q2  Q3    │
└──┼───┼───┼───┼────┘
   │   │   │   └──IRQ3→ CPU3
   │   │   └──────IRQ2→ CPU2
   │   └──────────IRQ1→ CPU1
   └──────────────IRQ0→ CPU0

Pros: Hardware acceleration Cons: Requires multi-queue NIC

RPS (Software)

Receive Packet Steering

Software-based RSS
CPU that receives IRQ hashes packet
Enqueues to target CPU’s backlog
Target CPU processes packet

CPU0 (IRQ handler)
  │ Receive packet
  │ Hash → CPU2
  ↓
  Enqueue to CPU2 backlog
          ↓
      CPU2 processes packet

Pros: Works with single-queue NIC Cons: Extra CPU for steering

RFS (Receive Flow Steering): Extension of RPS

Goal: Process packet on CPU where application is running
Benefit: CPU cache locality (hot cache = faster processing)

Flow:
1. Application recv() on CPU3
   └→ Kernel records: Flow X → CPU3

2. Packet for Flow X arrives on CPU0
   └→ RPS checks flow table
   └→ Steers to CPU3 (where app is!)

Result: Packet data already in CPU3's cache when app reads it

Configuration:

# Enable RPS (software steering)
echo "f" > /sys/class/net/eth0/queues/rx-0/rps_cpus  # Use CPUs 0-3

# Set RPS flow entries
echo 4096 > /proc/sys/net/core/rps_sock_flow_entries

# Set per-queue flow entries
echo 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt

4. XDP: eXpress Data Path

XDP allows running eBPF programs before sk_buff allocation, at the earliest possible point in packet processing.

4.1 XDP Architecture

Packet Flow with XDP:
─────────────────────

Without XDP:
┌───────┐   ┌──────────┐   ┌──────┐   ┌─────┐   ┌──────┐
│  NIC  │ → │ allocate │ → │  IP  │ → │ TCP │ → │ App  │
│  DMA  │   │  sk_buff │   │layer │   │layer│   │      │
└───────┘   └──────────┘   └──────┘   └─────┘   └──────┘
  ~500ns      ~200ns         ~100ns     ~100ns     user

With XDP:
┌───────┐   ┌──────────┐
│  NIC  │ → │   XDP    │ ──→ [XDP_DROP] (discard, fastest)
│  DMA  │   │  eBPF    │ ──→ [XDP_TX] (bounce back)
└───────┘   │ program  │ ──→ [XDP_REDIRECT] (other NIC/CPU)
  ~500ns    └──────────┘ ──→ [XDP_PASS] (continue to stack)
             ~100ns              ↓
                          ┌──────────┐
                          │ allocate │
                          │  sk_buff │
                          └──────────┘
                                ↓
                           Normal stack...

4.2 XDP Program Example

#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>

// Drop all packets from specific IP
SEC("xdp")
int xdp_drop_ip(struct xdp_md *ctx) {
    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    // Bounds check (required by BPF verifier)
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;

    // Only process IP packets
    if (eth->h_proto != htons(ETH_P_IP))
        return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;

    // Drop packets from 192.168.1.100
    __u32 blocked_ip = 0xC0A80164;  // 192.168.1.100 in network order
    if (ip->saddr == blocked_ip) {
        return XDP_DROP;  // Discard immediately!
    }

    return XDP_PASS;  // Continue to network stack
}

char _license[] SEC("license") = "GPL";

Compile and Load:

# Compile XDP program
clang -O2 -target bpf -c xdp_drop.c -o xdp_drop.o

# Load into kernel
ip link set dev eth0 xdp obj xdp_drop.o sec xdp

# Verify
ip link show eth0
# ... xdp ...

# Remove XDP program
ip link set dev eth0 xdp off

4.3 XDP Actions

XDP_DROP
XDP_TX
XDP_REDIRECT
XDP_PASS

// Fastest packet drop (DDoS mitigation)
if (is_attack_packet(ctx)) {
    return XDP_DROP;  // ~10 million pps possible
}

Use Cases:

DDoS mitigation (drop attack traffic before stack)
Invalid packet filtering
Rate limiting at wire speed

Performance: ~50ns per packet (vs ~10µs for iptables DROP)

// Bounce packet back out same interface
if (is_icmp_echo_request(ctx)) {
    // Swap MAC addresses
    swap_mac_addresses(ctx);
    return XDP_TX;  // Send back!
}

Use Cases:

Load balancer (reflect to backend)
ICMP responder
Packet generator

Performance: ~100ns round-trip

// Send packet to different interface or CPU
int target_ifindex = 3;  // eth1
return bpf_redirect(target_ifindex, 0);

Use Cases:

Multi-interface load balancing
Bridge/router acceleration
AF_XDP (redirect to user space)

Performance: ~200ns to redirect

// Continue normal kernel processing
// Maybe just collect statistics
__sync_fetch_and_add(&stats.packets, 1);
return XDP_PASS;

Use Cases:

Monitoring (count packets)
Conditional processing
Complex protocols (need full stack)

4.4 AF_XDP: Zero-Copy to User Space

AF_XDP allows user-space programs to receive packets directly from NIC DMA buffer (bypassing kernel stack entirely).

#include <linux/if_xdp.h>
#include <bpf/xsk.h>

struct xsk_socket_info {
    struct xsk_ring_cons rx;    // RX ring (kernel → user)
    struct xsk_ring_prod tx;    // TX ring (user → kernel)
    struct xsk_ring_prod fq;    // Fill queue (user provides buffers)
    struct xsk_ring_cons cq;    // Completion queue (kernel returns buffers)
    struct xsk_socket *xsk;
};

int main() {
    // Create XDP socket
    struct xsk_socket_info *xsk = create_xsk_socket("eth0", 0);

    // Main loop
    while (1) {
        // Fill queue with empty buffers
        unsigned int idx_fq;
        if (xsk_ring_prod__reserve(&xsk->fq, BATCH_SIZE, &idx_fq) == BATCH_SIZE) {
            for (int i = 0; i < BATCH_SIZE; i++) {
                *xsk_ring_prod__fill_addr(&xsk->fq, idx_fq++) = allocate_buffer();
            }
            xsk_ring_prod__submit(&xsk->fq, BATCH_SIZE);
        }

        // Receive packets
        unsigned int rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
        for (int i = 0; i < rcvd; i++) {
            const struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);

            // Access packet data (zero-copy!)
            void *pkt = xsk_umem__get_data(umem, desc->addr);
            unsigned int len = desc->len;

            // Process packet...
            process_packet(pkt, len);
        }
        xsk_ring_cons__release(&xsk->rx, rcvd);
    }

    return 0;
}

Performance: 20+ million packets per second per core (vs ~1-2M with regular sockets)

5. The TCP/IP Stack

5.1 IP Layer Processing

// Simplified from net/ipv4/ip_input.c

int ip_rcv(struct sk_buff *skb, struct net_device *dev) {
    struct iphdr *iph;

    // 1. Validate packet
    if (skb->len < sizeof(struct iphdr))
        goto drop;

    iph = ip_hdr(skb);

    // 2. Checksum verification (if not offloaded to NIC)
    if (ip_fast_csum((u8 *)iph, iph->ihl) != 0)
        goto csum_error;

    // 3. Validate header
    if (iph->ihl < 5 || iph->version != 4)
        goto drop;

    // 4. Netfilter hook: PREROUTING
    return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL,
                   ip_rcv_finish);

drop:
    kfree_skb(skb);
    return NET_RX_DROP;
}

static int ip_rcv_finish(struct sk_buff *skb) {
    struct iphdr *iph = ip_hdr(skb);

    // 5. Route lookup (where should this packet go?)
    if (ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, skb->dev) != 0)
        goto drop;

    // 6. Deliver locally or forward
    return dst_input(skb);  // Calls ip_local_deliver() or ip_forward()
}

int ip_local_deliver(struct sk_buff *skb) {
    struct iphdr *iph = ip_hdr(skb);

    // 7. IP fragmentation reassembly
    if (iph->frag_off & htons(IP_MF | IP_OFFSET)) {
        skb = ip_defrag(skb);
        if (!skb)
            return 0;
    }

    // 8. Netfilter hook: LOCAL_IN
    return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL,
                   ip_local_deliver_finish);
}

static int ip_local_deliver_finish(struct sk_buff *skb) {
    struct iphdr *iph = ip_hdr(skb);

    // 9. Demultiplex to transport layer
    int protocol = iph->protocol;

    switch (protocol) {
        case IPPROTO_TCP:
            tcp_v4_rcv(skb);
            break;
        case IPPROTO_UDP:
            udp_rcv(skb);
            break;
        case IPPROTO_ICMP:
            icmp_rcv(skb);
            break;
        default:
            // Unknown protocol
            icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PROT_UNREACH, 0);
            kfree_skb(skb);
    }

    return 0;
}

Routing Table Lookup (FIB - Forwarding Information Base):

Route lookup for destination IP:

1. Check routing cache (fast path)
   └→ Cache hit? Return cached route

2. FIB lookup (trie-based structure)
   ┌─────────────────────────────┐
   │ Longest Prefix Match (LPM)  │
   │                             │
   │ 192.168.1.0/24 → eth0       │
   │ 10.0.0.0/8 → tun0           │
   │ 0.0.0.0/0 → eth0 (default)  │
   └─────────────────────────────┘

3. Policy routing (multiple routing tables)
   └→ Check fib rules, select table

4. Result: dst_entry structure
   ┌──────────────────────────┐
   │ output_dev = eth0        │
   │ gateway = 192.168.1.1    │
   │ output_fn = ip_finish()  │
   └──────────────────────────┘

5.2 TCP Layer: The Fast Path

TCP processing has two paths:

Fast Path

Conditions:

In-order segment
No flags (except ACK)
Window not full
No urgent data
Checksum OK

Processing:

// Simplified
if (tcp_fast_path_check(sk, skb)) {
    // Fast path!
    memcpy(user_buffer, skb->data, skb->len);
    tcp_send_ack(sk);
    wake_up_process(sk->sk_sleep);
    return;
}

Performance: ~1µs per packet

Slow Path

Triggers:

Out-of-order segment
Retransmission
Window probing
Options (SACK, timestamps)
Connection management (SYN, FIN)

Processing:

// Complex state machine
tcp_validate_incoming(sk, skb);
tcp_ack(sk, skb);  // Process ACK
tcp_data_queue(sk, skb);  // Queue OOO
tcp_send_delayed_ack(sk);
// ... many more checks ...

Performance: ~5-10µs per packet

TCP Receive Processing (simplified from net/ipv4/tcp_input.c):

int tcp_v4_rcv(struct sk_buff *skb) {
    struct tcphdr *th;
    struct sock *sk;

    // 1. Get TCP header
    th = tcp_hdr(skb);

    // 2. Find socket (demultiplexing)
    sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
    if (!sk)
        goto no_tcp_socket;  // Send RST

    // 3. Process TCP state machine
    tcp_v4_do_rcv(sk, skb);

    return 0;
}

int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) {
    // Check if established connection
    if (sk->sk_state == TCP_ESTABLISHED) {
        // Try fast path
        if (tcp_rcv_established(sk, skb) == 0)
            return 0;
    }

    // Slow path (connection management)
    return tcp_rcv_state_process(sk, skb);
}

int tcp_rcv_established(struct sock *sk, struct sk_buff *skb) {
    struct tcp_sock *tp = tcp_sk(sk);
    struct tcphdr *th = tcp_hdr(skb);

    // FAST PATH CHECK
    if (tp->rcv_nxt == ntohl(th->seq) &&  // In-order
        tp->rcv_wnd &&                     // Window open
        !th->syn && !th->fin && !th->rst)  // No special flags
    {
        int len = skb->len - th->doff * 4;

        // Copy data to socket receive buffer
        if (!skb_copy_datagram_msg(skb, th->doff * 4,
                                    &sk->sk_receive_queue, len)) {
            tp->rcv_nxt += len;

            // Send ACK
            tcp_send_ack(sk);

            // Wake up waiting process
            sk->sk_data_ready(sk);

            kfree_skb(skb);
            return 0;  // Fast path success!
        }
    }

    // Fall through to slow path...
    return tcp_slow_path(sk, skb);
}

5.3 TCP Congestion Control

Linux supports pluggable congestion control algorithms:

CUBIC (Default)
BBR (Modern)
Configuration

// Simplified CUBIC algorithm

// Congestion window growth function
W(t) = C * (t - K)³ + W_max

Where:
  t = Time since last congestion event
  K = Cube root of (W_max * β / C)
  W_max = Window size before congestion
  C = Scaling constant
  β = Multiplicative decrease factor (0.7)

Behavior:
  Slow start → Exponential growth
  Congestion avoidance → Cubic growth
  Loss detected → W = W * β (reduce 30%)
  Recovery → Cubic growth toward W_max

┌─────────────────────────────────────┐
│  Congestion Window                  │
│      ↑                              │
│ W_max│     ╱╲                       │
│      │    ╱  ╲   ← Cubic curve      │
│      │   ╱    ╲╱                    │
│      │  ╱                           │
│      │ ╱                            │
│      └──────────────────→ Time      │
│         ↑ Loss                      │
└─────────────────────────────────────┘

Pros: Good for high-bandwidth networks Cons: Can be aggressive on lossy links

// BBR (Bottleneck Bandwidth and RTT)
// Focus: Model the network, not react to loss

BBR Phases:
1. STARTUP: Exponential growth to find bandwidth
2. DRAIN: Drain queues built during startup
3. PROBE_BW: Oscillate around optimal bandwidth
4. PROBE_RTT: Periodically probe for minimum RTT

Key Insight: Loss ≠ Congestion
  Wireless networks drop packets due to interference
  BBR ignores loss, focuses on measured bandwidth and RTT

Pacing Rate = BtlBw × Gain
  BtlBw = Estimated bottleneck bandwidth
  Gain = Multiplicative factor (varies by phase)

┌─────────────────────────────────────┐
│  Rate                               │
│    ↑                                │
│    │    ╱‾‾‾╲  ╱‾‾‾╲  ╱‾‾‾╲         │
│ BW │   ╱     ╲╱     ╲╱     ╲        │
│    │  ╱  PROBE_BW (cycles)  ╲       │
│    │ ╱                       ╲      │
│    └──────────────────────────→ t   │
└─────────────────────────────────────┘

Pros: Better on high-latency/lossy links Cons: Can be unfair to CUBIC flows

# View available algorithms
sysctl net.ipv4.tcp_available_congestion_control
# cubic reno bbr

# Get current default
sysctl net.ipv4.tcp_congestion_control
# cubic

# Change to BBR
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr

# Per-socket (from application)
int algo = TCP_CONGESTION;
char *cc = "bbr";
setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, cc, strlen(cc));

# View TCP statistics
ss -ti  # Show TCP info including congestion control

# Example output:
# cubic wscale:7,7 rto:204 rtt:3.5/2 cwnd:10

6. Socket Layer & System Calls

6.1 Socket Creation

// User space
int sockfd = socket(AF_INET, SOCK_STREAM, 0);

// Kernel: net/socket.c
SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol) {
    // 1. Allocate socket structure
    struct socket *sock = sock_alloc();

    // 2. Create protocol-specific socket
    // For TCP: calls inet_create() → tcp_prot.init()
    sock->ops = &inet_stream_ops;  // TCP operations

    // 3. Allocate struct sock (protocol control block)
    struct sock *sk = sk_alloc(family, GFP_KERNEL, &tcp_prot);
    sock->sk = sk;

    // 4. Initialize TCP-specific state
    tcp_init_sock(sk);

    // 5. Create file descriptor
    int fd = sock_map_fd(sock, O_CLOEXEC);

    return fd;
}

Data Structures:

File Descriptor Layer:
┌──────────────────┐
│ struct file      │  (Generic file operations)
│  ├─ f_op         │  → socket_file_ops
│  └─ private_data │  → points to struct socket
└────────┬─────────┘
         ↓
Socket Layer:
┌──────────────────┐
│ struct socket    │  (BSD socket API)
│  ├─ type         │  (SOCK_STREAM, SOCK_DGRAM)
│  ├─ ops          │  → inet_stream_ops (send, recv, bind, etc.)
│  └─ sk           │  → points to struct sock
└────────┬─────────┘
         ↓
Protocol Layer:
┌──────────────────┐
│ struct sock      │  (Protocol control block)
│  ├─ sk_state     │  (TCP_ESTABLISHED, TCP_LISTEN, etc.)
│  ├─ sk_prot      │  → tcp_prot (protocol operations)
│  ├─ sk_receive_queue   │  (Received data)
│  ├─ sk_write_queue     │  (Data to send)
│  └─ ...          │
└────────┬─────────┘
         ↓
Protocol-Specific:
┌──────────────────┐
│ struct tcp_sock  │  (TCP-specific state)
│  ├─ rcv_nxt      │  (Next expected sequence)
│  ├─ snd_una      │  (Oldest unacked byte)
│  ├─ srtt         │  (Smoothed RTT)
│  ├─ snd_cwnd     │  (Congestion window)
│  └─ ...          │
└──────────────────┘

6.2 send() and recv() Internals

send()
recv()

// User space
ssize_t n = send(sockfd, buffer, length, flags);

// Kernel: net/socket.c → tcp_sendmsg()
int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) {
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;
    int copied = 0;

    // 1. Check socket state
    if (sk->sk_state != TCP_ESTABLISHED)
        return -ENOTCONN;

    // 2. Wait for send buffer space if needed
    while (size > 0) {
        // Check if send buffer full
        if (sk_stream_memory_free(sk) <= 0) {
            if (flags & MSG_DONTWAIT)
                return -EAGAIN;

            // Block waiting for buffer space
            sk_stream_wait_memory(sk, &timeo);
        }

        // 3. Allocate sk_buff
        skb = sk_stream_alloc_skb(sk, min(size, mss), GFP_KERNEL);

        // 4. Copy data from user space
        int copy = min_t(int, size, skb_availroom(skb));
        if (skb_copy_to_page_nocache(sk, &msg->msg_iter, skb, page,
                                      off, copy))
            goto do_fault;

        // 5. Add to write queue
        skb_entail(sk, skb);

        copied += copy;
        size -= copy;

        // 6. Push data if:
        // - PSH flag set
        // - Queue getting large
        // - No more data
        if ((flags & MSG_MORE) == 0 || size == 0)
            tcp_push(sk, flags, mss_now, TCP_NAGLE_PUSH);
    }

    return copied;
}

// 7. Actually transmit
void tcp_push(struct sock *sk, int flags, int mss_now, int nonagle) {
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;

    while ((skb = tcp_send_head(sk)) != NULL) {
        // Check congestion window
        if (tcp_snd_wnd_test(tp, skb, mss_now) &&
            tcp_cwnd_test(tp, skb)) {

            // Transmit!
            tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
        } else {
            break;  // Window full, wait for ACK
        }
    }
}

// User space
ssize_t n = recv(sockfd, buffer, length, flags);

// Kernel: net/ipv4/tcp.c
int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
                int nonblock, int flags, int *addr_len) {
    struct tcp_sock *tp = tcp_sk(sk);
    int copied = 0;

    // 1. Wait for data
    while (len > 0) {
        struct sk_buff *skb;

        // Check receive queue
        skb = skb_peek(&sk->sk_receive_queue);

        if (!skb) {
            // No data available
            if (nonblock) {
                copied = -EAGAIN;
                break;
            }

            // Block waiting for data
            sk_wait_data(sk, &timeo);
            continue;
        }

        // 2. Copy data to user space
        int chunk = min_t(int, len, skb->len);

        if (skb_copy_datagram_msg(skb, 0, msg, chunk)) {
            copied = -EFAULT;
            break;
        }

        copied += chunk;
        len -= chunk;

        // 3. Update sequence numbers
        tp->rcv_nxt += chunk;

        // 4. Remove from queue if fully consumed
        if (chunk == skb->len) {
            __skb_unlink(skb, &sk->sk_receive_queue);
            __kfree_skb(skb);
        } else {
            // Partial read, adjust skb
            __skb_pull(skb, chunk);
        }

        // 5. Update window
        tcp_rcv_space_adjust(sk);

        // 6. Send ACK if needed
        if (tp->rcv_nxt - tp->rcv_wup > inet_csk(sk)->icsk_ack.rcv_mss)
            tcp_send_ack(sk);
    }

    // 7. Cleanup and return
    tcp_cleanup_rbuf(sk, copied);
    return copied;
}

6.3 Zero-Copy Techniques

sendfile()

// Traditional copy (4 copies!)
fd_in = open("file.txt", O_RDONLY);
fd_out = socket(...);

char buf[4096];
while ((n = read(fd_in, buf, sizeof(buf))) > 0) {
    write(fd_out, buf, n);
}

// Copies:
// 1. DMA: Disk → Kernel buffer
// 2. CPU: Kernel buffer → User buffer (read)
// 3. CPU: User buffer → Socket buffer (write)
// 4. DMA: Socket buffer → NIC

// Zero-copy sendfile()
sendfile(fd_out, fd_in, NULL, file_size);

// Copies:
// 1. DMA: Disk → Kernel buffer
// 2. DMA: Kernel buffer → NIC (if supported)
// Or just 1 CPU copy if DMA-to-DMA not available

// Kernel implementation
ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
                    size_t count) {
    // Use splice internally
    // Transfers pages directly to socket
    return splice_direct_to_actor(in_file, &sd,
                                  direct_splice_actor);
}

MSG_ZEROCOPY

// Modern zero-copy send
int fd = socket(AF_INET, SOCK_STREAM, 0);

// Enable zero-copy
int one = 1;
setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY,
           &one, sizeof(one));

// Send with zero-copy flag
char buf[65536];
send(fd, buf, sizeof(buf), MSG_ZEROCOPY);

// Kernel behavior:
// - Increments refcount on user pages
// - Passes page pointers to NIC
// - User buffer MUST NOT be modified until...

// Wait for completion notification
struct msghdr msg = {};
struct sock_extended_err *serr;
char control[100];

msg.msg_control = control;
msg.msg_controllen = sizeof(control);

recvmsg(fd, &msg, MSG_ERRQUEUE);

// Now safe to reuse buffer

// Benefits:
// - No CPU copy of payload data
// - Reduced cache pollution
// - Higher throughput

// Caveats:
// - Only beneficial for large sends (>10KB)
// - Buffer must stay valid until ACK
// - More complex error handling

7. Netfilter & Packet Filtering

7.1 Netfilter Hook Points

Packet Flow through Netfilter:
───────────────────────────────

                    Incoming Packet
                          ↓
                    ┌──────────┐
                    │   NIC    │
                    └─────┬────┘
                          ↓
           ┌──────────────────────────┐
           │  NF_INET_PRE_ROUTING     │ ← Hook 1
           │  (raw, mangle, nat)      │
           └─────────┬────────────────┘
                     ↓
              Routing Decision
                 ↙        ↘
         Local          Forward
           ↓                ↓
  ┌────────────────┐  ┌────────────────┐
  │ NF_INET_       │  │ NF_INET_       │
  │ LOCAL_IN       │  │ FORWARD        │ ← Hook 3
  │ (mangle,filter)│  │ (mangle,filter)│
  └───────┬────────┘  └────────┬───────┘
          ↓                    ↓
    Local Process      Routing Decision
          ↓                    ↓
  ┌────────────────┐  ┌────────────────┐
  │ NF_INET_       │  │ NF_INET_       │
  │ LOCAL_OUT      │  │ POST_ROUTING   │ ← Hook 4
  │ (raw,mangle,   │  │ (mangle, nat)  │
  │  nat, filter)  │  └────────┬───────┘
  └───────┬────────┘           ↓
          ↓              ┌──────────┐
    Routing Decision     │   NIC    │
          ↓              └──────────┘
  ┌────────────────┐          ↓
  │ NF_INET_       │    Outgoing Packet
  │ POST_ROUTING   │ ← Hook 5
  │ (mangle, nat)  │
  └───────┬────────┘
          ↓
    ┌──────────┐
    │   NIC    │
    └──────────┘
          ↓
    Outgoing Packet

7.2 Connection Tracking (conntrack)

// Conntrack tracks connection state

// TCP connection tracking states:
enum ip_conntrack_status {
    IPS_EXPECTED        = (1 << 0),  // Expected connection
    IPS_SEEN_REPLY      = (1 << 1),  // Seen reply direction
    IPS_ASSURED         = (1 << 2),  // Fully established
    IPS_CONFIRMED       = (1 << 3),  // In conntrack table
    IPS_SRC_NAT         = (1 << 4),  // Source NAT applied
    IPS_DST_NAT         = (1 << 5),  // Destination NAT applied
};

// Conntrack table entry
struct nf_conn {
    struct nf_conntrack ct_general;

    // Tuple: uniquely identifies connection
    struct nf_conntrack_tuple_hash tuplehash[IP_CT_DIR_MAX];
    // [0] = ORIGINAL direction
    // [1] = REPLY direction

    // Connection state
    unsigned long status;

    // Protocol-specific data
    union nf_conntrack_proto proto;

    // Timeout
    unsigned long timeout;
};

// Example: Track TCP connection
// Client 192.168.1.100:45000 → Server 8.8.8.8:80

// ORIGINAL tuple:
//   src: 192.168.1.100:45000
//   dst: 8.8.8.8:80
//   proto: TCP

// REPLY tuple:
//   src: 8.8.8.8:80
//   dst: 192.168.1.100:45000
//   proto: TCP

// This allows matching packets in both directions!

Performance Impact:

# View conntrack table
conntrack -L

# View statistics
conntrack -S
# cpu=0       found=1234 invalid=5 ignore=0 insert=567 ...

# Max connections
sysctl net.netfilter.nf_conntrack_max
# 65536

# Increase limit
sysctl -w net.netfilter.nf_conntrack_max=1048576

# Bypass conntrack for high-traffic flows (e.g., load balancer)
iptables -t raw -A PREROUTING -p tcp --dport 80 -j NOTRACK
iptables -t raw -A OUTPUT -p tcp --sport 80 -j NOTRACK

7.3 iptables Performance

# Rules are evaluated sequentially (O(n))
# Bad: 10,000 rules = slow!

iptables -A INPUT -s 1.2.3.4 -j DROP
iptables -A INPUT -s 1.2.3.5 -j DROP
# ... 9,998 more rules ...

# Better: Use ipset (hash table, O(1) lookup)
ipset create blocklist hash:ip
ipset add blocklist 1.2.3.4
ipset add blocklist 1.2.3.5
# ... add thousands ...

iptables -A INPUT -m set --match-set blocklist src -j DROP
# One rule, fast hash lookup!

# Modern alternative: nftables
nft add table inet filter
nft add chain inet filter input { type filter hook input priority 0\; }
nft add rule inet filter input ip saddr @blocklist drop

# nftables uses bytecode VM (like BPF), much faster

8. Network Buffers & Memory Management

8.1 Socket Buffers

// Each socket has send and receive buffers

// View socket buffer sizes
getsockopt(fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf, &len);
getsockopt(fd, SOL_SOCKET, SO_SNDBUF, &sndbuf, &len);

// Set larger buffers (important for high-BDP networks)
int buf_size = 4 * 1024 * 1024;  // 4 MB
setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &buf_size, sizeof(buf_size));
setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &buf_size, sizeof(buf_size));

// System-wide defaults
sysctl net.core.rmem_default  # Default receive buffer
sysctl net.core.wmem_default  # Default send buffer
sysctl net.core.rmem_max      # Max receive buffer
sysctl net.core.wmem_max      # Max send buffer

// TCP-specific (min, default, max)
sysctl net.ipv4.tcp_rmem  # 4096 131072 6291456
sysctl net.ipv4.tcp_wmem  # 4096 16384 4194304

Buffer Sizing for High Bandwidth-Delay Product:

BDP = Bandwidth × RTT

Example: 10 Gbps link, 100ms RTT
BDP = (10 × 10⁹ bits/sec) × (0.1 sec) / 8 bits/byte
    = 125 MB

Buffer should be >= BDP to fully utilize link!

# Configure
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"  # 128 MB max
sysctl -w net.core.rmem_max=134217728

8.2 TCP Autotuning

// Modern Linux auto-tunes TCP buffers (default: enabled)

// Kernel dynamically adjusts buffer size based on:
// 1. Measured RTT
// 2. Receive rate
// 3. Available memory

// Algorithm (simplified from tcp_input.c):
void tcp_rcv_space_adjust(struct sock *sk) {
    struct tcp_sock *tp = tcp_sk(sk);
    int time, space;

    // Measure receive rate
    time = tcp_stamp_us_delta(tp->tcp_mstamp, tp->rcvq_space.time);
    space = 2 * (tp->copied_seq - tp->rcvq_space.seq);

    if (time > 0) {
        int rcvbuf = tp->rcvq_space.space;
        int new_rcvbuf = space / time;  // Bytes per unit time

        // Increase buffer if receiving faster
        if (new_rcvbuf > rcvbuf) {
            new_rcvbuf = min(new_rcvbuf, sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);
            sk->sk_rcvbuf = min(new_rcvbuf, sysctl_rmem_max);
        }
    }
}

// Disable autotuning (if you want manual control)
echo 0 > /proc/sys/net/ipv4/tcp_moderate_rcvbuf

9. Performance Monitoring & Debugging

9.1 Essential Tools

ss (socket statistics)
netstat
ethtool
/proc & /sys

# View all TCP sockets
ss -tan

# Show TCP info (congestion control, RTT, etc.)
ss -ti

# Example output:
# ESTAB 0 0    192.168.1.100:45000  8.8.8.8:443
#  cubic wscale:7,7 rto:204 rtt:3.5/2 ato:40 mss:1448
#  cwnd:10 bytes_acked:12345 segs_out:100 segs_in:95

# Filter by state
ss -tan state established

# Show processes
ss -tap

# Watch in real-time
watch -n1 'ss -ti dst 8.8.8.8'

# Network statistics
netstat -s

# TCP stats
netstat -st | grep -i retrans
#     1234 segments retransmitted

# UDP stats
netstat -su

# Interface statistics
netstat -i

# NIC statistics
ethtool -S eth0

# Example output:
#      rx_packets: 123456789
#      tx_packets: 98765432
#      rx_bytes: 567890123456
#      rx_errors: 0
#      rx_dropped: 123  ← Ring buffer overflows!
#      tx_dropped: 0

# Ring buffer size
ethtool -g eth0

# Increase ring buffer
ethtool -G eth0 rx 4096 tx 4096

# Offload features
ethtool -k eth0
#   tcp-segmentation-offload: on
#   generic-receive-offload: on

# TCP statistics
cat /proc/net/snmp | grep Tcp:
# Tcp: ... OutSegs RetransSegs ...

# Socket memory usage
cat /proc/net/sockstat
# TCP: inuse 123 orphan 0 tw 45 alloc 678 mem 89

# Per-socket info
cat /proc/<pid>/net/tcp

# Network namespace
ls /proc/<pid>/ns/net

9.2 Tracing with BPF

// Trace TCP retransmissions with bpftrace

// tcp_retransmit.bt
bpftrace -e '
kprobe:tcp_retransmit_skb {
    $sk = (struct sock *)arg0;
    $inet_sock = (struct inet_sock *)$sk;

    printf("TCP retransmit: %s:%d -> %s:%d\n",
           ntop(AF_INET, $inet_sock->inet_saddr),
           $inet_sock->inet_sport,
           ntop(AF_INET, $inet_sock->inet_daddr),
           $inet_sock->inet_dport);
}
'

// Trace packet drops
bpftrace -e '
tracepoint:skb:kfree_skb {
    @drops[args->location] = count();
}

interval:s:5 {
    print(@drops);
    clear(@drops);
}
'

10. Interview Questions & Answers

Q1: Explain the sk_buff structure and why headroom/tailroom matter.

sk_buff: The fundamental packet data structure in Linux networking.Memory Layout:

┌─────────────────────────────────────────────┐
│ head  data          tail  end              │
│  ↓     ↓             ↓     ↓                │
│  ├─────┼─────────────┼─────┤                │
│  │ HR  │ Valid Data  │ TR  │                │
│  └─────┴─────────────┴─────┘                │
│<-headroom-> <-len-> <-tailroom->             │
└─────────────────────────────────────────────┘

Why Headroom Matters: As a packet moves down the network stack (from app to wire), each layer adds a header:

Application data
+20 bytes TCP header (skb_push)
+20 bytes IP header (skb_push)
+14 bytes Ethernet header (skb_push)

Without headroom, each layer would need to reallocate and copy the entire packet. With headroom, we just move the data pointer backwards.Why Tailroom Matters:

For adding trailers (less common)
For TSO/GSO (TCP Segmentation Offload): Kernel builds large packets, NIC splits them

Performance Impact: Zero-copy header addition vs expensive realloc/memcpy.

Q2: How does NAPI improve packet processing performance?

Problem with Old Interrupt Model:

Each packet → hardware interrupt
At 1 Gbps (1.5M packets/sec), CPU spends 100% time handling interrupts
This is “interrupt storm” or “receive livelock”

NAPI Solution (New API):Low Traffic (interrupt mode):

Packet arrives → IRQ
Driver disables NIC interrupts
Schedules NAPI poll
Returns immediately from IRQ

High Traffic (polling mode): 5. Softirq calls driver’s poll() function 6. Poll processes up to budget packets (default 64) 7. If more packets remain, stay in polling mode 8. If ring buffer empty, re-enable interruptsBenefits:

Low latency (low load): Interrupts still used
High throughput (high load): Polling avoids interrupt overhead
Fairness: Budget prevents one NIC from starving others
Adaptive: Automatically switches modes

Key Insight: Interrupts tell us “work is available”, then we switch to polling to batch the work.

Q3: What is XDP and how does it achieve such high performance?

XDP (eXpress Data Path): Runs eBPF programs at the earliest possible point in packet processing.Traditional Path:

NIC → DMA → Driver → Allocate sk_buff → Protocol Stack → ...
         ↑
     ~500ns
                    ↑
                 ~200ns + overhead

XDP Path:

NIC → DMA → Driver → XDP Program (eBPF) → Decision
         ↑           ↑
     ~500ns      ~100ns

Why So Fast:

No sk_buff allocation: Operating directly on DMA buffer
No cache misses: Data still in L1 cache from DMA
No context switches: Runs in softirq context
Early drop: Can discard packets before any processing
JIT compiled: eBPF → native machine code

Actions:

XDP_DROP: Discard (DDoS mitigation at 10M+ pps)
XDP_TX: Bounce back same interface (L2 load balancer)
XDP_REDIRECT: Send to different NIC or CPU
XDP_PASS: Continue to normal stack

Use Cases:

DDoS mitigation
Load balancing
Packet filtering
Network monitoring

Limitation: Can’t modify packet headers easily (need to recompute checksums).

Q4: Explain TCP Fast Path vs Slow Path.

Fast Path (common case optimization):Conditions:

TCP connection is ESTABLISHED
Packet arrives in-order (seq == rcv_nxt)
No flags except ACK
Receive window not full
No urgent data
Checksum valid

Processing:

if (fast_path_conditions) {
    memcpy(user_buffer, packet_data, len);  // Direct copy
    rcv_nxt += len;
    send_ack();
    wake_up_application();
    return;  // Done in ~1µs!
}

Slow Path (handles complex cases):Triggers:

Out-of-order segment (requires reassembly)
Retransmission (update RTO, congestion window)
Connection management (SYN, FIN, RST)
Options processing (SACK, timestamps, window scaling)
Zero window probing

Processing:

// Complex state machine
validate_sequence_numbers();
check_for_duplicate_acks();
update_rtt_estimates();
process_sack_blocks();
reorder_out_of_order_segments();
update_congestion_window();
// ... many more checks ...
// ~5-10µs

Impact: Fast path handles 90%+ of packets in established bulk-data transfers. Slow path ensures correctness for edge cases.Optimization: Keep connections in fast path by:

Avoiding packet loss (good network)
Using large enough buffers (avoid window full)
Minimizing out-of-order delivery (good QoS)

Q5: How does RSS/RPS/RFS distribute packet processing across CPU cores?

Problem: Single NIC queue → all packets processed on one CPU → bottleneckSolution 1: RSS (Receive Side Scaling) - Hardware

NIC has multiple RX queues (e.g., 8 queues)
NIC computes hash: hash(src_ip, dst_ip, src_port, dst_port) % num_queues
Each queue has dedicated IRQ mapped to specific CPU
Result: Packets distributed across CPUs in hardware

Pros: No CPU overhead, very fast Cons: Requires multi-queue NICSolution 2: RPS (Receive Packet Steering) - Software

Single queue NIC
CPU receiving IRQ computes hash
Enqueues packet to target CPU’s backlog
Target CPU processes packet

Pros: Works with any NIC Cons: Extra CPU overhead for steeringSolution 3: RFS (Receive Flow Steering) - Locality optimization

Extension of RPS
Tracks which CPU application is running on
Steers packets to that specific CPU
Result: Packet data in cache when application reads it

Example:

Without RFS:
  Packet → CPU0 (processes) → CPU2 (app blocked on recv)
  → Cache miss when app reads data

With RFS:
  Packet → CPU2 (processes + app runs here)
  → Data in cache, very fast!

Configuration:

# RSS (hardware, automatic if NIC supports)
ethtool -l eth0  # Show queue count

# RPS
echo "f" > /sys/class/net/eth0/queues/rx-0/rps_cpus  # CPUs 0-3

# RFS
echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
echo 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt

Q6: What is connection tracking (conntrack) and why can it be a bottleneck?

Connection Tracking: Kernel subsystem that tracks state of all connections (TCP, UDP, ICMP).Purpose:

Enable stateful firewall rules
NAT (must remember translations)
Connection-based filtering

How It Works:

Client initiates connection:
  192.168.1.100:45000 → 8.8.8.8:80

Conntrack creates entry:
  ORIGINAL: 192.168.1.100:45000 → 8.8.8.8:80 [NEW]
  REPLY:    8.8.8.8:80 → 192.168.1.100:45000

Subsequent packets match entry (both directions!)

States: NEW → ESTABLISHED → FIN_WAIT → CLOSE → destroyed

Performance Issues:

Hash table lookup: O(1) but still overhead on every packet
Global lock: (Older kernels) serializes all conntrack operations
Memory: Each connection consumes memory (~300 bytes)
Hash collisions: Degrade to O(n) lookup

Symptoms:

# Table full
dmesg | grep conntrack
# nf_conntrack: table full, dropping packet

# High CPU in conntrack
perf top
#   12.34%  [kernel]  [k] nf_conntrack_in

Solutions:

# 1. Increase table size
sysctl -w net.netfilter.nf_conntrack_max=1048576
sysctl -w net.netfilter.nf_conntrack_buckets=262144

# 2. Decrease timeout for short-lived connections
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30

# 3. Bypass conntrack for specific traffic (e.g., load balancer)
iptables -t raw -A PREROUTING -p tcp --dport 80 -j NOTRACK
iptables -t raw -A OUTPUT -p tcp --sport 80 -j NOTRACK
# WARNING: Breaks stateful rules for this traffic!

# 4. Use connection-less alternatives
# - eBPF/XDP for filtering (bypasses conntrack entirely)
# - Stateless firewall rules where possible

When to bypass: High-traffic stateless services (load balancers, DNS servers, CDN edges).

Q7: Explain zero-copy networking techniques (sendfile, MSG_ZEROCOPY, splice).

Problem: Traditional send/receive involves multiple memory copies.Traditional Path (4 copies!):

Sending a file:

1. read(file_fd, buffer, size)
   Disk → [Kernel buffer] → [User buffer]
        DMA copy          CPU copy

2. write(socket_fd, buffer, size)
   [User buffer] → [Socket buffer] → NIC
   CPU copy         DMA copy

Total: 2 DMA + 2 CPU copies

Zero-Copy Techniques:1. sendfile():

sendfile(socket_fd, file_fd, NULL, file_size);

// Kernel path:
Disk → [Kernel buffer] ────→ NIC
      DMA copy        DMA copy (if NIC supports)
                   or page remapping

// Avoids user-space copies entirely!
// Best for static file serving (nginx, Apache)

2. splice():

// Move data between FDs via pipe (zero-copy)
splice(file_fd, NULL, pipe_fd[1], NULL, size, 0);
splice(pipe_fd[0], NULL, socket_fd, NULL, size, 0);

// Kernel manipulates page tables, no memcpy

3. MSG_ZEROCOPY:

send(socket_fd, buffer, size, MSG_ZEROCOPY);

// Kernel:
// 1. Pin user pages in memory (increment refcount)
// 2. NIC DMAs directly from user buffer
// 3. After NIC finishes, kernel notifies app via error queue
// 4. App can now reuse buffer

// Benefit: No copy for large sends
// Caveat: Buffer must stay valid, async notification needed

4. mmap() + write():

void *data = mmap(NULL, size, PROT_READ, MAP_SHARED, file_fd, 0);
write(socket_fd, data, size);

// May avoid one copy if kernel is smart
// But less efficient than sendfile()

Performance Impact:

Method	Copies	Use Case
Traditional	4	Small data, flexibility needed
sendfile()	1-2	Static file serving
splice()	0	Piping data between FDs
MSG_ZEROCOPY	0	Large sends (>10KB)

When to use:

sendfile(): Web server serving files
splice(): Proxy/gateway (socket → socket)
MSG_ZEROCOPY: Bulk data transfer, streaming

Q8: How does TCP congestion control work? Compare CUBIC vs BBR.

Problem: Send too fast → network congestion → packet loss. Send too slow → underutilize bandwidth.Goal: Find optimal sending rate (maximize throughput, minimize loss).

CUBIC (Linux default):Algorithm:

Maintains congestion window (cwnd) = max packets in flight
On loss: cwnd = cwnd × β (reduce by 30%)
Recovery: Grow cwnd using cubic function

cwnd(t) = C × (t - K)³ + W_max

Where:
  W_max = window before loss
  K = time to reach W_max
  C = scaling constant

Behavior:

cwnd
  ↑
  │     ╱╲
  │    ╱  ╲    ← Cubic growth
  │   ╱    ╲╱
  │  ╱
  │ ╱
  └──────────→ time
      ↑ loss (reduce 30%)

Pros:

Aggressive growth after loss (good for high-bandwidth links)
Fair to other CUBIC flows
Simple, well-tested

Cons:

Treats loss as congestion signal (wrong for wireless)
Can cause bufferbloat (fills queues)
Slow convergence on very high BDP links

BBR (Bottleneck Bandwidth and RTT):Philosophy: Model the network, don’t react to loss.Measures:

BtlBw (bottleneck bandwidth): Max delivery rate observed
RTprop (round-trip propagation time): Min RTT observed

Pacing Rate = BtlBw × gainPhases:

STARTUP: Exponential growth to find BtlBw (like slow start)
DRAIN: Drain queues created during startup
PROBE_BW: Oscillate pacing rate around BtlBw (main phase)
PROBE_RTT: Periodically reduce cwnd to re-measure RTprop

Key Insight: Packet loss doesn’t mean congestion!

Wireless networks drop packets due to RF interference
BBR ignores loss, focuses on measured bandwidth

Pros:

Higher throughput on lossy links (wireless, satellite)
Lower latency (doesn’t fill buffers)
Better on bufferbloat-prone networks

Cons:

Can be unfair to CUBIC flows (more aggressive)
Requires accurate RTT measurement
More complex

When to Use:

Scenario	Best Choice
Data center (low latency, rare loss)	CUBIC
Internet (bufferbloat common)	BBR
Wireless/satellite (lossy)	BBR
Mixed traffic	BBR (lower latency helps all)

Configuration:

# Check current
sysctl net.ipv4.tcp_congestion_control

# Change to BBR
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Per-connection (from app)
setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, "bbr", 3);

Summary

Key Takeaways:

sk_buff: Central data structure. Understanding headroom/tailroom is key to zero-copy optimizations.
NAPI: Hybrid interrupt/polling model solves interrupt storm problem at high packet rates.
XDP: Fastest packet processing path. Process/drop packets before sk_buff allocation using eBPF.
RSS/RPS/RFS: Distribute packet processing across CPUs for scalability. RFS optimizes for cache locality.
TCP Fast Path: Handles common case (in-order delivery) with minimal overhead. Slow path handles edge cases.
Congestion Control: CUBIC (default) vs BBR (better on lossy/bufferbloat links). Understand trade-offs.
Zero-Copy: sendfile(), splice(), MSG_ZEROCOPY eliminate expensive memory copies for large transfers.
Conntrack: Essential for stateful firewalls but can be bottleneck. Bypass for high-traffic stateless services.

Performance Checklist:

Enable multi-queue NIC and RSS
Tune socket buffers for high BDP
Use XDP for packet filtering
Enable BBR for internet traffic
Increase conntrack table for high connection count
Use zero-copy for large data transfers

Next: File Systems →

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Kernel Networking Stack

​1. The Network Stack Architecture

​1.1 Layer Overview

​2. The Core Data Structure: sk_buff

​2.1 sk_buff Structure

​2.2 sk_buff Memory Layout

​2.3 Zero-Copy Mechanisms

​2.4 sk_buff Operations

​3. Packet Reception: From Wire to Socket

​3.1 The Legacy Interrupt-Driven Model (Pre-NAPI)

​3.2 NAPI: New API (Polling + Interrupts)

​3.3 Receive Packet Steering (RPS/RFS)

RSS (Hardware)

RPS (Software)

​4. XDP: eXpress Data Path

​4.1 XDP Architecture

​4.2 XDP Program Example

​4.3 XDP Actions

​4.4 AF_XDP: Zero-Copy to User Space

​5. The TCP/IP Stack

​5.1 IP Layer Processing

​5.2 TCP Layer: The Fast Path

Fast Path

Slow Path

​5.3 TCP Congestion Control

​6. Socket Layer & System Calls

​6.1 Socket Creation

​6.2 send() and recv() Internals

​6.3 Zero-Copy Techniques

sendfile()

MSG_ZEROCOPY

​7. Netfilter & Packet Filtering

​7.1 Netfilter Hook Points

​7.2 Connection Tracking (conntrack)

​7.3 iptables Performance

​8. Network Buffers & Memory Management

​8.1 Socket Buffers

​8.2 TCP Autotuning

Kernel Networking Stack

1. The Network Stack Architecture

1.1 Layer Overview

2. The Core Data Structure: sk_buff

2.1 sk_buff Structure

2.2 sk_buff Memory Layout

2.3 Zero-Copy Mechanisms

2.4 sk_buff Operations

3. Packet Reception: From Wire to Socket

3.1 The Legacy Interrupt-Driven Model (Pre-NAPI)

3.2 NAPI: New API (Polling + Interrupts)

3.3 Receive Packet Steering (RPS/RFS)

4. XDP: eXpress Data Path

4.1 XDP Architecture

4.2 XDP Program Example

4.3 XDP Actions

4.4 AF_XDP: Zero-Copy to User Space

5. The TCP/IP Stack

5.1 IP Layer Processing

5.2 TCP Layer: The Fast Path

5.3 TCP Congestion Control

6. Socket Layer & System Calls

6.1 Socket Creation

6.2 send() and recv() Internals

6.3 Zero-Copy Techniques

7. Netfilter & Packet Filtering

7.1 Netfilter Hook Points

7.2 Connection Tracking (conntrack)

7.3 iptables Performance

8. Network Buffers & Memory Management

8.1 Socket Buffers

8.2 TCP Autotuning

9. Performance Monitoring & Debugging

9.1 Essential Tools

9.2 Tracing with BPF

10. Interview Questions & Answers

Summary