Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Kernel Networking Stack

The networking subsystem is one of the most complex and performance-critical parts of the Linux kernel. Understanding how packets flow from the Network Interface Card (NIC) through kernel layers to application sockets is essential for building high-performance networked systems.
Mastery Level: Senior Systems Engineer Key Internals: sk_buff, NAPI, RSS/RPS/RFS, XDP, TCP congestion control, Netfilter Prerequisites: Interrupts, Virtual Memory

1. The Network Stack Architecture

1.1 Layer Overview

The Linux network stack follows the OSI model but implements it in a Linux-specific way:
┌──────────────────────────────────────────────────────┐
│         Application Layer (User Space)               │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────┐ │
│  │  Web Server  │  │   Database   │  │    DNS     │ │
│  └──────┬───────┘  └──────┬───────┘  └─────┬──────┘ │
│         │                 │                 │        │
│    Socket API (read/write/send/recv)        │        │
└─────────┼─────────────────┼─────────────────┼────────┘
          │                 │                 │
          ↓ Syscall         ↓                 ↓
┌──────────────────────────────────────────────────────┐
│                  Kernel Space                         │
│  ┌───────────────────────────────────────────────┐   │
│  │  Socket Layer (struct socket, struct sock)    │   │
│  │  - File descriptor management                 │   │
│  │  - Socket buffer management                   │   │
│  └─────────────────┬─────────────────────────────┘   │
│                    ↓                                  │
│  ┌───────────────────────────────────────────────┐   │
│  │  Transport Layer (TCP/UDP/SCTP)               │   │
│  │  - Segmentation / Reassembly                  │   │
│  │  - Flow control, Congestion control           │   │
│  │  - Port management                            │   │
│  └─────────────────┬─────────────────────────────┘   │
│                    ↓                                  │
│  ┌───────────────────────────────────────────────┐   │
│  │  Network Layer (IPv4/IPv6)                    │   │
│  │  - Routing decisions (FIB lookup)             │   │
│  │  - Fragmentation / Reassembly                 │   │
│  │  - Netfilter hooks (firewall)                 │   │
│  └─────────────────┬─────────────────────────────┘   │
│                    ↓                                  │
│  ┌───────────────────────────────────────────────┐   │
│  │  Link Layer (Ethernet, WiFi)                  │   │
│  │  - ARP resolution                             │   │
│  │  - MAC address handling                       │   │
│  │  - Queueing disciplines (tc)                  │   │
│  └─────────────────┬─────────────────────────────┘   │
│                    ↓                                  │
│  ┌───────────────────────────────────────────────┐   │
│  │  Network Device (struct net_device)           │   │
│  │  - Driver interface                           │   │
│  │  - Ring buffer management                     │   │
│  └─────────────────┬─────────────────────────────┘   │
└────────────────────┼─────────────────────────────────┘
                     ↓ DMA
          ┌──────────────────────┐
          │  Network Interface   │
          │  Card (NIC)          │
          │  - RX/TX Ring Buffers│
          │  - Checksumming      │
          │  - TSO/GSO offload   │
          └──────────────────────┘

              Physical Network

2. The Core Data Structure: sk_buff

The struct sk_buff (socket buffer) is the heart of the Linux networking stack. It represents a network packet as it travels through the kernel. Think of an sk_buff like a shipping envelope with adjustable flaps. As a packet moves down the stack (application to wire), each layer adds a header by pulling the front flap forward. As it moves up (wire to application), each layer strips a header by pushing the flap back. The clever part: the actual data never moves in memory — only the pointers change. This is what makes Linux networking fast even at millions of packets per second.

2.1 sk_buff Structure

// Simplified from include/linux/skbuff.h
struct sk_buff {
    // Linked list management
    struct sk_buff *next;
    struct sk_buff *prev;

    // Socket association
    struct sock *sk;

    // Timestamps
    ktime_t tstamp;

    // Network device
    struct net_device *dev;

    // Data pointers (THE KEY TO UNDERSTANDING NETWORKING)
    unsigned char *head;      // Start of allocated buffer
    unsigned char *data;      // Start of valid data
    sk_buff_data_t tail;      // End of valid data
    sk_buff_data_t end;       // End of allocated buffer

    // Buffer size information
    unsigned int len;         // Length of actual data
    unsigned int data_len;    // Length of fragmented data

    // Protocol headers (updated as packet moves up/down stack)
    __u16 transport_header;   // Offset to transport header (TCP/UDP)
    __u16 network_header;     // Offset to network header (IP)
    __u16 mac_header;         // Offset to MAC header (Ethernet)

    // Metadata
    __u32 priority;
    __u8 ip_summed;          // Checksum status
    __u8 cloned:1;           // Is this a clone?
    __u8 nohdr:1;

    // Reference counting
    refcount_t users;

    // Fragmented data (for large packets)
    unsigned int truesize;   // Total allocated size
    atomic_t dataref;

    // Shared info (at end of buffer)
    struct skb_shared_info *shinfo;
};

2.2 sk_buff Memory Layout

Understanding the memory layout is crucial for understanding zero-copy optimizations:
Packet Reception (growing from head toward tail):
─────────────────────────────────────────────────────

┌─────────────────────────────────────────────────────┐
│                                                       │
│  head                                         end    │
│   ↓                                            ↓     │
│   ┌──────────┬───────────┬──────────┬─────────┐     │
│   │ headroom │  Eth Hdr  │  IP Hdr  │ TCP Hdr │ ... │
│   │ (unused) │  (14 B)   │  (20 B)  │ (20 B)  │ ... │
│   └──────────┴───────────┴──────────┴─────────┘     │
│              ↑                                 ↑     │
│             data                             tail    │
│                                                       │
│   <-headroom->  <----- len (data length) ---->       │
│   <-------------- truesize (total alloc) ----------> │
└─────────────────────────────────────────────────────┘

Header Pointers:
  mac_header     → Points to Ethernet header
  network_header → Points to IP header
  transport_header → Points to TCP header

Key Operations:
  skb_push()  - Decrease data pointer (add header)
  skb_pull()  - Increase data pointer (remove header)
  skb_put()   - Increase tail pointer (add data at end)
  skb_trim()  - Decrease tail pointer (remove data at end)
Example: Adding Ethernet Header:
// Before: data points to IP header
┌────────────────────────────────┐
│ headroom │ IP Hdr │ TCP Hdr │  │
└──────────┴────────┴─────────┴──┘
           ↑ data

// Add Ethernet header
unsigned char *eth_hdr = skb_push(skb, ETH_HLEN);  // ETH_HLEN = 14

// After: data now points to Ethernet header
┌────────────────────────────────┐
│ headrm │Eth│ IP Hdr │ TCP Hdr │ │
└────────┴───┴────────┴─────────┴─┘
         ↑ data

2.3 Zero-Copy Mechanisms

Problem: Copying large packets is expensive (memory bandwidth limited). On a 100 Gbps NIC, the CPU would spend all its time in memcpy() if every packet required a full copy. Zero-copy techniques are what make high-speed networking possible on commodity hardware. Solution 1: skb_clone() - Clone sk_buff structure, share data
struct sk_buff *clone = skb_clone(original_skb, GFP_ATOMIC);

Original:        Clone:
┌──────────┐    ┌──────────┐
│ sk_buff  │    │ sk_buff  │
struct  │    │  struct
│          │    │          │
│ head ────┼───>│ head ────┼──┐
│ data ────┼───>│ data ────┼──┤  Points to SAME buffer
│ tail     │    │ tail     │  │
└──────────┘    └──────────┘  │

                    ┌────────────────────┐
                    │  Shared Data Buffer │
                    │  (refcount = 2)     │
                    └────────────────────┘

Use case: Packet sniffing (tcpdump)
- Clone packet for sniffer
- Original continues up the stack
- No data copy!
Solution 2: Fragmented Data (skb_frag) - For large packets
struct skb_shared_info {
    unsigned char nr_frags;  // Number of fragments
    skb_frag_t frags[MAX_SKB_FRAGS];  // Fragment array
};

typedef struct skb_frag_struct {
    struct page *page;       // Points to physical page
    __u16 page_offset;       // Offset within page
    __u16 size;              // Fragment size
} skb_frag_t;

Layout for 9000-byte packet (Jumbo frame):
┌──────────────────────────────────────────────┐
│ sk_buff                                      │
│  head → [headers: 54 bytes]                  │
│  data → [first chunk of data]                │
└──┬───────────────────────────────────────────┘

   └→ skb_shared_info
      ├→ frag[0] → page A (1500 bytes)
      ├→ frag[1] → page B (1500 bytes)
      ├→ frag[2] → page C (1500 bytes)
      └→ frag[3] → page D (remainder)

NIC DMAs directly into these pages (zero-copy receive!)

2.4 sk_buff Operations

#include <linux/skbuff.h>

// Add Ethernet header (14 bytes)
struct ethhdr *eth = (struct ethhdr *)skb_push(skb, sizeof(struct ethhdr));
eth->h_proto = htons(ETH_P_IP);
memcpy(eth->h_dest, dst_mac, ETH_ALEN);
memcpy(eth->h_source, src_mac, ETH_ALEN);

// Remove Ethernet header when moving up stack
skb_pull(skb, sizeof(struct ethhdr));

// Access network header
struct iphdr *iph = ip_hdr(skb);  // Macro: (struct iphdr *)skb_network_header(skb)

// Access transport header
struct tcphdr *tcph = tcp_hdr(skb);

3. Packet Reception: From Wire to Socket

3.1 The Legacy Interrupt-Driven Model (Pre-NAPI)

Old Approach (before 2.5 kernel):
1. Packet arrives at NIC
   └→ NIC DMAs packet to ring buffer
   └→ NIC raises hardware IRQ

2. CPU receives interrupt
   └→ Context switch (save current task)
   └→ Jump to interrupt handler

3. Interrupt handler (top half)
   └→ Disable interrupts (critical section)
   └→ Allocate sk_buff
   └→ Copy packet data from ring buffer to sk_buff
   └→ Queue sk_buff to network stack
   └→ Re-enable interrupts
   └→ Return from interrupt

4. Network stack processes packet (bottom half)
   └→ IP layer processing
   └→ TCP layer processing
   └→ Socket layer delivery

Problem: At 1Gbps+ speeds, 100,000+ interrupts/sec
Result: 100% CPU time in interrupt handling (interrupt storm)

3.2 NAPI: New API (Polling + Interrupts)

Solution: Hybrid polling/interrupt model. The analogy: imagine a doorbell that rings every time a letter arrives. If you get one letter per hour, the doorbell is helpful. If you get 1,000 letters per second, you would never leave the door. NAPI’s approach: after the first ring, disable the doorbell and check the mailbox in batches until it is empty, then re-enable the doorbell.
Receive Flow with NAPI:
────────────────────────

Phase 1: Low Traffic (Interrupt Mode)
┌─────────────────────────────────────────┐
│ Packet arrives                          │
│  ↓                                      │
│ NIC raises IRQ                          │
│  ↓                                      │
│ Driver IRQ handler:                     │
│  • Disable NIC interrupts               │
│  • Schedule NAPI poll (add to poll_list)│
│  • Return (very fast!)                  │
│  ↓                                      │
│ Softirq NET_RX_SOFTIRQ triggers         │
│  ↓                                      │
│ net_rx_action():                        │
│  • Call driver->poll()                  │
│  • Process up to budget packets (64)    │
│  • If ring empty: re-enable IRQ         │
└─────────────────────────────────────────┘

Phase 2: High Traffic (Polling Mode)
┌─────────────────────────────────────────┐
│ poll() processes 64 packets             │
│  ↓                                      │
│ More packets in ring buffer?            │
│  • YES: Stay in polling mode            │
│  • Keep calling poll() until ring empty │
│  • No interrupts needed!                │
│  ↓                                      │
│ Eventually ring empties                 │
│  • Re-enable NIC interrupts             │
│  • Wait for next packet                 │
└─────────────────────────────────────────┘
Kernel Code (simplified from net/core/dev.c):
// Driver's NAPI poll function
static int my_driver_poll(struct napi_struct *napi, int budget) {
    struct my_adapter *adapter = container_of(napi, struct my_adapter, napi);
    int work_done = 0;

    while (work_done < budget) {
        // Check if ring buffer has packets
        if (ring_buffer_empty(adapter))
            break;

        // Fetch packet from ring buffer
        struct sk_buff *skb = fetch_packet_from_ring(adapter);

        // Set metadata
        skb->dev = adapter->netdev;
        skb->protocol = eth_type_trans(skb, adapter->netdev);

        // Pass to network stack
        netif_receive_skb(skb);

        work_done++;
    }

    // If we processed less than budget, ring is empty
    if (work_done < budget) {
        napi_complete(napi);  // Exit polling mode
        enable_irq(adapter->irq);  // Re-enable interrupts
    }

    return work_done;
}

// IRQ handler (top half)
static irqreturn_t my_driver_irq_handler(int irq, void *data) {
    struct my_adapter *adapter = data;

    // Disable NIC interrupts
    disable_nic_interrupts(adapter);

    // Schedule NAPI polling
    napi_schedule(&adapter->napi);

    return IRQ_HANDLED;
}

// Network core: softirq handler
static void net_rx_action(struct softirq_action *h) {
    struct list_head *poll_list = this_cpu_ptr(&softnet_data.poll_list);
    int budget = netdev_budget;  // Default: 300
    unsigned long time_limit = jiffies + netdev_budget_usecs;

    while (!list_empty(poll_list)) {
        struct napi_struct *napi = list_first_entry(poll_list, ...);

        int work = napi->poll(napi, budget);
        budget -= work;

        if (budget <= 0 || time_after(jiffies, time_limit))
            break;  // Yield CPU
    }
}
Benefits:
  1. Low latency under low load: Interrupts still used
  2. High throughput under high load: Polling avoids interrupt overhead
  3. Fairness: Budget limits per-device processing
  4. CPU efficiency: No interrupt storm

3.3 Receive Packet Steering (RPS/RFS)

Problem: Single NIC queue means all packets processed on one CPU core. On a 10 Gbps link pushing small packets, a single core can become 100% saturated while the other 31 cores sit idle. Solution: Distribute packet processing across multiple CPUs. There are three levels of this, each solving a different part of the problem:

RSS (Hardware)

Receive Side Scaling
  • NIC has multiple RX queues
  • NIC hashes packet (IP + port)
  • Distributes to different queues
  • Each queue has own IRQ → CPU core
NIC
┌────────────────────┐
│  Hash(pkt) % 4     │
│   ↓   ↓   ↓   ↓    │
│  Q0  Q1  Q2  Q3    │
└──┼───┼───┼───┼────┘
   │   │   │   └──IRQ3→ CPU3
   │   │   └──────IRQ2→ CPU2
   │   └──────────IRQ1→ CPU1
   └──────────────IRQ0→ CPU0
Pros: Hardware acceleration Cons: Requires multi-queue NIC

RPS (Software)

Receive Packet Steering
  • Software-based RSS
  • CPU that receives IRQ hashes packet
  • Enqueues to target CPU’s backlog
  • Target CPU processes packet
CPU0 (IRQ handler)
  │ Receive packet
  │ Hash → CPU2

  Enqueue to CPU2 backlog

      CPU2 processes packet
Pros: Works with single-queue NIC Cons: Extra CPU for steering
RFS (Receive Flow Steering): Extension of RPS
Goal: Process packet on CPU where application is running
Benefit: CPU cache locality (hot cache = faster processing)

Flow:
1. Application recv() on CPU3
   └→ Kernel records: Flow X → CPU3

2. Packet for Flow X arrives on CPU0
   └→ RPS checks flow table
   └→ Steers to CPU3 (where app is!)

Result: Packet data already in CPU3's cache when app reads it
Configuration:
# Enable RPS (software steering)
echo "f" > /sys/class/net/eth0/queues/rx-0/rps_cpus  # Use CPUs 0-3

# Set RPS flow entries
echo 4096 > /proc/sys/net/core/rps_sock_flow_entries

# Set per-queue flow entries
echo 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt

4. XDP: eXpress Data Path

XDP allows running eBPF programs before sk_buff allocation, at the earliest possible point in packet processing.

4.1 XDP Architecture

Packet Flow with XDP:
─────────────────────

Without XDP:
┌───────┐   ┌──────────┐   ┌──────┐   ┌─────┐   ┌──────┐
│  NIC  │ → │ allocate │ → │  IP  │ → │ TCP │ → │ App  │
│  DMA  │   │  sk_buff │   │layer │   │layer│   │      │
└───────┘   └──────────┘   └──────┘   └─────┘   └──────┘
  ~500ns      ~200ns         ~100ns     ~100ns     user

With XDP:
┌───────┐   ┌──────────┐
│  NIC  │ → │   XDP    │ ──→ [XDP_DROP] (discard, fastest)
│  DMA  │   │  eBPF    │ ──→ [XDP_TX] (bounce back)
└───────┘   │ program  │ ──→ [XDP_REDIRECT] (other NIC/CPU)
  ~500ns    └──────────┘ ──→ [XDP_PASS] (continue to stack)
             ~100ns              ↓
                          ┌──────────┐
                          │ allocate │
                          │  sk_buff │
                          └──────────┘

                           Normal stack...

4.2 XDP Program Example

#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>

// Drop all packets from specific IP
SEC("xdp")
int xdp_drop_ip(struct xdp_md *ctx) {
    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    // Bounds check (required by BPF verifier)
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;

    // Only process IP packets
    if (eth->h_proto != htons(ETH_P_IP))
        return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;

    // Drop packets from 192.168.1.100
    __u32 blocked_ip = 0xC0A80164;  // 192.168.1.100 in network order
    if (ip->saddr == blocked_ip) {
        return XDP_DROP;  // Discard immediately!
    }

    return XDP_PASS;  // Continue to network stack
}

char _license[] SEC("license") = "GPL";
Compile and Load:
# Compile XDP program
clang -O2 -target bpf -c xdp_drop.c -o xdp_drop.o

# Load into kernel
ip link set dev eth0 xdp obj xdp_drop.o sec xdp

# Verify
ip link show eth0
# ... xdp ...

# Remove XDP program
ip link set dev eth0 xdp off

4.3 XDP Actions

// Fastest packet drop (DDoS mitigation)
if (is_attack_packet(ctx)) {
    return XDP_DROP;  // ~10 million pps possible
}
Use Cases:
  • DDoS mitigation (drop attack traffic before stack)
  • Invalid packet filtering
  • Rate limiting at wire speed
Performance: ~50ns per packet (vs ~10µs for iptables DROP)

4.4 AF_XDP: Zero-Copy to User Space

AF_XDP allows user-space programs to receive packets directly from NIC DMA buffer (bypassing kernel stack entirely).
#include <linux/if_xdp.h>
#include <bpf/xsk.h>

struct xsk_socket_info {
    struct xsk_ring_cons rx;    // RX ring (kernel → user)
    struct xsk_ring_prod tx;    // TX ring (user → kernel)
    struct xsk_ring_prod fq;    // Fill queue (user provides buffers)
    struct xsk_ring_cons cq;    // Completion queue (kernel returns buffers)
    struct xsk_socket *xsk;
};

int main() {
    // Create XDP socket
    struct xsk_socket_info *xsk = create_xsk_socket("eth0", 0);

    // Main loop
    while (1) {
        // Fill queue with empty buffers
        unsigned int idx_fq;
        if (xsk_ring_prod__reserve(&xsk->fq, BATCH_SIZE, &idx_fq) == BATCH_SIZE) {
            for (int i = 0; i < BATCH_SIZE; i++) {
                *xsk_ring_prod__fill_addr(&xsk->fq, idx_fq++) = allocate_buffer();
            }
            xsk_ring_prod__submit(&xsk->fq, BATCH_SIZE);
        }

        // Receive packets
        unsigned int rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
        for (int i = 0; i < rcvd; i++) {
            const struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);

            // Access packet data (zero-copy!)
            void *pkt = xsk_umem__get_data(umem, desc->addr);
            unsigned int len = desc->len;

            // Process packet...
            process_packet(pkt, len);
        }
        xsk_ring_cons__release(&xsk->rx, rcvd);
    }

    return 0;
}
Performance: 20+ million packets per second per core (vs ~1-2M with regular sockets)

5. The TCP/IP Stack

5.1 IP Layer Processing

// Simplified from net/ipv4/ip_input.c

int ip_rcv(struct sk_buff *skb, struct net_device *dev) {
    struct iphdr *iph;

    // 1. Validate packet
    if (skb->len < sizeof(struct iphdr))
        goto drop;

    iph = ip_hdr(skb);

    // 2. Checksum verification (if not offloaded to NIC)
    if (ip_fast_csum((u8 *)iph, iph->ihl) != 0)
        goto csum_error;

    // 3. Validate header
    if (iph->ihl < 5 || iph->version != 4)
        goto drop;

    // 4. Netfilter hook: PREROUTING
    return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL,
                   ip_rcv_finish);

drop:
    kfree_skb(skb);
    return NET_RX_DROP;
}

static int ip_rcv_finish(struct sk_buff *skb) {
    struct iphdr *iph = ip_hdr(skb);

    // 5. Route lookup (where should this packet go?)
    if (ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, skb->dev) != 0)
        goto drop;

    // 6. Deliver locally or forward
    return dst_input(skb);  // Calls ip_local_deliver() or ip_forward()
}

int ip_local_deliver(struct sk_buff *skb) {
    struct iphdr *iph = ip_hdr(skb);

    // 7. IP fragmentation reassembly
    if (iph->frag_off & htons(IP_MF | IP_OFFSET)) {
        skb = ip_defrag(skb);
        if (!skb)
            return 0;
    }

    // 8. Netfilter hook: LOCAL_IN
    return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL,
                   ip_local_deliver_finish);
}

static int ip_local_deliver_finish(struct sk_buff *skb) {
    struct iphdr *iph = ip_hdr(skb);

    // 9. Demultiplex to transport layer
    int protocol = iph->protocol;

    switch (protocol) {
        case IPPROTO_TCP:
            tcp_v4_rcv(skb);
            break;
        case IPPROTO_UDP:
            udp_rcv(skb);
            break;
        case IPPROTO_ICMP:
            icmp_rcv(skb);
            break;
        default:
            // Unknown protocol
            icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PROT_UNREACH, 0);
            kfree_skb(skb);
    }

    return 0;
}
Routing Table Lookup (FIB - Forwarding Information Base):
Route lookup for destination IP:

1. Check routing cache (fast path)
   └→ Cache hit? Return cached route

2. FIB lookup (trie-based structure)
   ┌─────────────────────────────┐
   │ Longest Prefix Match (LPM)  │
   │                             │
   │ 192.168.1.0/24 → eth0       │
   │ 10.0.0.0/8 → tun0           │
   │ 0.0.0.0/0 → eth0 (default)  │
   └─────────────────────────────┘

3. Policy routing (multiple routing tables)
   └→ Check fib rules, select table

4. Result: dst_entry structure
   ┌──────────────────────────┐
   │ output_dev = eth0        │
   │ gateway = 192.168.1.1    │
   │ output_fn = ip_finish()  │
   └──────────────────────────┘

5.2 TCP Layer: The Fast Path

TCP processing has two paths:

Fast Path

Conditions:
  • In-order segment
  • No flags (except ACK)
  • Window not full
  • No urgent data
  • Checksum OK
Processing:
// Simplified
if (tcp_fast_path_check(sk, skb)) {
    // Fast path!
    memcpy(user_buffer, skb->data, skb->len);
    tcp_send_ack(sk);
    wake_up_process(sk->sk_sleep);
    return;
}
Performance: ~1µs per packet

Slow Path

Triggers:
  • Out-of-order segment
  • Retransmission
  • Window probing
  • Options (SACK, timestamps)
  • Connection management (SYN, FIN)
Processing:
// Complex state machine
tcp_validate_incoming(sk, skb);
tcp_ack(sk, skb);  // Process ACK
tcp_data_queue(sk, skb);  // Queue OOO
tcp_send_delayed_ack(sk);
// ... many more checks ...
Performance: ~5-10µs per packet
TCP Receive Processing (simplified from net/ipv4/tcp_input.c):
int tcp_v4_rcv(struct sk_buff *skb) {
    struct tcphdr *th;
    struct sock *sk;

    // 1. Get TCP header
    th = tcp_hdr(skb);

    // 2. Find socket (demultiplexing)
    sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
    if (!sk)
        goto no_tcp_socket;  // Send RST

    // 3. Process TCP state machine
    tcp_v4_do_rcv(sk, skb);

    return 0;
}

int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) {
    // Check if established connection
    if (sk->sk_state == TCP_ESTABLISHED) {
        // Try fast path
        if (tcp_rcv_established(sk, skb) == 0)
            return 0;
    }

    // Slow path (connection management)
    return tcp_rcv_state_process(sk, skb);
}

int tcp_rcv_established(struct sock *sk, struct sk_buff *skb) {
    struct tcp_sock *tp = tcp_sk(sk);
    struct tcphdr *th = tcp_hdr(skb);

    // FAST PATH CHECK
    if (tp->rcv_nxt == ntohl(th->seq) &&  // In-order
        tp->rcv_wnd &&                     // Window open
        !th->syn && !th->fin && !th->rst)  // No special flags
    {
        int len = skb->len - th->doff * 4;

        // Copy data to socket receive buffer
        if (!skb_copy_datagram_msg(skb, th->doff * 4,
                                    &sk->sk_receive_queue, len)) {
            tp->rcv_nxt += len;

            // Send ACK
            tcp_send_ack(sk);

            // Wake up waiting process
            sk->sk_data_ready(sk);

            kfree_skb(skb);
            return 0;  // Fast path success!
        }
    }

    // Fall through to slow path...
    return tcp_slow_path(sk, skb);
}

5.3 TCP Congestion Control

Linux supports pluggable congestion control algorithms:
// Simplified CUBIC algorithm

// Congestion window growth function
W(t) = C * (t - K)³ + W_max

Where:
  t = Time since last congestion event
  K = Cube root of (W_max * β / C)
  W_max = Window size before congestion
  C = Scaling constant
  β = Multiplicative decrease factor (0.7)

Behavior:
  Slow start → Exponential growth
  Congestion avoidance → Cubic growth
  Loss detected → W = W * β (reduce 30%)
  Recovery → Cubic growth toward W_max

┌─────────────────────────────────────┐
│  Congestion Window                  │
│      ↑                              │
│ W_max│     ╱╲                       │
│      │    ╱  ╲   ← Cubic curve      │
│      │   ╱    ╲╱                    │
│      │  ╱                           │
│      │ ╱                            │
│      └──────────────────→ Time      │
│         ↑ Loss                      │
└─────────────────────────────────────┘
Pros: Good for high-bandwidth networks Cons: Can be aggressive on lossy links

6. Socket Layer & System Calls

6.1 Socket Creation

// User space
int sockfd = socket(AF_INET, SOCK_STREAM, 0);

// Kernel: net/socket.c
SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol) {
    // 1. Allocate socket structure
    struct socket *sock = sock_alloc();

    // 2. Create protocol-specific socket
    // For TCP: calls inet_create() → tcp_prot.init()
    sock->ops = &inet_stream_ops;  // TCP operations

    // 3. Allocate struct sock (protocol control block)
    struct sock *sk = sk_alloc(family, GFP_KERNEL, &tcp_prot);
    sock->sk = sk;

    // 4. Initialize TCP-specific state
    tcp_init_sock(sk);

    // 5. Create file descriptor
    int fd = sock_map_fd(sock, O_CLOEXEC);

    return fd;
}
Data Structures:
File Descriptor Layer:
┌──────────────────┐
│ struct file      │  (Generic file operations)
│  ├─ f_op         │  → socket_file_ops
│  └─ private_data │  → points to struct socket
└────────┬─────────┘

Socket Layer:
┌──────────────────┐
│ struct socket    │  (BSD socket API)
│  ├─ type         │  (SOCK_STREAM, SOCK_DGRAM)
│  ├─ ops          │  → inet_stream_ops (send, recv, bind, etc.)
│  └─ sk           │  → points to struct sock
└────────┬─────────┘

Protocol Layer:
┌──────────────────┐
│ struct sock      │  (Protocol control block)
│  ├─ sk_state     │  (TCP_ESTABLISHED, TCP_LISTEN, etc.)
│  ├─ sk_prot      │  → tcp_prot (protocol operations)
│  ├─ sk_receive_queue   │  (Received data)
│  ├─ sk_write_queue     │  (Data to send)
│  └─ ...          │
└────────┬─────────┘

Protocol-Specific:
┌──────────────────┐
│ struct tcp_sock  │  (TCP-specific state)
│  ├─ rcv_nxt      │  (Next expected sequence)
│  ├─ snd_una      │  (Oldest unacked byte)
│  ├─ srtt         │  (Smoothed RTT)
│  ├─ snd_cwnd     │  (Congestion window)
│  └─ ...          │
└──────────────────┘

6.2 send() and recv() Internals

// User space
ssize_t n = send(sockfd, buffer, length, flags);

// Kernel: net/socket.c → tcp_sendmsg()
int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) {
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;
    int copied = 0;

    // 1. Check socket state
    if (sk->sk_state != TCP_ESTABLISHED)
        return -ENOTCONN;

    // 2. Wait for send buffer space if needed
    while (size > 0) {
        // Check if send buffer full
        if (sk_stream_memory_free(sk) <= 0) {
            if (flags & MSG_DONTWAIT)
                return -EAGAIN;

            // Block waiting for buffer space
            sk_stream_wait_memory(sk, &timeo);
        }

        // 3. Allocate sk_buff
        skb = sk_stream_alloc_skb(sk, min(size, mss), GFP_KERNEL);

        // 4. Copy data from user space
        int copy = min_t(int, size, skb_availroom(skb));
        if (skb_copy_to_page_nocache(sk, &msg->msg_iter, skb, page,
                                      off, copy))
            goto do_fault;

        // 5. Add to write queue
        skb_entail(sk, skb);

        copied += copy;
        size -= copy;

        // 6. Push data if:
        // - PSH flag set
        // - Queue getting large
        // - No more data
        if ((flags & MSG_MORE) == 0 || size == 0)
            tcp_push(sk, flags, mss_now, TCP_NAGLE_PUSH);
    }

    return copied;
}

// 7. Actually transmit
void tcp_push(struct sock *sk, int flags, int mss_now, int nonagle) {
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;

    while ((skb = tcp_send_head(sk)) != NULL) {
        // Check congestion window
        if (tcp_snd_wnd_test(tp, skb, mss_now) &&
            tcp_cwnd_test(tp, skb)) {

            // Transmit!
            tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
        } else {
            break;  // Window full, wait for ACK
        }
    }
}

6.3 Zero-Copy Techniques

sendfile()

// Traditional copy (4 copies!)
fd_in = open("file.txt", O_RDONLY);
fd_out = socket(...);

char buf[4096];
while ((n = read(fd_in, buf, sizeof(buf))) > 0) {
    write(fd_out, buf, n);
}

// Copies:
// 1. DMA: Disk → Kernel buffer
// 2. CPU: Kernel buffer → User buffer (read)
// 3. CPU: User buffer → Socket buffer (write)
// 4. DMA: Socket buffer → NIC

// Zero-copy sendfile()
sendfile(fd_out, fd_in, NULL, file_size);

// Copies:
// 1. DMA: Disk → Kernel buffer
// 2. DMA: Kernel buffer → NIC (if supported)
// Or just 1 CPU copy if DMA-to-DMA not available

// Kernel implementation
ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
                    size_t count) {
    // Use splice internally
    // Transfers pages directly to socket
    return splice_direct_to_actor(in_file, &sd,
                                  direct_splice_actor);
}

MSG_ZEROCOPY

// Modern zero-copy send
int fd = socket(AF_INET, SOCK_STREAM, 0);

// Enable zero-copy
int one = 1;
setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY,
           &one, sizeof(one));

// Send with zero-copy flag
char buf[65536];
send(fd, buf, sizeof(buf), MSG_ZEROCOPY);

// Kernel behavior:
// - Increments refcount on user pages
// - Passes page pointers to NIC
// - User buffer MUST NOT be modified until...

// Wait for completion notification
struct msghdr msg = {};
struct sock_extended_err *serr;
char control[100];

msg.msg_control = control;
msg.msg_controllen = sizeof(control);

recvmsg(fd, &msg, MSG_ERRQUEUE);

// Now safe to reuse buffer

// Benefits:
// - No CPU copy of payload data
// - Reduced cache pollution
// - Higher throughput

// Caveats:
// - Only beneficial for large sends (>10KB)
// - Buffer must stay valid until ACK
// - More complex error handling

7. Netfilter & Packet Filtering

7.1 Netfilter Hook Points

Packet Flow through Netfilter:
───────────────────────────────

                    Incoming Packet

                    ┌──────────┐
                    │   NIC    │
                    └─────┬────┘

           ┌──────────────────────────┐
           │  NF_INET_PRE_ROUTING     │ ← Hook 1
           │  (raw, mangle, nat)      │
           └─────────┬────────────────┘

              Routing Decision
                 ↙        ↘
         Local          Forward
           ↓                ↓
  ┌────────────────┐  ┌────────────────┐
  │ NF_INET_       │  │ NF_INET_       │
  │ LOCAL_IN       │  │ FORWARD        │ ← Hook 3
  │ (mangle,filter)│  │ (mangle,filter)│
  └───────┬────────┘  └────────┬───────┘
          ↓                    ↓
    Local Process      Routing Decision
          ↓                    ↓
  ┌────────────────┐  ┌────────────────┐
  │ NF_INET_       │  │ NF_INET_       │
  │ LOCAL_OUT      │  │ POST_ROUTING   │ ← Hook 4
  │ (raw,mangle,   │  │ (mangle, nat)  │
  │  nat, filter)  │  └────────┬───────┘
  └───────┬────────┘           ↓
          ↓              ┌──────────┐
    Routing Decision     │   NIC    │
          ↓              └──────────┘
  ┌────────────────┐          ↓
  │ NF_INET_       │    Outgoing Packet
  │ POST_ROUTING   │ ← Hook 5
  │ (mangle, nat)  │
  └───────┬────────┘

    ┌──────────┐
    │   NIC    │
    └──────────┘

    Outgoing Packet

7.2 Connection Tracking (conntrack)

// Conntrack tracks connection state

// TCP connection tracking states:
enum ip_conntrack_status {
    IPS_EXPECTED        = (1 << 0),  // Expected connection
    IPS_SEEN_REPLY      = (1 << 1),  // Seen reply direction
    IPS_ASSURED         = (1 << 2),  // Fully established
    IPS_CONFIRMED       = (1 << 3),  // In conntrack table
    IPS_SRC_NAT         = (1 << 4),  // Source NAT applied
    IPS_DST_NAT         = (1 << 5),  // Destination NAT applied
};

// Conntrack table entry
struct nf_conn {
    struct nf_conntrack ct_general;

    // Tuple: uniquely identifies connection
    struct nf_conntrack_tuple_hash tuplehash[IP_CT_DIR_MAX];
    // [0] = ORIGINAL direction
    // [1] = REPLY direction

    // Connection state
    unsigned long status;

    // Protocol-specific data
    union nf_conntrack_proto proto;

    // Timeout
    unsigned long timeout;
};

// Example: Track TCP connection
// Client 192.168.1.100:45000 → Server 8.8.8.8:80

// ORIGINAL tuple:
//   src: 192.168.1.100:45000
//   dst: 8.8.8.8:80
//   proto: TCP

// REPLY tuple:
//   src: 8.8.8.8:80
//   dst: 192.168.1.100:45000
//   proto: TCP

// This allows matching packets in both directions!
Performance Impact:
# View conntrack table
conntrack -L

# View statistics
conntrack -S
# cpu=0       found=1234 invalid=5 ignore=0 insert=567 ...

# Max connections
sysctl net.netfilter.nf_conntrack_max
# 65536

# Increase limit
sysctl -w net.netfilter.nf_conntrack_max=1048576

# Bypass conntrack for high-traffic flows (e.g., load balancer)
iptables -t raw -A PREROUTING -p tcp --dport 80 -j NOTRACK
iptables -t raw -A OUTPUT -p tcp --sport 80 -j NOTRACK

7.3 iptables Performance

# Rules are evaluated sequentially (O(n))
# Bad: 10,000 rules = slow!

iptables -A INPUT -s 1.2.3.4 -j DROP
iptables -A INPUT -s 1.2.3.5 -j DROP
# ... 9,998 more rules ...

# Better: Use ipset (hash table, O(1) lookup)
ipset create blocklist hash:ip
ipset add blocklist 1.2.3.4
ipset add blocklist 1.2.3.5
# ... add thousands ...

iptables -A INPUT -m set --match-set blocklist src -j DROP
# One rule, fast hash lookup!

# Modern alternative: nftables
nft add table inet filter
nft add chain inet filter input { type filter hook input priority 0\; }
nft add rule inet filter input ip saddr @blocklist drop

# nftables uses bytecode VM (like BPF), much faster

8. Network Buffers & Memory Management

8.1 Socket Buffers

// Each socket has send and receive buffers

// View socket buffer sizes
getsockopt(fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf, &len);
getsockopt(fd, SOL_SOCKET, SO_SNDBUF, &sndbuf, &len);

// Set larger buffers (important for high-BDP networks)
int buf_size = 4 * 1024 * 1024;  // 4 MB
setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &buf_size, sizeof(buf_size));
setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &buf_size, sizeof(buf_size));

// System-wide defaults
sysctl net.core.rmem_default  # Default receive buffer
sysctl net.core.wmem_default  # Default send buffer
sysctl net.core.rmem_max      # Max receive buffer
sysctl net.core.wmem_max      # Max send buffer

// TCP-specific (min, default, max)
sysctl net.ipv4.tcp_rmem  # 4096 131072 6291456
sysctl net.ipv4.tcp_wmem  # 4096 16384 4194304
Buffer Sizing for High Bandwidth-Delay Product:
BDP = Bandwidth × RTT

Example: 10 Gbps link, 100ms RTT
BDP = (10 × 10⁹ bits/sec) × (0.1 sec) / 8 bits/byte
    = 125 MB

Buffer should be >= BDP to fully utilize link!

# Configure
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"  # 128 MB max
sysctl -w net.core.rmem_max=134217728

8.2 TCP Autotuning

// Modern Linux auto-tunes TCP buffers (default: enabled)

// Kernel dynamically adjusts buffer size based on:
// 1. Measured RTT
// 2. Receive rate
// 3. Available memory

// Algorithm (simplified from tcp_input.c):
void tcp_rcv_space_adjust(struct sock *sk) {
    struct tcp_sock *tp = tcp_sk(sk);
    int time, space;

    // Measure receive rate
    time = tcp_stamp_us_delta(tp->tcp_mstamp, tp->rcvq_space.time);
    space = 2 * (tp->copied_seq - tp->rcvq_space.seq);

    if (time > 0) {
        int rcvbuf = tp->rcvq_space.space;
        int new_rcvbuf = space / time;  // Bytes per unit time

        // Increase buffer if receiving faster
        if (new_rcvbuf > rcvbuf) {
            new_rcvbuf = min(new_rcvbuf, sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);
            sk->sk_rcvbuf = min(new_rcvbuf, sysctl_rmem_max);
        }
    }
}

// Disable autotuning (if you want manual control)
echo 0 > /proc/sys/net/ipv4/tcp_moderate_rcvbuf

8.5 Production Caveats and Common Pitfalls

The kernel network stack ships with reasonable defaults for a generic workload. The moment your traffic profile diverges from “generic,” those defaults become silent throughput killers. Below are the four traps that bite senior engineers most often in production, paired with the patterns that fix them.
Pitfall 1: TCP socket buffer tuning sized for the wrong bandwidth-delay productThe send and receive buffers cap the in-flight bytes a TCP connection can hold. If the buffer is smaller than your bandwidth-delay product (BDP = bandwidth times RTT), the sender blocks waiting for ACKs even though the link has spare capacity. Engineers see CPU idle, network idle, and conclude “the network is fine” — but throughput is stuck at a fraction of line rate.Concrete numbers: a 10 Gbps cross-region link with 80 ms RTT has a BDP of 100 MB. The default tcp_rmem max on most distros is 6 MB. A single TCP connection on that link is hard-capped at roughly 600 Mbps regardless of how much CPU and bandwidth you have. This caused a 7-hour throughput regression at a CDN we worked with after a data-center migration changed the RTT from 5 ms to 65 ms.Equally bad in the other direction: oversized buffers create bufferbloat — packets sit in send buffers for seconds, killing latency and breaking ACK clocking.
Solution: size buffers to BDP, then trust autotuningSet the max value of tcp_rmem/tcp_wmem to roughly 2x your worst-case BDP. Leave the min and default values modest — TCP autotuning will grow each connection’s buffer up to the max as needed. Do not set the default to the max; that wastes memory on idle connections and makes the kernel slower to recover under memory pressure.
# 10 Gbps x 100ms RTT BDP = ~125 MB. Set max to 256 MB.
sysctl -w net.core.rmem_max=268435456
sysctl -w net.core.wmem_max=268435456
sysctl -w net.ipv4.tcp_rmem="4096 87380 268435456"
sysctl -w net.ipv4.tcp_wmem="4096 65536 268435456"
# Keep autotuning enabled (it is by default)
sysctl -w net.ipv4.tcp_moderate_rcvbuf=1
Verify with ss -tim — the cwnd and rcv_space fields show what TCP actually negotiated. If cwnd plateaus while wscale is non-zero and the application has data to send, the buffer is your cap.
Pitfall 2: TIME_WAIT exhaustion under high connection churnTCP requires the side that sends the final FIN to hold the socket in TIME_WAIT for 2 times the maximum segment lifetime (typically 60 seconds on Linux). On a service that opens many short-lived outbound connections — a metrics shipper, a service-to-service HTTP client without keep-alive, an old-school PHP-FPM worker — you can run out of ephemeral ports in under a minute. The symptom: connect() starts returning EADDRNOTAVAIL even though the network is fine.The folk remedy is SO_REUSEADDR, but that only lets the listening side rebind to a port still in TIME_WAIT. It does nothing for the outbound case. The other tempting knob, net.ipv4.tcp_tw_recycle, was removed in kernel 4.12 because it broke connectivity through NATs. Setting tcp_tw_reuse=1 helps but only when the local timestamp is monotonic and only for outbound connections.
Solution: connection pooling, then knobs as a backstopThe right fix is architectural: keep a pool of long-lived connections instead of opening one per request. HTTP/1.1 keep-alive, HTTP/2 multiplexing, gRPC channels, and database connection pools all exist for this reason. A pool of 100 reused connections handles the same load as 100,000 ephemeral ones without ever touching TIME_WAIT.If you cannot fix the application, the layered backstop is:
# Allow reusing TIME_WAIT sockets for outbound when safe
sysctl -w net.ipv4.tcp_tw_reuse=1
# Widen the ephemeral port range from the default 32768-60999
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# Shorten FIN_WAIT_2 timeout (be careful, this affects half-closed connections)
sysctl -w net.ipv4.tcp_fin_timeout=30
Monitor with ss -s — the timewait count should be a small fraction of available ephemeral ports. Going above 50 percent means you are one traffic spike away from an outage.
Pitfall 3: MTU mismatches cause silent path-MTU-discovery failuresThe classic scenario: your app works perfectly on a LAN with 1500-byte MTU, then you deploy through a VPN, GRE tunnel, or VXLAN overlay that adds 50-100 bytes of encapsulation. The encapsulated path now has an effective MTU of 1400 or so. Your sender uses 1500-byte packets, the intermediate router sees that the packet exceeds the next-hop MTU, sends back an ICMP “Fragmentation Needed” message, and the sender shrinks its segments. That is path-MTU discovery (PMTUD) working correctly.Now the bug: somewhere between you and the destination, a firewall is dropping ICMP “Fragmentation Needed” messages because “ICMP is dangerous.” Your sender never receives the hint. It keeps sending 1500-byte packets that get black-holed. Connections hang during the first large transfer (a TLS handshake with a big certificate, an HTTP response with a few KB of headers). Small connections work, large ones do not. The pcap on the sender shows packets going out, the pcap on the receiver shows nothing arriving, and there is no error anywhere.
Solution: lower MSS on the affected path, or enable PMTU black-hole detectionThe clean fix is tcp_mtu_probing, which lets TCP detect a black hole heuristically (it notices retransmission patterns consistent with PMTUD failure and probes with smaller segments).
# Enable MTU black-hole detection
sysctl -w net.ipv4.tcp_mtu_probing=1
# Set the floor -- TCP will probe down to this size
sysctl -w net.ipv4.tcp_base_mss=1024
For known-bad paths (a VPN you control), clamp MSS at the iptables/nftables layer:
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
    -j TCPMSS --clamp-mss-to-pmtu
To reproduce in testing: tracepath -n example.com reports the MTU at each hop. If it stops advancing partway through, you have a black hole. Diagnostic incident: this exact failure mode took down outbound API calls for several hours at a fintech we know in 2022, after a network team enabled stricter ICMP filtering on a transit provider.
Pitfall 4: epoll edge-triggered vs level-triggered semantics — ET requires draining or you stallepoll has two trigger modes. Level-triggered (LT, the default) reports a file descriptor as ready every time you call epoll_wait if it has data buffered. Edge-triggered (ET) only reports a transition — you get one notification when data arrives, and you do not get another until the buffer empties and refills. ET is faster (fewer syscalls per event) and is what high-performance servers like nginx, HAProxy, and Envoy use.The trap: in ET mode, if your handler reads only some of the available data (say, you read 4 KB but 8 KB arrived), you will not get another notification until more data shows up. The remaining 4 KB sits in the socket buffer indefinitely. The connection appears stuck. Worse, the bug is timing-dependent and load-dependent — under light load, every read happens to drain the buffer and you never see it. Under heavy load with bursty senders, half your connections stall. This pattern caused a 4-hour partial outage at a streaming service in 2019 when a code path read a fixed chunk size instead of looping.
Solution: in ET mode, always loop until EAGAINThe contract for edge-triggered is non-negotiable: every time you receive an event, drain the descriptor until read/recv returns EAGAIN or EWOULDBLOCK. Same on the write side — write until EAGAIN, then re-arm.
// Edge-triggered read loop (correct)
while (1) {
    ssize_t n = read(fd, buf, sizeof(buf));
    if (n > 0) {
        process(buf, n);
        continue;
    }
    if (n == 0) {
        // Peer closed the connection
        close(fd);
        break;
    }
    if (errno == EAGAIN || errno == EWOULDBLOCK) {
        // Buffer drained -- wait for next epoll event
        break;
    }
    // Real error
    handle_error(errno);
    break;
}
If you cannot guarantee draining (for example, you want fairness across many connections and refuse to spin on one), use level-triggered mode instead. The performance delta between ET and LT for typical web workloads is under 5 percent. The correctness delta when you get ET wrong is “everything hangs occasionally.”Stronger pattern: prefer EPOLLONESHOT — the fd is reported once, then disarmed until you explicitly re-arm it with epoll_ctl(EPOLL_CTL_MOD). This forces you to handle the read-to-completion cycle and re-enable, eliminating both lost-wakeup bugs and concurrent-access races in multithreaded servers.

9. Performance Monitoring & Debugging

9.1 Essential Tools

# View all TCP sockets
ss -tan

# Show TCP info (congestion control, RTT, etc.)
ss -ti

# Example output:
# ESTAB 0 0    192.168.1.100:45000  8.8.8.8:443
#  cubic wscale:7,7 rto:204 rtt:3.5/2 ato:40 mss:1448
#  cwnd:10 bytes_acked:12345 segs_out:100 segs_in:95

# Filter by state
ss -tan state established

# Show processes
ss -tap

# Watch in real-time
watch -n1 'ss -ti dst 8.8.8.8'

9.2 Tracing with BPF

// Trace TCP retransmissions with bpftrace

// tcp_retransmit.bt
bpftrace -e '
kprobe:tcp_retransmit_skb {
    $sk = (struct sock *)arg0;
    $inet_sock = (struct inet_sock *)$sk;

    printf("TCP retransmit: %s:%d -> %s:%d\n",
           ntop(AF_INET, $inet_sock->inet_saddr),
           $inet_sock->inet_sport,
           ntop(AF_INET, $inet_sock->inet_daddr),
           $inet_sock->inet_dport);
}
'

// Trace packet drops
bpftrace -e '
tracepoint:skb:kfree_skb {
    @drops[args->location] = count();
}

interval:s:5 {
    print(@drops);
    clear(@drops);
}
'

10. Interview Questions & Answers

sk_buff: The fundamental packet data structure in Linux networking.Memory Layout:
┌─────────────────────────────────────────────┐
│ head  data          tail  end              │
│  ↓     ↓             ↓     ↓                │
│  ├─────┼─────────────┼─────┤                │
│  │ HR  │ Valid Data  │ TR  │                │
│  └─────┴─────────────┴─────┘                │
│<-headroom-> <-len-> <-tailroom->             │
└─────────────────────────────────────────────┘
Why Headroom Matters: As a packet moves down the network stack (from app to wire), each layer adds a header:
  • Application data
  • +20 bytes TCP header (skb_push)
  • +20 bytes IP header (skb_push)
  • +14 bytes Ethernet header (skb_push)
Without headroom, each layer would need to reallocate and copy the entire packet. With headroom, we just move the data pointer backwards.Why Tailroom Matters:
  • For adding trailers (less common)
  • For TSO/GSO (TCP Segmentation Offload): Kernel builds large packets, NIC splits them
Performance Impact: Zero-copy header addition vs expensive realloc/memcpy.
Problem with Old Interrupt Model:
  • Each packet → hardware interrupt
  • At 1 Gbps (1.5M packets/sec), CPU spends 100% time handling interrupts
  • This is “interrupt storm” or “receive livelock”
NAPI Solution (New API):Low Traffic (interrupt mode):
  1. Packet arrives → IRQ
  2. Driver disables NIC interrupts
  3. Schedules NAPI poll
  4. Returns immediately from IRQ
High Traffic (polling mode): 5. Softirq calls driver’s poll() function 6. Poll processes up to budget packets (default 64) 7. If more packets remain, stay in polling mode 8. If ring buffer empty, re-enable interruptsBenefits:
  • Low latency (low load): Interrupts still used
  • High throughput (high load): Polling avoids interrupt overhead
  • Fairness: Budget prevents one NIC from starving others
  • Adaptive: Automatically switches modes
Key Insight: Interrupts tell us “work is available”, then we switch to polling to batch the work.
XDP (eXpress Data Path): Runs eBPF programs at the earliest possible point in packet processing.Traditional Path:
NIC → DMA → Driver → Allocate sk_buff → Protocol Stack → ...

     ~500ns

                 ~200ns + overhead
XDP Path:
NIC → DMA → Driver → XDP Program (eBPF) → Decision
         ↑           ↑
     ~500ns      ~100ns
Why So Fast:
  1. No sk_buff allocation: Operating directly on DMA buffer
  2. No cache misses: Data still in L1 cache from DMA
  3. No context switches: Runs in softirq context
  4. Early drop: Can discard packets before any processing
  5. JIT compiled: eBPF → native machine code
Actions:
  • XDP_DROP: Discard (DDoS mitigation at 10M+ pps)
  • XDP_TX: Bounce back same interface (L2 load balancer)
  • XDP_REDIRECT: Send to different NIC or CPU
  • XDP_PASS: Continue to normal stack
Use Cases:
  • DDoS mitigation
  • Load balancing
  • Packet filtering
  • Network monitoring
Limitation: Can’t modify packet headers easily (need to recompute checksums).
Fast Path (common case optimization):Conditions:
  • TCP connection is ESTABLISHED
  • Packet arrives in-order (seq == rcv_nxt)
  • No flags except ACK
  • Receive window not full
  • No urgent data
  • Checksum valid
Processing:
if (fast_path_conditions) {
    memcpy(user_buffer, packet_data, len);  // Direct copy
    rcv_nxt += len;
    send_ack();
    wake_up_application();
    return;  // Done in ~1µs!
}
Slow Path (handles complex cases):Triggers:
  • Out-of-order segment (requires reassembly)
  • Retransmission (update RTO, congestion window)
  • Connection management (SYN, FIN, RST)
  • Options processing (SACK, timestamps, window scaling)
  • Zero window probing
Processing:
// Complex state machine
validate_sequence_numbers();
check_for_duplicate_acks();
update_rtt_estimates();
process_sack_blocks();
reorder_out_of_order_segments();
update_congestion_window();
// ... many more checks ...
// ~5-10µs
Impact: Fast path handles 90%+ of packets in established bulk-data transfers. Slow path ensures correctness for edge cases.Optimization: Keep connections in fast path by:
  • Avoiding packet loss (good network)
  • Using large enough buffers (avoid window full)
  • Minimizing out-of-order delivery (good QoS)
Problem: Single NIC queue → all packets processed on one CPU → bottleneckSolution 1: RSS (Receive Side Scaling) - Hardware
  • NIC has multiple RX queues (e.g., 8 queues)
  • NIC computes hash: hash(src_ip, dst_ip, src_port, dst_port) % num_queues
  • Each queue has dedicated IRQ mapped to specific CPU
  • Result: Packets distributed across CPUs in hardware
Pros: No CPU overhead, very fast Cons: Requires multi-queue NICSolution 2: RPS (Receive Packet Steering) - Software
  • Single queue NIC
  • CPU receiving IRQ computes hash
  • Enqueues packet to target CPU’s backlog
  • Target CPU processes packet
Pros: Works with any NIC Cons: Extra CPU overhead for steeringSolution 3: RFS (Receive Flow Steering) - Locality optimization
  • Extension of RPS
  • Tracks which CPU application is running on
  • Steers packets to that specific CPU
  • Result: Packet data in cache when application reads it
Example:
Without RFS:
  Packet → CPU0 (processes) → CPU2 (app blocked on recv)
  → Cache miss when app reads data

With RFS:
  Packet → CPU2 (processes + app runs here)
  → Data in cache, very fast!
Configuration:
# RSS (hardware, automatic if NIC supports)
ethtool -l eth0  # Show queue count

# RPS
echo "f" > /sys/class/net/eth0/queues/rx-0/rps_cpus  # CPUs 0-3

# RFS
echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
echo 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
Connection Tracking: Kernel subsystem that tracks state of all connections (TCP, UDP, ICMP).Purpose:
  • Enable stateful firewall rules
  • NAT (must remember translations)
  • Connection-based filtering
How It Works:
Client initiates connection:
  192.168.1.100:45000 → 8.8.8.8:80

Conntrack creates entry:
  ORIGINAL: 192.168.1.100:45000 → 8.8.8.8:80 [NEW]
  REPLY:    8.8.8.8:80 → 192.168.1.100:45000

Subsequent packets match entry (both directions!)

States: NEW → ESTABLISHED → FIN_WAIT → CLOSE → destroyed
Performance Issues:
  1. Hash table lookup: O(1) but still overhead on every packet
  2. Global lock: (Older kernels) serializes all conntrack operations
  3. Memory: Each connection consumes memory (~300 bytes)
  4. Hash collisions: Degrade to O(n) lookup
Symptoms:
# Table full
dmesg | grep conntrack
# nf_conntrack: table full, dropping packet

# High CPU in conntrack
perf top
#   12.34%  [kernel]  [k] nf_conntrack_in
Solutions:
# 1. Increase table size
sysctl -w net.netfilter.nf_conntrack_max=1048576
sysctl -w net.netfilter.nf_conntrack_buckets=262144

# 2. Decrease timeout for short-lived connections
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30

# 3. Bypass conntrack for specific traffic (e.g., load balancer)
iptables -t raw -A PREROUTING -p tcp --dport 80 -j NOTRACK
iptables -t raw -A OUTPUT -p tcp --sport 80 -j NOTRACK
# WARNING: Breaks stateful rules for this traffic!

# 4. Use connection-less alternatives
# - eBPF/XDP for filtering (bypasses conntrack entirely)
# - Stateless firewall rules where possible
When to bypass: High-traffic stateless services (load balancers, DNS servers, CDN edges).
Problem: Traditional send/receive involves multiple memory copies.Traditional Path (4 copies!):
Sending a file:

1. read(file_fd, buffer, size)
   Disk → [Kernel buffer] → [User buffer]
        DMA copy          CPU copy

2. write(socket_fd, buffer, size)
   [User buffer] → [Socket buffer] → NIC
   CPU copy         DMA copy

Total: 2 DMA + 2 CPU copies
Zero-Copy Techniques:1. sendfile():
sendfile(socket_fd, file_fd, NULL, file_size);

// Kernel path:
Disk → [Kernel buffer] ────→ NIC
      DMA copy        DMA copy (if NIC supports)
                   or page remapping

// Avoids user-space copies entirely!
// Best for static file serving (nginx, Apache)
2. splice():
// Move data between FDs via pipe (zero-copy)
splice(file_fd, NULL, pipe_fd[1], NULL, size, 0);
splice(pipe_fd[0], NULL, socket_fd, NULL, size, 0);

// Kernel manipulates page tables, no memcpy
3. MSG_ZEROCOPY:
send(socket_fd, buffer, size, MSG_ZEROCOPY);

// Kernel:
// 1. Pin user pages in memory (increment refcount)
// 2. NIC DMAs directly from user buffer
// 3. After NIC finishes, kernel notifies app via error queue
// 4. App can now reuse buffer

// Benefit: No copy for large sends
// Caveat: Buffer must stay valid, async notification needed
4. mmap() + write():
void *data = mmap(NULL, size, PROT_READ, MAP_SHARED, file_fd, 0);
write(socket_fd, data, size);

// May avoid one copy if kernel is smart
// But less efficient than sendfile()
Performance Impact:
MethodCopiesUse Case
Traditional4Small data, flexibility needed
sendfile()1-2Static file serving
splice()0Piping data between FDs
MSG_ZEROCOPY0Large sends (>10KB)
When to use:
  • sendfile(): Web server serving files
  • splice(): Proxy/gateway (socket → socket)
  • MSG_ZEROCOPY: Bulk data transfer, streaming
Problem: Send too fast → network congestion → packet loss. Send too slow → underutilize bandwidth.Goal: Find optimal sending rate (maximize throughput, minimize loss).
CUBIC (Linux default):Algorithm:
  • Maintains congestion window (cwnd) = max packets in flight
  • On loss: cwnd = cwnd × β (reduce by 30%)
  • Recovery: Grow cwnd using cubic function
cwnd(t) = C × (t - K)³ + W_max

Where:
  W_max = window before loss
  K = time to reach W_max
  C = scaling constant
Behavior:
cwnd

  │     ╱╲
  │    ╱  ╲    ← Cubic growth
  │   ╱    ╲╱
  │  ╱
  │ ╱
  └──────────→ time
      ↑ loss (reduce 30%)
Pros:
  • Aggressive growth after loss (good for high-bandwidth links)
  • Fair to other CUBIC flows
  • Simple, well-tested
Cons:
  • Treats loss as congestion signal (wrong for wireless)
  • Can cause bufferbloat (fills queues)
  • Slow convergence on very high BDP links

BBR (Bottleneck Bandwidth and RTT):Philosophy: Model the network, don’t react to loss.Measures:
  • BtlBw (bottleneck bandwidth): Max delivery rate observed
  • RTprop (round-trip propagation time): Min RTT observed
Pacing Rate = BtlBw × gainPhases:
  1. STARTUP: Exponential growth to find BtlBw (like slow start)
  2. DRAIN: Drain queues created during startup
  3. PROBE_BW: Oscillate pacing rate around BtlBw (main phase)
  4. PROBE_RTT: Periodically reduce cwnd to re-measure RTprop
Key Insight: Packet loss doesn’t mean congestion!
  • Wireless networks drop packets due to RF interference
  • BBR ignores loss, focuses on measured bandwidth
Pros:
  • Higher throughput on lossy links (wireless, satellite)
  • Lower latency (doesn’t fill buffers)
  • Better on bufferbloat-prone networks
Cons:
  • Can be unfair to CUBIC flows (more aggressive)
  • Requires accurate RTT measurement
  • More complex

When to Use:
ScenarioBest Choice
Data center (low latency, rare loss)CUBIC
Internet (bufferbloat common)BBR
Wireless/satellite (lossy)BBR
Mixed trafficBBR (lower latency helps all)
Configuration:
# Check current
sysctl net.ipv4.tcp_congestion_control

# Change to BBR
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Per-connection (from app)
setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, "bbr", 3);

Summary

Key Takeaways:
  1. sk_buff: Central data structure. Understanding headroom/tailroom is key to zero-copy optimizations.
  2. NAPI: Hybrid interrupt/polling model solves interrupt storm problem at high packet rates.
  3. XDP: Fastest packet processing path. Process/drop packets before sk_buff allocation using eBPF.
  4. RSS/RPS/RFS: Distribute packet processing across CPUs for scalability. RFS optimizes for cache locality.
  5. TCP Fast Path: Handles common case (in-order delivery) with minimal overhead. Slow path handles edge cases.
  6. Congestion Control: CUBIC (default) vs BBR (better on lossy/bufferbloat links). Understand trade-offs.
  7. Zero-Copy: sendfile(), splice(), MSG_ZEROCOPY eliminate expensive memory copies for large transfers.
  8. Conntrack: Essential for stateful firewalls but can be bottleneck. Bypass for high-traffic stateless services.
Performance Checklist:
  • Enable multi-queue NIC and RSS
  • Tune socket buffers for high BDP
  • Use XDP for packet filtering
  • Enable BBR for internet traffic
  • Increase conntrack table for high connection count
  • Use zero-copy for large data transfers


Interview Deep-Dive

Strong Answer:The way I approach this is bottom-up, from hardware to application, checking for drops at each layer. Every layer in the network stack has its own counters, and the trick is finding where packets disappear.
  • NIC level: Start with ethtool -S eth0 | grep -i drop and ethtool -S eth0 | grep -i error. Look for rx_dropped, rx_missed_errors, and rx_fifo_errors. If the NIC is dropping, the ring buffer is full — packets arrive faster than the CPU drains them. Fix: increase ring buffer size with ethtool -G eth0 rx 4096, enable RSS (multiple hardware queues), or move to NAPI polling with a higher budget.
  • Softirq/NAPI level: Check /proc/net/softnet_stat. Each line represents a CPU. Column 1 is total packets processed, column 2 is drops (backlog overflow), column 3 is time_squeeze (NAPI budget exhausted before all packets processed). If column 2 is non-zero, increase net.core.netdev_max_backlog. If column 3 is non-zero, increase net.core.netdev_budget or enable RPS to spread load across more CPUs.
  • Socket buffer level: Check ss -s for socket memory statistics. If the TCP receive buffer is full because the application is not calling recv() fast enough, the kernel drops incoming segments. Check netstat -s | grep "pruned" for TCP pruning events. Fix: increase socket buffers (net.core.rmem_max, net.ipv4.tcp_rmem) or fix the slow application.
  • Conntrack table: If using iptables/nftables, check conntrack -C for the current count and compare with net.netfilter.nf_conntrack_max. A full conntrack table silently drops new connections. This is a classic production issue for high-connection-count services (100K+ concurrent connections). Fix: increase the max or bypass conntrack for stateless services with -j NOTRACK.
  • Application level: If none of the above show drops, the application is the bottleneck. Check with ss -tlnp if the listen backlog is overflowing (Recv-Q exceeding the backlog value). Increase net.core.somaxconn and the application’s listen backlog.
In my experience, 80% of production packet drops are either conntrack table exhaustion or softirq budget starvation. Both are silent — no error logs, no alerts unless you explicitly monitor those counters.Follow-up: How does XDP help in this scenario compared to iptables?XDP processes packets before sk_buff allocation, at the driver level. For dropping known-bad traffic (DDoS mitigation, IP blacklists), XDP can drop at 10-100x the rate of iptables because it avoids the entire network stack overhead — no sk_buff allocation, no conntrack lookup, no routing decision. Facebook’s Katran load balancer uses XDP to handle millions of packets per second per core. The trade-off is that XDP programs only see raw packet bytes (no parsed TCP state, no connection tracking), so they are limited to stateless or simple stateful decisions. For complex firewall rules that need connection state, you still need conntrack/netfilter.
Strong Answer:CUBIC and BBR take fundamentally different approaches to congestion control:
  • CUBIC (default on Linux): Loss-based. CUBIC increases the congestion window following a cubic function until it detects packet loss, then backs off. The key insight is that packet loss signals network congestion. CUBIC is aggressive at probing for bandwidth (the cubic growth curve approaches the previous maximum quickly after a loss event) and conservative after loss.
  • BBR (Bottleneck Bandwidth and RTT): Model-based. BBR continuously estimates two parameters: bottleneck bandwidth (max throughput the path can sustain) and minimum RTT. It then paces sending to match bottleneck bandwidth while keeping in-flight data to roughly bandwidth x RTT. BBR does not treat loss as a congestion signal — it treats it as noise.
When to switch to BBR:
  • Bufferbloat networks: On paths with large buffers (common in ISPs, cellular networks), CUBIC fills the buffers until loss occurs, adding hundreds of milliseconds of queuing delay. BBR avoids filling buffers by targeting the bottleneck rate, resulting in dramatically lower latency (often 10-50x reduction in queuing delay).
  • Lossy links (wireless, satellite): CUBIC interprets random packet loss (wireless interference) as congestion and backs off unnecessarily. BBR ignores occasional loss and maintains throughput.
  • Long-distance, high-BDP paths: BBR converges to fair share faster than CUBIC on high bandwidth-delay product links.
What could go wrong:
  • Fairness with CUBIC flows: BBR can be unfair when competing with CUBIC flows on the same bottleneck link. BBR tends to grab more bandwidth because it does not back off on loss. BBRv2 and BBRv3 address this somewhat, but fairness remains a concern in shared environments.
  • RTT measurement sensitivity: BBR’s performance depends on accurate minimum RTT estimation. If your service runs behind a load balancer that adds variable latency, BBR may over-estimate min_RTT and underperform.
  • Retransmission behavior: BBR can sometimes cause higher retransmission rates than CUBIC because it probes aggressively and does not immediately reduce sending on loss. Monitor netstat -s | grep retrans after switching.
My recommendation: enable BBR for internet-facing services (CDNs, APIs serving mobile clients) where bufferbloat is the primary latency enemy. Keep CUBIC for data center east-west traffic where links are reliable and bufferbloat is not a factor. You can set it per-connection with setsockopt(TCP_CONGESTION, "bbr"), so you do not need a system-wide switch.Follow-up: How would you measure whether switching to BBR actually improved your service’s performance?Before/after A/B test measuring three metrics: p50/p95/p99 TCP RTT (from ss -ti or TCP tracepoints), retransmission rate (nstat TcpRetransSegs), and application-level latency. Run both simultaneously on different server pools serving the same traffic. If BBR shows lower RTT and equal or lower retransmit rate, it is a win. If retransmits spike, investigate whether the path has a strict policer (some ISPs police by drop rate, which confuses BBR).
Strong Answer:The sk_buff (socket buffer) is the central data structure representing a network packet as it traverses the Linux kernel network stack. Every packet — whether incoming or outgoing — is wrapped in an sk_buff from the moment it enters the kernel until it leaves.
  • Key design elements: An sk_buff contains a head pointer (start of allocated buffer), data pointer (start of current layer’s header), tail pointer (end of data), and end pointer (end of allocated buffer). The space between head and data is “headroom” — reserved for prepending headers as the packet moves down the stack (e.g., adding an IP header, then an Ethernet header). The space between tail and end is “tailroom” for appending data. This design means that adding or removing headers is a pointer adjustment, not a memory copy.
  • Metadata: sk_buff also carries extensive metadata: timestamp, hash value (for RSS), mark (for iptables), VLAN tag, checksum offload status, GRO/GSO information, and pointers to the associated socket and network device. This metadata is what makes protocol processing efficient — each layer annotates the sk_buff rather than parsing from scratch.
  • Cloning: When a packet needs to go to multiple destinations (multicast, tapping), the kernel clones the sk_buff — creating a new metadata structure that points to the same data buffer. This avoids copying the packet payload.
For redesigning at 100Gbps:
  • The problem: At 100Gbps with 64-byte packets, you need to process 148 million packets per second. Allocating and freeing an sk_buff per packet is impossibly expensive — each allocation involves slab allocator calls, cache-line bouncing, and metadata initialization.
  • Batch processing: Modern approaches (GRO — Generic Receive Offload) coalesce multiple packets into a single sk_buff before handing to upper layers, reducing per-packet overhead by 10-60x.
  • XDP’s approach: Bypass sk_buff entirely. XDP uses a minimal xdp_md structure that is just pointers into the DMA buffer. No allocation, no metadata bloat. This is why XDP can process packets at line rate on 100Gbps NICs.
  • AF_XDP: Provides a zero-copy path from NIC DMA buffers directly to user-space ring buffers, completely bypassing the sk_buff-based stack. Used by high-frequency trading firms and DPDK-like workloads.
If I were redesigning from scratch, I would make sk_buff a thin wrapper with a fixed-size inline metadata area (avoiding pointer chasing), support batch allocation/deallocation natively (like the page allocator’s bulk APIs), and make the XDP fast path the primary path rather than an afterthought.Follow-up: What is GRO and how does it reduce per-packet overhead?GRO (Generic Receive Offload) coalesces multiple incoming packets belonging to the same flow into a single large sk_buff before passing it up the stack. For example, if 10 TCP segments of 1500 bytes arrive for the same connection, GRO merges them into one 15000-byte sk_buff. The upper layers (TCP) then process one “packet” instead of ten. This reduces per-packet processing overhead (header parsing, socket lookup, lock acquisition) by an order of magnitude. The hardware equivalent is LRO (Large Receive Offload), which does the same thing in the NIC firmware. GRO is preferred because it is smarter about which packets can be safely coalesced (it respects TCP timestamps, ECN bits, etc.) while LRO can sometimes break things by coalescing packets that should not be merged.
Strong Answer Framework:
  1. SYN arrives at the listener. The packet hits the NIC, goes through softirq, IP, and lands in tcp_v4_rcv. The kernel does an __inet_lookup_listener to find a LISTEN-state socket bound to the destination IP and port. If found, it allocates a request socket (struct request_sock) — a lightweight half-open structure — and adds it to the SYN queue (accept_queue->syn_queue). The full socket is not allocated yet.
  2. SYN-ACK is sent back. The kernel constructs a SYN-ACK with its initial sequence number and the negotiated TCP options (MSS, window scale, SACK permitted, timestamps). It also chooses a starting sk_rcv_saddr if the listener was bound to INADDR_ANY.
  3. ACK arrives, completing the handshake. The kernel matches the ACK to the request socket, allocates a full struct sock via tcp_v4_syn_recv_sock, transitions it to ESTABLISHED, and moves it from the SYN queue to the accept queue (accept_queue->rskq_accept_head). Only at this point does the application’s accept() syscall return a new file descriptor.
  4. The listening socket has two queues. tcp_max_syn_backlog caps the SYN queue (half-open). The accept() queue (full-open, waiting to be picked up by the application) is capped at min(somaxconn, application_backlog). If the accept queue overflows because the app is slow to call accept(), the kernel drops the third ACK and the connection silently fails — the client thinks it succeeded, but the server has no socket.
Real-World Example: In 2016, a major SaaS provider’s API gateway started failing under bursty load with no errors in the application. netstat -s | grep -i "listen" showed thousands of ListenOverflows and ListenDrops per second. The cause: their accept queue was sized at the historical default of 128 (SOMAXCONN), but they were getting bursts of 5K connections in 200 ms. The fix was raising net.core.somaxconn to 16384 and updating the application’s listen() backlog to match. They also enabled tcp_abort_on_overflow=1 so that overflows became visible RSTs instead of silent drops, restoring deterministic error behavior.
Senior follow-up 1: How does TCP fast open (TFO) change the handshake, and what is the security tradeoff?TFO lets the client send data in the SYN packet itself. On a first connection, the server returns a TFO cookie in the SYN-ACK; on subsequent connections, the client includes that cookie in the SYN with up to ~1460 bytes of data, and the server processes the data immediately rather than after the third ACK. This saves one full RTT for short transactions (DNS, small HTTP requests). The security tradeoff: an attacker who steals a cookie can amplify SYN-flood attacks against arbitrary destinations using the victim’s identity, and TFO state has to be carefully aged. Linux disables TFO server-side by default for this reason and requires explicit opt-in per socket.
Senior follow-up 2: What are SYN cookies, and what do you lose by enabling them?SYN cookies are a defense against SYN floods. Instead of allocating a request socket and storing it in the SYN queue, the server encodes the connection state into the initial sequence number of the SYN-ACK using a cryptographic hash of the four-tuple plus a secret. The server keeps no per-connection memory at all until the third ACK arrives — at which point it validates the cookie in the ACK and reconstructs the connection. The tradeoff: TCP options that are usually negotiated in the SYN/SYN-ACK (window scale, SACK permitted, timestamps) cannot be fully preserved through the cookie, so connections established under SYN flood may have suboptimal performance. Linux only activates SYN cookies when the SYN queue is full (tcp_syncookies=1, default), so normal traffic is unaffected.
Senior follow-up 3: Why does the listen backlog need both a SYN queue and an accept queue separately?Because they protect against different failures. The SYN queue absorbs bursts of half-open connections during the handshake — it must be large enough to hold every in-flight handshake even under flood. The accept queue absorbs the gap between connections being ready and the application calling accept() — it must be large enough to hide application stalls (GC pauses, slow startup). Sizing them together would force a tradeoff: raise the limit to handle floods, you also raise the time the app can stall before connections fail. Splitting them lets you size each for its purpose. Linux merged the two limits historically (pre-2.2) and split them precisely because operators kept hitting one cap or the other.
Common Wrong Answers:
  • “Three packets, that is the handshake.” Correct in the abstract, but it dodges every implementation detail an interviewer cares about: the two queues, the request socket, the cookie path, where the backlog parameters apply.
  • “The connection is fully open after the SYN-ACK.” No. The server moves to SYN_RECEIVED after sending SYN-ACK and only to ESTABLISHED on the third ACK. Treating SYN-ACK as completion misses where overflow drops happen.
  • accept() blocks until a SYN arrives.” accept() blocks until a fully-established socket lands in the accept queue. SYNs alone never wake accept().
Further Reading:
  • Linux source: net/ipv4/tcp_input.c (tcp_conn_request), net/ipv4/inet_connection_sock.c (inet_csk_accept)
  • Cloudflare blog: “SYN packet handling in the wild” (2018) — production breakdown of every queue and counter
  • “TCP/IP Illustrated, Volume 2: The Implementation” by Wright and Stevens, chapters 28-29
Strong Answer Framework:
  1. Establish baseline and current rate. Pull nstat -az TcpRetransSegs TcpOutSegs over a 10-second window. The retransmit ratio is TcpRetransSegs / TcpOutSegs. Healthy data centers see under 0.01 percent; over the public internet, 0.1-0.5 percent is normal; above 1 percent is a real problem. Compare against your historical baseline — “10x normal” matters a lot more than the absolute number.
  2. Slice by connection. ss -tin shows per-socket retransmit counts (retrans: field) and the inferred congestion state. Sort by retransmits to find the worst offenders. Are retransmits concentrated on one peer, one subnet, one congestion control variant? That tells you whether it is a path problem or a host problem.
  3. Use tracepoints to see why. bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb { @[args->skaddr] = count(); }' counts retransmits per socket in real time. Even better, attach to tcp_retransmit_skb and dump the cause — RTO timeout vs fast retransmit vs tail loss probe. The mix matters: lots of fast retransmits suggest reordering or random loss; lots of RTO-driven retransmits suggest sustained congestion or routing flaps.
  4. Capture and read. tcpdump -i any -w trace.pcap 'tcp and host suspicious_peer', then load it in Wireshark with the “Expert” view. The cleanest signals are duplicate ACKs (loss in the forward direction), SACK blocks (out-of-order arrival, often from ECMP rehashing), and zero-window updates (receiver overloaded).
  5. Tune with intent, not by spraying knobs. If the path is lossy and you are stuck on CUBIC, evaluate BBR. If retransmissions cluster on RTO timeouts of exactly 200ms, tune tcp_rto_min — but only after confirming the path RTT is well below it. If you see persistent reordering due to multipath, enable tcp_recovery (RACK) which uses time-based loss detection rather than dup-ACK counting.
Real-World Example: In 2019, Slack documented a multi-hour latency incident where retransmits on east-west traffic between US-East AZs jumped 20x. Root cause was an AWS networking change that introduced asymmetric ECMP rehashing, causing TCP segments to take different paths and arrive out of order. Fast retransmits fired aggressively because three duplicate ACKs were a normal occurrence. Mitigation was raising net.ipv4.tcp_reordering to tolerate more out-of-order segments before declaring loss, and enabling RACK for time-based recovery. The longer-term fix came from the cloud provider’s side, but the host-level tuning carried them through.
Senior follow-up 1: How do you distinguish a forward-path loss from a reverse-path ACK loss using only host-side counters?Forward-path loss shows as duplicate ACKs from the receiver and triggers fast retransmit. Reverse-path ACK loss shows as the sender’s cwnd shrinking despite TcpRetransSegs not increasing — the sender retransmits because RTO fired, but the data actually arrived; the ACKs were just lost. Counter signature: nstat TcpFastRetrans increases for forward loss, TcpTimeouts increases (without much fast retransmit) for ACK loss. ACK loss is rarer but worth recognizing because the cure (turn on selective ACK with tcp_sack=1, which is default but worth verifying) is different from forward-loss tuning.
Senior follow-up 2: Why might tcp_low_latency actually hurt latency on some workloads?tcp_low_latency (deprecated in modern kernels, but still relevant for older systems) disables prequeue processing — packets are processed in softirq context immediately rather than queued for the receiving process to drain. For a single connection on a single CPU it sounds like a win. The catch: under high softirq load, tying processing to the receive interrupt can starve other work and increase tail latency systemwide. Modern kernels removed the prequeue entirely, so the knob is a no-op. The lesson: latency knobs that look local often have systemic effects, which is why most “tcp_low_latency”-style flags get deprecated once the kernel team has data.
Senior follow-up 3: How would you use eBPF to attribute retransmissions to specific application code paths?Attach a kprobe on tcp_retransmit_skb to capture the socket and the kernel stack. Then use bpf_get_current_pid_tgid and bpf_get_current_comm to capture which userspace thread owns the socket. For request-attribution, correlate the socket’s source port with a userspace ringbuffer that the application writes (request_id, source_port) tuples into when it opens a connection. This lets you say “retransmissions are concentrated on the search service’s outbound connections to the recommendations service from 14:02-14:07,” which is what you actually need to fix the problem. Tools like tcpretrans from BCC and retsnoop automate variants of this.
Common Wrong Answers:
  • “Bump the TCP window.” A bigger window does not help retransmits — it makes them worse if the path is lossy because more in-flight segments mean more losses per RTO event.
  • “Switch to UDP.” UDP does not have retransmits, but it also does not have ordering, congestion control, or reliability — you have just moved the problem out of the kernel into your application code, which is rarely simpler.
  • “Enable jumbo frames.” Larger MSS reduces per-packet overhead but increases the cost of each lost packet (you retransmit 9000 bytes instead of 1500). It is a throughput optimization, not a retransmit fix.
Further Reading:
  • Brendan Gregg, “TCP Retransmits” tools and methodology — the BCC tcpretrans writeup
  • Linux source: net/ipv4/tcp_input.c (tcp_fastretrans_alert), net/ipv4/tcp_recovery.c (RACK)
  • “Making Linux TCP Fast” (Cardwell et al., NSDI 2017) — the BBR paper, which doubles as a clear tour of TCP loss detection
Strong Answer Framework:
  1. Define the threat model. A SYN flood sends SYN packets with spoofed source addresses faster than the server can respond or evict half-open state. Without defenses, the SYN queue fills, legitimate clients cannot establish connections, and you have a denial of service. The defining property is statelessness on the attacker’s side — the attacker burns very little CPU per SYN.
  2. First line: large SYN backlogs and fast eviction. tcp_max_syn_backlog controls how many half-opens can sit in the queue. Raising it from the default 128 to 65536 buys you headroom. tcp_synack_retries controls how many times the kernel retries the SYN-ACK — lowering this from 5 to 2 cuts the time half-opens stay in the queue from ~93 seconds to ~7 seconds, multiplying effective capacity.
  3. Second line: SYN cookies. When the SYN queue is full, the kernel switches to cookie mode. It encodes the four-tuple, MSS, and timestamp into the initial sequence number using a keyed SHA-1 hash, sends the SYN-ACK with that ISN, and forgets the connection entirely. When the third ACK arrives, the kernel validates the ack number minus 1 against its hash and, if valid, reconstructs the connection state. Stateless on the server’s side too, neutralizing the asymmetry.
  4. Third line: drop earlier in the stack. SYN cookies still process every SYN through softirq and the TCP stack — expensive at packet-flood rates. XDP programs run in the driver before sk_buff allocation and can drop spoofed traffic at 10-100x the packet rate. Cloudflare and Meta both built their DDoS mitigation around this principle: identify obvious-bad traffic in XDP, let everything else through to the normal stack with cookies as a backstop.
  5. Fourth line: upstream filtering. Anti-spoofing filters at ISPs (BCP38) and scrubbing centers handle the volumetric portion. By the time the traffic reaches your edge, it should be substantially cleaner. This is non-negotiable for any internet-facing service above small scale.
Real-World Example: In February 2020, AWS mitigated a 2.3 Tbps DDoS attack — the largest publicly disclosed at the time — much of which was reflective UDP traffic, but with a substantial SYN-flood component aimed at TCP services. The mitigation stack involved BGP flowspec filtering at network edges, Shield Advanced rate limits, and host-level SYN cookies as a final backstop. The public writeup credits multi-layer defense: no single mechanism handled the entire flood; each layer reduced the rate the next layer had to handle.
Senior follow-up 1: What information does a SYN cookie lose, and why does it matter?The cookie encodes the source/destination addresses, ports, MSS class (one of 8 bins), and a timestamp counter. It does not encode the full TCP options block: window scaling, SACK permitted, and timestamp option are lost. Connections established via cookies start with a default window scale (usually 7) which may not match what the client advertised, capping throughput on high-BDP connections. For an attack, this is fine — you are protecting against denial-of-service, not optimizing performance. For a misconfigured cookie-always-on system handling legitimate traffic, you can leave 50 percent of throughput on the table without realizing it. This is why tcp_syncookies=1 (cookies only when queue is full) is correct, while tcp_syncookies=2 (always cookies) is not.
Senior follow-up 2: How do XDP-based DDoS mitigations decide what to drop without state?They use stateless heuristics computed per-packet: drop SYNs with mismatched TCP flags, drop packets with source IPs in known spoofed ranges (bogons, your own subnets), drop packets that fail simple geographic or rate-limit fingerprints. For state that does fit, XDP can use eBPF maps — a per-source-IP token bucket implemented as a BPF_MAP_TYPE_LRU_HASH lets you rate-limit cheaply. What XDP cannot do is track multi-packet TCP state, so the line you draw is “stateless filtering in XDP, stateful logic in the kernel TCP stack with cookies.” Cloudflare’s open-sourced bpf-tools and Meta’s Katran are real implementations of this design.
Senior follow-up 3: What is the difference between tcp_syncookies=1 and tcp_syncookies=2, and which would you ever set to 2?Mode 1 enables cookies as a fallback when the SYN queue overflows; normal traffic uses the regular handshake path. Mode 2 forces cookies on every SYN regardless of queue state. The only time mode 2 is reasonable is on a host that has been chronically under attack and where you have measured that the regular path’s SYN queue keeps filling faster than it drains — in practice, that means the attack is sustained enough that you are paying the cookie cost continuously anyway. Even then, you have probably already moved DDoS handling to a separate scrubbing layer and the host should not see the volume. I have not deployed mode 2 in production and would treat anyone who has as either a niche operator or someone who has not noticed the throughput cost.
Common Wrong Answers:
  • “Just enable SYN cookies, problem solved.” Cookies are necessary but insufficient. They consume CPU per SYN and lose TCP options. At packet-flood scale you need filtering before the kernel even allocates the sk_buff.
  • “Increase the backlog to a huge number.” A larger queue defers the failure but does not prevent it. If the attack rate exceeds your eviction rate, any finite queue overflows.
  • “Block UDP and you stop SYN floods.” SYN floods are TCP. You are confusing reflection attacks (often UDP) with floods (any protocol).
Further Reading:
  • Cloudflare blog: “SYN packet handling in the wild” (Marek Majkowski) and “How to drop 10 million packets per second” — the canonical XDP-DDoS writeup
  • DJ Bernstein’s original SYN cookie page (cr.yp.to/syncookies.html) — still the clearest explanation of the cryptographic construction
  • Linux source: net/ipv4/syncookies.c, particularly cookie_v4_init_sequence
  • “DDoS Attack Mitigation” (Cardwell, Cheng, Yang) — Google’s writeup on integrating SYN protection with congestion control

Next: File Systems