The networking subsystem is one of the most complex and performance-critical parts of the Linux kernel. Understanding how packets flow from the Network Interface Card (NIC) through kernel layers to application sockets is essential for building high-performance networked systems.
The struct sk_buff (socket buffer) is the heart of the Linux networking stack. It represents a network packet as it travels through the kernel.Think of an sk_buff like a shipping envelope with adjustable flaps. As a packet moves down the stack (application to wire), each layer adds a header by pulling the front flap forward. As it moves up (wire to application), each layer strips a header by pushing the flap back. The clever part: the actual data never moves in memory — only the pointers change. This is what makes Linux networking fast even at millions of packets per second.
// Simplified from include/linux/skbuff.hstruct sk_buff { // Linked list management struct sk_buff *next; struct sk_buff *prev; // Socket association struct sock *sk; // Timestamps ktime_t tstamp; // Network device struct net_device *dev; // Data pointers (THE KEY TO UNDERSTANDING NETWORKING) unsigned char *head; // Start of allocated buffer unsigned char *data; // Start of valid data sk_buff_data_t tail; // End of valid data sk_buff_data_t end; // End of allocated buffer // Buffer size information unsigned int len; // Length of actual data unsigned int data_len; // Length of fragmented data // Protocol headers (updated as packet moves up/down stack) __u16 transport_header; // Offset to transport header (TCP/UDP) __u16 network_header; // Offset to network header (IP) __u16 mac_header; // Offset to MAC header (Ethernet) // Metadata __u32 priority; __u8 ip_summed; // Checksum status __u8 cloned:1; // Is this a clone? __u8 nohdr:1; // Reference counting refcount_t users; // Fragmented data (for large packets) unsigned int truesize; // Total allocated size atomic_t dataref; // Shared info (at end of buffer) struct skb_shared_info *shinfo;};
Problem: Copying large packets is expensive (memory bandwidth limited). On a 100 Gbps NIC, the CPU would spend all its time in memcpy() if every packet required a full copy. Zero-copy techniques are what make high-speed networking possible on commodity hardware.Solution 1: skb_clone() - Clone sk_buff structure, share data
struct sk_buff *clone = skb_clone(original_skb, GFP_ATOMIC);Original: Clone:┌──────────┐ ┌──────────┐│ sk_buff │ │ sk_buff ││ struct │ │ struct ││ │ │ ││ head ────┼───>│ head ────┼──┐│ data ────┼───>│ data ────┼──┤ Points to SAME buffer│ tail │ │ tail │ │└──────────┘ └──────────┘ │ ↓ ┌────────────────────┐ │ Shared Data Buffer │ │ (refcount = 2) │ └────────────────────┘Use case: Packet sniffing (tcpdump)- Clone packet for sniffer- Original continues up the stack- No data copy!
Solution 2: Fragmented Data (skb_frag) - For large packets
struct skb_shared_info { unsigned char nr_frags; // Number of fragments skb_frag_t frags[MAX_SKB_FRAGS]; // Fragment array};typedef struct skb_frag_struct { struct page *page; // Points to physical page __u16 page_offset; // Offset within page __u16 size; // Fragment size} skb_frag_t;Layout for 9000-byte packet (Jumbo frame):┌──────────────────────────────────────────────┐│ sk_buff ││ head → [headers: 54 bytes] ││ data → [first chunk of data] │└──┬───────────────────────────────────────────┘ │ └→ skb_shared_info ├→ frag[0] → page A (1500 bytes) ├→ frag[1] → page B (1500 bytes) ├→ frag[2] → page C (1500 bytes) └→ frag[3] → page D (remainder)NIC DMAs directly into these pages (zero-copy receive!)
// Linear data lengthunsigned int len = skb->len;// Check if data is linearif (skb_is_nonlinear(skb)) { // Data is fragmented skb_frag_t *frag; int i; for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { frag = &skb_shinfo(skb)->frags[i]; // Process fragment void *frag_addr = skb_frag_address(frag); unsigned int frag_size = skb_frag_size(frag); }}// Linearize fragmented skb (expensive!)if (skb_linearize(skb) != 0) { // Failed to linearize kfree_skb(skb); return -ENOMEM;}
1. Packet arrives at NIC └→ NIC DMAs packet to ring buffer └→ NIC raises hardware IRQ2. CPU receives interrupt └→ Context switch (save current task) └→ Jump to interrupt handler3. Interrupt handler (top half) └→ Disable interrupts (critical section) └→ Allocate sk_buff └→ Copy packet data from ring buffer to sk_buff └→ Queue sk_buff to network stack └→ Re-enable interrupts └→ Return from interrupt4. Network stack processes packet (bottom half) └→ IP layer processing └→ TCP layer processing └→ Socket layer deliveryProblem: At 1Gbps+ speeds, 100,000+ interrupts/secResult: 100% CPU time in interrupt handling (interrupt storm)
Solution: Hybrid polling/interrupt model. The analogy: imagine a doorbell that rings every time a letter arrives. If you get one letter per hour, the doorbell is helpful. If you get 1,000 letters per second, you would never leave the door. NAPI’s approach: after the first ring, disable the doorbell and check the mailbox in batches until it is empty, then re-enable the doorbell.
Receive Flow with NAPI:────────────────────────Phase 1: Low Traffic (Interrupt Mode)┌─────────────────────────────────────────┐│ Packet arrives ││ ↓ ││ NIC raises IRQ ││ ↓ ││ Driver IRQ handler: ││ • Disable NIC interrupts ││ • Schedule NAPI poll (add to poll_list)││ • Return (very fast!) ││ ↓ ││ Softirq NET_RX_SOFTIRQ triggers ││ ↓ ││ net_rx_action(): ││ • Call driver->poll() ││ • Process up to budget packets (64) ││ • If ring empty: re-enable IRQ │└─────────────────────────────────────────┘Phase 2: High Traffic (Polling Mode)┌─────────────────────────────────────────┐│ poll() processes 64 packets ││ ↓ ││ More packets in ring buffer? ││ • YES: Stay in polling mode ││ • Keep calling poll() until ring empty ││ • No interrupts needed! ││ ↓ ││ Eventually ring empties ││ • Re-enable NIC interrupts ││ • Wait for next packet │└─────────────────────────────────────────┘
Kernel Code (simplified from net/core/dev.c):
// Driver's NAPI poll functionstatic int my_driver_poll(struct napi_struct *napi, int budget) { struct my_adapter *adapter = container_of(napi, struct my_adapter, napi); int work_done = 0; while (work_done < budget) { // Check if ring buffer has packets if (ring_buffer_empty(adapter)) break; // Fetch packet from ring buffer struct sk_buff *skb = fetch_packet_from_ring(adapter); // Set metadata skb->dev = adapter->netdev; skb->protocol = eth_type_trans(skb, adapter->netdev); // Pass to network stack netif_receive_skb(skb); work_done++; } // If we processed less than budget, ring is empty if (work_done < budget) { napi_complete(napi); // Exit polling mode enable_irq(adapter->irq); // Re-enable interrupts } return work_done;}// IRQ handler (top half)static irqreturn_t my_driver_irq_handler(int irq, void *data) { struct my_adapter *adapter = data; // Disable NIC interrupts disable_nic_interrupts(adapter); // Schedule NAPI polling napi_schedule(&adapter->napi); return IRQ_HANDLED;}// Network core: softirq handlerstatic void net_rx_action(struct softirq_action *h) { struct list_head *poll_list = this_cpu_ptr(&softnet_data.poll_list); int budget = netdev_budget; // Default: 300 unsigned long time_limit = jiffies + netdev_budget_usecs; while (!list_empty(poll_list)) { struct napi_struct *napi = list_first_entry(poll_list, ...); int work = napi->poll(napi, budget); budget -= work; if (budget <= 0 || time_after(jiffies, time_limit)) break; // Yield CPU }}
Benefits:
Low latency under low load: Interrupts still used
High throughput under high load: Polling avoids interrupt overhead
Problem: Single NIC queue means all packets processed on one CPU core. On a 10 Gbps link pushing small packets, a single core can become 100% saturated while the other 31 cores sit idle.Solution: Distribute packet processing across multiple CPUs. There are three levels of this, each solving a different part of the problem:
Pros: Works with single-queue NIC
Cons: Extra CPU for steering
RFS (Receive Flow Steering): Extension of RPS
Goal: Process packet on CPU where application is runningBenefit: CPU cache locality (hot cache = faster processing)Flow:1. Application recv() on CPU3 └→ Kernel records: Flow X → CPU32. Packet for Flow X arrives on CPU0 └→ RPS checks flow table └→ Steers to CPU3 (where app is!)Result: Packet data already in CPU3's cache when app reads it
Configuration:
# Enable RPS (software steering)echo "f" > /sys/class/net/eth0/queues/rx-0/rps_cpus # Use CPUs 0-3# Set RPS flow entriesecho 4096 > /proc/sys/net/core/rps_sock_flow_entries# Set per-queue flow entriesecho 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
#include <linux/bpf.h>#include <linux/if_ether.h>#include <linux/ip.h>#include <linux/tcp.h>// Drop all packets from specific IPSEC("xdp")int xdp_drop_ip(struct xdp_md *ctx) { void *data = (void *)(long)ctx->data; void *data_end = (void *)(long)ctx->data_end; // Bounds check (required by BPF verifier) struct ethhdr *eth = data; if ((void *)(eth + 1) > data_end) return XDP_PASS; // Only process IP packets if (eth->h_proto != htons(ETH_P_IP)) return XDP_PASS; struct iphdr *ip = (void *)(eth + 1); if ((void *)(ip + 1) > data_end) return XDP_PASS; // Drop packets from 192.168.1.100 __u32 blocked_ip = 0xC0A80164; // 192.168.1.100 in network order if (ip->saddr == blocked_ip) { return XDP_DROP; // Discard immediately! } return XDP_PASS; // Continue to network stack}char _license[] SEC("license") = "GPL";
Compile and Load:
# Compile XDP programclang -O2 -target bpf -c xdp_drop.c -o xdp_drop.o# Load into kernelip link set dev eth0 xdp obj xdp_drop.o sec xdp# Verifyip link show eth0# ... xdp ...# Remove XDP programip link set dev eth0 xdp off
// Complex state machinetcp_validate_incoming(sk, skb);tcp_ack(sk, skb); // Process ACKtcp_data_queue(sk, skb); // Queue OOOtcp_send_delayed_ack(sk);// ... many more checks ...
Performance: ~5-10µs per packet
TCP Receive Processing (simplified from net/ipv4/tcp_input.c):
int tcp_v4_rcv(struct sk_buff *skb) { struct tcphdr *th; struct sock *sk; // 1. Get TCP header th = tcp_hdr(skb); // 2. Find socket (demultiplexing) sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest); if (!sk) goto no_tcp_socket; // Send RST // 3. Process TCP state machine tcp_v4_do_rcv(sk, skb); return 0;}int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) { // Check if established connection if (sk->sk_state == TCP_ESTABLISHED) { // Try fast path if (tcp_rcv_established(sk, skb) == 0) return 0; } // Slow path (connection management) return tcp_rcv_state_process(sk, skb);}int tcp_rcv_established(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); struct tcphdr *th = tcp_hdr(skb); // FAST PATH CHECK if (tp->rcv_nxt == ntohl(th->seq) && // In-order tp->rcv_wnd && // Window open !th->syn && !th->fin && !th->rst) // No special flags { int len = skb->len - th->doff * 4; // Copy data to socket receive buffer if (!skb_copy_datagram_msg(skb, th->doff * 4, &sk->sk_receive_queue, len)) { tp->rcv_nxt += len; // Send ACK tcp_send_ack(sk); // Wake up waiting process sk->sk_data_ready(sk); kfree_skb(skb); return 0; // Fast path success! } } // Fall through to slow path... return tcp_slow_path(sk, skb);}
// Traditional copy (4 copies!)fd_in = open("file.txt", O_RDONLY);fd_out = socket(...);char buf[4096];while ((n = read(fd_in, buf, sizeof(buf))) > 0) { write(fd_out, buf, n);}// Copies:// 1. DMA: Disk → Kernel buffer// 2. CPU: Kernel buffer → User buffer (read)// 3. CPU: User buffer → Socket buffer (write)// 4. DMA: Socket buffer → NIC// Zero-copy sendfile()sendfile(fd_out, fd_in, NULL, file_size);// Copies:// 1. DMA: Disk → Kernel buffer// 2. DMA: Kernel buffer → NIC (if supported)// Or just 1 CPU copy if DMA-to-DMA not available// Kernel implementationssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, size_t count) { // Use splice internally // Transfers pages directly to socket return splice_direct_to_actor(in_file, &sd, direct_splice_actor);}
MSG_ZEROCOPY
// Modern zero-copy sendint fd = socket(AF_INET, SOCK_STREAM, 0);// Enable zero-copyint one = 1;setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one));// Send with zero-copy flagchar buf[65536];send(fd, buf, sizeof(buf), MSG_ZEROCOPY);// Kernel behavior:// - Increments refcount on user pages// - Passes page pointers to NIC// - User buffer MUST NOT be modified until...// Wait for completion notificationstruct msghdr msg = {};struct sock_extended_err *serr;char control[100];msg.msg_control = control;msg.msg_controllen = sizeof(control);recvmsg(fd, &msg, MSG_ERRQUEUE);// Now safe to reuse buffer// Benefits:// - No CPU copy of payload data// - Reduced cache pollution// - Higher throughput// Caveats:// - Only beneficial for large sends (>10KB)// - Buffer must stay valid until ACK// - More complex error handling
The kernel network stack ships with reasonable defaults for a generic workload. The moment your traffic profile diverges from “generic,” those defaults become silent throughput killers. Below are the four traps that bite senior engineers most often in production, paired with the patterns that fix them.
Pitfall 1: TCP socket buffer tuning sized for the wrong bandwidth-delay productThe send and receive buffers cap the in-flight bytes a TCP connection can hold. If the buffer is smaller than your bandwidth-delay product (BDP = bandwidth times RTT), the sender blocks waiting for ACKs even though the link has spare capacity. Engineers see CPU idle, network idle, and conclude “the network is fine” — but throughput is stuck at a fraction of line rate.Concrete numbers: a 10 Gbps cross-region link with 80 ms RTT has a BDP of 100 MB. The default tcp_rmem max on most distros is 6 MB. A single TCP connection on that link is hard-capped at roughly 600 Mbps regardless of how much CPU and bandwidth you have. This caused a 7-hour throughput regression at a CDN we worked with after a data-center migration changed the RTT from 5 ms to 65 ms.Equally bad in the other direction: oversized buffers create bufferbloat — packets sit in send buffers for seconds, killing latency and breaking ACK clocking.
Solution: size buffers to BDP, then trust autotuningSet the max value of tcp_rmem/tcp_wmem to roughly 2x your worst-case BDP. Leave the min and default values modest — TCP autotuning will grow each connection’s buffer up to the max as needed. Do not set the default to the max; that wastes memory on idle connections and makes the kernel slower to recover under memory pressure.
# 10 Gbps x 100ms RTT BDP = ~125 MB. Set max to 256 MB.sysctl -w net.core.rmem_max=268435456sysctl -w net.core.wmem_max=268435456sysctl -w net.ipv4.tcp_rmem="4096 87380 268435456"sysctl -w net.ipv4.tcp_wmem="4096 65536 268435456"# Keep autotuning enabled (it is by default)sysctl -w net.ipv4.tcp_moderate_rcvbuf=1
Verify with ss -tim — the cwnd and rcv_space fields show what TCP actually negotiated. If cwnd plateaus while wscale is non-zero and the application has data to send, the buffer is your cap.
Pitfall 2: TIME_WAIT exhaustion under high connection churnTCP requires the side that sends the final FIN to hold the socket in TIME_WAIT for 2 times the maximum segment lifetime (typically 60 seconds on Linux). On a service that opens many short-lived outbound connections — a metrics shipper, a service-to-service HTTP client without keep-alive, an old-school PHP-FPM worker — you can run out of ephemeral ports in under a minute. The symptom: connect() starts returning EADDRNOTAVAIL even though the network is fine.The folk remedy is SO_REUSEADDR, but that only lets the listening side rebind to a port still in TIME_WAIT. It does nothing for the outbound case. The other tempting knob, net.ipv4.tcp_tw_recycle, was removed in kernel 4.12 because it broke connectivity through NATs. Setting tcp_tw_reuse=1 helps but only when the local timestamp is monotonic and only for outbound connections.
Solution: connection pooling, then knobs as a backstopThe right fix is architectural: keep a pool of long-lived connections instead of opening one per request. HTTP/1.1 keep-alive, HTTP/2 multiplexing, gRPC channels, and database connection pools all exist for this reason. A pool of 100 reused connections handles the same load as 100,000 ephemeral ones without ever touching TIME_WAIT.If you cannot fix the application, the layered backstop is:
# Allow reusing TIME_WAIT sockets for outbound when safesysctl -w net.ipv4.tcp_tw_reuse=1# Widen the ephemeral port range from the default 32768-60999sysctl -w net.ipv4.ip_local_port_range="1024 65535"# Shorten FIN_WAIT_2 timeout (be careful, this affects half-closed connections)sysctl -w net.ipv4.tcp_fin_timeout=30
Monitor with ss -s — the timewait count should be a small fraction of available ephemeral ports. Going above 50 percent means you are one traffic spike away from an outage.
Pitfall 3: MTU mismatches cause silent path-MTU-discovery failuresThe classic scenario: your app works perfectly on a LAN with 1500-byte MTU, then you deploy through a VPN, GRE tunnel, or VXLAN overlay that adds 50-100 bytes of encapsulation. The encapsulated path now has an effective MTU of 1400 or so. Your sender uses 1500-byte packets, the intermediate router sees that the packet exceeds the next-hop MTU, sends back an ICMP “Fragmentation Needed” message, and the sender shrinks its segments. That is path-MTU discovery (PMTUD) working correctly.Now the bug: somewhere between you and the destination, a firewall is dropping ICMP “Fragmentation Needed” messages because “ICMP is dangerous.” Your sender never receives the hint. It keeps sending 1500-byte packets that get black-holed. Connections hang during the first large transfer (a TLS handshake with a big certificate, an HTTP response with a few KB of headers). Small connections work, large ones do not. The pcap on the sender shows packets going out, the pcap on the receiver shows nothing arriving, and there is no error anywhere.
Solution: lower MSS on the affected path, or enable PMTU black-hole detectionThe clean fix is tcp_mtu_probing, which lets TCP detect a black hole heuristically (it notices retransmission patterns consistent with PMTUD failure and probes with smaller segments).
# Enable MTU black-hole detectionsysctl -w net.ipv4.tcp_mtu_probing=1# Set the floor -- TCP will probe down to this sizesysctl -w net.ipv4.tcp_base_mss=1024
For known-bad paths (a VPN you control), clamp MSS at the iptables/nftables layer:
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \ -j TCPMSS --clamp-mss-to-pmtu
To reproduce in testing: tracepath -n example.com reports the MTU at each hop. If it stops advancing partway through, you have a black hole. Diagnostic incident: this exact failure mode took down outbound API calls for several hours at a fintech we know in 2022, after a network team enabled stricter ICMP filtering on a transit provider.
Pitfall 4: epoll edge-triggered vs level-triggered semantics — ET requires draining or you stallepoll has two trigger modes. Level-triggered (LT, the default) reports a file descriptor as ready every time you call epoll_wait if it has data buffered. Edge-triggered (ET) only reports a transition — you get one notification when data arrives, and you do not get another until the buffer empties and refills. ET is faster (fewer syscalls per event) and is what high-performance servers like nginx, HAProxy, and Envoy use.The trap: in ET mode, if your handler reads only some of the available data (say, you read 4 KB but 8 KB arrived), you will not get another notification until more data shows up. The remaining 4 KB sits in the socket buffer indefinitely. The connection appears stuck. Worse, the bug is timing-dependent and load-dependent — under light load, every read happens to drain the buffer and you never see it. Under heavy load with bursty senders, half your connections stall. This pattern caused a 4-hour partial outage at a streaming service in 2019 when a code path read a fixed chunk size instead of looping.
Solution: in ET mode, always loop until EAGAINThe contract for edge-triggered is non-negotiable: every time you receive an event, drain the descriptor until read/recv returns EAGAIN or EWOULDBLOCK. Same on the write side — write until EAGAIN, then re-arm.
// Edge-triggered read loop (correct)while (1) { ssize_t n = read(fd, buf, sizeof(buf)); if (n > 0) { process(buf, n); continue; } if (n == 0) { // Peer closed the connection close(fd); break; } if (errno == EAGAIN || errno == EWOULDBLOCK) { // Buffer drained -- wait for next epoll event break; } // Real error handle_error(errno); break;}
If you cannot guarantee draining (for example, you want fairness across many connections and refuse to spin on one), use level-triggered mode instead. The performance delta between ET and LT for typical web workloads is under 5 percent. The correctness delta when you get ET wrong is “everything hangs occasionally.”Stronger pattern: prefer EPOLLONESHOT — the fd is reported once, then disarmed until you explicitly re-arm it with epoll_ctl(EPOLL_CTL_MOD). This forces you to handle the read-to-completion cycle and re-enable, eliminating both lost-wakeup bugs and concurrent-access races in multithreaded servers.
Q1: Explain the sk_buff structure and why headroom/tailroom matter.
sk_buff: The fundamental packet data structure in Linux networking.Memory Layout:
┌─────────────────────────────────────────────┐│ head data tail end ││ ↓ ↓ ↓ ↓ ││ ├─────┼─────────────┼─────┤ ││ │ HR │ Valid Data │ TR │ ││ └─────┴─────────────┴─────┘ ││<-headroom-> <-len-> <-tailroom-> │└─────────────────────────────────────────────┘
Why Headroom Matters:
As a packet moves down the network stack (from app to wire), each layer adds a header:
Application data
+20 bytes TCP header (skb_push)
+20 bytes IP header (skb_push)
+14 bytes Ethernet header (skb_push)
Without headroom, each layer would need to reallocate and copy the entire packet. With headroom, we just move the data pointer backwards.Why Tailroom Matters:
For adding trailers (less common)
For TSO/GSO (TCP Segmentation Offload): Kernel builds large packets, NIC splits them
Performance Impact: Zero-copy header addition vs expensive realloc/memcpy.
Q2: How does NAPI improve packet processing performance?
Problem with Old Interrupt Model:
Each packet → hardware interrupt
At 1 Gbps (1.5M packets/sec), CPU spends 100% time handling interrupts
This is “interrupt storm” or “receive livelock”
NAPI Solution (New API):Low Traffic (interrupt mode):
Packet arrives → IRQ
Driver disables NIC interrupts
Schedules NAPI poll
Returns immediately from IRQ
High Traffic (polling mode):
5. Softirq calls driver’s poll() function
6. Poll processes up to budget packets (default 64)
7. If more packets remain, stay in polling mode
8. If ring buffer empty, re-enable interruptsBenefits:
Low latency (low load): Interrupts still used
High throughput (high load): Polling avoids interrupt overhead
Fairness: Budget prevents one NIC from starving others
Adaptive: Automatically switches modes
Key Insight: Interrupts tell us “work is available”, then we switch to polling to batch the work.
Q3: What is XDP and how does it achieve such high performance?
XDP (eXpress Data Path): Runs eBPF programs at the earliest possible point in packet processing.Traditional Path:
// Complex state machinevalidate_sequence_numbers();check_for_duplicate_acks();update_rtt_estimates();process_sack_blocks();reorder_out_of_order_segments();update_congestion_window();// ... many more checks ...// ~5-10µs
Impact: Fast path handles 90%+ of packets in established bulk-data transfers. Slow path ensures correctness for edge cases.Optimization: Keep connections in fast path by:
Avoiding packet loss (good network)
Using large enough buffers (avoid window full)
Minimizing out-of-order delivery (good QoS)
Q5: How does RSS/RPS/RFS distribute packet processing across CPU cores?
Problem: Single NIC queue → all packets processed on one CPU → bottleneckSolution 1: RSS (Receive Side Scaling) - Hardware
NIC has multiple RX queues (e.g., 8 queues)
NIC computes hash: hash(src_ip, dst_ip, src_port, dst_port) % num_queues
Each queue has dedicated IRQ mapped to specific CPU
Result: Packets distributed across CPUs in hardware
Pros: No CPU overhead, very fast
Cons: Requires multi-queue NICSolution 2: RPS (Receive Packet Steering) - Software
Single queue NIC
CPU receiving IRQ computes hash
Enqueues packet to target CPU’s backlog
Target CPU processes packet
Pros: Works with any NIC
Cons: Extra CPU overhead for steeringSolution 3: RFS (Receive Flow Steering) - Locality optimization
Extension of RPS
Tracks which CPU application is running on
Steers packets to that specific CPU
Result: Packet data in cache when application reads it
Example:
Without RFS: Packet → CPU0 (processes) → CPU2 (app blocked on recv) → Cache miss when app reads dataWith RFS: Packet → CPU2 (processes + app runs here) → Data in cache, very fast!
Configuration:
# RSS (hardware, automatic if NIC supports)ethtool -l eth0 # Show queue count# RPSecho "f" > /sys/class/net/eth0/queues/rx-0/rps_cpus # CPUs 0-3# RFSecho 32768 > /proc/sys/net/core/rps_sock_flow_entriesecho 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
Q6: What is connection tracking (conntrack) and why can it be a bottleneck?
Connection Tracking: Kernel subsystem that tracks state of all connections (TCP, UDP, ICMP).Purpose:
Enable stateful firewall rules
NAT (must remember translations)
Connection-based filtering
How It Works:
Client initiates connection: 192.168.1.100:45000 → 8.8.8.8:80Conntrack creates entry: ORIGINAL: 192.168.1.100:45000 → 8.8.8.8:80 [NEW] REPLY: 8.8.8.8:80 → 192.168.1.100:45000Subsequent packets match entry (both directions!)States: NEW → ESTABLISHED → FIN_WAIT → CLOSE → destroyed
Performance Issues:
Hash table lookup: O(1) but still overhead on every packet
Global lock: (Older kernels) serializes all conntrack operations
Memory: Each connection consumes memory (~300 bytes)
Hash collisions: Degrade to O(n) lookup
Symptoms:
# Table fulldmesg | grep conntrack# nf_conntrack: table full, dropping packet# High CPU in conntrackperf top# 12.34% [kernel] [k] nf_conntrack_in
Solutions:
# 1. Increase table sizesysctl -w net.netfilter.nf_conntrack_max=1048576sysctl -w net.netfilter.nf_conntrack_buckets=262144# 2. Decrease timeout for short-lived connectionssysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30# 3. Bypass conntrack for specific traffic (e.g., load balancer)iptables -t raw -A PREROUTING -p tcp --dport 80 -j NOTRACKiptables -t raw -A OUTPUT -p tcp --sport 80 -j NOTRACK# WARNING: Breaks stateful rules for this traffic!# 4. Use connection-less alternatives# - eBPF/XDP for filtering (bypasses conntrack entirely)# - Stateless firewall rules where possible
When to bypass: High-traffic stateless services (load balancers, DNS servers, CDN edges).
Problem: Traditional send/receive involves multiple memory copies.Traditional Path (4 copies!):
Sending a file:1. read(file_fd, buffer, size) Disk → [Kernel buffer] → [User buffer] DMA copy CPU copy2. write(socket_fd, buffer, size) [User buffer] → [Socket buffer] → NIC CPU copy DMA copyTotal: 2 DMA + 2 CPU copies
Zero-Copy Techniques:1. sendfile():
sendfile(socket_fd, file_fd, NULL, file_size);// Kernel path:Disk → [Kernel buffer] ────→ NIC DMA copy DMA copy (if NIC supports) or page remapping// Avoids user-space copies entirely!// Best for static file serving (nginx, Apache)
2. splice():
// Move data between FDs via pipe (zero-copy)splice(file_fd, NULL, pipe_fd[1], NULL, size, 0);splice(pipe_fd[0], NULL, socket_fd, NULL, size, 0);// Kernel manipulates page tables, no memcpy
3. MSG_ZEROCOPY:
send(socket_fd, buffer, size, MSG_ZEROCOPY);// Kernel:// 1. Pin user pages in memory (increment refcount)// 2. NIC DMAs directly from user buffer// 3. After NIC finishes, kernel notifies app via error queue// 4. App can now reuse buffer// Benefit: No copy for large sends// Caveat: Buffer must stay valid, async notification needed
4. mmap() + write():
void *data = mmap(NULL, size, PROT_READ, MAP_SHARED, file_fd, 0);write(socket_fd, data, size);// May avoid one copy if kernel is smart// But less efficient than sendfile()
Performance Impact:
Method
Copies
Use Case
Traditional
4
Small data, flexibility needed
sendfile()
1-2
Static file serving
splice()
0
Piping data between FDs
MSG_ZEROCOPY
0
Large sends (>10KB)
When to use:
sendfile(): Web server serving files
splice(): Proxy/gateway (socket → socket)
MSG_ZEROCOPY: Bulk data transfer, streaming
Q8: How does TCP congestion control work? Compare CUBIC vs BBR.
Problem: Send too fast → network congestion → packet loss. Send too slow → underutilize bandwidth.Goal: Find optimal sending rate (maximize throughput, minimize loss).CUBIC (Linux default):Algorithm:
Maintains congestion window (cwnd) = max packets in flight
On loss: cwnd = cwnd × β (reduce by 30%)
Recovery: Grow cwnd using cubic function
cwnd(t) = C × (t - K)³ + W_maxWhere: W_max = window before loss K = time to reach W_max C = scaling constant
A production service is dropping packets under high load. You suspect the kernel network stack is the bottleneck, not the application. Walk me through your debugging methodology from NIC to socket.
Strong Answer:The way I approach this is bottom-up, from hardware to application, checking for drops at each layer. Every layer in the network stack has its own counters, and the trick is finding where packets disappear.
NIC level: Start with ethtool -S eth0 | grep -i drop and ethtool -S eth0 | grep -i error. Look for rx_dropped, rx_missed_errors, and rx_fifo_errors. If the NIC is dropping, the ring buffer is full — packets arrive faster than the CPU drains them. Fix: increase ring buffer size with ethtool -G eth0 rx 4096, enable RSS (multiple hardware queues), or move to NAPI polling with a higher budget.
Softirq/NAPI level: Check /proc/net/softnet_stat. Each line represents a CPU. Column 1 is total packets processed, column 2 is drops (backlog overflow), column 3 is time_squeeze (NAPI budget exhausted before all packets processed). If column 2 is non-zero, increase net.core.netdev_max_backlog. If column 3 is non-zero, increase net.core.netdev_budget or enable RPS to spread load across more CPUs.
Socket buffer level: Check ss -s for socket memory statistics. If the TCP receive buffer is full because the application is not calling recv() fast enough, the kernel drops incoming segments. Check netstat -s | grep "pruned" for TCP pruning events. Fix: increase socket buffers (net.core.rmem_max, net.ipv4.tcp_rmem) or fix the slow application.
Conntrack table: If using iptables/nftables, check conntrack -C for the current count and compare with net.netfilter.nf_conntrack_max. A full conntrack table silently drops new connections. This is a classic production issue for high-connection-count services (100K+ concurrent connections). Fix: increase the max or bypass conntrack for stateless services with -j NOTRACK.
Application level: If none of the above show drops, the application is the bottleneck. Check with ss -tlnp if the listen backlog is overflowing (Recv-Q exceeding the backlog value). Increase net.core.somaxconn and the application’s listen backlog.
In my experience, 80% of production packet drops are either conntrack table exhaustion or softirq budget starvation. Both are silent — no error logs, no alerts unless you explicitly monitor those counters.Follow-up: How does XDP help in this scenario compared to iptables?XDP processes packets before sk_buff allocation, at the driver level. For dropping known-bad traffic (DDoS mitigation, IP blacklists), XDP can drop at 10-100x the rate of iptables because it avoids the entire network stack overhead — no sk_buff allocation, no conntrack lookup, no routing decision. Facebook’s Katran load balancer uses XDP to handle millions of packets per second per core. The trade-off is that XDP programs only see raw packet bytes (no parsed TCP state, no connection tracking), so they are limited to stateless or simple stateful decisions. For complex firewall rules that need connection state, you still need conntrack/netfilter.
Explain the difference between TCP BBR and CUBIC congestion control. When would you switch a production service from CUBIC to BBR, and what could go wrong?
Strong Answer:CUBIC and BBR take fundamentally different approaches to congestion control:
CUBIC (default on Linux): Loss-based. CUBIC increases the congestion window following a cubic function until it detects packet loss, then backs off. The key insight is that packet loss signals network congestion. CUBIC is aggressive at probing for bandwidth (the cubic growth curve approaches the previous maximum quickly after a loss event) and conservative after loss.
BBR (Bottleneck Bandwidth and RTT): Model-based. BBR continuously estimates two parameters: bottleneck bandwidth (max throughput the path can sustain) and minimum RTT. It then paces sending to match bottleneck bandwidth while keeping in-flight data to roughly bandwidth x RTT. BBR does not treat loss as a congestion signal — it treats it as noise.
When to switch to BBR:
Bufferbloat networks: On paths with large buffers (common in ISPs, cellular networks), CUBIC fills the buffers until loss occurs, adding hundreds of milliseconds of queuing delay. BBR avoids filling buffers by targeting the bottleneck rate, resulting in dramatically lower latency (often 10-50x reduction in queuing delay).
Lossy links (wireless, satellite): CUBIC interprets random packet loss (wireless interference) as congestion and backs off unnecessarily. BBR ignores occasional loss and maintains throughput.
Long-distance, high-BDP paths: BBR converges to fair share faster than CUBIC on high bandwidth-delay product links.
What could go wrong:
Fairness with CUBIC flows: BBR can be unfair when competing with CUBIC flows on the same bottleneck link. BBR tends to grab more bandwidth because it does not back off on loss. BBRv2 and BBRv3 address this somewhat, but fairness remains a concern in shared environments.
RTT measurement sensitivity: BBR’s performance depends on accurate minimum RTT estimation. If your service runs behind a load balancer that adds variable latency, BBR may over-estimate min_RTT and underperform.
Retransmission behavior: BBR can sometimes cause higher retransmission rates than CUBIC because it probes aggressively and does not immediately reduce sending on loss. Monitor netstat -s | grep retrans after switching.
My recommendation: enable BBR for internet-facing services (CDNs, APIs serving mobile clients) where bufferbloat is the primary latency enemy. Keep CUBIC for data center east-west traffic where links are reliable and bufferbloat is not a factor. You can set it per-connection with setsockopt(TCP_CONGESTION, "bbr"), so you do not need a system-wide switch.Follow-up: How would you measure whether switching to BBR actually improved your service’s performance?Before/after A/B test measuring three metrics: p50/p95/p99 TCP RTT (from ss -ti or TCP tracepoints), retransmission rate (nstat TcpRetransSegs), and application-level latency. Run both simultaneously on different server pools serving the same traffic. If BBR shows lower RTT and equal or lower retransmit rate, it is a win. If retransmits spike, investigate whether the path has a strict policer (some ISPs police by drop rate, which confuses BBR).
What is the sk_buff structure and why is it designed the way it is? What would happen if you needed to redesign it for modern 100Gbps NICs?
Strong Answer:The sk_buff (socket buffer) is the central data structure representing a network packet as it traverses the Linux kernel network stack. Every packet — whether incoming or outgoing — is wrapped in an sk_buff from the moment it enters the kernel until it leaves.
Key design elements: An sk_buff contains a head pointer (start of allocated buffer), data pointer (start of current layer’s header), tail pointer (end of data), and end pointer (end of allocated buffer). The space between head and data is “headroom” — reserved for prepending headers as the packet moves down the stack (e.g., adding an IP header, then an Ethernet header). The space between tail and end is “tailroom” for appending data. This design means that adding or removing headers is a pointer adjustment, not a memory copy.
Metadata: sk_buff also carries extensive metadata: timestamp, hash value (for RSS), mark (for iptables), VLAN tag, checksum offload status, GRO/GSO information, and pointers to the associated socket and network device. This metadata is what makes protocol processing efficient — each layer annotates the sk_buff rather than parsing from scratch.
Cloning: When a packet needs to go to multiple destinations (multicast, tapping), the kernel clones the sk_buff — creating a new metadata structure that points to the same data buffer. This avoids copying the packet payload.
For redesigning at 100Gbps:
The problem: At 100Gbps with 64-byte packets, you need to process 148 million packets per second. Allocating and freeing an sk_buff per packet is impossibly expensive — each allocation involves slab allocator calls, cache-line bouncing, and metadata initialization.
Batch processing: Modern approaches (GRO — Generic Receive Offload) coalesce multiple packets into a single sk_buff before handing to upper layers, reducing per-packet overhead by 10-60x.
XDP’s approach: Bypass sk_buff entirely. XDP uses a minimal xdp_md structure that is just pointers into the DMA buffer. No allocation, no metadata bloat. This is why XDP can process packets at line rate on 100Gbps NICs.
AF_XDP: Provides a zero-copy path from NIC DMA buffers directly to user-space ring buffers, completely bypassing the sk_buff-based stack. Used by high-frequency trading firms and DPDK-like workloads.
If I were redesigning from scratch, I would make sk_buff a thin wrapper with a fixed-size inline metadata area (avoiding pointer chasing), support batch allocation/deallocation natively (like the page allocator’s bulk APIs), and make the XDP fast path the primary path rather than an afterthought.Follow-up: What is GRO and how does it reduce per-packet overhead?GRO (Generic Receive Offload) coalesces multiple incoming packets belonging to the same flow into a single large sk_buff before passing it up the stack. For example, if 10 TCP segments of 1500 bytes arrive for the same connection, GRO merges them into one 15000-byte sk_buff. The upper layers (TCP) then process one “packet” instead of ten. This reduces per-packet processing overhead (header parsing, socket lookup, lock acquisition) by an order of magnitude. The hardware equivalent is LRO (Large Receive Offload), which does the same thing in the NIC firmware. GRO is preferred because it is smarter about which packets can be safely coalesced (it respects TCP timestamps, ECN bits, etc.) while LRO can sometimes break things by coalescing packets that should not be merged.
Walk me through the TCP three-way handshake from the kernel's perspective. What happens at each step, what data structures change, and where can it go wrong?
Strong Answer Framework:
SYN arrives at the listener. The packet hits the NIC, goes through softirq, IP, and lands in tcp_v4_rcv. The kernel does an __inet_lookup_listener to find a LISTEN-state socket bound to the destination IP and port. If found, it allocates a request socket (struct request_sock) — a lightweight half-open structure — and adds it to the SYN queue (accept_queue->syn_queue). The full socket is not allocated yet.
SYN-ACK is sent back. The kernel constructs a SYN-ACK with its initial sequence number and the negotiated TCP options (MSS, window scale, SACK permitted, timestamps). It also chooses a starting sk_rcv_saddr if the listener was bound to INADDR_ANY.
ACK arrives, completing the handshake. The kernel matches the ACK to the request socket, allocates a full struct sock via tcp_v4_syn_recv_sock, transitions it to ESTABLISHED, and moves it from the SYN queue to the accept queue (accept_queue->rskq_accept_head). Only at this point does the application’s accept() syscall return a new file descriptor.
The listening socket has two queues.tcp_max_syn_backlog caps the SYN queue (half-open). The accept() queue (full-open, waiting to be picked up by the application) is capped at min(somaxconn, application_backlog). If the accept queue overflows because the app is slow to call accept(), the kernel drops the third ACK and the connection silently fails — the client thinks it succeeded, but the server has no socket.
Real-World Example: In 2016, a major SaaS provider’s API gateway started failing under bursty load with no errors in the application. netstat -s | grep -i "listen" showed thousands of ListenOverflows and ListenDrops per second. The cause: their accept queue was sized at the historical default of 128 (SOMAXCONN), but they were getting bursts of 5K connections in 200 ms. The fix was raising net.core.somaxconn to 16384 and updating the application’s listen() backlog to match. They also enabled tcp_abort_on_overflow=1 so that overflows became visible RSTs instead of silent drops, restoring deterministic error behavior.
Senior follow-up 1: How does TCP fast open (TFO) change the handshake, and what is the security tradeoff?TFO lets the client send data in the SYN packet itself. On a first connection, the server returns a TFO cookie in the SYN-ACK; on subsequent connections, the client includes that cookie in the SYN with up to ~1460 bytes of data, and the server processes the data immediately rather than after the third ACK. This saves one full RTT for short transactions (DNS, small HTTP requests). The security tradeoff: an attacker who steals a cookie can amplify SYN-flood attacks against arbitrary destinations using the victim’s identity, and TFO state has to be carefully aged. Linux disables TFO server-side by default for this reason and requires explicit opt-in per socket.
Senior follow-up 2: What are SYN cookies, and what do you lose by enabling them?SYN cookies are a defense against SYN floods. Instead of allocating a request socket and storing it in the SYN queue, the server encodes the connection state into the initial sequence number of the SYN-ACK using a cryptographic hash of the four-tuple plus a secret. The server keeps no per-connection memory at all until the third ACK arrives — at which point it validates the cookie in the ACK and reconstructs the connection. The tradeoff: TCP options that are usually negotiated in the SYN/SYN-ACK (window scale, SACK permitted, timestamps) cannot be fully preserved through the cookie, so connections established under SYN flood may have suboptimal performance. Linux only activates SYN cookies when the SYN queue is full (tcp_syncookies=1, default), so normal traffic is unaffected.
Senior follow-up 3: Why does the listen backlog need both a SYN queue and an accept queue separately?Because they protect against different failures. The SYN queue absorbs bursts of half-open connections during the handshake — it must be large enough to hold every in-flight handshake even under flood. The accept queue absorbs the gap between connections being ready and the application calling accept() — it must be large enough to hide application stalls (GC pauses, slow startup). Sizing them together would force a tradeoff: raise the limit to handle floods, you also raise the time the app can stall before connections fail. Splitting them lets you size each for its purpose. Linux merged the two limits historically (pre-2.2) and split them precisely because operators kept hitting one cap or the other.
Common Wrong Answers:
“Three packets, that is the handshake.” Correct in the abstract, but it dodges every implementation detail an interviewer cares about: the two queues, the request socket, the cookie path, where the backlog parameters apply.
“The connection is fully open after the SYN-ACK.” No. The server moves to SYN_RECEIVED after sending SYN-ACK and only to ESTABLISHED on the third ACK. Treating SYN-ACK as completion misses where overflow drops happen.
“accept() blocks until a SYN arrives.” accept() blocks until a fully-established socket lands in the accept queue. SYNs alone never wake accept().
Further Reading:
Linux source: net/ipv4/tcp_input.c (tcp_conn_request), net/ipv4/inet_connection_sock.c (inet_csk_accept)
Cloudflare blog: “SYN packet handling in the wild” (2018) — production breakdown of every queue and counter
“TCP/IP Illustrated, Volume 2: The Implementation” by Wright and Stevens, chapters 28-29
A production service is showing 10x its normal TCP retransmission rate. What tools do you reach for, what knobs do you tune, and how do you tell if it is the network's fault or yours?
Strong Answer Framework:
Establish baseline and current rate. Pull nstat -az TcpRetransSegs TcpOutSegs over a 10-second window. The retransmit ratio is TcpRetransSegs / TcpOutSegs. Healthy data centers see under 0.01 percent; over the public internet, 0.1-0.5 percent is normal; above 1 percent is a real problem. Compare against your historical baseline — “10x normal” matters a lot more than the absolute number.
Slice by connection.ss -tin shows per-socket retransmit counts (retrans: field) and the inferred congestion state. Sort by retransmits to find the worst offenders. Are retransmits concentrated on one peer, one subnet, one congestion control variant? That tells you whether it is a path problem or a host problem.
Use tracepoints to see why.bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb { @[args->skaddr] = count(); }' counts retransmits per socket in real time. Even better, attach to tcp_retransmit_skb and dump the cause — RTO timeout vs fast retransmit vs tail loss probe. The mix matters: lots of fast retransmits suggest reordering or random loss; lots of RTO-driven retransmits suggest sustained congestion or routing flaps.
Capture and read.tcpdump -i any -w trace.pcap 'tcp and host suspicious_peer', then load it in Wireshark with the “Expert” view. The cleanest signals are duplicate ACKs (loss in the forward direction), SACK blocks (out-of-order arrival, often from ECMP rehashing), and zero-window updates (receiver overloaded).
Tune with intent, not by spraying knobs. If the path is lossy and you are stuck on CUBIC, evaluate BBR. If retransmissions cluster on RTO timeouts of exactly 200ms, tune tcp_rto_min — but only after confirming the path RTT is well below it. If you see persistent reordering due to multipath, enable tcp_recovery (RACK) which uses time-based loss detection rather than dup-ACK counting.
Real-World Example: In 2019, Slack documented a multi-hour latency incident where retransmits on east-west traffic between US-East AZs jumped 20x. Root cause was an AWS networking change that introduced asymmetric ECMP rehashing, causing TCP segments to take different paths and arrive out of order. Fast retransmits fired aggressively because three duplicate ACKs were a normal occurrence. Mitigation was raising net.ipv4.tcp_reordering to tolerate more out-of-order segments before declaring loss, and enabling RACK for time-based recovery. The longer-term fix came from the cloud provider’s side, but the host-level tuning carried them through.
Senior follow-up 1: How do you distinguish a forward-path loss from a reverse-path ACK loss using only host-side counters?Forward-path loss shows as duplicate ACKs from the receiver and triggers fast retransmit. Reverse-path ACK loss shows as the sender’s cwnd shrinking despite TcpRetransSegs not increasing — the sender retransmits because RTO fired, but the data actually arrived; the ACKs were just lost. Counter signature: nstat TcpFastRetrans increases for forward loss, TcpTimeouts increases (without much fast retransmit) for ACK loss. ACK loss is rarer but worth recognizing because the cure (turn on selective ACK with tcp_sack=1, which is default but worth verifying) is different from forward-loss tuning.
Senior follow-up 2: Why might tcp_low_latency actually hurt latency on some workloads?tcp_low_latency (deprecated in modern kernels, but still relevant for older systems) disables prequeue processing — packets are processed in softirq context immediately rather than queued for the receiving process to drain. For a single connection on a single CPU it sounds like a win. The catch: under high softirq load, tying processing to the receive interrupt can starve other work and increase tail latency systemwide. Modern kernels removed the prequeue entirely, so the knob is a no-op. The lesson: latency knobs that look local often have systemic effects, which is why most “tcp_low_latency”-style flags get deprecated once the kernel team has data.
Senior follow-up 3: How would you use eBPF to attribute retransmissions to specific application code paths?Attach a kprobe on tcp_retransmit_skb to capture the socket and the kernel stack. Then use bpf_get_current_pid_tgid and bpf_get_current_comm to capture which userspace thread owns the socket. For request-attribution, correlate the socket’s source port with a userspace ringbuffer that the application writes (request_id, source_port) tuples into when it opens a connection. This lets you say “retransmissions are concentrated on the search service’s outbound connections to the recommendations service from 14:02-14:07,” which is what you actually need to fix the problem. Tools like tcpretrans from BCC and retsnoop automate variants of this.
Common Wrong Answers:
“Bump the TCP window.” A bigger window does not help retransmits — it makes them worse if the path is lossy because more in-flight segments mean more losses per RTO event.
“Switch to UDP.” UDP does not have retransmits, but it also does not have ordering, congestion control, or reliability — you have just moved the problem out of the kernel into your application code, which is rarely simpler.
“Enable jumbo frames.” Larger MSS reduces per-packet overhead but increases the cost of each lost packet (you retransmit 9000 bytes instead of 1500). It is a throughput optimization, not a retransmit fix.
Further Reading:
Brendan Gregg, “TCP Retransmits” tools and methodology — the BCC tcpretrans writeup
Linux source: net/ipv4/tcp_input.c (tcp_fastretrans_alert), net/ipv4/tcp_recovery.c (RACK)
“Making Linux TCP Fast” (Cardwell et al., NSDI 2017) — the BBR paper, which doubles as a clear tour of TCP loss detection
How does Linux handle SYN floods? Walk me through SYN cookies, the alternatives, and what you would deploy in 2026.
Strong Answer Framework:
Define the threat model. A SYN flood sends SYN packets with spoofed source addresses faster than the server can respond or evict half-open state. Without defenses, the SYN queue fills, legitimate clients cannot establish connections, and you have a denial of service. The defining property is statelessness on the attacker’s side — the attacker burns very little CPU per SYN.
First line: large SYN backlogs and fast eviction.tcp_max_syn_backlog controls how many half-opens can sit in the queue. Raising it from the default 128 to 65536 buys you headroom. tcp_synack_retries controls how many times the kernel retries the SYN-ACK — lowering this from 5 to 2 cuts the time half-opens stay in the queue from ~93 seconds to ~7 seconds, multiplying effective capacity.
Second line: SYN cookies. When the SYN queue is full, the kernel switches to cookie mode. It encodes the four-tuple, MSS, and timestamp into the initial sequence number using a keyed SHA-1 hash, sends the SYN-ACK with that ISN, and forgets the connection entirely. When the third ACK arrives, the kernel validates the ack number minus 1 against its hash and, if valid, reconstructs the connection state. Stateless on the server’s side too, neutralizing the asymmetry.
Third line: drop earlier in the stack. SYN cookies still process every SYN through softirq and the TCP stack — expensive at packet-flood rates. XDP programs run in the driver before sk_buff allocation and can drop spoofed traffic at 10-100x the packet rate. Cloudflare and Meta both built their DDoS mitigation around this principle: identify obvious-bad traffic in XDP, let everything else through to the normal stack with cookies as a backstop.
Fourth line: upstream filtering. Anti-spoofing filters at ISPs (BCP38) and scrubbing centers handle the volumetric portion. By the time the traffic reaches your edge, it should be substantially cleaner. This is non-negotiable for any internet-facing service above small scale.
Real-World Example: In February 2020, AWS mitigated a 2.3 Tbps DDoS attack — the largest publicly disclosed at the time — much of which was reflective UDP traffic, but with a substantial SYN-flood component aimed at TCP services. The mitigation stack involved BGP flowspec filtering at network edges, Shield Advanced rate limits, and host-level SYN cookies as a final backstop. The public writeup credits multi-layer defense: no single mechanism handled the entire flood; each layer reduced the rate the next layer had to handle.
Senior follow-up 1: What information does a SYN cookie lose, and why does it matter?The cookie encodes the source/destination addresses, ports, MSS class (one of 8 bins), and a timestamp counter. It does not encode the full TCP options block: window scaling, SACK permitted, and timestamp option are lost. Connections established via cookies start with a default window scale (usually 7) which may not match what the client advertised, capping throughput on high-BDP connections. For an attack, this is fine — you are protecting against denial-of-service, not optimizing performance. For a misconfigured cookie-always-on system handling legitimate traffic, you can leave 50 percent of throughput on the table without realizing it. This is why tcp_syncookies=1 (cookies only when queue is full) is correct, while tcp_syncookies=2 (always cookies) is not.
Senior follow-up 2: How do XDP-based DDoS mitigations decide what to drop without state?They use stateless heuristics computed per-packet: drop SYNs with mismatched TCP flags, drop packets with source IPs in known spoofed ranges (bogons, your own subnets), drop packets that fail simple geographic or rate-limit fingerprints. For state that does fit, XDP can use eBPF maps — a per-source-IP token bucket implemented as a BPF_MAP_TYPE_LRU_HASH lets you rate-limit cheaply. What XDP cannot do is track multi-packet TCP state, so the line you draw is “stateless filtering in XDP, stateful logic in the kernel TCP stack with cookies.” Cloudflare’s open-sourced bpf-tools and Meta’s Katran are real implementations of this design.
Senior follow-up 3: What is the difference between tcp_syncookies=1 and tcp_syncookies=2, and which would you ever set to 2?Mode 1 enables cookies as a fallback when the SYN queue overflows; normal traffic uses the regular handshake path. Mode 2 forces cookies on every SYN regardless of queue state. The only time mode 2 is reasonable is on a host that has been chronically under attack and where you have measured that the regular path’s SYN queue keeps filling faster than it drains — in practice, that means the attack is sustained enough that you are paying the cookie cost continuously anyway. Even then, you have probably already moved DDoS handling to a separate scrubbing layer and the host should not see the volume. I have not deployed mode 2 in production and would treat anyone who has as either a niche operator or someone who has not noticed the throughput cost.
Common Wrong Answers:
“Just enable SYN cookies, problem solved.” Cookies are necessary but insufficient. They consume CPU per SYN and lose TCP options. At packet-flood scale you need filtering before the kernel even allocates the sk_buff.
“Increase the backlog to a huge number.” A larger queue defers the failure but does not prevent it. If the attack rate exceeds your eviction rate, any finite queue overflows.
“Block UDP and you stop SYN floods.” SYN floods are TCP. You are confusing reflection attacks (often UDP) with floods (any protocol).
Further Reading:
Cloudflare blog: “SYN packet handling in the wild” (Marek Majkowski) and “How to drop 10 million packets per second” — the canonical XDP-DDoS writeup
DJ Bernstein’s original SYN cookie page (cr.yp.to/syncookies.html) — still the clearest explanation of the cryptographic construction
Linux source: net/ipv4/syncookies.c, particularly cookie_v4_init_sequence
“DDoS Attack Mitigation” (Cardwell, Cheng, Yang) — Google’s writeup on integrating SYN protection with congestion control