Kernel Networking Stack
The networking subsystem is one of the most complex and performance-critical parts of the Linux kernel. Understanding how packets flow from the Network Interface Card (NIC) through kernel layers to application sockets is essential for building high-performance networked systems.Mastery Level: Senior Systems Engineer
Key Internals: sk_buff, NAPI, RSS/RPS/RFS, XDP, TCP congestion control, Netfilter
Prerequisites: Interrupts, Virtual Memory
1. The Network Stack Architecture
1.1 Layer Overview
The Linux network stack follows the OSI model but implements it in a Linux-specific way:2. The Core Data Structure: sk_buff
Thestruct sk_buff (socket buffer) is the heart of the Linux networking stack. It represents a network packet as it travels through the kernel.
2.1 sk_buff Structure
2.2 sk_buff Memory Layout
Understanding the memory layout is crucial for understanding zero-copy optimizations:2.3 Zero-Copy Mechanisms
Problem: Copying large packets is expensive (memory bandwidth limited). Solution 1: skb_clone() - Clone sk_buff structure, share data2.4 sk_buff Operations
- Header Manipulation
- Memory Management
- Data Access
3. Packet Reception: From Wire to Socket
3.1 The Legacy Interrupt-Driven Model (Pre-NAPI)
Old Approach (before 2.5 kernel):3.2 NAPI: New API (Polling + Interrupts)
Solution: Hybrid polling/interrupt model- Low latency under low load: Interrupts still used
- High throughput under high load: Polling avoids interrupt overhead
- Fairness: Budget limits per-device processing
- CPU efficiency: No interrupt storm
3.3 Receive Packet Steering (RPS/RFS)
Problem: Single NIC queue → all packets processed on one CPU core Solution: Distribute packet processing across multiple CPUsRSS (Hardware)
Receive Side ScalingPros: Hardware acceleration
Cons: Requires multi-queue NIC
- NIC has multiple RX queues
- NIC hashes packet (IP + port)
- Distributes to different queues
- Each queue has own IRQ → CPU core
RPS (Software)
Receive Packet SteeringPros: Works with single-queue NIC
Cons: Extra CPU for steering
- Software-based RSS
- CPU that receives IRQ hashes packet
- Enqueues to target CPU’s backlog
- Target CPU processes packet
4. XDP: eXpress Data Path
XDP allows running eBPF programs before sk_buff allocation, at the earliest possible point in packet processing.4.1 XDP Architecture
4.2 XDP Program Example
4.3 XDP Actions
- XDP_DROP
- XDP_TX
- XDP_REDIRECT
- XDP_PASS
- DDoS mitigation (drop attack traffic before stack)
- Invalid packet filtering
- Rate limiting at wire speed
4.4 AF_XDP: Zero-Copy to User Space
AF_XDP allows user-space programs to receive packets directly from NIC DMA buffer (bypassing kernel stack entirely).5. The TCP/IP Stack
5.1 IP Layer Processing
5.2 TCP Layer: The Fast Path
TCP processing has two paths:Fast Path
Conditions:Performance: ~1µs per packet
- In-order segment
- No flags (except ACK)
- Window not full
- No urgent data
- Checksum OK
Slow Path
Triggers:Performance: ~5-10µs per packet
- Out-of-order segment
- Retransmission
- Window probing
- Options (SACK, timestamps)
- Connection management (SYN, FIN)
5.3 TCP Congestion Control
Linux supports pluggable congestion control algorithms:- CUBIC (Default)
- BBR (Modern)
- Configuration
6. Socket Layer & System Calls
6.1 Socket Creation
6.2 send() and recv() Internals
- send()
- recv()
6.3 Zero-Copy Techniques
sendfile()
MSG_ZEROCOPY
7. Netfilter & Packet Filtering
7.1 Netfilter Hook Points
7.2 Connection Tracking (conntrack)
7.3 iptables Performance
8. Network Buffers & Memory Management
8.1 Socket Buffers
8.2 TCP Autotuning
9. Performance Monitoring & Debugging
9.1 Essential Tools
- ss (socket statistics)
- netstat
- ethtool
- /proc & /sys
9.2 Tracing with BPF
10. Interview Questions & Answers
Q1: Explain the sk_buff structure and why headroom/tailroom matter.
Q1: Explain the sk_buff structure and why headroom/tailroom matter.
sk_buff: The fundamental packet data structure in Linux networking.Memory Layout:Why Headroom Matters:
As a packet moves down the network stack (from app to wire), each layer adds a header:
- Application data
- +20 bytes TCP header (skb_push)
- +20 bytes IP header (skb_push)
- +14 bytes Ethernet header (skb_push)
data pointer backwards.Why Tailroom Matters:- For adding trailers (less common)
- For TSO/GSO (TCP Segmentation Offload): Kernel builds large packets, NIC splits them
Q2: How does NAPI improve packet processing performance?
Q2: How does NAPI improve packet processing performance?
Problem with Old Interrupt Model:
- Each packet → hardware interrupt
- At 1 Gbps (1.5M packets/sec), CPU spends 100% time handling interrupts
- This is “interrupt storm” or “receive livelock”
- Packet arrives → IRQ
- Driver disables NIC interrupts
- Schedules NAPI poll
- Returns immediately from IRQ
budget packets (default 64)
7. If more packets remain, stay in polling mode
8. If ring buffer empty, re-enable interruptsBenefits:- Low latency (low load): Interrupts still used
- High throughput (high load): Polling avoids interrupt overhead
- Fairness: Budget prevents one NIC from starving others
- Adaptive: Automatically switches modes
Q3: What is XDP and how does it achieve such high performance?
Q3: What is XDP and how does it achieve such high performance?
XDP (eXpress Data Path): Runs eBPF programs at the earliest possible point in packet processing.Traditional Path:XDP Path:Why So Fast:
- No sk_buff allocation: Operating directly on DMA buffer
- No cache misses: Data still in L1 cache from DMA
- No context switches: Runs in softirq context
- Early drop: Can discard packets before any processing
- JIT compiled: eBPF → native machine code
XDP_DROP: Discard (DDoS mitigation at 10M+ pps)XDP_TX: Bounce back same interface (L2 load balancer)XDP_REDIRECT: Send to different NIC or CPUXDP_PASS: Continue to normal stack
- DDoS mitigation
- Load balancing
- Packet filtering
- Network monitoring
Q4: Explain TCP Fast Path vs Slow Path.
Q4: Explain TCP Fast Path vs Slow Path.
Fast Path (common case optimization):Conditions:Slow Path (handles complex cases):Triggers:Impact: Fast path handles 90%+ of packets in established bulk-data transfers. Slow path ensures correctness for edge cases.Optimization: Keep connections in fast path by:
- TCP connection is ESTABLISHED
- Packet arrives in-order (seq == rcv_nxt)
- No flags except ACK
- Receive window not full
- No urgent data
- Checksum valid
- Out-of-order segment (requires reassembly)
- Retransmission (update RTO, congestion window)
- Connection management (SYN, FIN, RST)
- Options processing (SACK, timestamps, window scaling)
- Zero window probing
- Avoiding packet loss (good network)
- Using large enough buffers (avoid window full)
- Minimizing out-of-order delivery (good QoS)
Q5: How does RSS/RPS/RFS distribute packet processing across CPU cores?
Q5: How does RSS/RPS/RFS distribute packet processing across CPU cores?
Problem: Single NIC queue → all packets processed on one CPU → bottleneckSolution 1: RSS (Receive Side Scaling) - HardwareConfiguration:
- NIC has multiple RX queues (e.g., 8 queues)
- NIC computes hash:
hash(src_ip, dst_ip, src_port, dst_port) % num_queues - Each queue has dedicated IRQ mapped to specific CPU
- Result: Packets distributed across CPUs in hardware
- Single queue NIC
- CPU receiving IRQ computes hash
- Enqueues packet to target CPU’s backlog
- Target CPU processes packet
- Extension of RPS
- Tracks which CPU application is running on
- Steers packets to that specific CPU
- Result: Packet data in cache when application reads it
Q6: What is connection tracking (conntrack) and why can it be a bottleneck?
Q6: What is connection tracking (conntrack) and why can it be a bottleneck?
Connection Tracking: Kernel subsystem that tracks state of all connections (TCP, UDP, ICMP).Purpose:Performance Issues:Solutions:When to bypass: High-traffic stateless services (load balancers, DNS servers, CDN edges).
- Enable stateful firewall rules
- NAT (must remember translations)
- Connection-based filtering
- Hash table lookup: O(1) but still overhead on every packet
- Global lock: (Older kernels) serializes all conntrack operations
- Memory: Each connection consumes memory (~300 bytes)
- Hash collisions: Degrade to O(n) lookup
Q7: Explain zero-copy networking techniques (sendfile, MSG_ZEROCOPY, splice).
Q7: Explain zero-copy networking techniques (sendfile, MSG_ZEROCOPY, splice).
Problem: Traditional send/receive involves multiple memory copies.Traditional Path (4 copies!):Zero-Copy Techniques:1. sendfile():2. splice():3. MSG_ZEROCOPY:4. mmap() + write():Performance Impact:
When to use:
| Method | Copies | Use Case |
|---|---|---|
| Traditional | 4 | Small data, flexibility needed |
| sendfile() | 1-2 | Static file serving |
| splice() | 0 | Piping data between FDs |
| MSG_ZEROCOPY | 0 | Large sends (>10KB) |
- sendfile(): Web server serving files
- splice(): Proxy/gateway (socket → socket)
- MSG_ZEROCOPY: Bulk data transfer, streaming
Q8: How does TCP congestion control work? Compare CUBIC vs BBR.
Q8: How does TCP congestion control work? Compare CUBIC vs BBR.
Problem: Send too fast → network congestion → packet loss. Send too slow → underutilize bandwidth.Goal: Find optimal sending rate (maximize throughput, minimize loss).
CUBIC (Linux default):Algorithm:Behavior:Pros:
BBR (Bottleneck Bandwidth and RTT):Philosophy: Model the network, don’t react to loss.Measures:
When to Use:
Configuration:
CUBIC (Linux default):Algorithm:
- Maintains congestion window (cwnd) = max packets in flight
- On loss: cwnd = cwnd × β (reduce by 30%)
- Recovery: Grow cwnd using cubic function
- Aggressive growth after loss (good for high-bandwidth links)
- Fair to other CUBIC flows
- Simple, well-tested
- Treats loss as congestion signal (wrong for wireless)
- Can cause bufferbloat (fills queues)
- Slow convergence on very high BDP links
BBR (Bottleneck Bandwidth and RTT):Philosophy: Model the network, don’t react to loss.Measures:
- BtlBw (bottleneck bandwidth): Max delivery rate observed
- RTprop (round-trip propagation time): Min RTT observed
- STARTUP: Exponential growth to find BtlBw (like slow start)
- DRAIN: Drain queues created during startup
- PROBE_BW: Oscillate pacing rate around BtlBw (main phase)
- PROBE_RTT: Periodically reduce cwnd to re-measure RTprop
- Wireless networks drop packets due to RF interference
- BBR ignores loss, focuses on measured bandwidth
- Higher throughput on lossy links (wireless, satellite)
- Lower latency (doesn’t fill buffers)
- Better on bufferbloat-prone networks
- Can be unfair to CUBIC flows (more aggressive)
- Requires accurate RTT measurement
- More complex
When to Use:
| Scenario | Best Choice |
|---|---|
| Data center (low latency, rare loss) | CUBIC |
| Internet (bufferbloat common) | BBR |
| Wireless/satellite (lossy) | BBR |
| Mixed traffic | BBR (lower latency helps all) |
Summary
Key Takeaways:- sk_buff: Central data structure. Understanding headroom/tailroom is key to zero-copy optimizations.
- NAPI: Hybrid interrupt/polling model solves interrupt storm problem at high packet rates.
- XDP: Fastest packet processing path. Process/drop packets before sk_buff allocation using eBPF.
- RSS/RPS/RFS: Distribute packet processing across CPUs for scalability. RFS optimizes for cache locality.
- TCP Fast Path: Handles common case (in-order delivery) with minimal overhead. Slow path handles edge cases.
- Congestion Control: CUBIC (default) vs BBR (better on lossy/bufferbloat links). Understand trade-offs.
- Zero-Copy: sendfile(), splice(), MSG_ZEROCOPY eliminate expensive memory copies for large transfers.
- Conntrack: Essential for stateful firewalls but can be bottleneck. Bypass for high-traffic stateless services.
- Enable multi-queue NIC and RSS
- Tune socket buffers for high BDP
- Use XDP for packet filtering
- Enable BBR for internet traffic
- Increase conntrack table for high connection count
- Use zero-copy for large data transfers
Next: File Systems →