Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Linux Networking Stack - Packet journey through the kernel

Network Stack

The Linux network stack is one of the most performance-critical subsystems. Understanding it deeply is essential for infrastructure roles at companies like Cloudflare, AWS, and Meta.
Prerequisites: System calls, basic networking concepts
Interview Focus: Packet path, socket buffers, TCP tuning, XDP
Time to Master: 5-6 hours

Network Stack Architecture

Why Layers?

The Linux network stack is organized into layers. But why? The problem: Networking is complex. A single monolithic “send packet” function would need to handle:
  • Application data formatting
  • Connection management (TCP state machine)
  • Routing decisions
  • Hardware-specific transmission
The solution: Separation of concerns. Each layer handles one responsibility:
  • Application layer: What data to send
  • Transport layer (TCP/UDP): How to deliver it reliably
  • Network layer (IP): Where to send it
  • Link layer: Which physical interface
  • Driver: Hardware-specific details
Benefits:
  • Modularity: Can swap TCP for UDP without changing application
  • Reusability: Same IP layer works for TCP, UDP, ICMP
  • Testability: Can test each layer independently
┌─────────────────────────────────────────────────────────────────────┐
│                    LINUX NETWORK STACK                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  User Space                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  Application                                                     ││
│  │  socket(), bind(), listen(), accept(), read(), write()          ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ════════════════════════════│═══════════════════════════════════   │
│              System Call Interface                                   │
│  ════════════════════════════│═══════════════════════════════════   │
│                              ▼                                       │
│  Kernel Space                                                        │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                      Socket Layer                                ││
│  │           struct socket, struct sock, sk_buff queues            ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                   Transport Layer (L4)                           ││
│  │              TCP (tcp_prot), UDP (udp_prot)                     ││
│  │        Congestion control, retransmission, flow control         ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                   Network Layer (L3)                             ││
│  │              IP (routing, fragmentation)                        ││
│  │              Netfilter/iptables hooks                           ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                   Link Layer (L2)                                ││
│  │              ARP, bridging, VLAN                                ││
│  │              tc (traffic control)                               ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                   Device Driver                                  ││
│  │              NAPI, ring buffers, DMA                            ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                   Hardware (NIC)                                 ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Packet Receive Path

Linux Packet Flow
┌─────────────────────────────────────────────────────────────────────┐
│                    PACKET RECEIVE PATH                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. HARDWARE LAYER                                                   │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  NIC receives packet → DMA to ring buffer → Raise IRQ           ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  2. DRIVER LAYER                                                     │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  IRQ Handler (hard IRQ context)                                 ││
│  │  └─ Acknowledge IRQ                                             ││
│  │  └─ Schedule NAPI poll (disable IRQs for this queue)           ││
│  │                                                                  ││
│  │  NAPI Poll (soft IRQ context)                                   ││
│  │  └─ Allocate sk_buff                                            ││
│  │  └─ DMA sync / copy packet data                                 ││
│  │  └─ Set skb metadata (protocol, len, dev)                       ││
│  │  └─ Call napi_gro_receive() or netif_receive_skb()              ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  3. XDP (eXpress Data Path) - Optional                              │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  Before sk_buff allocation!                                     ││
│  │  └─ XDP_DROP: Drop immediately                                  ││
│  │  └─ XDP_PASS: Continue to stack                                 ││
│  │  └─ XDP_TX: Transmit back out same interface                   ││
│  │  └─ XDP_REDIRECT: Send to another interface/CPU                ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  4. GRO (Generic Receive Offload)                                   │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  Coalesce multiple packets into one large sk_buff               ││
│  │  Reduces per-packet overhead in stack                           ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  5. NETFILTER HOOKS                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  PREROUTING → INPUT (or FORWARD) → Local delivery              ││
│  │  └─ iptables/nftables rules                                     ││
│  │  └─ Connection tracking (conntrack)                             ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  6. IP LAYER                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  ip_rcv() → ip_rcv_finish()                                    ││
│  │  └─ Routing decision (local vs forward)                         ││
│  │  └─ ip_local_deliver() for local packets                       ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  7. TRANSPORT LAYER                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  tcp_v4_rcv() or udp_rcv()                                     ││
│  │  └─ Find socket (hash lookup)                                   ││
│  │  └─ TCP: Process state machine, ACK, etc.                      ││
│  │  └─ Enqueue to socket receive buffer                           ││
│  │  └─ Wake waiting process                                        ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  8. APPLICATION                                                      │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  read()/recv() copies data to user space                       ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Socket Structures

Socket Hierarchy

// User-visible socket
struct socket {
    socket_state            state;      // SS_UNCONNECTED, SS_CONNECTED, ...
    short                   type;       // SOCK_STREAM, SOCK_DGRAM, ...
    unsigned long           flags;      
    struct socket_wq        *wq;        // Wait queue
    struct file             *file;      // Associated file
    struct sock             *sk;        // Internal socket
    const struct proto_ops  *ops;       // Protocol operations
};

// Protocol-specific socket
struct sock {
    struct sock_common      __sk_common;
    socket_lock_t           sk_lock;     // Lock
    struct sk_buff_head     sk_receive_queue;   // Receive buffer
    struct sk_buff_head     sk_write_queue;     // Send buffer
    int                     sk_rcvbuf;   // Receive buffer size
    int                     sk_sndbuf;   // Send buffer size
    int                     sk_wmem_queued;     // Send queue memory
    int                     sk_rmem_alloc;      // Receive memory
    // ... many more fields
};

// TCP-specific socket extends sock
struct tcp_sock {
    struct inet_connection_sock inet_conn;
    
    // TCP state
    u32 snd_una;        // First unacked sequence
    u32 snd_nxt;        // Next sequence to send
    u32 rcv_nxt;        // Expected next sequence
    u32 rcv_wnd;        // Receive window
    
    // Congestion control
    u32 snd_cwnd;       // Congestion window
    u32 snd_ssthresh;   // Slow start threshold
    
    // RTT estimation
    u32 srtt_us;        // Smoothed RTT
    u32 rttvar_us;      // RTT variance
    // ... many more
};

sk_buff Structure

Efficient Packet Handling

Every packet in the Linux network stack is represented by an sk_buff (socket buffer). But why is it designed this way? The problem: As a packet moves through layers, each layer adds/removes headers:
  • Application → TCP adds TCP header
  • TCP → IP adds IP header
  • IP → Ethernet adds Ethernet header
Naive approach: Copy data at each layer. For a 1500-byte packet through 4 layers = 6000 bytes copied! The solution: sk_buff uses headroom and tailroom:
  • Allocate extra space at the beginning (headroom) and end (tailroom)
  • Adding a header? Just move the data pointer backward into headroom
  • Removing a header? Move the data pointer forward
  • No copying needed!
Real-world impact: Zero-copy packet processing. A 10Gbps NIC can handle millions of packets/second because headers are added/removed by pointer manipulation, not memory copies.
┌─────────────────────────────────────────────────────────────────────┐
│                        sk_buff STRUCTURE                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  struct sk_buff {                                                    │
│      /* Packet data pointers */                                     │
│      unsigned char *head;     ──────────┐                           │
│      unsigned char *data;     ──────────│──┐                        │
│      unsigned char *tail;     ──────────│──│──┐                     │
│      unsigned char *end;      ──────────│──│──│──┐                  │
│                                         │  │  │  │                   │
│      /* Buffer layout: */               │  │  │  │                   │
│      ┌──────────────────────────────────▼──│──│──│──────────────┐   │
│      │ headroom │◀─────── data ─────────▶│  │  │  │ tailroom    │   │
│      │          │  HEADERS  │  PAYLOAD   │  │  │  │             │   │
│      └──────────┴───────────┴────────────┴──▼──▼──▼─────────────┘   │
│                  │                            │                      │
│                  │◀────────── len ───────────▶│                     │
│                                                                      │
│      /* Important fields */                                         │
│      unsigned int len;           // Data length                     │
│      unsigned int data_len;      // Paged data length               │
│      __u16 protocol;             // L3 protocol (ETH_P_IP)         │
│      __u8 pkt_type;              // PACKET_HOST, PACKET_BROADCAST  │
│      struct net_device *dev;     // Device                          │
│      struct sock *sk;            // Associated socket               │
│                                                                      │
│      /* Protocol headers (union) */                                 │
│      struct {                                                       │
│          struct iphdr *iph;                                         │
│          struct ipv6hdr *ipv6h;                                     │
│      } network_header;                                              │
│      struct {                                                       │
│          struct tcphdr *th;                                         │
│          struct udphdr *uh;                                         │
│      } transport_header;                                            │
│  }                                                                   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

sk_buff Operations

// Allocate sk_buff
struct sk_buff *skb = alloc_skb(size, GFP_KERNEL);

// Reserve headroom for headers
skb_reserve(skb, header_len);

// Add data at tail
unsigned char *data = skb_put(skb, data_len);
memcpy(data, payload, data_len);

// Add header at head
unsigned char *header = skb_push(skb, header_len);

// Remove data from head (processing headers)
skb_pull(skb, header_len);

// Clone (share data, new metadata)
struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);

// Copy (full copy)
struct sk_buff *skb3 = skb_copy(skb, GFP_ATOMIC);

TCP Connection Lifecycle

Why TIME_WAIT Exists

Before we look at the TCP state machine, let’s understand one of its most misunderstood states: TIME_WAIT. The problem: After closing a connection, what if:
  • The final ACK gets lost? The other side will retransmit FIN
  • Delayed packets from the old connection arrive after we reuse the port?
The solution: TIME_WAIT state waits 2×MSL (Maximum Segment Lifetime, typically 60-120 seconds) to:
  1. Ensure clean shutdown: If final ACK is lost, we can resend it
  2. Prevent packet confusion: Old packets expire before port is reused
Why it’s annoying: A busy server closing 1000 connections/second accumulates 60,000-120,000 TIME_WAIT sockets! Mitigations:
  • SO_REUSEADDR: Allows binding to TIME_WAIT port
  • tcp_tw_reuse: Reuse TIME_WAIT for outgoing connections
  • Connection pooling: Keep connections alive instead of closing
┌─────────────────────────────────────────────────────────────────────┐
│                    TCP STATE MACHINE                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Client                              Server                          │
│  ──────                              ──────                          │
│                                                                      │
│  CLOSED                              CLOSED                          │
│     │                                   │                            │
│     │ socket(), bind()                  │ socket(), bind(), listen() │
│     │                                   ▼                            │
│     │                                LISTEN                          │
│     │                                   │                            │
│     │ connect()                         │                            │
│     ├─────── SYN ─────────────────────▶│                            │
│     │                                   │                            │
│  SYN_SENT                            │ (SYN received)              │
│     │                                   │                            │
│     │◀────── SYN+ACK ───────────────────┤                            │
│     │                                   │                            │
│     │         (ACK SYN)              SYN_RCVD                       │
│     ├─────── ACK ──────────────────────▶│                            │
│     │                                   │                            │
│  ESTABLISHED                         ESTABLISHED                    │
│     │                                   │                            │
│     │◀════════ DATA TRANSFER ══════════▶│                            │
│     │                                   │                            │
│     │ close()                           │                            │
│     ├─────── FIN ──────────────────────▶│                            │
│     │                                   │                            │
│  FIN_WAIT_1                          CLOSE_WAIT                     │
│     │                                   │                            │
│     │◀────── ACK ───────────────────────┤                            │
│     │                                   │ close()                   │
│  FIN_WAIT_2                             │                            │
│     │                                   │                            │
│     │◀────── FIN ───────────────────────┤                            │
│     │                                   │                            │
│     ├─────── ACK ──────────────────────▶│                            │
│     │                                   │                            │
│  TIME_WAIT                           LAST_ACK                       │
│     │                                   │                            │
│     │ (2*MSL timeout)                   │                            │
│     ▼                                   ▼                            │
│  CLOSED                              CLOSED                          │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Observing TCP States

# View connection states
ss -tan state established
ss -tan state time-wait
ss -tan state syn-recv

# Count by state
ss -tan | awk '{print $1}' | sort | uniq -c

# Detailed socket info
ss -ti dst 10.0.0.1:443

# Socket memory usage
ss -tm | grep -A1 ESTAB

# Watch for SYN floods
watch -n1 'ss -tan state syn-recv | wc -l'

TCP Tuning Parameters

Buffer Sizes

# Socket buffer defaults and limits
cat /proc/sys/net/core/rmem_default    # Default receive buffer
cat /proc/sys/net/core/rmem_max        # Max receive buffer
cat /proc/sys/net/core/wmem_default    # Default send buffer
cat /proc/sys/net/core/wmem_max        # Max send buffer

# TCP-specific auto-tuning
cat /proc/sys/net/ipv4/tcp_rmem
# min  default  max
# 4096 131072   6291456

cat /proc/sys/net/ipv4/tcp_wmem
# 4096 16384    4194304

# Tune for high-bandwidth paths
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216

Connection Parameters

# Backlog queue sizes
cat /proc/sys/net/core/somaxconn          # listen() backlog limit
cat /proc/sys/net/ipv4/tcp_max_syn_backlog # SYN queue size

# TIME_WAIT handling
cat /proc/sys/net/ipv4/tcp_tw_reuse       # Reuse TIME_WAIT for outgoing
cat /proc/sys/net/ipv4/tcp_fin_timeout    # FIN_WAIT_2 timeout

# Keepalive
cat /proc/sys/net/ipv4/tcp_keepalive_time  # Idle before keepalive
cat /proc/sys/net/ipv4/tcp_keepalive_intvl # Interval between probes
cat /proc/sys/net/ipv4/tcp_keepalive_probes # Probes before drop

Congestion Control

# Available algorithms
cat /proc/sys/net/ipv4/tcp_available_congestion_control
# reno cubic bbr

# Current algorithm
cat /proc/sys/net/ipv4/tcp_congestion_control

# Switch to BBR
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Per-socket selection
setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, "bbr", 3);

Netfilter and iptables

Netfilter Hooks

┌─────────────────────────────────────────────────────────────────────┐
│                    NETFILTER HOOK POINTS                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│                         ┌───────────────┐                           │
│                         │  PREROUTING   │                           │
│   Incoming              │   (DNAT)      │                           │
│   Packet ─────────────▶│               │                           │
│                         └───────┬───────┘                           │
│                                 │                                    │
│                         ┌───────▼───────┐                           │
│                         │   Routing     │                           │
│                         │   Decision    │                           │
│                         └───┬───────┬───┘                           │
│                 Local?      │       │      Forward?                 │
│              ┌──────────────┘       └──────────────┐               │
│              │                                      │               │
│      ┌───────▼───────┐                      ┌───────▼───────┐      │
│      │    INPUT      │                      │   FORWARD     │      │
│      │ (filter local)│                      │ (filter fwd)  │      │
│      └───────┬───────┘                      └───────┬───────┘      │
│              │                                      │               │
│      ┌───────▼───────┐                              │               │
│      │ Local Process │                              │               │
│      └───────┬───────┘                              │               │
│              │                                      │               │
│      ┌───────▼───────┐                      ┌───────▼───────┐      │
│      │   OUTPUT      │                      │  POSTROUTING  │      │
│      │ (filter out)  │                      │   (SNAT)      │      │
│      └───────┬───────┘                      └───────┬───────┘      │
│              │                                      │               │
│              └──────────────┬──────────────────────┘               │
│                             │                                       │
│                     ┌───────▼───────┐                              │
│                     │  POSTROUTING  │                              │
│                     │    (SNAT)     │                              │
│                     └───────┬───────┘                              │
│                             │                                       │
│                             ▼                                       │
│                      Outgoing Packet                                │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Connection Tracking

# View connection tracking table
conntrack -L

# Example output:
# tcp  6 300 ESTABLISHED src=10.0.0.1 dst=10.0.0.2 sport=12345 dport=80 ...

# Connection tracking stats
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

# Tune for high connection counts
sysctl -w net.netfilter.nf_conntrack_max=1000000
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=3600

XDP (eXpress Data Path)

XDP Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         XDP ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Packet arrives at NIC                                               │
│         │                                                            │
│         ▼                                                            │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                     XDP Hook Point                               ││
│  │                 (Before sk_buff creation)                        ││
│  │                                                                  ││
│  │  BPF Program executes:                                          ││
│  │  ┌────────────────────────────────────────────────────────────┐ ││
│  │  │ SEC("xdp")                                                 │ ││
│  │  │ int xdp_prog(struct xdp_md *ctx) {                        │ ││
│  │  │     void *data = (void*)(long)ctx->data;                  │ ││
│  │  │     void *data_end = (void*)(long)ctx->data_end;          │ ││
│  │  │     struct ethhdr *eth = data;                            │ ││
│  │  │                                                            │ ││
│  │  │     if (eth + 1 > data_end)                               │ ││
│  │  │         return XDP_DROP;                                   │ ││
│  │  │                                                            │ ││
│  │  │     // Process packet...                                   │ ││
│  │  │     return XDP_PASS;                                       │ ││
│  │  │ }                                                          │ ││
│  │  └────────────────────────────────────────────────────────────┘ ││
│  │                                                                  ││
│  │  Return values:                                                  ││
│  │  ├─ XDP_DROP:    Drop packet (no further processing)           ││
│  │  ├─ XDP_PASS:    Continue to network stack                     ││
│  │  ├─ XDP_TX:      Transmit back out same interface             ││
│  │  ├─ XDP_REDIRECT: Send to different interface/CPU              ││
│  │  └─ XDP_ABORTED: Error, drop with trace                        ││
│  │                                                                  ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│                              ▼                                       │
│                      Normal network stack                            │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

XDP Modes

ModeDescriptionPerformanceRequirements
NativeDriver supportBestNIC driver support
GenericIn network stackGoodAny NIC
OffloadOn NIC hardwareExtremeSmartNIC

XDP Use Cases

// DDoS mitigation - drop SYN floods
SEC("xdp")
int xdp_ddos(struct xdp_md *ctx) {
    void *data = (void*)(long)ctx->data;
    void *data_end = (void*)(long)ctx->data_end;
    
    struct ethhdr *eth = data;
    if (eth + 1 > data_end)
        return XDP_DROP;
    
    if (eth->h_proto != htons(ETH_P_IP))
        return XDP_PASS;
    
    struct iphdr *iph = data + sizeof(*eth);
    if (iph + 1 > data_end)
        return XDP_DROP;
    
    if (iph->protocol != IPPROTO_TCP)
        return XDP_PASS;
    
    struct tcphdr *tcp = (void*)iph + (iph->ihl * 4);
    if (tcp + 1 > data_end)
        return XDP_DROP;
    
    // Check SYN flag
    if (tcp->syn && !tcp->ack) {
        // Check against allowlist or rate limit
        __u32 src_ip = iph->saddr;
        __u64 *count = bpf_map_lookup_elem(&syn_count, &src_ip);
        if (count && *count > SYN_LIMIT)
            return XDP_DROP;
    }
    
    return XDP_PASS;
}

Network Performance Tuning

RSS and RPS

# RSS: Receive Side Scaling (hardware-based)
# Distributes packets across multiple CPU queues
ethtool -l eth0          # Show queue count
ethtool -L eth0 combined 8  # Set queue count

# Check RSS hash
ethtool -x eth0

# RPS: Receive Packet Steering (software-based)
# Distributes to CPUs when hardware RSS unavailable
echo "ff" > /sys/class/net/eth0/queues/rx-0/rps_cpus
# "ff" = CPUs 0-7

# RFS: Receive Flow Steering
# Steers packets to CPU where application is running
echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
echo 4096 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt

Interrupt Coalescing

# View current settings
ethtool -c eth0

# Tune for latency (less coalescing)
ethtool -C eth0 rx-usecs 10 rx-frames 5

# Tune for throughput (more coalescing)
ethtool -C eth0 rx-usecs 100 rx-frames 64

# Adaptive coalescing
ethtool -C eth0 adaptive-rx on adaptive-tx on

TCP Offloads

# View offload status
ethtool -k eth0

# Key offloads:
# - rx-checksumming: Hardware RX checksum
# - tx-checksumming: Hardware TX checksum
# - tcp-segmentation-offload: TSO
# - generic-receive-offload: GRO
# - large-receive-offload: LRO

# Disable (for debugging)
ethtool -K eth0 gro off

# Enable (for performance)
ethtool -K eth0 tso on gso on gro on

Debugging Network Issues

Packet Drops

# Check interface statistics
ip -s link show eth0
# Watch for: RX dropped, RX overruns

# Detailed drop reasons
ethtool -S eth0 | grep -i drop

# System-wide drops
cat /proc/net/softnet_stat
# Columns: processed, dropped, time_squeeze, ...

# Socket buffer overflows
ss -nmp | grep -E "rmem|wmem"

# Per-protocol drops
netstat -s | grep -i drop
nstat -az | grep -i drop

Tracing Tools

# Trace TCP state changes
sudo tcpstates-bpfcc

# Trace TCP retransmits
sudo tcpretrans-bpfcc

# Trace TCP connections
sudo tcpconnect-bpfcc   # Outgoing
sudo tcpaccept-bpfcc    # Incoming

# Trace packet drops
sudo dropwatch -l kas

# Trace specific flow
sudo bpftrace -e '
kprobe:tcp_rcv_established /
    ((struct sock *)arg0)->sk_dport == 8080 / {
    @recv = count();
}
'

Interview Deep Dives

Complete flow:
  1. DNS Resolution:
    • Application calls getaddrinfo()
    • Check /etc/hosts, then resolver
    • DNS query → response with IP
  2. Socket Creation:
    • socket(AF_INET, SOCK_STREAM, 0)
    • Allocate struct socket, struct sock
  3. TCP Handshake:
    • connect() triggers SYN
    • Build sk_buff, add TCP/IP headers
    • Route lookup, ARP if needed
    • Queue to driver, DMA to NIC
    • Wait for SYN+ACK
    • Send ACK, enter ESTABLISHED
  4. TLS Handshake:
    • ClientHello (ciphersuites, SNI)
    • ServerHello + Certificate
    • Key exchange (ECDHE)
    • Finished messages
  5. HTTP Request:
    • Send HTTP/1.1 or HTTP/2 request
    • TCP handles segmentation, retransmit
    • Data copied to send buffer, then NIC
  6. Response Path:
    • NIC receives, DMA to memory
    • NAPI poll, allocate sk_buff
    • IP/TCP processing
    • Copy to socket receive buffer
    • Application read()
Systematic approach:
  1. Measure where latency occurs:
    # Application to kernel
    strace -T -e network ./app
    
    # Kernel processing
    sudo perf trace -e 'net:*' ./app
    
    # Network path
    mtr target.com
    
  2. Check for retransmits:
    ss -ti dst target.com:443
    # Look for: retrans, rto
    
    sudo tcpretrans-bpfcc
    
  3. Check for buffer issues:
    ss -nmp | grep target.com
    # Check rcv_space, rmem
    
    cat /proc/net/sockstat
    # Check memory pressure
    
  4. Check for CPU issues:
    cat /proc/net/softnet_stat
    # Non-zero in column 3 = CPU overload
    
    mpstat -P ALL 1
    # Check for ksoftirqd CPU usage
    
  5. Check network path:
    tcpdump -i eth0 host target.com -w capture.pcap
    # Analyze in Wireshark for delays
    
Key mechanisms:
  1. Sequence numbers: Every byte numbered, receiver knows what’s missing
  2. Acknowledgments: Receiver ACKs what it received
  3. Retransmission:
    • Timeout-based (RTO expires)
    • Fast retransmit (3 duplicate ACKs)
  4. Flow control:
    • Receive window advertised
    • Sender won’t exceed receiver buffer
  5. Congestion control:
    • AIMD (Additive Increase, Multiplicative Decrease)
    • Slow start, congestion avoidance
    • Modern: BBR, CUBIC
  6. Checksums: Detect corruption
  7. Connection state: Three-way handshake, four-way close

Quick Reference

# Socket statistics
ss -s              # Summary
ss -tan            # TCP sockets
ss -uan            # UDP sockets
ss -ltn            # Listening TCP

# Interface statistics
ip -s link         # Interface stats
ethtool -S eth0    # Detailed stats

# TCP tuning files
/proc/sys/net/ipv4/tcp_*
/proc/sys/net/core/*

# Tracing
tcpdump -i any -nn port 80
sudo tcpretrans-bpfcc
sudo tcpconnect-bpfcc

Interview Deep-Dive

Strong Answer:
  • Packet drops at 40% average CPU can happen because network processing is concentrated on specific CPUs rather than evenly distributed. The key diagnostic is cat /proc/net/softnet_stat where each line represents a CPU. Column 1 is processed packets, column 2 is dropped packets (due to full backlog), and column 3 is time_squeeze events (softirq was cut short because it ran too long). If drops are concentrated on one or two CPUs, the problem is RSS (Receive Side Scaling) misconfiguration — packets are not being distributed across cores.
  • At the NIC level: ethtool -S eth0 | grep -i drop shows hardware-level drops. rx_missed_errors means the NIC’s ring buffer overflowed because the driver did not process packets fast enough. Fix: increase ring buffer size with ethtool -G eth0 rx 4096 or enable interrupt coalescing to batch processing.
  • At the socket level: ss -nmp | grep -E "rcv_space|rmem" shows per-socket buffer utilization. If rmem_alloc approaches rmem_max, the socket buffer is full because the application is not reading fast enough. Fix: increase net.core.rmem_max and net.ipv4.tcp_rmem.
  • At the conntrack level: cat /proc/sys/net/netfilter/nf_conntrack_count vs nf_conntrack_max. If the connection tracking table is full, new connections are dropped with nf_conntrack: table full, dropping packet in dmesg. Fix: increase nf_conntrack_max or reduce timeout values.
  • At the application level: if the server uses accept() in a single thread, the listen backlog (somaxconn) could overflow. ss -tlnp shows the Send-Q (backlog limit) and Recv-Q (current backlog). Fix: increase net.core.somaxconn and use SO_REUSEPORT for multi-threaded accept.
Follow-up: How does NAPI prevent interrupt storms and what is the trade-off?Follow-up Answer:
  • Without NAPI, the NIC raises one hardware interrupt per received packet. At 1 million packets per second, that is 1 million interrupts per second, each costing 1-2 microseconds, consuming 100% of a CPU core just handling interrupts. NAPI (New API) converts to a polling model: the first packet triggers an interrupt, the driver disables further interrupts for that queue and schedules a NAPI poll in softirq context. The poll function processes up to a budget (default 64) packets per invocation without any interrupts. When the poll finds no more packets, it re-enables interrupts. The trade-off is latency versus throughput: under low load, NAPI adds a small delay because the first packet triggers an interrupt but subsequent ones wait for the poll cycle. Under high load, NAPI is strictly better because it amortizes interrupt overhead across many packets. Busy polling (net.core.busy_poll) can further reduce latency by having the application poll the NIC directly in user-space, eliminating even the softirq scheduling delay.
Strong Answer:
  • CUBIC is a loss-based congestion control algorithm: it increases the congestion window following a cubic function and reduces it when packet loss is detected. The assumption is that packet loss signals congestion. This works well when the only cause of loss is buffer overflow, but on modern networks with deep buffers (bufferbloat), CUBIC fills the buffers before detecting loss, causing high latency. On lossy links (wireless, long-haul), CUBIC misinterprets non-congestion loss as congestion and unnecessarily reduces throughput.
  • BBR (Bottleneck Bandwidth and RTT) is a model-based algorithm that estimates the bottleneck bandwidth and minimum RTT, then sets the congestion window to bandwidth * RTT. It does not react to loss directly but instead maintains a model of the path’s capacity. BBR periodically probes for more bandwidth (ProbeUp phase) and lower RTT (ProbeRTT phase).
  • I would deploy BBR for: long-distance paths with high bandwidth-delay product (CDN to end users), paths with non-congestion loss (mobile networks), and any scenario where bufferbloat causes high latency under load. Google reports 2-25% throughput improvement on YouTube with BBR.
  • Risks: BBR v1 has fairness issues when competing with CUBIC flows — it can be overly aggressive and starve CUBIC connections. BBR v2 (still evolving) addresses this with explicit loss detection. BBR also requires accurate RTT measurement, so it may behave poorly behind TCP proxies that terminate connections. In shared hosting environments, deploying BBR on some servers but not others can create unfairness. I would test with A/B deployment and monitor both throughput and fairness metrics before fleet-wide rollout.
Follow-up: How does the kernel implement per-socket congestion control selection?Follow-up Answer:
  • The kernel allows different TCP connections to use different congestion control algorithms. The default is set via net.ipv4.tcp_congestion_control, but individual sockets can override it with setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, "bbr", 3). Internally, each tcp_sock has a ca_ops pointer to a tcp_congestion_ops structure containing function pointers for cong_avoid(), ssthresh(), undo_cwnd(), etc. When the TCP state machine processes ACKs or detects loss, it calls through these function pointers. Congestion control modules register via tcp_register_congestion_control() and are loaded as kernel modules (tcp_bbr.ko). This pluggable design allows seamless A/B testing at the application level.
Strong Answer:
  • The key insight is that for same-host pod-to-pod traffic, packets currently traverse: application -> socket layer -> TCP -> IP -> netfilter -> bridge -> veth -> IP -> TCP -> socket layer -> application. Most of this processing is wasted because the source and destination are on the same machine.
  • Using eBPF at the sockops and sk_msg program types, I can short-circuit this. First, a BPF_PROG_TYPE_SOCK_OPS program attached to the cgroup intercepts socket operations. When connect() is called and the destination IP belongs to a local pod (checked via a BPF hash map of local pod IPs), the program records the socket pair in a BPF_MAP_TYPE_SOCKHASH map.
  • Then, a BPF_PROG_TYPE_SK_MSG program is attached to the sockhash map. When data is written to a socket in the map, the sk_msg program redirects the data directly from the source socket’s send buffer to the destination socket’s receive buffer using bpf_msg_redirect_hash(). The data never enters the TCP/IP stack, never gets encapsulated in IP headers, never traverses netfilter, and never crosses a veth pair.
  • This is exactly how Cilium implements its socket-level load balancing and transparent encryption bypass for same-host traffic. The performance improvement is dramatic: latency drops from 50-100 microseconds (full stack traversal) to 5-10 microseconds (socket-to-socket redirect), and CPU overhead drops by 60-80%.
Follow-up: What are the observability implications of this short-circuiting, and how do you maintain visibility into the traffic?Follow-up Answer:
  • The major implication is that traditional network observability tools — tcpdump, iptables counters, conntrack, kprobes on TCP functions — will not see this traffic because it never enters the network stack. This is a significant operational concern. To maintain visibility, the eBPF programs themselves must implement observability: the sk_msg program can update per-connection byte counters in BPF maps, emit events to a ring buffer for detailed tracing, and maintain latency histograms. Cilium solves this with its Hubble observability layer, which reads these BPF maps and provides a flow log equivalent to what tcpdump would show. The key design principle is that when you bypass the kernel’s built-in observability, you must replace it with eBPF-based equivalents.

Next: Tracing & Profiling →