Skip to main content
Linux Networking Stack - Packet journey through the kernel

Network Stack

The Linux network stack is one of the most performance-critical subsystems. Understanding it deeply is essential for infrastructure roles at companies like Cloudflare, AWS, and Meta.
Prerequisites: System calls, basic networking concepts
Interview Focus: Packet path, socket buffers, TCP tuning, XDP
Time to Master: 5-6 hours

Network Stack Architecture

Why Layers?

The Linux network stack is organized into layers. But why? The problem: Networking is complex. A single monolithic “send packet” function would need to handle:
  • Application data formatting
  • Connection management (TCP state machine)
  • Routing decisions
  • Hardware-specific transmission
The solution: Separation of concerns. Each layer handles one responsibility:
  • Application layer: What data to send
  • Transport layer (TCP/UDP): How to deliver it reliably
  • Network layer (IP): Where to send it
  • Link layer: Which physical interface
  • Driver: Hardware-specific details
Benefits:
  • Modularity: Can swap TCP for UDP without changing application
  • Reusability: Same IP layer works for TCP, UDP, ICMP
  • Testability: Can test each layer independently
┌─────────────────────────────────────────────────────────────────────┐
│                    LINUX NETWORK STACK                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  User Space                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  Application                                                     ││
│  │  socket(), bind(), listen(), accept(), read(), write()          ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ════════════════════════════│═══════════════════════════════════   │
│              System Call Interface                                   │
│  ════════════════════════════│═══════════════════════════════════   │
│                              ▼                                       │
│  Kernel Space                                                        │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                      Socket Layer                                ││
│  │           struct socket, struct sock, sk_buff queues            ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                   Transport Layer (L4)                           ││
│  │              TCP (tcp_prot), UDP (udp_prot)                     ││
│  │        Congestion control, retransmission, flow control         ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                   Network Layer (L3)                             ││
│  │              IP (routing, fragmentation)                        ││
│  │              Netfilter/iptables hooks                           ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                   Link Layer (L2)                                ││
│  │              ARP, bridging, VLAN                                ││
│  │              tc (traffic control)                               ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                   Device Driver                                  ││
│  │              NAPI, ring buffers, DMA                            ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                   Hardware (NIC)                                 ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Packet Receive Path

Linux Packet Flow
┌─────────────────────────────────────────────────────────────────────┐
│                    PACKET RECEIVE PATH                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. HARDWARE LAYER                                                   │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  NIC receives packet → DMA to ring buffer → Raise IRQ           ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  2. DRIVER LAYER                                                     │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  IRQ Handler (hard IRQ context)                                 ││
│  │  └─ Acknowledge IRQ                                             ││
│  │  └─ Schedule NAPI poll (disable IRQs for this queue)           ││
│  │                                                                  ││
│  │  NAPI Poll (soft IRQ context)                                   ││
│  │  └─ Allocate sk_buff                                            ││
│  │  └─ DMA sync / copy packet data                                 ││
│  │  └─ Set skb metadata (protocol, len, dev)                       ││
│  │  └─ Call napi_gro_receive() or netif_receive_skb()              ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  3. XDP (eXpress Data Path) - Optional                              │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  Before sk_buff allocation!                                     ││
│  │  └─ XDP_DROP: Drop immediately                                  ││
│  │  └─ XDP_PASS: Continue to stack                                 ││
│  │  └─ XDP_TX: Transmit back out same interface                   ││
│  │  └─ XDP_REDIRECT: Send to another interface/CPU                ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  4. GRO (Generic Receive Offload)                                   │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  Coalesce multiple packets into one large sk_buff               ││
│  │  Reduces per-packet overhead in stack                           ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  5. NETFILTER HOOKS                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  PREROUTING → INPUT (or FORWARD) → Local delivery              ││
│  │  └─ iptables/nftables rules                                     ││
│  │  └─ Connection tracking (conntrack)                             ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  6. IP LAYER                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  ip_rcv() → ip_rcv_finish()                                    ││
│  │  └─ Routing decision (local vs forward)                         ││
│  │  └─ ip_local_deliver() for local packets                       ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  7. TRANSPORT LAYER                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  tcp_v4_rcv() or udp_rcv()                                     ││
│  │  └─ Find socket (hash lookup)                                   ││
│  │  └─ TCP: Process state machine, ACK, etc.                      ││
│  │  └─ Enqueue to socket receive buffer                           ││
│  │  └─ Wake waiting process                                        ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  8. APPLICATION                                                      │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  read()/recv() copies data to user space                       ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Socket Structures

Socket Hierarchy

// User-visible socket
struct socket {
    socket_state            state;      // SS_UNCONNECTED, SS_CONNECTED, ...
    short                   type;       // SOCK_STREAM, SOCK_DGRAM, ...
    unsigned long           flags;      
    struct socket_wq        *wq;        // Wait queue
    struct file             *file;      // Associated file
    struct sock             *sk;        // Internal socket
    const struct proto_ops  *ops;       // Protocol operations
};

// Protocol-specific socket
struct sock {
    struct sock_common      __sk_common;
    socket_lock_t           sk_lock;     // Lock
    struct sk_buff_head     sk_receive_queue;   // Receive buffer
    struct sk_buff_head     sk_write_queue;     // Send buffer
    int                     sk_rcvbuf;   // Receive buffer size
    int                     sk_sndbuf;   // Send buffer size
    int                     sk_wmem_queued;     // Send queue memory
    int                     sk_rmem_alloc;      // Receive memory
    // ... many more fields
};

// TCP-specific socket extends sock
struct tcp_sock {
    struct inet_connection_sock inet_conn;
    
    // TCP state
    u32 snd_una;        // First unacked sequence
    u32 snd_nxt;        // Next sequence to send
    u32 rcv_nxt;        // Expected next sequence
    u32 rcv_wnd;        // Receive window
    
    // Congestion control
    u32 snd_cwnd;       // Congestion window
    u32 snd_ssthresh;   // Slow start threshold
    
    // RTT estimation
    u32 srtt_us;        // Smoothed RTT
    u32 rttvar_us;      // RTT variance
    // ... many more
};

sk_buff Structure

Efficient Packet Handling

Every packet in the Linux network stack is represented by an sk_buff (socket buffer). But why is it designed this way? The problem: As a packet moves through layers, each layer adds/removes headers:
  • Application → TCP adds TCP header
  • TCP → IP adds IP header
  • IP → Ethernet adds Ethernet header
Naive approach: Copy data at each layer. For a 1500-byte packet through 4 layers = 6000 bytes copied! The solution: sk_buff uses headroom and tailroom:
  • Allocate extra space at the beginning (headroom) and end (tailroom)
  • Adding a header? Just move the data pointer backward into headroom
  • Removing a header? Move the data pointer forward
  • No copying needed!
Real-world impact: Zero-copy packet processing. A 10Gbps NIC can handle millions of packets/second because headers are added/removed by pointer manipulation, not memory copies.
┌─────────────────────────────────────────────────────────────────────┐
│                        sk_buff STRUCTURE                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  struct sk_buff {                                                    │
│      /* Packet data pointers */                                     │
│      unsigned char *head;     ──────────┐                           │
│      unsigned char *data;     ──────────│──┐                        │
│      unsigned char *tail;     ──────────│──│──┐                     │
│      unsigned char *end;      ──────────│──│──│──┐                  │
│                                         │  │  │  │                   │
│      /* Buffer layout: */               │  │  │  │                   │
│      ┌──────────────────────────────────▼──│──│──│──────────────┐   │
│      │ headroom │◀─────── data ─────────▶│  │  │  │ tailroom    │   │
│      │          │  HEADERS  │  PAYLOAD   │  │  │  │             │   │
│      └──────────┴───────────┴────────────┴──▼──▼──▼─────────────┘   │
│                  │                            │                      │
│                  │◀────────── len ───────────▶│                     │
│                                                                      │
│      /* Important fields */                                         │
│      unsigned int len;           // Data length                     │
│      unsigned int data_len;      // Paged data length               │
│      __u16 protocol;             // L3 protocol (ETH_P_IP)         │
│      __u8 pkt_type;              // PACKET_HOST, PACKET_BROADCAST  │
│      struct net_device *dev;     // Device                          │
│      struct sock *sk;            // Associated socket               │
│                                                                      │
│      /* Protocol headers (union) */                                 │
│      struct {                                                       │
│          struct iphdr *iph;                                         │
│          struct ipv6hdr *ipv6h;                                     │
│      } network_header;                                              │
│      struct {                                                       │
│          struct tcphdr *th;                                         │
│          struct udphdr *uh;                                         │
│      } transport_header;                                            │
│  }                                                                   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

sk_buff Operations

// Allocate sk_buff
struct sk_buff *skb = alloc_skb(size, GFP_KERNEL);

// Reserve headroom for headers
skb_reserve(skb, header_len);

// Add data at tail
unsigned char *data = skb_put(skb, data_len);
memcpy(data, payload, data_len);

// Add header at head
unsigned char *header = skb_push(skb, header_len);

// Remove data from head (processing headers)
skb_pull(skb, header_len);

// Clone (share data, new metadata)
struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);

// Copy (full copy)
struct sk_buff *skb3 = skb_copy(skb, GFP_ATOMIC);

TCP Connection Lifecycle

Why TIME_WAIT Exists

Before we look at the TCP state machine, let’s understand one of its most misunderstood states: TIME_WAIT. The problem: After closing a connection, what if:
  • The final ACK gets lost? The other side will retransmit FIN
  • Delayed packets from the old connection arrive after we reuse the port?
The solution: TIME_WAIT state waits 2×MSL (Maximum Segment Lifetime, typically 60-120 seconds) to:
  1. Ensure clean shutdown: If final ACK is lost, we can resend it
  2. Prevent packet confusion: Old packets expire before port is reused
Why it’s annoying: A busy server closing 1000 connections/second accumulates 60,000-120,000 TIME_WAIT sockets! Mitigations:
  • SO_REUSEADDR: Allows binding to TIME_WAIT port
  • tcp_tw_reuse: Reuse TIME_WAIT for outgoing connections
  • Connection pooling: Keep connections alive instead of closing
┌─────────────────────────────────────────────────────────────────────┐
│                    TCP STATE MACHINE                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Client                              Server                          │
│  ──────                              ──────                          │
│                                                                      │
│  CLOSED                              CLOSED                          │
│     │                                   │                            │
│     │ socket(), bind()                  │ socket(), bind(), listen() │
│     │                                   ▼                            │
│     │                                LISTEN                          │
│     │                                   │                            │
│     │ connect()                         │                            │
│     ├─────── SYN ─────────────────────▶│                            │
│     │                                   │                            │
│  SYN_SENT                            │ (SYN received)              │
│     │                                   │                            │
│     │◀────── SYN+ACK ───────────────────┤                            │
│     │                                   │                            │
│     │         (ACK SYN)              SYN_RCVD                       │
│     ├─────── ACK ──────────────────────▶│                            │
│     │                                   │                            │
│  ESTABLISHED                         ESTABLISHED                    │
│     │                                   │                            │
│     │◀════════ DATA TRANSFER ══════════▶│                            │
│     │                                   │                            │
│     │ close()                           │                            │
│     ├─────── FIN ──────────────────────▶│                            │
│     │                                   │                            │
│  FIN_WAIT_1                          CLOSE_WAIT                     │
│     │                                   │                            │
│     │◀────── ACK ───────────────────────┤                            │
│     │                                   │ close()                   │
│  FIN_WAIT_2                             │                            │
│     │                                   │                            │
│     │◀────── FIN ───────────────────────┤                            │
│     │                                   │                            │
│     ├─────── ACK ──────────────────────▶│                            │
│     │                                   │                            │
│  TIME_WAIT                           LAST_ACK                       │
│     │                                   │                            │
│     │ (2*MSL timeout)                   │                            │
│     ▼                                   ▼                            │
│  CLOSED                              CLOSED                          │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Observing TCP States

# View connection states
ss -tan state established
ss -tan state time-wait
ss -tan state syn-recv

# Count by state
ss -tan | awk '{print $1}' | sort | uniq -c

# Detailed socket info
ss -ti dst 10.0.0.1:443

# Socket memory usage
ss -tm | grep -A1 ESTAB

# Watch for SYN floods
watch -n1 'ss -tan state syn-recv | wc -l'

TCP Tuning Parameters

Buffer Sizes

# Socket buffer defaults and limits
cat /proc/sys/net/core/rmem_default    # Default receive buffer
cat /proc/sys/net/core/rmem_max        # Max receive buffer
cat /proc/sys/net/core/wmem_default    # Default send buffer
cat /proc/sys/net/core/wmem_max        # Max send buffer

# TCP-specific auto-tuning
cat /proc/sys/net/ipv4/tcp_rmem
# min  default  max
# 4096 131072   6291456

cat /proc/sys/net/ipv4/tcp_wmem
# 4096 16384    4194304

# Tune for high-bandwidth paths
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216

Connection Parameters

# Backlog queue sizes
cat /proc/sys/net/core/somaxconn          # listen() backlog limit
cat /proc/sys/net/ipv4/tcp_max_syn_backlog # SYN queue size

# TIME_WAIT handling
cat /proc/sys/net/ipv4/tcp_tw_reuse       # Reuse TIME_WAIT for outgoing
cat /proc/sys/net/ipv4/tcp_fin_timeout    # FIN_WAIT_2 timeout

# Keepalive
cat /proc/sys/net/ipv4/tcp_keepalive_time  # Idle before keepalive
cat /proc/sys/net/ipv4/tcp_keepalive_intvl # Interval between probes
cat /proc/sys/net/ipv4/tcp_keepalive_probes # Probes before drop

Congestion Control

# Available algorithms
cat /proc/sys/net/ipv4/tcp_available_congestion_control
# reno cubic bbr

# Current algorithm
cat /proc/sys/net/ipv4/tcp_congestion_control

# Switch to BBR
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Per-socket selection
setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, "bbr", 3);

Netfilter and iptables

Netfilter Hooks

┌─────────────────────────────────────────────────────────────────────┐
│                    NETFILTER HOOK POINTS                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│                         ┌───────────────┐                           │
│                         │  PREROUTING   │                           │
│   Incoming              │   (DNAT)      │                           │
│   Packet ─────────────▶│               │                           │
│                         └───────┬───────┘                           │
│                                 │                                    │
│                         ┌───────▼───────┐                           │
│                         │   Routing     │                           │
│                         │   Decision    │                           │
│                         └───┬───────┬───┘                           │
│                 Local?      │       │      Forward?                 │
│              ┌──────────────┘       └──────────────┐               │
│              │                                      │               │
│      ┌───────▼───────┐                      ┌───────▼───────┐      │
│      │    INPUT      │                      │   FORWARD     │      │
│      │ (filter local)│                      │ (filter fwd)  │      │
│      └───────┬───────┘                      └───────┬───────┘      │
│              │                                      │               │
│      ┌───────▼───────┐                              │               │
│      │ Local Process │                              │               │
│      └───────┬───────┘                              │               │
│              │                                      │               │
│      ┌───────▼───────┐                      ┌───────▼───────┐      │
│      │   OUTPUT      │                      │  POSTROUTING  │      │
│      │ (filter out)  │                      │   (SNAT)      │      │
│      └───────┬───────┘                      └───────┬───────┘      │
│              │                                      │               │
│              └──────────────┬──────────────────────┘               │
│                             │                                       │
│                     ┌───────▼───────┐                              │
│                     │  POSTROUTING  │                              │
│                     │    (SNAT)     │                              │
│                     └───────┬───────┘                              │
│                             │                                       │
│                             ▼                                       │
│                      Outgoing Packet                                │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Connection Tracking

# View connection tracking table
conntrack -L

# Example output:
# tcp  6 300 ESTABLISHED src=10.0.0.1 dst=10.0.0.2 sport=12345 dport=80 ...

# Connection tracking stats
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

# Tune for high connection counts
sysctl -w net.netfilter.nf_conntrack_max=1000000
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=3600

XDP (eXpress Data Path)

XDP Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         XDP ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Packet arrives at NIC                                               │
│         │                                                            │
│         ▼                                                            │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                     XDP Hook Point                               ││
│  │                 (Before sk_buff creation)                        ││
│  │                                                                  ││
│  │  BPF Program executes:                                          ││
│  │  ┌────────────────────────────────────────────────────────────┐ ││
│  │  │ SEC("xdp")                                                 │ ││
│  │  │ int xdp_prog(struct xdp_md *ctx) {                        │ ││
│  │  │     void *data = (void*)(long)ctx->data;                  │ ││
│  │  │     void *data_end = (void*)(long)ctx->data_end;          │ ││
│  │  │     struct ethhdr *eth = data;                            │ ││
│  │  │                                                            │ ││
│  │  │     if (eth + 1 > data_end)                               │ ││
│  │  │         return XDP_DROP;                                   │ ││
│  │  │                                                            │ ││
│  │  │     // Process packet...                                   │ ││
│  │  │     return XDP_PASS;                                       │ ││
│  │  │ }                                                          │ ││
│  │  └────────────────────────────────────────────────────────────┘ ││
│  │                                                                  ││
│  │  Return values:                                                  ││
│  │  ├─ XDP_DROP:    Drop packet (no further processing)           ││
│  │  ├─ XDP_PASS:    Continue to network stack                     ││
│  │  ├─ XDP_TX:      Transmit back out same interface             ││
│  │  ├─ XDP_REDIRECT: Send to different interface/CPU              ││
│  │  └─ XDP_ABORTED: Error, drop with trace                        ││
│  │                                                                  ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│                              ▼                                       │
│                      Normal network stack                            │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

XDP Modes

ModeDescriptionPerformanceRequirements
NativeDriver supportBestNIC driver support
GenericIn network stackGoodAny NIC
OffloadOn NIC hardwareExtremeSmartNIC

XDP Use Cases

// DDoS mitigation - drop SYN floods
SEC("xdp")
int xdp_ddos(struct xdp_md *ctx) {
    void *data = (void*)(long)ctx->data;
    void *data_end = (void*)(long)ctx->data_end;
    
    struct ethhdr *eth = data;
    if (eth + 1 > data_end)
        return XDP_DROP;
    
    if (eth->h_proto != htons(ETH_P_IP))
        return XDP_PASS;
    
    struct iphdr *iph = data + sizeof(*eth);
    if (iph + 1 > data_end)
        return XDP_DROP;
    
    if (iph->protocol != IPPROTO_TCP)
        return XDP_PASS;
    
    struct tcphdr *tcp = (void*)iph + (iph->ihl * 4);
    if (tcp + 1 > data_end)
        return XDP_DROP;
    
    // Check SYN flag
    if (tcp->syn && !tcp->ack) {
        // Check against allowlist or rate limit
        __u32 src_ip = iph->saddr;
        __u64 *count = bpf_map_lookup_elem(&syn_count, &src_ip);
        if (count && *count > SYN_LIMIT)
            return XDP_DROP;
    }
    
    return XDP_PASS;
}

Network Performance Tuning

RSS and RPS

# RSS: Receive Side Scaling (hardware-based)
# Distributes packets across multiple CPU queues
ethtool -l eth0          # Show queue count
ethtool -L eth0 combined 8  # Set queue count

# Check RSS hash
ethtool -x eth0

# RPS: Receive Packet Steering (software-based)
# Distributes to CPUs when hardware RSS unavailable
echo "ff" > /sys/class/net/eth0/queues/rx-0/rps_cpus
# "ff" = CPUs 0-7

# RFS: Receive Flow Steering
# Steers packets to CPU where application is running
echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
echo 4096 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt

Interrupt Coalescing

# View current settings
ethtool -c eth0

# Tune for latency (less coalescing)
ethtool -C eth0 rx-usecs 10 rx-frames 5

# Tune for throughput (more coalescing)
ethtool -C eth0 rx-usecs 100 rx-frames 64

# Adaptive coalescing
ethtool -C eth0 adaptive-rx on adaptive-tx on

TCP Offloads

# View offload status
ethtool -k eth0

# Key offloads:
# - rx-checksumming: Hardware RX checksum
# - tx-checksumming: Hardware TX checksum
# - tcp-segmentation-offload: TSO
# - generic-receive-offload: GRO
# - large-receive-offload: LRO

# Disable (for debugging)
ethtool -K eth0 gro off

# Enable (for performance)
ethtool -K eth0 tso on gso on gro on

Debugging Network Issues

Packet Drops

# Check interface statistics
ip -s link show eth0
# Watch for: RX dropped, RX overruns

# Detailed drop reasons
ethtool -S eth0 | grep -i drop

# System-wide drops
cat /proc/net/softnet_stat
# Columns: processed, dropped, time_squeeze, ...

# Socket buffer overflows
ss -nmp | grep -E "rmem|wmem"

# Per-protocol drops
netstat -s | grep -i drop
nstat -az | grep -i drop

Tracing Tools

# Trace TCP state changes
sudo tcpstates-bpfcc

# Trace TCP retransmits
sudo tcpretrans-bpfcc

# Trace TCP connections
sudo tcpconnect-bpfcc   # Outgoing
sudo tcpaccept-bpfcc    # Incoming

# Trace packet drops
sudo dropwatch -l kas

# Trace specific flow
sudo bpftrace -e '
kprobe:tcp_rcv_established /
    ((struct sock *)arg0)->sk_dport == 8080 / {
    @recv = count();
}
'

Interview Deep Dives

Complete flow:
  1. DNS Resolution:
    • Application calls getaddrinfo()
    • Check /etc/hosts, then resolver
    • DNS query → response with IP
  2. Socket Creation:
    • socket(AF_INET, SOCK_STREAM, 0)
    • Allocate struct socket, struct sock
  3. TCP Handshake:
    • connect() triggers SYN
    • Build sk_buff, add TCP/IP headers
    • Route lookup, ARP if needed
    • Queue to driver, DMA to NIC
    • Wait for SYN+ACK
    • Send ACK, enter ESTABLISHED
  4. TLS Handshake:
    • ClientHello (ciphersuites, SNI)
    • ServerHello + Certificate
    • Key exchange (ECDHE)
    • Finished messages
  5. HTTP Request:
    • Send HTTP/1.1 or HTTP/2 request
    • TCP handles segmentation, retransmit
    • Data copied to send buffer, then NIC
  6. Response Path:
    • NIC receives, DMA to memory
    • NAPI poll, allocate sk_buff
    • IP/TCP processing
    • Copy to socket receive buffer
    • Application read()
Systematic approach:
  1. Measure where latency occurs:
    # Application to kernel
    strace -T -e network ./app
    
    # Kernel processing
    sudo perf trace -e 'net:*' ./app
    
    # Network path
    mtr target.com
    
  2. Check for retransmits:
    ss -ti dst target.com:443
    # Look for: retrans, rto
    
    sudo tcpretrans-bpfcc
    
  3. Check for buffer issues:
    ss -nmp | grep target.com
    # Check rcv_space, rmem
    
    cat /proc/net/sockstat
    # Check memory pressure
    
  4. Check for CPU issues:
    cat /proc/net/softnet_stat
    # Non-zero in column 3 = CPU overload
    
    mpstat -P ALL 1
    # Check for ksoftirqd CPU usage
    
  5. Check network path:
    tcpdump -i eth0 host target.com -w capture.pcap
    # Analyze in Wireshark for delays
    
Key mechanisms:
  1. Sequence numbers: Every byte numbered, receiver knows what’s missing
  2. Acknowledgments: Receiver ACKs what it received
  3. Retransmission:
    • Timeout-based (RTO expires)
    • Fast retransmit (3 duplicate ACKs)
  4. Flow control:
    • Receive window advertised
    • Sender won’t exceed receiver buffer
  5. Congestion control:
    • AIMD (Additive Increase, Multiplicative Decrease)
    • Slow start, congestion avoidance
    • Modern: BBR, CUBIC
  6. Checksums: Detect corruption
  7. Connection state: Three-way handshake, four-way close

Quick Reference

# Socket statistics
ss -s              # Summary
ss -tan            # TCP sockets
ss -uan            # UDP sockets
ss -ltn            # Listening TCP

# Interface statistics
ip -s link         # Interface stats
ethtool -S eth0    # Detailed stats

# TCP tuning files
/proc/sys/net/ipv4/tcp_*
/proc/sys/net/core/*

# Tracing
tcpdump -i any -nn port 80
sudo tcpretrans-bpfcc
sudo tcpconnect-bpfcc

Next: Tracing & Profiling →