Linux Networking Stack - Packet journey through the kernel

Network Stack

The Linux network stack is one of the most performance-critical subsystems. Understanding it deeply is essential for infrastructure roles at companies like Cloudflare, AWS, and Meta.

Prerequisites: System calls, basic networking concepts
Interview Focus: Packet path, socket buffers, TCP tuning, XDP
Time to Master: 5-6 hours

Network Stack Architecture

Why Layers?

The Linux network stack is organized into layers. But why? The problem: Networking is complex. A single monolithic “send packet” function would need to handle:

Application data formatting
Connection management (TCP state machine)
Routing decisions
Hardware-specific transmission

The solution: Separation of concerns. Each layer handles one responsibility:

Application layer: What data to send
Transport layer (TCP/UDP): How to deliver it reliably
Network layer (IP): Where to send it
Link layer: Which physical interface
Driver: Hardware-specific details

Benefits:

Modularity: Can swap TCP for UDP without changing application
Reusability: Same IP layer works for TCP, UDP, ICMP
Testability: Can test each layer independently

┌─────────────────────────────────────────────────────────────────────┐
│                    LINUX NETWORK STACK                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  User Space                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  Application                                                     ││
│  │  socket(), bind(), listen(), accept(), read(), write()          ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ════════════════════════════│═══════════════════════════════════   │
│              System Call Interface                                   │
│  ════════════════════════════│═══════════════════════════════════   │
│                              ▼                                       │
│  Kernel Space                                                        │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                      Socket Layer                                ││
│  │           struct socket, struct sock, sk_buff queues            ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                   Transport Layer (L4)                           ││
│  │              TCP (tcp_prot), UDP (udp_prot)                     ││
│  │        Congestion control, retransmission, flow control         ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                   Network Layer (L3)                             ││
│  │              IP (routing, fragmentation)                        ││
│  │              Netfilter/iptables hooks                           ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                   Link Layer (L2)                                ││
│  │              ARP, bridging, VLAN                                ││
│  │              tc (traffic control)                               ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                   Device Driver                                  ││
│  │              NAPI, ring buffers, DMA                            ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                   Hardware (NIC)                                 ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Packet Receive Path

┌─────────────────────────────────────────────────────────────────────┐
│                    PACKET RECEIVE PATH                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. HARDWARE LAYER                                                   │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  NIC receives packet → DMA to ring buffer → Raise IRQ           ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  2. DRIVER LAYER                                                     │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  IRQ Handler (hard IRQ context)                                 ││
│  │  └─ Acknowledge IRQ                                             ││
│  │  └─ Schedule NAPI poll (disable IRQs for this queue)           ││
│  │                                                                  ││
│  │  NAPI Poll (soft IRQ context)                                   ││
│  │  └─ Allocate sk_buff                                            ││
│  │  └─ DMA sync / copy packet data                                 ││
│  │  └─ Set skb metadata (protocol, len, dev)                       ││
│  │  └─ Call napi_gro_receive() or netif_receive_skb()              ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  3. XDP (eXpress Data Path) - Optional                              │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  Before sk_buff allocation!                                     ││
│  │  └─ XDP_DROP: Drop immediately                                  ││
│  │  └─ XDP_PASS: Continue to stack                                 ││
│  │  └─ XDP_TX: Transmit back out same interface                   ││
│  │  └─ XDP_REDIRECT: Send to another interface/CPU                ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  4. GRO (Generic Receive Offload)                                   │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  Coalesce multiple packets into one large sk_buff               ││
│  │  Reduces per-packet overhead in stack                           ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  5. NETFILTER HOOKS                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  PREROUTING → INPUT (or FORWARD) → Local delivery              ││
│  │  └─ iptables/nftables rules                                     ││
│  │  └─ Connection tracking (conntrack)                             ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  6. IP LAYER                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  ip_rcv() → ip_rcv_finish()                                    ││
│  │  └─ Routing decision (local vs forward)                         ││
│  │  └─ ip_local_deliver() for local packets                       ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  7. TRANSPORT LAYER                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  tcp_v4_rcv() or udp_rcv()                                     ││
│  │  └─ Find socket (hash lookup)                                   ││
│  │  └─ TCP: Process state machine, ACK, etc.                      ││
│  │  └─ Enqueue to socket receive buffer                           ││
│  │  └─ Wake waiting process                                        ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│  8. APPLICATION                                                      │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  read()/recv() copies data to user space                       ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Socket Structures

Socket Hierarchy

// User-visible socket
struct socket {
    socket_state            state;      // SS_UNCONNECTED, SS_CONNECTED, ...
    short                   type;       // SOCK_STREAM, SOCK_DGRAM, ...
    unsigned long           flags;      
    struct socket_wq        *wq;        // Wait queue
    struct file             *file;      // Associated file
    struct sock             *sk;        // Internal socket
    const struct proto_ops  *ops;       // Protocol operations
};

// Protocol-specific socket
struct sock {
    struct sock_common      __sk_common;
    socket_lock_t           sk_lock;     // Lock
    struct sk_buff_head     sk_receive_queue;   // Receive buffer
    struct sk_buff_head     sk_write_queue;     // Send buffer
    int                     sk_rcvbuf;   // Receive buffer size
    int                     sk_sndbuf;   // Send buffer size
    int                     sk_wmem_queued;     // Send queue memory
    int                     sk_rmem_alloc;      // Receive memory
    // ... many more fields
};

// TCP-specific socket extends sock
struct tcp_sock {
    struct inet_connection_sock inet_conn;
    
    // TCP state
    u32 snd_una;        // First unacked sequence
    u32 snd_nxt;        // Next sequence to send
    u32 rcv_nxt;        // Expected next sequence
    u32 rcv_wnd;        // Receive window
    
    // Congestion control
    u32 snd_cwnd;       // Congestion window
    u32 snd_ssthresh;   // Slow start threshold
    
    // RTT estimation
    u32 srtt_us;        // Smoothed RTT
    u32 rttvar_us;      // RTT variance
    // ... many more
};

sk_buff Structure

Efficient Packet Handling

Every packet in the Linux network stack is represented by an sk_buff (socket buffer). But why is it designed this way? The problem: As a packet moves through layers, each layer adds/removes headers:

Application → TCP adds TCP header
TCP → IP adds IP header
IP → Ethernet adds Ethernet header

Naive approach: Copy data at each layer. For a 1500-byte packet through 4 layers = 6000 bytes copied! The solution: sk_buff uses headroom and tailroom:

Allocate extra space at the beginning (headroom) and end (tailroom)
Adding a header? Just move the data pointer backward into headroom
Removing a header? Move the data pointer forward
No copying needed!

Real-world impact: Zero-copy packet processing. A 10Gbps NIC can handle millions of packets/second because headers are added/removed by pointer manipulation, not memory copies.

┌─────────────────────────────────────────────────────────────────────┐
│                        sk_buff STRUCTURE                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  struct sk_buff {                                                    │
│      /* Packet data pointers */                                     │
│      unsigned char *head;     ──────────┐                           │
│      unsigned char *data;     ──────────│──┐                        │
│      unsigned char *tail;     ──────────│──│──┐                     │
│      unsigned char *end;      ──────────│──│──│──┐                  │
│                                         │  │  │  │                   │
│      /* Buffer layout: */               │  │  │  │                   │
│      ┌──────────────────────────────────▼──│──│──│──────────────┐   │
│      │ headroom │◀─────── data ─────────▶│  │  │  │ tailroom    │   │
│      │          │  HEADERS  │  PAYLOAD   │  │  │  │             │   │
│      └──────────┴───────────┴────────────┴──▼──▼──▼─────────────┘   │
│                  │                            │                      │
│                  │◀────────── len ───────────▶│                     │
│                                                                      │
│      /* Important fields */                                         │
│      unsigned int len;           // Data length                     │
│      unsigned int data_len;      // Paged data length               │
│      __u16 protocol;             // L3 protocol (ETH_P_IP)         │
│      __u8 pkt_type;              // PACKET_HOST, PACKET_BROADCAST  │
│      struct net_device *dev;     // Device                          │
│      struct sock *sk;            // Associated socket               │
│                                                                      │
│      /* Protocol headers (union) */                                 │
│      struct {                                                       │
│          struct iphdr *iph;                                         │
│          struct ipv6hdr *ipv6h;                                     │
│      } network_header;                                              │
│      struct {                                                       │
│          struct tcphdr *th;                                         │
│          struct udphdr *uh;                                         │
│      } transport_header;                                            │
│  }                                                                   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

sk_buff Operations

// Allocate sk_buff
struct sk_buff *skb = alloc_skb(size, GFP_KERNEL);

// Reserve headroom for headers
skb_reserve(skb, header_len);

// Add data at tail
unsigned char *data = skb_put(skb, data_len);
memcpy(data, payload, data_len);

// Add header at head
unsigned char *header = skb_push(skb, header_len);

// Remove data from head (processing headers)
skb_pull(skb, header_len);

// Clone (share data, new metadata)
struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);

// Copy (full copy)
struct sk_buff *skb3 = skb_copy(skb, GFP_ATOMIC);

TCP Connection Lifecycle

Why TIME_WAIT Exists

Before we look at the TCP state machine, let’s understand one of its most misunderstood states: TIME_WAIT. The problem: After closing a connection, what if:

The final ACK gets lost? The other side will retransmit FIN
Delayed packets from the old connection arrive after we reuse the port?

The solution: TIME_WAIT state waits 2×MSL (Maximum Segment Lifetime, typically 60-120 seconds) to:

Ensure clean shutdown: If final ACK is lost, we can resend it
Prevent packet confusion: Old packets expire before port is reused

Why it’s annoying: A busy server closing 1000 connections/second accumulates 60,000-120,000 TIME_WAIT sockets! Mitigations:

SO_REUSEADDR: Allows binding to TIME_WAIT port
tcp_tw_reuse: Reuse TIME_WAIT for outgoing connections
Connection pooling: Keep connections alive instead of closing

┌─────────────────────────────────────────────────────────────────────┐
│                    TCP STATE MACHINE                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Client                              Server                          │
│  ──────                              ──────                          │
│                                                                      │
│  CLOSED                              CLOSED                          │
│     │                                   │                            │
│     │ socket(), bind()                  │ socket(), bind(), listen() │
│     │                                   ▼                            │
│     │                                LISTEN                          │
│     │                                   │                            │
│     │ connect()                         │                            │
│     ├─────── SYN ─────────────────────▶│                            │
│     │                                   │                            │
│  SYN_SENT                            │ (SYN received)              │
│     │                                   │                            │
│     │◀────── SYN+ACK ───────────────────┤                            │
│     │                                   │                            │
│     │         (ACK SYN)              SYN_RCVD                       │
│     ├─────── ACK ──────────────────────▶│                            │
│     │                                   │                            │
│  ESTABLISHED                         ESTABLISHED                    │
│     │                                   │                            │
│     │◀════════ DATA TRANSFER ══════════▶│                            │
│     │                                   │                            │
│     │ close()                           │                            │
│     ├─────── FIN ──────────────────────▶│                            │
│     │                                   │                            │
│  FIN_WAIT_1                          CLOSE_WAIT                     │
│     │                                   │                            │
│     │◀────── ACK ───────────────────────┤                            │
│     │                                   │ close()                   │
│  FIN_WAIT_2                             │                            │
│     │                                   │                            │
│     │◀────── FIN ───────────────────────┤                            │
│     │                                   │                            │
│     ├─────── ACK ──────────────────────▶│                            │
│     │                                   │                            │
│  TIME_WAIT                           LAST_ACK                       │
│     │                                   │                            │
│     │ (2*MSL timeout)                   │                            │
│     ▼                                   ▼                            │
│  CLOSED                              CLOSED                          │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Observing TCP States

# View connection states
ss -tan state established
ss -tan state time-wait
ss -tan state syn-recv

# Count by state
ss -tan | awk '{print $1}' | sort | uniq -c

# Detailed socket info
ss -ti dst 10.0.0.1:443

# Socket memory usage
ss -tm | grep -A1 ESTAB

# Watch for SYN floods
watch -n1 'ss -tan state syn-recv | wc -l'

TCP Tuning Parameters

Buffer Sizes

# Socket buffer defaults and limits
cat /proc/sys/net/core/rmem_default    # Default receive buffer
cat /proc/sys/net/core/rmem_max        # Max receive buffer
cat /proc/sys/net/core/wmem_default    # Default send buffer
cat /proc/sys/net/core/wmem_max        # Max send buffer

# TCP-specific auto-tuning
cat /proc/sys/net/ipv4/tcp_rmem
# min  default  max
# 4096 131072   6291456

cat /proc/sys/net/ipv4/tcp_wmem
# 4096 16384    4194304

# Tune for high-bandwidth paths
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216

Connection Parameters

# Backlog queue sizes
cat /proc/sys/net/core/somaxconn          # listen() backlog limit
cat /proc/sys/net/ipv4/tcp_max_syn_backlog # SYN queue size

# TIME_WAIT handling
cat /proc/sys/net/ipv4/tcp_tw_reuse       # Reuse TIME_WAIT for outgoing
cat /proc/sys/net/ipv4/tcp_fin_timeout    # FIN_WAIT_2 timeout

# Keepalive
cat /proc/sys/net/ipv4/tcp_keepalive_time  # Idle before keepalive
cat /proc/sys/net/ipv4/tcp_keepalive_intvl # Interval between probes
cat /proc/sys/net/ipv4/tcp_keepalive_probes # Probes before drop

Congestion Control

# Available algorithms
cat /proc/sys/net/ipv4/tcp_available_congestion_control
# reno cubic bbr

# Current algorithm
cat /proc/sys/net/ipv4/tcp_congestion_control

# Switch to BBR
sysctl -w net.ipv4.tcp_congestion_control=bbr

# Per-socket selection
setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, "bbr", 3);

Netfilter and iptables

Netfilter Hooks

┌─────────────────────────────────────────────────────────────────────┐
│                    NETFILTER HOOK POINTS                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│                         ┌───────────────┐                           │
│                         │  PREROUTING   │                           │
│   Incoming              │   (DNAT)      │                           │
│   Packet ─────────────▶│               │                           │
│                         └───────┬───────┘                           │
│                                 │                                    │
│                         ┌───────▼───────┐                           │
│                         │   Routing     │                           │
│                         │   Decision    │                           │
│                         └───┬───────┬───┘                           │
│                 Local?      │       │      Forward?                 │
│              ┌──────────────┘       └──────────────┐               │
│              │                                      │               │
│      ┌───────▼───────┐                      ┌───────▼───────┐      │
│      │    INPUT      │                      │   FORWARD     │      │
│      │ (filter local)│                      │ (filter fwd)  │      │
│      └───────┬───────┘                      └───────┬───────┘      │
│              │                                      │               │
│      ┌───────▼───────┐                              │               │
│      │ Local Process │                              │               │
│      └───────┬───────┘                              │               │
│              │                                      │               │
│      ┌───────▼───────┐                      ┌───────▼───────┐      │
│      │   OUTPUT      │                      │  POSTROUTING  │      │
│      │ (filter out)  │                      │   (SNAT)      │      │
│      └───────┬───────┘                      └───────┬───────┘      │
│              │                                      │               │
│              └──────────────┬──────────────────────┘               │
│                             │                                       │
│                     ┌───────▼───────┐                              │
│                     │  POSTROUTING  │                              │
│                     │    (SNAT)     │                              │
│                     └───────┬───────┘                              │
│                             │                                       │
│                             ▼                                       │
│                      Outgoing Packet                                │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Connection Tracking

# View connection tracking table
conntrack -L

# Example output:
# tcp  6 300 ESTABLISHED src=10.0.0.1 dst=10.0.0.2 sport=12345 dport=80 ...

# Connection tracking stats
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

# Tune for high connection counts
sysctl -w net.netfilter.nf_conntrack_max=1000000
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=3600

XDP (eXpress Data Path)

XDP Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         XDP ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Packet arrives at NIC                                               │
│         │                                                            │
│         ▼                                                            │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                     XDP Hook Point                               ││
│  │                 (Before sk_buff creation)                        ││
│  │                                                                  ││
│  │  BPF Program executes:                                          ││
│  │  ┌────────────────────────────────────────────────────────────┐ ││
│  │  │ SEC("xdp")                                                 │ ││
│  │  │ int xdp_prog(struct xdp_md *ctx) {                        │ ││
│  │  │     void *data = (void*)(long)ctx->data;                  │ ││
│  │  │     void *data_end = (void*)(long)ctx->data_end;          │ ││
│  │  │     struct ethhdr *eth = data;                            │ ││
│  │  │                                                            │ ││
│  │  │     if (eth + 1 > data_end)                               │ ││
│  │  │         return XDP_DROP;                                   │ ││
│  │  │                                                            │ ││
│  │  │     // Process packet...                                   │ ││
│  │  │     return XDP_PASS;                                       │ ││
│  │  │ }                                                          │ ││
│  │  └────────────────────────────────────────────────────────────┘ ││
│  │                                                                  ││
│  │  Return values:                                                  ││
│  │  ├─ XDP_DROP:    Drop packet (no further processing)           ││
│  │  ├─ XDP_PASS:    Continue to network stack                     ││
│  │  ├─ XDP_TX:      Transmit back out same interface             ││
│  │  ├─ XDP_REDIRECT: Send to different interface/CPU              ││
│  │  └─ XDP_ABORTED: Error, drop with trace                        ││
│  │                                                                  ││
│  └─────────────────────────────────────────────────────────────────┘│
│                              │                                       │
│                              ▼                                       │
│                      Normal network stack                            │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

XDP Modes

Mode	Description	Performance	Requirements
Native	Driver support	Best	NIC driver support
Generic	In network stack	Good	Any NIC
Offload	On NIC hardware	Extreme	SmartNIC

XDP Use Cases

// DDoS mitigation - drop SYN floods
SEC("xdp")
int xdp_ddos(struct xdp_md *ctx) {
    void *data = (void*)(long)ctx->data;
    void *data_end = (void*)(long)ctx->data_end;
    
    struct ethhdr *eth = data;
    if (eth + 1 > data_end)
        return XDP_DROP;
    
    if (eth->h_proto != htons(ETH_P_IP))
        return XDP_PASS;
    
    struct iphdr *iph = data + sizeof(*eth);
    if (iph + 1 > data_end)
        return XDP_DROP;
    
    if (iph->protocol != IPPROTO_TCP)
        return XDP_PASS;
    
    struct tcphdr *tcp = (void*)iph + (iph->ihl * 4);
    if (tcp + 1 > data_end)
        return XDP_DROP;
    
    // Check SYN flag
    if (tcp->syn && !tcp->ack) {
        // Check against allowlist or rate limit
        __u32 src_ip = iph->saddr;
        __u64 *count = bpf_map_lookup_elem(&syn_count, &src_ip);
        if (count && *count > SYN_LIMIT)
            return XDP_DROP;
    }
    
    return XDP_PASS;
}

Network Performance Tuning

RSS and RPS

# RSS: Receive Side Scaling (hardware-based)
# Distributes packets across multiple CPU queues
ethtool -l eth0          # Show queue count
ethtool -L eth0 combined 8  # Set queue count

# Check RSS hash
ethtool -x eth0

# RPS: Receive Packet Steering (software-based)
# Distributes to CPUs when hardware RSS unavailable
echo "ff" > /sys/class/net/eth0/queues/rx-0/rps_cpus
# "ff" = CPUs 0-7

# RFS: Receive Flow Steering
# Steers packets to CPU where application is running
echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
echo 4096 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt

Interrupt Coalescing

# View current settings
ethtool -c eth0

# Tune for latency (less coalescing)
ethtool -C eth0 rx-usecs 10 rx-frames 5

# Tune for throughput (more coalescing)
ethtool -C eth0 rx-usecs 100 rx-frames 64

# Adaptive coalescing
ethtool -C eth0 adaptive-rx on adaptive-tx on

TCP Offloads

# View offload status
ethtool -k eth0

# Key offloads:
# - rx-checksumming: Hardware RX checksum
# - tx-checksumming: Hardware TX checksum
# - tcp-segmentation-offload: TSO
# - generic-receive-offload: GRO
# - large-receive-offload: LRO

# Disable (for debugging)
ethtool -K eth0 gro off

# Enable (for performance)
ethtool -K eth0 tso on gso on gro on

Debugging Network Issues

Packet Drops

# Check interface statistics
ip -s link show eth0
# Watch for: RX dropped, RX overruns

# Detailed drop reasons
ethtool -S eth0 | grep -i drop

# System-wide drops
cat /proc/net/softnet_stat
# Columns: processed, dropped, time_squeeze, ...

# Socket buffer overflows
ss -nmp | grep -E "rmem|wmem"

# Per-protocol drops
netstat -s | grep -i drop
nstat -az | grep -i drop

Tracing Tools

# Trace TCP state changes
sudo tcpstates-bpfcc

# Trace TCP retransmits
sudo tcpretrans-bpfcc

# Trace TCP connections
sudo tcpconnect-bpfcc   # Outgoing
sudo tcpaccept-bpfcc    # Incoming

# Trace packet drops
sudo dropwatch -l kas

# Trace specific flow
sudo bpftrace -e '
kprobe:tcp_rcv_established /
    ((struct sock *)arg0)->sk_dport == 8080 / {
    @recv = count();
}
'

Interview Deep Dives

Q: Explain what happens when you connect to google.com:443

Complete flow:

DNS Resolution:
- Application calls getaddrinfo()
- Check /etc/hosts, then resolver
- DNS query → response with IP
Socket Creation:
- socket(AF_INET, SOCK_STREAM, 0)
- Allocate struct socket, struct sock
TCP Handshake:
- connect() triggers SYN
- Build sk_buff, add TCP/IP headers
- Route lookup, ARP if needed
- Queue to driver, DMA to NIC
- Wait for SYN+ACK
- Send ACK, enter ESTABLISHED
TLS Handshake:
- ClientHello (ciphersuites, SNI)
- ServerHello + Certificate
- Key exchange (ECDHE)
- Finished messages
HTTP Request:
- Send HTTP/1.1 or HTTP/2 request
- TCP handles segmentation, retransmit
- Data copied to send buffer, then NIC
Response Path:
- NIC receives, DMA to memory
- NAPI poll, allocate sk_buff
- IP/TCP processing
- Copy to socket receive buffer
- Application read()

Q: How would you debug high network latency?

Systematic approach:

Measure where latency occurs:

# Application to kernel
strace -T -e network ./app

# Kernel processing
sudo perf trace -e 'net:*' ./app

# Network path
mtr target.com

Check for retransmits:

ss -ti dst target.com:443
# Look for: retrans, rto

sudo tcpretrans-bpfcc

Check for buffer issues:

ss -nmp | grep target.com
# Check rcv_space, rmem

cat /proc/net/sockstat
# Check memory pressure

Check for CPU issues:

cat /proc/net/softnet_stat
# Non-zero in column 3 = CPU overload

mpstat -P ALL 1
# Check for ksoftirqd CPU usage

Check network path:

tcpdump -i eth0 host target.com -w capture.pcap
# Analyze in Wireshark for delays

Q: How does TCP achieve reliability?

Key mechanisms:

Sequence numbers: Every byte numbered, receiver knows what’s missing
Acknowledgments: Receiver ACKs what it received
Retransmission:
- Timeout-based (RTO expires)
- Fast retransmit (3 duplicate ACKs)
Flow control:
- Receive window advertised
- Sender won’t exceed receiver buffer
Congestion control:
- AIMD (Additive Increase, Multiplicative Decrease)
- Slow start, congestion avoidance
- Modern: BBR, CUBIC
Checksums: Detect corruption
Connection state: Three-way handshake, four-way close

Quick Reference

# Socket statistics
ss -s              # Summary
ss -tan            # TCP sockets
ss -uan            # UDP sockets
ss -ltn            # Listening TCP

# Interface statistics
ip -s link         # Interface stats
ethtool -S eth0    # Detailed stats

# TCP tuning files
/proc/sys/net/ipv4/tcp_*
/proc/sys/net/core/*

# Tracing
tcpdump -i any -nn port 80
sudo tcpretrans-bpfcc
sudo tcpconnect-bpfcc

Next: Tracing & Profiling →

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Network Stack

​Network Stack Architecture

​Why Layers?

​Packet Receive Path

​Socket Structures

​Socket Hierarchy

​sk_buff Structure

​Efficient Packet Handling

​sk_buff Operations

​TCP Connection Lifecycle

​Why TIME_WAIT Exists

​Observing TCP States

​TCP Tuning Parameters

​Buffer Sizes

​Connection Parameters

​Congestion Control

​Netfilter and iptables

​Netfilter Hooks

​Connection Tracking

​XDP (eXpress Data Path)

​XDP Architecture

​XDP Modes

​XDP Use Cases

​Network Performance Tuning

​RSS and RPS

​Interrupt Coalescing

​TCP Offloads

​Debugging Network Issues

​Packet Drops

​Tracing Tools

​Interview Deep Dives

​Quick Reference

Network Stack

Network Stack Architecture

Why Layers?

Packet Receive Path

Socket Structures

Socket Hierarchy

sk_buff Structure

Efficient Packet Handling

sk_buff Operations

TCP Connection Lifecycle

Why TIME_WAIT Exists

Observing TCP States

TCP Tuning Parameters

Buffer Sizes

Connection Parameters

Congestion Control

Netfilter and iptables

Netfilter Hooks

Connection Tracking

XDP (eXpress Data Path)

XDP Architecture

XDP Modes

XDP Use Cases

Network Performance Tuning

RSS and RPS

Interrupt Coalescing

TCP Offloads

Debugging Network Issues

Packet Drops

Tracing Tools

Interview Deep Dives

Quick Reference