Networking Subsystem
The network subsystem is one of the most complex parts of an operating system. It implements the protocols that enable all network communication. Understanding it is essential for debugging network issues and optimizing application performance.Interview Frequency: High for systems/infrastructure roles
Key Topics: Network stack, sockets, TCP/IP internals, performance tuning
Time to Master: 15-20 hours
Key Topics: Network stack, sockets, TCP/IP internals, performance tuning
Time to Master: 15-20 hours
Network Stack Overview
Copy
┌─────────────────────────────────────────────────────────────────┐
│ LINUX NETWORK STACK │
├─────────────────────────────────────────────────────────────────┤
│ │
│ User Space │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Application (nginx, curl, browser) │ │
│ │ │ │ │
│ │ │ send()/recv(), read()/write() │ │
│ │ ▼ │ │
│ │ Socket API (POSIX) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ System call │
│ ════════════════════════════════════════════════════════════ │
│ ▼ │
│ Kernel Space │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Socket Layer │ │
│ │ • struct socket, struct sock │ │
│ │ • Protocol-independent interface │ │
│ └───────────────────────────┬─────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Transport Layer (L4) │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ TCP │ │ UDP │ │ │
│ │ │ • Connection │ │ • Connectionless│ │ │
│ │ │ • Reliable │ │ • Best-effort │ │ │
│ │ │ • Ordered │ │ • Unordered │ │ │
│ │ └─────────────────┘ └─────────────────┘ │ │
│ └───────────────────────────┬─────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Network Layer (L3) │ │
│ │ • IP routing │ │
│ │ • Fragmentation │ │
│ │ • Netfilter (iptables, nftables) │ │
│ └───────────────────────────┬─────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Link Layer (L2) │ │
│ │ • ARP │ │
│ │ • Device drivers │ │
│ │ • Traffic control (tc) │ │
│ └───────────────────────────┬─────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Network Device │ │
│ │ • NIC driver │ │
│ │ • DMA, ring buffers │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ════════════════════════════════════════════════════════════ │
│ ▼ │
│ Hardware │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ NIC (Network Interface Card) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Socket API
Socket Data Structures
Copy
// User-space view
int sockfd = socket(AF_INET, SOCK_STREAM, 0);
// Kernel representation
struct socket {
socket_state state; // SS_FREE, SS_CONNECTED, etc.
short type; // SOCK_STREAM, SOCK_DGRAM
unsigned long flags;
struct file *file; // VFS integration
struct sock *sk; // Protocol-specific state
const struct proto_ops *ops; // Protocol operations
};
// Protocol-specific socket (sock)
struct sock {
struct sock_common __sk_common;
// Receive/send queues
struct sk_buff_head sk_receive_queue;
struct sk_buff_head sk_write_queue;
// Buffer sizes
int sk_rcvbuf;
int sk_sndbuf;
// Callbacks
void (*sk_data_ready)(struct sock *sk);
void (*sk_write_space)(struct sock *sk);
// ... many more fields
};
Socket Buffer (sk_buff)
Copy
┌─────────────────────────────────────────────────────────────────┐
│ SK_BUFF STRUCTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ struct sk_buff { │
│ // Navigation │
│ struct sk_buff *next, *prev; │
│ struct sock *sk; // Owning socket │
│ struct net_device *dev; // Network device │
│ │
│ // Data pointers │
│ unsigned char *head; // Start of buffer │
│ unsigned char *data; // Start of data │
│ unsigned char *tail; // End of data │
│ unsigned char *end; // End of buffer │
│ │
│ // Protocol headers │
│ unsigned char *mac_header; │
│ unsigned char *network_header; │
│ unsigned char *transport_header; │
│ }; │
│ │
│ Memory layout: │
│ ┌─────┬──────┬──────────┬──────┬──────┬─────────────┬─────┐ │
│ │head │ room │ MAC │ IP │ TCP │ Payload │tail │ │
│ │space│ │ header │header│header│ │space│ │
│ └─────┴──────┴──────────┴──────┴──────┴─────────────┴─────┘ │
│ ↑ ↑ ↑ ↑ ↑ │
│ head mac network transport end │
│ │
│ Headroom: Reserved for adding headers (transmit) │
│ Tailroom: Reserved for trailers │
│ │
└─────────────────────────────────────────────────────────────────┘
Socket Lifecycle
Copy
// Server side
int server_fd = socket(AF_INET, SOCK_STREAM, 0);
struct sockaddr_in addr = {
.sin_family = AF_INET,
.sin_port = htons(8080),
.sin_addr.s_addr = INADDR_ANY
};
bind(server_fd, (struct sockaddr*)&addr, sizeof(addr));
listen(server_fd, SOMAXCONN);
while (1) {
int client_fd = accept(server_fd, NULL, NULL);
// Handle connection in client_fd
close(client_fd);
}
// What happens in kernel:
// socket() → Allocate struct socket, struct sock
// bind() → Associate with local address, add to hash table
// listen() → Set up accept queue, change state to LISTEN
// accept() → Wait on accept queue, create new socket for client
TCP Implementation
TCP State Machine
Copy
┌─────────────────────────────────────────────────────────────────┐
│ TCP STATE MACHINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Client Server │
│ ──────── ───────── │
│ CLOSED CLOSED │
│ │ │ │
│ │ │ socket(), bind(), │
│ │ │ listen() │
│ │ ▼ │
│ │ LISTEN │
│ │ │ │
│ │ connect() ──────── SYN ──────────►│ │
│ │ │ │
│ ▼ ▼ │
│ SYN_SENT ◄──────── SYN+ACK ────────── SYN_RCVD │
│ │ │ │
│ │ ──────────── ACK ────────────────►│ │
│ ▼ ▼ │
│ ESTABLISHED ◄────────────────────► ESTABLISHED │
│ │ │ │
│ │ ──────────── │ │
│ │ Data exchange │ │
│ │ ◄─────────── │ │
│ │ │ │
│ │ close() ────── FIN ─────────────►│ │
│ ▼ │ │
│ FIN_WAIT_1 │ │
│ │ ◄───────────── ACK ───────────── │ │
│ ▼ ▼ │
│ FIN_WAIT_2 CLOSE_WAIT │
│ │ │ close() │
│ │ ◄───────────── FIN ────────────── │ │
│ ▼ ▼ │
│ TIME_WAIT LAST_ACK │
│ │ ──────────── ACK ────────────────►│ │
│ │ (2*MSL wait) ▼ │
│ ▼ CLOSED │
│ CLOSED │
│ │
└─────────────────────────────────────────────────────────────────┘
TCP Congestion Control
Copy
┌─────────────────────────────────────────────────────────────────┐
│ TCP CONGESTION CONTROL │
├─────────────────────────────────────────────────────────────────┤
│ │
│ CWND │
│ (segments) │
│ │ │
│ 64 │ ╭────────╮ │
│ │ ╭──╯ ╰───── │
│ 32 │ ╭──╯ │
│ │ ╭──╯ Congestion │
│ 16 │ ╭──╯ Avoidance │
│ │ ╭──╯ (linear) │
│ 8 │ ssthresh ╭──╯ │
│ │ ─ ─ ─ ─╯ │
│ 4 │ ╭──╯ │
│ │ ╭──╯ Slow Start │
│ 2 │ ╭──╯ (exponential) │
│ │╯ │
│ 1 ├───────────────────────────────────────────────────► Time │
│ │
│ Slow Start: CWND doubles each RTT until ssthresh │
│ Congestion Avoidance: CWND increases by 1 each RTT │
│ Loss detected: ssthresh = CWND/2, CWND = 1 (or ssthresh) │
│ │
│ Modern algorithms: │
│ • CUBIC: Default in Linux, optimized for high BDP │
│ • BBR: Google's, models bandwidth and RTT │
│ • Reno: Classic, simple AIMD │
│ • Vegas: RTT-based congestion detection │
│ │
└─────────────────────────────────────────────────────────────────┘
TCP Options and Features
Copy
// View TCP congestion control algorithm
sysctl net.ipv4.tcp_congestion_control
// bbr, cubic, reno
// Key TCP parameters
sysctl net.ipv4.tcp_window_scaling // Enable window scaling
sysctl net.ipv4.tcp_sack // Selective ACK
sysctl net.ipv4.tcp_timestamps // RTT measurement
sysctl net.ipv4.tcp_fastopen // TFO (data in SYN)
sysctl net.ipv4.tcp_ecn // Explicit Congestion Notification
// Socket options
setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &one, sizeof(one)); // Disable Nagle
setsockopt(fd, IPPROTO_TCP, TCP_CORK, &one, sizeof(one)); // Cork mode
setsockopt(fd, IPPROTO_TCP, TCP_QUICKACK, &one, sizeof(one)); // Immediate ACK
setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE, &one, sizeof(one)); // Keepalive
Network Device and Drivers
Packet Reception (NAPI)
Copy
┌─────────────────────────────────────────────────────────────────┐
│ PACKET RECEPTION (NAPI) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Traditional (Interrupt per packet): │
│ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │
│ │IRQ │ │IRQ │ │IRQ │ │IRQ │ ← Too many interrupts! │
│ └────┘ └────┘ └────┘ └────┘ │
│ │
│ NAPI (New API): │
│ ┌────┐ │
│ │IRQ │ ─── Disable IRQ, poll packets ─── Enable IRQ │
│ └────┘ ─────────────────────────────────────► │
│ ← Process many packets per interrupt │
│ │
│ Flow: │
│ 1. Packet arrives → NIC writes to ring buffer via DMA │
│ 2. NIC raises interrupt (if enabled) │
│ 3. Driver disables NIC interrupts │
│ 4. Driver schedules NAPI poll │
│ 5. Softirq runs napi_poll() │
│ → Process up to budget packets │
│ → For each: allocate sk_buff, pass up stack │
│ 6. If more packets, reschedule; else enable interrupts │
│ │
│ Ring Buffer: │
│ ┌────┬────┬────┬────┬────┬────┬────┬────┐ │
│ │ DX │ D │ D │ D │ │ │ │ │ │
│ └────┴────┴────┴────┴────┴────┴────┴────┘ │
│ ↑ ↑ ↑ │
│ Producer Consumer Size │
│ (NIC) (Driver) │
│ │
└─────────────────────────────────────────────────────────────────┘
Packet Transmission
Copy
┌─────────────────────────────────────────────────────────────────┐
│ PACKET TRANSMISSION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Application: send(sockfd, data, len, 0) │
│ │ │
│ ▼ │
│ Socket Layer: Copy to kernel, create sk_buff │
│ │ │
│ ▼ │
│ TCP: Add TCP header, queue in sk_write_queue │
│ Segment if needed, calculate checksum │
│ │ │
│ ▼ │
│ IP: Add IP header, route lookup │
│ Fragment if MTU exceeded │
│ │ │
│ ▼ │
│ Netfilter: iptables rules (FORWARD, OUTPUT chains) │
│ │ │
│ ▼ │
│ Neighbor (ARP): Resolve MAC address │
│ │ │
│ ▼ │
│ Device: Add Ethernet header │
│ Queue in qdisc (traffic control) │
│ │ │
│ ▼ │
│ Driver: DMA to NIC ring buffer │
│ Trigger transmission │
│ │ │
│ ▼ │
│ NIC: Transmit on wire │
│ │
└─────────────────────────────────────────────────────────────────┘
Netfilter and iptables
Netfilter Hooks
Copy
┌─────────────────────────────────────────────────────────────────┐
│ NETFILTER HOOKS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ │
│ ────────────│ PREROUTING │──────────── │
│ │ └──────┬──────┘ │ │
│ │ │ │ │
│ │ ┌──────▼──────┐ │ │
│ │ │ Routing │ │ │
│ │ │ Decision │ │ │
│ │ └──────┬──────┘ │ │
│ │ │ │ │
│ │ Local? ╱╲ Forward? │ │
│ │ ◄──────╱ ╲──────► │ │
│ │ ╱ ╲ │ │
│ │ ┌─────────▼─┐ ┌─▼─────────┐ │ │
│ │ │ INPUT │ │ FORWARD │ │ │
│ │ └─────┬─────┘ └─────┬─────┘ │ │
│ │ │ │ │ │
│ │ ┌─────▼─────┐ │ │ │
│ │ │ Local │ │ │ │
│ │ │ Process │ │ │ │
│ │ └─────┬─────┘ │ │ │
│ │ │ │ │ │
│ │ ┌─────▼─────┐ │ │ │
│ │ │ OUTPUT │ │ │ │
│ │ └─────┬─────┘ │ │ │
│ │ │ │ │ │
│ │ ┌─────▼──────────────▼─────┐ │ │
│ │ │ POSTROUTING │ │ │
│ └────┤ ├─────┘ │
│ └──────────────────────────┘ │
│ │ │
│ ▼ │
│ Network Out │
│ │
│ Tables at each hook: │
│ • raw: Connection tracking bypass │
│ • mangle: Packet modification │
│ • nat: Network Address Translation │
│ • filter: Accept/drop decisions │
│ │
└─────────────────────────────────────────────────────────────────┘
iptables Examples
Copy
# View all rules
iptables -L -n -v
# Allow established connections
iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
# Allow SSH
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
# Drop all other incoming
iptables -A INPUT -j DROP
# NAT (masquerade)
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
# Port forwarding
iptables -t nat -A PREROUTING -p tcp --dport 80 -j DNAT --to-destination 192.168.1.100:8080
iptables -t nat -A POSTROUTING -j MASQUERADE
High-Performance Networking
Kernel Bypass Techniques
Copy
┌─────────────────────────────────────────────────────────────────┐
│ KERNEL BYPASS OPTIONS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Standard Path: Kernel Bypass: │
│ Application Application │
│ │ │ │
│ ▼ │ │
│ System Call │ │
│ │ │ │
│ ▼ │ │
│ Socket Layer │ │
│ │ │ mmap to userspace │
│ ▼ │ │
│ TCP/IP Stack │ │
│ │ │ │
│ ▼ │ │
│ Network Driver │ │
│ │ ▼ │
│ ▼ ┌───────────┐ │
│ ┌───────┐ │DPDK/AF_XDP│ │
│ │ NIC │ │ Userspace │ │
│ └───────┘ │ Driver │ │
│ └─────┬─────┘ │
│ │ │
│ ┌─────▼─────┐ │
│ │ NIC │ │
│ └───────────┘ │
│ │
│ Techniques: │
│ • DPDK: Full userspace networking │
│ • AF_XDP: XDP sockets for userspace │
│ • io_uring: Async I/O with zero-copy │
│ • SR-IOV: Hardware NIC virtualization │
│ │
└─────────────────────────────────────────────────────────────────┘
XDP (eXpress Data Path)
Copy
// XDP runs before sk_buff allocation
// Decisions: XDP_PASS, XDP_DROP, XDP_TX, XDP_REDIRECT
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
SEC("xdp")
int xdp_drop_icmp(struct xdp_md *ctx) {
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
struct ethhdr *eth = data;
if ((void *)eth + sizeof(*eth) > data_end)
return XDP_PASS;
if (eth->h_proto != htons(ETH_P_IP))
return XDP_PASS;
struct iphdr *ip = data + sizeof(*eth);
if ((void *)ip + sizeof(*ip) > data_end)
return XDP_PASS;
// Drop ICMP packets
if (ip->protocol == IPPROTO_ICMP)
return XDP_DROP;
return XDP_PASS;
}
char _license[] SEC("license") = "GPL";
Copy
# Load XDP program
ip link set dev eth0 xdpgeneric obj xdp_drop.o sec xdp
# View XDP stats
ip link show dev eth0
# Remove XDP program
ip link set dev eth0 xdpgeneric off
Network Tuning
Key Parameters
Copy
# Increase socket buffer sizes
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
# Connection tracking table size
sysctl -w net.netfilter.nf_conntrack_max=1000000
# TIME_WAIT tuning
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_fin_timeout=30
# SYN flood protection
sysctl -w net.ipv4.tcp_syncookies=1
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
# Enable BBR congestion control
sysctl -w net.core.default_qdisc=fq
sysctl -w net.ipv4.tcp_congestion_control=bbr
# Increase local port range
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# Increase connection queue
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_tw_buckets=1440000
NIC Tuning
Copy
# Ring buffer size
ethtool -g eth0 # View current
ethtool -G eth0 rx 4096 # Set RX ring size
# Interrupt coalescing
ethtool -c eth0 # View
ethtool -C eth0 rx-usecs 100 rx-frames 64
# Offload features
ethtool -k eth0 # View offloads
ethtool -K eth0 tso on # Enable TCP segmentation offload
ethtool -K eth0 gso on # Generic segmentation offload
ethtool -K eth0 gro on # Generic receive offload
ethtool -K eth0 tx-checksum-ipv4 on
# Multi-queue / RSS
ethtool -l eth0 # View channels
ethtool -L eth0 combined 8 # Set to 8 channels
# CPU affinity for interrupts
# Set IRQ affinity to specific CPUs
echo 2 > /proc/irq/24/smp_affinity
Debugging Network Issues
Common Tools
Copy
# Socket statistics
ss -tunapl # TCP/UDP, all, numeric, listening, processes
ss -s # Summary statistics
ss -t state time-wait # TIME_WAIT sockets
# Network statistics
netstat -s # Protocol statistics
nstat -az # Detailed counters
# Connection tracking
conntrack -L # List tracked connections
conntrack -S # Statistics
# Packet capture
tcpdump -i eth0 -nn host 192.168.1.1
tcpdump -i any port 443 -w capture.pcap
# Network path
traceroute 8.8.8.8
mtr 8.8.8.8 # Continuous traceroute
# DNS
dig example.com
host example.com
# TCP connection test
telnet host 80
nc -zv host 80
Performance Analysis
Copy
# Monitor network throughput
sar -n DEV 1 # Device stats every second
sar -n SOCK 1 # Socket stats
# Per-process network I/O
iotop # (shows network too)
nethogs # Bandwidth per process
# Socket buffer usage
cat /proc/net/sockstat
cat /proc/net/sockstat6
# TCP metrics
ss -ti # TCP internal info
# Shows: cwnd, rtt, retransmits, etc.
# Dropped packets
cat /proc/net/softnet_stat
# Column 1: packets processed
# Column 2: drops (queue full)
# Column 3: time squeeze (ran out of CPU)
Interview Questions
What happens when you type 'curl google.com'?
What happens when you type 'curl google.com'?
Answer (network stack focus):
- DNS Resolution:
- Check /etc/hosts, then /etc/resolv.conf
- Send UDP query to DNS server
- Receive IP address
- TCP Connection:
- Create socket: socket(AF_INET, SOCK_STREAM, 0)
- connect() initiates 3-way handshake
- SYN → SYN+ACK → ACK
- HTTP Request:
- send() data to socket
- Kernel: socket → TCP (segment) → IP (route) → Device
- NIC transmits frames
- Response:
- NIC receives, DMA to ring buffer
- NAPI poll, create sk_buff
- IP → TCP (reassemble) → Socket queue
- Application recv()
- Connection Close:
- FIN → ACK, FIN → ACK
- TIME_WAIT for 2*MSL
Explain TCP vs UDP
Explain TCP vs UDP
TCP:
- Connection-oriented (3-way handshake)
- Reliable (ACKs, retransmission)
- Ordered delivery (sequence numbers)
- Flow control (sliding window)
- Congestion control (CUBIC, BBR)
- Use: HTTP, SSH, databases
- Connectionless
- Unreliable (no ACKs)
- Unordered
- No flow/congestion control
- Lower latency, less overhead
- Use: DNS, gaming, streaming, VoIP
Why does TIME_WAIT exist?
Why does TIME_WAIT exist?
Two reasons:
- Ensure final ACK arrives: If the last ACK is lost, the peer will retransmit FIN. We need to be around to ACK it.
- Prevent old packets: Old delayed packets from this connection shouldn’t be interpreted as new connection data. 2*MSL ensures all old packets expire.
tcp_tw_reuse: Reuse TIME_WAIT for outgoing connectionsSO_REUSEADDR: Allows bind to TIME_WAIT socket- Short-lived connections at scale can exhaust ports
How would you debug slow network performance?
How would you debug slow network performance?
Systematic approach:
-
Check basics:
ping,traceroute- Is network reachable? -
Bandwidth:
iperf3- Test raw throughput -
Latency:
ping,ss -tifor RTT - Is latency high? -
Packet loss:
ping,mtr- Are packets being dropped? -
TCP internals:
ss -ti- Check cwnd, retransmits- Small cwnd = congestion or high latency
- Retransmits = loss
-
System limits:
cat /proc/net/sockstat- Socket memorysysctl net.ipv4.tcp_*- Buffer sizesulimit -n- File descriptors
-
NIC issues:
ethtool -S eth0- Driver statistics- Check ring buffer drops
Summary
Copy
┌─────────────────────────────────────────────────────────────────┐
│ NETWORKING SUBSYSTEM SUMMARY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Key Concepts: │
│ • Socket API: Interface between apps and kernel │
│ • sk_buff: Kernel's packet representation │
│ • TCP state machine: SYN→ESTABLISHED→FIN→TIME_WAIT │
│ • Congestion control: CUBIC (default), BBR (Google) │
│ │
│ Performance: │
│ • NAPI: Batch interrupt processing │
│ • XDP: Pre-stack packet processing with eBPF │
│ • DPDK/AF_XDP: Kernel bypass for high performance │
│ • Tune: Buffer sizes, congestion control, offloads │
│ │
│ Debugging: │
│ • ss: Socket statistics │
│ • tcpdump: Packet capture │
│ • netstat -s: Protocol statistics │
│ • ethtool: NIC configuration and stats │
│ │
└─────────────────────────────────────────────────────────────────┘