Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Networking Protocols for System Design

Why Networking Matters

Every distributed system communicates over networks. Understanding networking is crucial for:
  • Latency optimization - Where does delay come from?
  • Protocol selection - HTTP vs WebSocket vs gRPC
  • Debugging issues - Why is my API slow?
  • Security design - TLS, firewalls, VPNs
Think of the network as the road system connecting buildings in a city. TCP is like certified mail — guaranteed delivery, in order, with a receipt. UDP is like shouting across the street — fast, no guarantee they heard you, but good enough for many situations. HTTP is like a form you fill out at a government office window: you submit a request, wait, and get a response. WebSockets are like a phone call — once connected, both sides talk freely. The reason you need to understand networking for system design is that the road system is always the bottleneck. The fastest database in the world is useless if the network between your service and the database adds 200ms of latency.
Practical tip: When debugging slow API calls, always measure where time is actually spent. Use the decomposition: DNS lookup + TCP handshake + TLS handshake + time-to-first-byte + transfer time. In most system design interviews, the latency bottleneck is not the protocol — it is geography (speed of light through fiber) or serialization. A senior engineer asks “where are the servers relative to the users?” before optimizing the protocol.

The OSI Model (Simplified)

┌────────────────────────────────────────────────────────────────┐
│                      OSI Model (7 Layers)                       │
├─────────┬──────────────────────────────────────────────────────┤
│ Layer 7 │ Application  │ HTTP, HTTPS, WebSocket, gRPC, DNS    │
├─────────┼──────────────────────────────────────────────────────┤
│ Layer 6 │ Presentation │ SSL/TLS, Encryption, Compression     │
├─────────┼──────────────────────────────────────────────────────┤
│ Layer 5 │ Session      │ Session management, Authentication   │
├─────────┼──────────────────────────────────────────────────────┤
│ Layer 4 │ Transport    │ TCP, UDP                              │
├─────────┼──────────────────────────────────────────────────────┤
│ Layer 3 │ Network      │ IP, Routing                           │
├─────────┼──────────────────────────────────────────────────────┤
│ Layer 2 │ Data Link    │ Ethernet, MAC addresses              │
├─────────┼──────────────────────────────────────────────────────┤
│ Layer 1 │ Physical     │ Cables, Radio waves                  │
└─────────┴──────────────────────────────────────────────────────┘

For system design, focus on Layers 4 and 7

DNS (Domain Name System)

DNS translates human-readable domain names to IP addresses.

DNS Resolution Flow

User types: www.example.com


    ┌──────────────────┐
    │  Browser Cache   │ ← Check local cache first
    └────────┬─────────┘
             │ Cache miss

    ┌──────────────────┐
    │    OS Cache      │ ← Check OS DNS cache
    └────────┬─────────┘
             │ Cache miss

    ┌──────────────────┐
    │ Recursive DNS    │ ← ISP's DNS resolver
    │    Resolver      │
    └────────┬─────────┘

   ┌─────────┴─────────┐
   ▼                   ▼
┌──────┐         ┌──────────┐         ┌──────────┐
│ Root │────────►│   TLD    │────────►│Authorit- │
│Server│         │(.com,.io)│         │ative DNS │
└──────┘         └──────────┘         └──────────┘
   │                  │                    │
   │ "Ask .com TLD"   │ "Ask ns.example"  │ "IP: 93.184.216.34"
   └──────────────────┴────────────────────┘

DNS Record Types

RecordPurposeExample
AMaps domain to IPv4example.com → 93.184.216.34
AAAAMaps domain to IPv6example.com → 2606:2800:220:1:...
CNAMEAlias to another domainwww.example.com → example.com
MXMail serverexample.com → mail.example.com
TXTText data (verification)SPF, DKIM records
NSNameserverexample.com → ns1.provider.com

DNS in System Design

┌─────────────────────────────────────────────────────────────────┐
│                    DNS-Based Load Balancing                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   DNS Query: api.example.com                                    │
│        │                                                        │
│        ▼                                                        │
│   ┌──────────────────────────────────────┐                     │
│   │         DNS Round Robin               │                     │
│   │   Returns different IPs each time     │                     │
│   └──────────────────────────────────────┘                     │
│        │                                                        │
│   ┌────┴────┬────────────────┐                                 │
│   ▼         ▼                ▼                                 │
│ Server 1  Server 2        Server 3                             │
│ 1.1.1.1   2.2.2.2         3.3.3.3                              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Pros: Simple, no extra infrastructure
Cons: No health checks, TTL caching delays failover
DNS TTL Matters: Low TTL (60s) = faster failover, more DNS queries. High TTL (3600s) = better caching, slower failover. Common: 300s (5 min)

TCP vs UDP

TCP (Transmission Control Protocol)

┌─────────────────────────────────────────────────────────────────┐
│                    TCP Three-Way Handshake                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│     Client                              Server                  │
│        │                                   │                    │
│        │──────── SYN (seq=x) ─────────────►│                   │
│        │                                   │                    │
│        │◄────── SYN-ACK (seq=y, ack=x+1) ──│                   │
│        │                                   │                    │
│        │──────── ACK (ack=y+1) ────────────►│                   │
│        │                                   │                    │
│        │◄═══════ Connection Established ═══►│                   │
│        │                                   │                    │
│        │←───────── Data Transfer ──────────→│                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

TCP vs UDP Comparison

FeatureTCPUDP
ConnectionConnection-orientedConnectionless
ReliabilityGuaranteed deliveryBest effort
OrderingIn-order deliveryNo ordering
SpeedSlower (overhead)Faster
Use CaseHTTP, databasesVideo streaming, gaming, DNS
Header Size20-60 bytes8 bytes

When to Use What

Use TCP

  • Web applications (HTTP/HTTPS)
  • File transfers
  • Database connections
  • Email (SMTP, IMAP)
  • When data integrity matters

Use UDP

  • Live video/audio streaming
  • Online gaming
  • DNS queries
  • IoT sensors
  • When speed > reliability

HTTP/HTTPS

HTTP Request/Response

┌─────────────────────────────────────────────────────────────────┐
│                    HTTP Request                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  GET /api/users/123 HTTP/1.1                                   │
│  Host: api.example.com                                          │
│  Authorization: Bearer eyJhbGciOiJIUzI1...                     │
│  Content-Type: application/json                                 │
│  Accept: application/json                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    HTTP Response                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  HTTP/1.1 200 OK                                                │
│  Content-Type: application/json                                 │
│  Cache-Control: max-age=3600                                    │
│  X-RateLimit-Remaining: 99                                      │
│                                                                 │
│  {                                                              │
│    "id": 123,                                                   │
│    "name": "John Doe",                                          │
│    "email": "john@example.com"                                  │
│  }                                                              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

HTTP Methods

MethodPurposeIdempotentSafe
GETRetrieve resourceYesYes
POSTCreate resourceNoNo
PUTReplace resourceYesNo
PATCHPartial updateNoNo
DELETERemove resourceYesNo
HEADGet headers onlyYesYes
OPTIONSGet allowed methodsYesYes

HTTP Status Codes

┌───────────────────────────────────────────────────────────────┐
│                    HTTP Status Codes                          │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│  1xx Informational  │  100 Continue, 101 Switching Protocols │
│  ───────────────────┼───────────────────────────────────────│
│  2xx Success        │  200 OK                                │
│                     │  201 Created                           │
│                     │  204 No Content                        │
│  ───────────────────┼───────────────────────────────────────│
│  3xx Redirection    │  301 Moved Permanently                 │
│                     │  302 Found (temporary)                 │
│                     │  304 Not Modified                      │
│  ───────────────────┼───────────────────────────────────────│
│  4xx Client Error   │  400 Bad Request                       │
│                     │  401 Unauthorized                      │
│                     │  403 Forbidden                         │
│                     │  404 Not Found                         │
│                     │  429 Too Many Requests                 │
│  ───────────────────┼───────────────────────────────────────│
│  5xx Server Error   │  500 Internal Server Error             │
│                     │  502 Bad Gateway                       │
│                     │  503 Service Unavailable               │
│                     │  504 Gateway Timeout                   │
│                                                               │
└───────────────────────────────────────────────────────────────┘

HTTP/1.1 vs HTTP/2 vs HTTP/3

HTTP/1.1                    HTTP/2                     HTTP/3
┌─────────────────┐        ┌─────────────────┐       ┌─────────────────┐
│ Request 1       │        │ ┌───┬───┬───┐   │       │ ┌───┬───┬───┐   │
│ ─────────────── │        │ │ 1 │ 2 │ 3 │   │       │ │ 1 │ 2 │ 3 │   │
│ Response 1      │        │ └───┴───┴───┘   │       │ └───┴───┴───┘   │
│ ═══════════════ │        │  Multiplexed    │       │  Over QUIC     │
│ Request 2       │        │  Binary frames  │       │  (UDP based)    │
│ ─────────────── │        │  on single TCP  │       │  0-RTT resume   │
│ Response 2      │        │                 │       │                 │
│ ═══════════════ │        │ Server Push     │       │ No head-of-line │
│ (Sequential)    │        │ Header Compress │       │  blocking       │
└─────────────────┘        └─────────────────┘       └─────────────────┘

6 connections max          Single connection         Faster, resilient
Head-of-line blocking      per origin                to network changes

HTTPS/TLS Handshake

┌─────────────────────────────────────────────────────────────────┐
│                    TLS 1.3 Handshake                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Client                                         Server          │
│     │                                              │            │
│     │──── ClientHello + Key Share ────────────────►│           │
│     │     (Supported ciphers, client random)       │            │
│     │                                              │            │
│     │◄─── ServerHello + Key Share + Certificate ──│           │
│     │     (Selected cipher, server random, cert)   │            │
│     │                                              │            │
│     │     [Both compute shared secret]             │            │
│     │                                              │            │
│     │◄════════ Encrypted Application Data ════════►│           │
│     │                                              │            │
│                                                                 │
│  TLS 1.3: 1-RTT handshake (down from 2-RTT in TLS 1.2)        │
│  0-RTT resumption for returning clients                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

WebSockets

WebSocket vs HTTP

HTTP (Request-Response)              WebSocket (Bidirectional)

Client          Server               Client          Server
   │──── GET ─────►│                    │──── Upgrade ──►│
   │◄─── 200 ──────│                    │◄─── 101 ───────│
   │               │                    │                │
   │──── GET ─────►│                    │◄══════════════►│
   │◄─── 200 ──────│                    │  Full-duplex   │
   │               │                    │  connection    │
   │ (Poll again)  │                    │                │
   │──── GET ─────►│                    │◄══════════════►│
   │◄─── 200 ──────│                    │                │

Each request = new TCP     Single persistent connection
connection overhead        Low latency, real-time

WebSocket Use Cases

Real-time Chat

WhatsApp, Slack, Discord - instant message delivery

Live Updates

Stock prices, sports scores, notifications

Gaming

Multiplayer games, real-time player positions

Collaboration

Google Docs, Figma - live editing

WebSocket Scaling Challenge

WebSocket Scaling with Pub/Sub
┌─────────────────────────────────────────────────────────────────┐
│                WebSocket Scaling with Pub/Sub                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│     User A                                       User B         │
│        │                                            │           │
│        │ WS                                      WS │           │
│        ▼                                            ▼           │
│   ┌─────────┐                                  ┌─────────┐     │
│   │Server 1 │                                  │Server 2 │     │
│   └────┬────┘                                  └────┬────┘     │
│        │                                            │           │
│        └──────────────┬─────────────────────────────┘           │
│                       │                                         │
│                ┌──────▼──────┐                                 │
│                │    Redis    │  Pub/Sub for cross-server       │
│                │   Pub/Sub   │  message broadcasting           │
│                └─────────────┘                                 │
│                                                                 │
│  Problem: User A on Server 1 messages User B on Server 2       │
│  Solution: Publish to Redis, all servers subscribe             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

WebSocket Implementation

Production-ready WebSocket server with connection management:
import asyncio
import json
from typing import Dict, Set, Optional, Any
from dataclasses import dataclass, field
from datetime import datetime
import uuid
import aioredis
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from contextlib import asynccontextmanager

@dataclass
class Connection:
    """Represents a WebSocket connection"""
    websocket: WebSocket
    user_id: str
    connection_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    connected_at: datetime = field(default_factory=datetime.utcnow)
    subscriptions: Set[str] = field(default_factory=set)
    metadata: Dict[str, Any] = field(default_factory=dict)

class ConnectionManager:
    """
    Production WebSocket connection manager with Redis pub/sub
    for multi-server deployments
    """
    
    def __init__(self, redis_url: str = "redis://localhost"):
        self.redis_url = redis_url
        self.connections: Dict[str, Connection] = {}
        self.user_connections: Dict[str, Set[str]] = {}
        self.room_connections: Dict[str, Set[str]] = {}
        self.redis: Optional[aioredis.Redis] = None
        self.pubsub: Optional[aioredis.client.PubSub] = None
    
    async def initialize(self) -> None:
        """Initialize Redis connection for pub/sub"""
        self.redis = await aioredis.from_url(self.redis_url)
        self.pubsub = self.redis.pubsub()
        
        # Start listening for broadcast messages
        asyncio.create_task(self._redis_listener())
    
    async def connect(
        self, 
        websocket: WebSocket, 
        user_id: str,
        metadata: Optional[Dict] = None
    ) -> Connection:
        """Accept and register a new WebSocket connection"""
        await websocket.accept()
        
        connection = Connection(
            websocket=websocket,
            user_id=user_id,
            metadata=metadata or {}
        )
        
        # Register connection
        self.connections[connection.connection_id] = connection
        
        if user_id not in self.user_connections:
            self.user_connections[user_id] = set()
        self.user_connections[user_id].add(connection.connection_id)
        
        # Notify user came online
        await self._publish_presence(user_id, "online")
        
        return connection
    
    async def disconnect(self, connection: Connection) -> None:
        """Clean up a disconnected WebSocket"""
        conn_id = connection.connection_id
        user_id = connection.user_id
        
        # Remove from rooms
        for room in connection.subscriptions:
            if room in self.room_connections:
                self.room_connections[room].discard(conn_id)
        
        # Remove from user connections
        if user_id in self.user_connections:
            self.user_connections[user_id].discard(conn_id)
            
            # If no more connections, user is offline
            if not self.user_connections[user_id]:
                del self.user_connections[user_id]
                await self._publish_presence(user_id, "offline")
        
        # Remove connection
        del self.connections[conn_id]
    
    async def join_room(self, connection: Connection, room: str) -> None:
        """Subscribe connection to a room"""
        connection.subscriptions.add(room)
        
        if room not in self.room_connections:
            self.room_connections[room] = set()
            # Subscribe to Redis channel for this room
            await self.pubsub.subscribe(f"room:{room}")
        
        self.room_connections[room].add(connection.connection_id)
    
    async def leave_room(self, connection: Connection, room: str) -> None:
        """Unsubscribe connection from a room"""
        connection.subscriptions.discard(room)
        
        if room in self.room_connections:
            self.room_connections[room].discard(connection.connection_id)
    
    async def send_to_user(self, user_id: str, message: Dict) -> int:
        """Send message to all connections of a user"""
        sent = 0
        
        # Local connections
        if user_id in self.user_connections:
            for conn_id in self.user_connections[user_id]:
                if conn_id in self.connections:
                    try:
                        await self.connections[conn_id].websocket.send_json(message)
                        sent += 1
                    except Exception:
                        pass
        
        # Broadcast to other servers via Redis
        await self.redis.publish(
            f"user:{user_id}",
            json.dumps(message)
        )
        
        return sent
    
    async def broadcast_to_room(self, room: str, message: Dict) -> int:
        """Broadcast message to all connections in a room"""
        sent = 0
        
        # Local connections
        if room in self.room_connections:
            for conn_id in self.room_connections[room]:
                if conn_id in self.connections:
                    try:
                        await self.connections[conn_id].websocket.send_json(message)
                        sent += 1
                    except Exception:
                        pass
        
        # Broadcast to other servers via Redis
        await self.redis.publish(
            f"room:{room}",
            json.dumps(message)
        )
        
        return sent
    
    async def _redis_listener(self) -> None:
        """Listen for messages from other servers"""
        async for message in self.pubsub.listen():
            if message['type'] != 'message':
                continue
            
            channel = message['channel'].decode()
            data = json.loads(message['data'])
            
            if channel.startswith('room:'):
                room = channel.split(':', 1)[1]
                await self._local_broadcast_room(room, data)
            elif channel.startswith('user:'):
                user_id = channel.split(':', 1)[1]
                await self._local_send_user(user_id, data)
    
    async def _local_broadcast_room(self, room: str, message: Dict) -> None:
        """Broadcast to local room connections only"""
        if room in self.room_connections:
            for conn_id in self.room_connections[room]:
                if conn_id in self.connections:
                    try:
                        await self.connections[conn_id].websocket.send_json(message)
                    except Exception:
                        pass
    
    async def _local_send_user(self, user_id: str, message: Dict) -> None:
        """Send to local user connections only"""
        if user_id in self.user_connections:
            for conn_id in self.user_connections[user_id]:
                if conn_id in self.connections:
                    try:
                        await self.connections[conn_id].websocket.send_json(message)
                    except Exception:
                        pass
    
    async def _publish_presence(self, user_id: str, status: str) -> None:
        """Publish user presence change"""
        await self.redis.publish(
            "presence",
            json.dumps({"user_id": user_id, "status": status})
        )

# FastAPI WebSocket endpoint
app = FastAPI()
manager = ConnectionManager()

@app.on_event("startup")
async def startup():
    await manager.initialize()

@app.websocket("/ws/{user_id}")
async def websocket_endpoint(websocket: WebSocket, user_id: str):
    connection = await manager.connect(websocket, user_id)
    
    try:
        while True:
            data = await websocket.receive_json()
            
            # Handle different message types
            msg_type = data.get("type")
            
            if msg_type == "join_room":
                await manager.join_room(connection, data["room"])
                
            elif msg_type == "leave_room":
                await manager.leave_room(connection, data["room"])
                
            elif msg_type == "room_message":
                await manager.broadcast_to_room(
                    data["room"],
                    {
                        "type": "message",
                        "room": data["room"],
                        "from": user_id,
                        "content": data["content"],
                        "timestamp": datetime.utcnow().isoformat()
                    }
                )
                
            elif msg_type == "direct_message":
                await manager.send_to_user(
                    data["to"],
                    {
                        "type": "direct_message",
                        "from": user_id,
                        "content": data["content"],
                        "timestamp": datetime.utcnow().isoformat()
                    }
                )
                
    except WebSocketDisconnect:
        await manager.disconnect(connection)

gRPC

gRPC vs REST

FeatureRESTgRPC
ProtocolHTTP/1.1 or HTTP/2HTTP/2
PayloadJSON (text)Protobuf (binary)
ContractOpenAPI (optional).proto files (required)
StreamingLimitedBidirectional streaming
BrowserNative supportRequires gRPC-Web
Code GenOptionalBuilt-in
SpeedSlower10x faster

gRPC Communication Patterns

┌─────────────────────────────────────────────────────────────────┐
│                    gRPC Patterns                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. Unary (Request-Response)                                   │
│     Client ────── Request ──────► Server                       │
│     Client ◄───── Response ─────── Server                       │
│                                                                 │
│  2. Server Streaming                                            │
│     Client ────── Request ──────► Server                       │
│     Client ◄═══ Stream of data ══ Server                       │
│                                                                 │
│  3. Client Streaming                                            │
│     Client ═══ Stream of data ══► Server                       │
│     Client ◄───── Response ─────── Server                       │
│                                                                 │
│  4. Bidirectional Streaming                                     │
│     Client ◄═══════════════════► Server                        │
│             Both stream freely                                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

When to Use gRPC

Use gRPC

  • Microservices communication
  • Low latency requirements
  • Strong typing needed
  • Streaming data
  • Internal services

Avoid gRPC

  • Public APIs (browser clients)
  • Simple CRUD operations
  • Team unfamiliar with Protobuf
  • Debugging ease is priority

Long Polling vs SSE vs WebSocket

┌───────────────────────────────────────────────────────────────────────┐
│                    Real-Time Communication Options                     │
├───────────────────┬───────────────────┬───────────────────────────────┤
│    Long Polling   │       SSE         │        WebSocket              │
├───────────────────┼───────────────────┼───────────────────────────────┤
│                   │                   │                               │
│  Client ── GET ──►│  Client ── GET ──►│  Client ── Upgrade ──►        │
│  Server holds...  │  Server streams   │  Bidirectional                │
│  Until data ready │  event: update    │                               │
│  Client ◄─ data ─ │  data: {...}      │  Client ◄═══════════► Server │
│  Repeat           │  data: {...}      │                               │
│                   │  (one direction)  │                               │
│                   │                   │                               │
├───────────────────┼───────────────────┼───────────────────────────────┤
│ Many TCP connects │ One TCP, server→  │ One TCP, both ways            │
│ HTTP compatible   │ HTTP compatible   │ Different protocol            │
│ Simple            │ Auto-reconnect    │ Most flexible                 │
│ Higher latency    │ Medium latency    │ Lowest latency                │
│                   │                   │                               │
│ Use: Fallback     │ Use: Notifications│ Use: Chat, gaming             │
│ for legacy        │ Live feeds        │ Collaboration                 │
└───────────────────┴───────────────────┴───────────────────────────────┘

Network Latency Budget

Where Time Goes

┌─────────────────────────────────────────────────────────────────┐
│                Request Latency Breakdown                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  DNS Lookup:           1-50ms     (cached: <1ms)               │
│  TCP Handshake:        1 RTT     (50-150ms cross-continent)   │
│  TLS Handshake:        1-2 RTT   (50-300ms)                    │
│  Request Transfer:     Varies    (size / bandwidth)            │
│  Server Processing:    Varies    (your code)                   │
│  Response Transfer:    Varies    (size / bandwidth)            │
│                                                                 │
│  Example: USA → Europe API call                                 │
│  ─────────────────────────────────                              │
│  DNS:         5ms   (cached)                                   │
│  TCP:        75ms   (1 RTT)                                    │
│  TLS:       150ms   (2 RTT)                                    │
│  Request:    10ms   (small payload)                            │
│  Server:     50ms   (DB query + processing)                    │
│  Response:   20ms   (JSON response)                            │
│  ─────────────────────────────────                              │
│  TOTAL:     310ms                                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Optimization Strategies

OptimizationLatency SavedTrade-off
CDN100-200msCost, cache invalidation
Keep-Alive150-300msConnection limits
HTTP/2VariableServer support needed
Compression10-50msCPU overhead
Edge Computing100-200msComplexity
DNS Prefetch50msAdditional requests
Interview Tip: When discussing latency, mention geographic distribution. “Users in Singapore accessing servers in US-East will have ~200ms RTT just from physics.”

Key Takeaways

ConceptRemember
DNSFirst hop, cache TTLs matter, can be used for load balancing
TCP vs UDPTCP = reliable, UDP = fast; choose based on use case
HTTP/2Multiplexing, server push, header compression
WebSocketReal-time bidirectional, needs pub/sub for scaling
gRPCFast binary protocol, great for microservices
LatencyMinimize RTTs, use CDNs, keep connections alive

Interview Deep-Dive Questions

What the interviewer is really testing: Whether you understand the full request lifecycle from browser to server and back — DNS, TCP, TLS, HTTP — and can systematically identify optimization opportunities at each layer.Strong Answer:
  • The 3-second first-page load breaks down into sequential network costs that only apply to the first request. Subsequent pages are fast because connections are reused and resources are cached. Let me walk through each phase.
  • DNS resolution: if the browser has no cache entry for your domain, it goes through the recursive resolver chain (browser cache, OS cache, ISP resolver, root servers, TLD servers, authoritative server). This can take 50-200ms depending on geography and cache state. Fix: use DNS prefetching (dns-prefetch link header), set reasonable TTLs on DNS records (300-600 seconds is typical — too low means frequent lookups, too high means slow failover).
  • TCP handshake: one round trip (SYN, SYN-ACK, ACK). For a user 100ms away from the server, that is 100ms. Fix: use a CDN so the TCP connection terminates at a nearby edge node rather than the origin server. Consider TCP Fast Open (TFO) which allows data in the SYN packet on subsequent connections.
  • TLS handshake: TLS 1.2 requires two additional round trips (200ms for a 100ms RTT user). TLS 1.3 reduces this to one round trip, and 0-RTT resumption eliminates it entirely for returning visitors. Fix: upgrade to TLS 1.3, enable session resumption, use OCSP stapling to avoid the client making a separate request to check certificate revocation.
  • HTTP request and response: the actual data transfer. If the page requires multiple resources (HTML, CSS, JavaScript, images), HTTP/1.1 loads them sequentially per connection (browsers open 6 parallel connections, but that is still a bottleneck). HTTP/2 multiplexes all requests over a single connection, eliminating head-of-line blocking at the HTTP layer. HTTP/3 (QUIC) goes further by eliminating head-of-line blocking at the transport layer.
  • Server processing: Time-to-first-byte (TTFB) depends on server-side processing. If the server needs to query a database, render a template, and assemble the response, that adds latency. Fix: server-side caching, precomputed pages for common routes, edge-side rendering.
  • Total optimization: DNS (50ms saved with prefetch) + TCP (100ms saved with CDN) + TLS (100ms saved with TLS 1.3) + HTTP multiplexing (200ms saved with HTTP/2) + server-side caching (500ms saved with CDN cache hit) can bring the 3-second load down to under 500ms.
  • Example: Cloudflare’s performance measurements show that switching from TLS 1.2 to TLS 1.3 saves one full RTT per new connection. For users in Australia connecting to US servers (200ms RTT), that is a 200ms improvement on every first visit. Combined with their CDN edge nodes in Sydney, the TCP+TLS cost drops from 600ms to under 50ms.
Follow-up: The site uses HTTP/2 and a CDN, but mobile users in India still report slow loads. Desktop users in the same region are fine. What networking factors specific to mobile could explain this?Mobile networks add several latency sources: (1) Radio resource allocation — on LTE/5G, the device must negotiate a radio channel before any data can flow, adding 50-100ms. (2) Higher RTTs on cellular networks — typical LTE RTT is 30-50ms even to nearby towers, versus 5-10ms for wired broadband. (3) TCP slow start interacts badly with mobile — high RTT means the congestion window grows slowly, so large resources take many round trips to fully transfer. (4) Packet loss on mobile is higher, causing TCP retransmissions. Fix: aggressive resource compression, smaller initial page payloads (aim for under 14KB to fit in the first TCP congestion window), lazy loading of non-critical resources, and consider QUIC/HTTP3 which handles packet loss better than TCP because it avoids head-of-line blocking across streams.
What the interviewer is really testing: Whether you can make nuanced protocol decisions based on actual requirements rather than hype, and whether you understand the operational implications beyond raw performance.Strong Answer:
  • gRPC is not universally better than REST — it is better for specific use cases, and it comes with operational costs that are easy to underestimate. The decision should be driven by concrete pain points, not benchmarks.
  • When gRPC makes sense: (1) High-throughput internal service communication where the protobuf binary encoding saves significant bandwidth (a 1KB JSON payload might be 300 bytes in protobuf — at millions of requests per second, that bandwidth savings is real). (2) Strict API contracts are needed — protobuf schemas enforce types at compile time, catching breaking changes before deployment. (3) You need streaming (server-streaming, client-streaming, or bidirectional streaming) — gRPC has first-class streaming support, while REST over HTTP/1.1 does not. (4) Latency-sensitive internal paths where JSON parsing overhead matters (protobuf deserialization is 2-10x faster than JSON parsing).
  • When REST is still the right choice: (1) Public-facing APIs — browsers do not natively support gRPC (you need gRPC-Web or a proxy), and developer experience with REST is far more accessible. (2) Services that are called infrequently — the performance difference is negligible at low volume. (3) Teams without protobuf experience — the learning curve is real and affects velocity.
  • Operational costs people underestimate: (1) Debugging is harder — binary protobuf payloads are not human-readable in packet captures or logs. You need tools like grpcurl or Postman’s gRPC support. With REST, you can curl an endpoint and read the JSON response. (2) Load balancing is more complex — gRPC uses HTTP/2 with long-lived connections. A standard L4 load balancer will route all requests from one connection to one backend. You need L7 (application-layer) load balancing that understands HTTP/2 frames, or client-side load balancing. (3) Schema evolution requires discipline — adding a field to a protobuf message is backward-compatible, but removing or renumbering a field is a breaking change that can cause silent data corruption. (4) Monitoring and tracing middleware needs to understand gRPC status codes (which are different from HTTP status codes).
  • My recommendation for the team: introduce gRPC selectively on the highest-traffic internal paths first. Keep REST for public APIs and low-volume internal services. Run both protocols through the same service mesh so you get consistent observability regardless of protocol.
  • Example: Google uses gRPC internally for almost all service-to-service communication (it was built for this purpose), but their public APIs (Maps, Gmail, etc.) offer REST endpoints because developer adoption matters more than protocol efficiency for external consumers. Internally, they report that gRPC’s streaming support was a bigger factor than raw performance in their adoption decision.
Follow-up: Your team adopts gRPC for the critical path between the API Gateway and the Order Service. Requests are being unevenly distributed — one Order Service instance is getting 80% of traffic while three others are idle. What is happening?This is the classic gRPC load balancing problem. gRPC uses HTTP/2, which multiplexes all requests over a single long-lived TCP connection. If the API Gateway opens one connection to each backend, the L4 load balancer assigned the connection to one backend and all subsequent requests flow through that same connection. Solutions: (1) Use L7 load balancing (Envoy, nginx with gRPC support) that can distribute individual gRPC requests across backends, not just connections. (2) Use client-side load balancing where the API Gateway maintains connections to all backends and round-robins requests itself (gRPC libraries support this natively with name resolvers). (3) If using Kubernetes, use a service mesh like Istio which handles per-request load balancing transparently.
What the interviewer is really testing: Whether you understand networking fundamentals at a practical level and can connect low-level protocol behavior to high-level system design decisions.Strong Answer:
  • The three-way handshake establishes a TCP connection: (1) Client sends SYN with an initial sequence number. (2) Server responds with SYN-ACK, acknowledging the client’s sequence number and providing its own. (3) Client sends ACK, acknowledging the server’s sequence number. The connection is now established and data can flow.
  • This takes one round trip (the SYN goes out, SYN-ACK comes back, ACK goes out with or before the first data packet). For system design, this means every new TCP connection costs at minimum one RTT before any application data is exchanged.
  • Why this matters for system design: (1) Connection pooling is critical for microservices. If Service A calls Service B 1000 times per second and opens a new connection each time, you are paying 1000 handshakes per second. At 1ms RTT within a datacenter, that is tolerable but wasteful. At 100ms RTT across regions, that is 100 seconds of cumulative handshake time per second of operation — a disaster. Connection pools reuse established connections, amortizing the handshake cost. (2) TCP slow start means even after the handshake, the connection starts with a small congestion window (typically 10 segments, or ~14KB). It takes multiple round trips to ramp up to full throughput. This is why large file downloads are slow at the beginning and why serving a 100KB response over a fresh connection takes longer than serving it over a warm connection. (3) Keep-alive connections (HTTP keep-alive, gRPC persistent connections) avoid repeated handshakes. The trade-off is that each open connection consumes memory on both client and server (kernel buffers, file descriptors). A server with 100K idle keep-alive connections can consume significant memory. (4) CDNs and edge proxies work partly by terminating TCP connections close to the user. The handshake RTT between the user and the CDN edge is 5ms instead of 150ms to the origin. The CDN can maintain a warm, pre-established connection pool to the origin.
  • UDP skips the handshake entirely, which is why DNS uses UDP for small queries (the entire query and response fit in one round trip), and why QUIC (the transport under HTTP/3) uses UDP with its own connection establishment that can be done in 0-RTT for returning visitors.
  • Example: When Cloudflare analyzed their traffic, they found that 40% of the latency for typical web requests was TCP and TLS handshake overhead. By moving their edge nodes closer to users and enabling TLS 1.3 with 0-RTT, they eliminated most of this overhead for repeat visitors. This is a direct system design implication of the three-way handshake cost.
Follow-up: A service behind your load balancer is running out of ephemeral ports and you see thousands of connections in TIME_WAIT state. What is happening and how do you fix it?TIME_WAIT is a TCP state where a closed connection lingers for 2x the maximum segment lifetime (typically 60 seconds on Linux) to ensure delayed packets from the old connection do not corrupt a new connection on the same port. If a service opens and closes many short-lived connections rapidly (e.g., a microservice making thousands of HTTP calls per second without connection pooling), it exhausts the ephemeral port range (typically 28,232 ports on Linux). Fixes: (1) Use connection pooling — this is the primary fix. Reuse connections instead of opening new ones. (2) Increase the ephemeral port range via net.ipv4.ip_local_port_range. (3) Enable net.ipv4.tcp_tw_reuse to allow reusing TIME_WAIT sockets for new outbound connections (safe for client-initiated connections). (4) Never enable tcp_tw_recycle — it breaks NAT and was removed from Linux 4.12. The root cause is almost always missing connection pooling.