Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Why Networking Matters
Every distributed system communicates over networks. Understanding networking is crucial for:- Latency optimization - Where does delay come from?
- Protocol selection - HTTP vs WebSocket vs gRPC
- Debugging issues - Why is my API slow?
- Security design - TLS, firewalls, VPNs
The OSI Model (Simplified)
DNS (Domain Name System)
DNS translates human-readable domain names to IP addresses.DNS Resolution Flow
DNS Record Types
| Record | Purpose | Example |
|---|---|---|
| A | Maps domain to IPv4 | example.com → 93.184.216.34 |
| AAAA | Maps domain to IPv6 | example.com → 2606:2800:220:1:... |
| CNAME | Alias to another domain | www.example.com → example.com |
| MX | Mail server | example.com → mail.example.com |
| TXT | Text data (verification) | SPF, DKIM records |
| NS | Nameserver | example.com → ns1.provider.com |
DNS in System Design
TCP vs UDP
TCP (Transmission Control Protocol)
TCP vs UDP Comparison
| Feature | TCP | UDP |
|---|---|---|
| Connection | Connection-oriented | Connectionless |
| Reliability | Guaranteed delivery | Best effort |
| Ordering | In-order delivery | No ordering |
| Speed | Slower (overhead) | Faster |
| Use Case | HTTP, databases | Video streaming, gaming, DNS |
| Header Size | 20-60 bytes | 8 bytes |
When to Use What
Use TCP
- Web applications (HTTP/HTTPS)
- File transfers
- Database connections
- Email (SMTP, IMAP)
- When data integrity matters
Use UDP
- Live video/audio streaming
- Online gaming
- DNS queries
- IoT sensors
- When speed > reliability
HTTP/HTTPS
HTTP Request/Response
HTTP Methods
| Method | Purpose | Idempotent | Safe |
|---|---|---|---|
| GET | Retrieve resource | Yes | Yes |
| POST | Create resource | No | No |
| PUT | Replace resource | Yes | No |
| PATCH | Partial update | No | No |
| DELETE | Remove resource | Yes | No |
| HEAD | Get headers only | Yes | Yes |
| OPTIONS | Get allowed methods | Yes | Yes |
HTTP Status Codes
HTTP/1.1 vs HTTP/2 vs HTTP/3
HTTPS/TLS Handshake
WebSockets
WebSocket vs HTTP
WebSocket Use Cases
Real-time Chat
Live Updates
Gaming
Collaboration
WebSocket Scaling Challenge
WebSocket Implementation
Production-ready WebSocket server with connection management:- Python
- JavaScript
gRPC
gRPC vs REST
| Feature | REST | gRPC |
|---|---|---|
| Protocol | HTTP/1.1 or HTTP/2 | HTTP/2 |
| Payload | JSON (text) | Protobuf (binary) |
| Contract | OpenAPI (optional) | .proto files (required) |
| Streaming | Limited | Bidirectional streaming |
| Browser | Native support | Requires gRPC-Web |
| Code Gen | Optional | Built-in |
| Speed | Slower | 10x faster |
gRPC Communication Patterns
When to Use gRPC
Use gRPC
- Microservices communication
- Low latency requirements
- Strong typing needed
- Streaming data
- Internal services
Avoid gRPC
- Public APIs (browser clients)
- Simple CRUD operations
- Team unfamiliar with Protobuf
- Debugging ease is priority
Long Polling vs SSE vs WebSocket
Network Latency Budget
Where Time Goes
Optimization Strategies
| Optimization | Latency Saved | Trade-off |
|---|---|---|
| CDN | 100-200ms | Cost, cache invalidation |
| Keep-Alive | 150-300ms | Connection limits |
| HTTP/2 | Variable | Server support needed |
| Compression | 10-50ms | CPU overhead |
| Edge Computing | 100-200ms | Complexity |
| DNS Prefetch | 50ms | Additional requests |
Key Takeaways
| Concept | Remember |
|---|---|
| DNS | First hop, cache TTLs matter, can be used for load balancing |
| TCP vs UDP | TCP = reliable, UDP = fast; choose based on use case |
| HTTP/2 | Multiplexing, server push, header compression |
| WebSocket | Real-time bidirectional, needs pub/sub for scaling |
| gRPC | Fast binary protocol, great for microservices |
| Latency | Minimize RTTs, use CDNs, keep connections alive |
Interview Deep-Dive Questions
Q1: A user reports that your web application takes 3 seconds to load the first page, but subsequent pages are fast. Walk me through every network-level factor contributing to that first-page latency and how you would reduce it.
Q1: A user reports that your web application takes 3 seconds to load the first page, but subsequent pages are fast. Walk me through every network-level factor contributing to that first-page latency and how you would reduce it.
- The 3-second first-page load breaks down into sequential network costs that only apply to the first request. Subsequent pages are fast because connections are reused and resources are cached. Let me walk through each phase.
- DNS resolution: if the browser has no cache entry for your domain, it goes through the recursive resolver chain (browser cache, OS cache, ISP resolver, root servers, TLD servers, authoritative server). This can take 50-200ms depending on geography and cache state. Fix: use DNS prefetching (
dns-prefetchlink header), set reasonable TTLs on DNS records (300-600 seconds is typical — too low means frequent lookups, too high means slow failover). - TCP handshake: one round trip (SYN, SYN-ACK, ACK). For a user 100ms away from the server, that is 100ms. Fix: use a CDN so the TCP connection terminates at a nearby edge node rather than the origin server. Consider TCP Fast Open (TFO) which allows data in the SYN packet on subsequent connections.
- TLS handshake: TLS 1.2 requires two additional round trips (200ms for a 100ms RTT user). TLS 1.3 reduces this to one round trip, and 0-RTT resumption eliminates it entirely for returning visitors. Fix: upgrade to TLS 1.3, enable session resumption, use OCSP stapling to avoid the client making a separate request to check certificate revocation.
- HTTP request and response: the actual data transfer. If the page requires multiple resources (HTML, CSS, JavaScript, images), HTTP/1.1 loads them sequentially per connection (browsers open 6 parallel connections, but that is still a bottleneck). HTTP/2 multiplexes all requests over a single connection, eliminating head-of-line blocking at the HTTP layer. HTTP/3 (QUIC) goes further by eliminating head-of-line blocking at the transport layer.
- Server processing: Time-to-first-byte (TTFB) depends on server-side processing. If the server needs to query a database, render a template, and assemble the response, that adds latency. Fix: server-side caching, precomputed pages for common routes, edge-side rendering.
- Total optimization: DNS (50ms saved with prefetch) + TCP (100ms saved with CDN) + TLS (100ms saved with TLS 1.3) + HTTP multiplexing (200ms saved with HTTP/2) + server-side caching (500ms saved with CDN cache hit) can bring the 3-second load down to under 500ms.
- Example: Cloudflare’s performance measurements show that switching from TLS 1.2 to TLS 1.3 saves one full RTT per new connection. For users in Australia connecting to US servers (200ms RTT), that is a 200ms improvement on every first visit. Combined with their CDN edge nodes in Sydney, the TCP+TLS cost drops from 600ms to under 50ms.
Q2: Your microservices architecture currently uses REST for all inter-service communication. The team is evaluating gRPC. When would you recommend the switch, and what are the operational costs people underestimate?
Q2: Your microservices architecture currently uses REST for all inter-service communication. The team is evaluating gRPC. When would you recommend the switch, and what are the operational costs people underestimate?
- gRPC is not universally better than REST — it is better for specific use cases, and it comes with operational costs that are easy to underestimate. The decision should be driven by concrete pain points, not benchmarks.
- When gRPC makes sense: (1) High-throughput internal service communication where the protobuf binary encoding saves significant bandwidth (a 1KB JSON payload might be 300 bytes in protobuf — at millions of requests per second, that bandwidth savings is real). (2) Strict API contracts are needed — protobuf schemas enforce types at compile time, catching breaking changes before deployment. (3) You need streaming (server-streaming, client-streaming, or bidirectional streaming) — gRPC has first-class streaming support, while REST over HTTP/1.1 does not. (4) Latency-sensitive internal paths where JSON parsing overhead matters (protobuf deserialization is 2-10x faster than JSON parsing).
- When REST is still the right choice: (1) Public-facing APIs — browsers do not natively support gRPC (you need gRPC-Web or a proxy), and developer experience with REST is far more accessible. (2) Services that are called infrequently — the performance difference is negligible at low volume. (3) Teams without protobuf experience — the learning curve is real and affects velocity.
- Operational costs people underestimate: (1) Debugging is harder — binary protobuf payloads are not human-readable in packet captures or logs. You need tools like
grpcurlor Postman’s gRPC support. With REST, you cancurlan endpoint and read the JSON response. (2) Load balancing is more complex — gRPC uses HTTP/2 with long-lived connections. A standard L4 load balancer will route all requests from one connection to one backend. You need L7 (application-layer) load balancing that understands HTTP/2 frames, or client-side load balancing. (3) Schema evolution requires discipline — adding a field to a protobuf message is backward-compatible, but removing or renumbering a field is a breaking change that can cause silent data corruption. (4) Monitoring and tracing middleware needs to understand gRPC status codes (which are different from HTTP status codes). - My recommendation for the team: introduce gRPC selectively on the highest-traffic internal paths first. Keep REST for public APIs and low-volume internal services. Run both protocols through the same service mesh so you get consistent observability regardless of protocol.
- Example: Google uses gRPC internally for almost all service-to-service communication (it was built for this purpose), but their public APIs (Maps, Gmail, etc.) offer REST endpoints because developer adoption matters more than protocol efficiency for external consumers. Internally, they report that gRPC’s streaming support was a bigger factor than raw performance in their adoption decision.
Q3: Explain the TCP three-way handshake to me. Then tell me why it matters for system design, not just for a networking exam.
Q3: Explain the TCP three-way handshake to me. Then tell me why it matters for system design, not just for a networking exam.
- The three-way handshake establishes a TCP connection: (1) Client sends SYN with an initial sequence number. (2) Server responds with SYN-ACK, acknowledging the client’s sequence number and providing its own. (3) Client sends ACK, acknowledging the server’s sequence number. The connection is now established and data can flow.
- This takes one round trip (the SYN goes out, SYN-ACK comes back, ACK goes out with or before the first data packet). For system design, this means every new TCP connection costs at minimum one RTT before any application data is exchanged.
- Why this matters for system design: (1) Connection pooling is critical for microservices. If Service A calls Service B 1000 times per second and opens a new connection each time, you are paying 1000 handshakes per second. At 1ms RTT within a datacenter, that is tolerable but wasteful. At 100ms RTT across regions, that is 100 seconds of cumulative handshake time per second of operation — a disaster. Connection pools reuse established connections, amortizing the handshake cost. (2) TCP slow start means even after the handshake, the connection starts with a small congestion window (typically 10 segments, or ~14KB). It takes multiple round trips to ramp up to full throughput. This is why large file downloads are slow at the beginning and why serving a 100KB response over a fresh connection takes longer than serving it over a warm connection. (3) Keep-alive connections (HTTP keep-alive, gRPC persistent connections) avoid repeated handshakes. The trade-off is that each open connection consumes memory on both client and server (kernel buffers, file descriptors). A server with 100K idle keep-alive connections can consume significant memory. (4) CDNs and edge proxies work partly by terminating TCP connections close to the user. The handshake RTT between the user and the CDN edge is 5ms instead of 150ms to the origin. The CDN can maintain a warm, pre-established connection pool to the origin.
- UDP skips the handshake entirely, which is why DNS uses UDP for small queries (the entire query and response fit in one round trip), and why QUIC (the transport under HTTP/3) uses UDP with its own connection establishment that can be done in 0-RTT for returning visitors.
- Example: When Cloudflare analyzed their traffic, they found that 40% of the latency for typical web requests was TCP and TLS handshake overhead. By moving their edge nodes closer to users and enabling TLS 1.3 with 0-RTT, they eliminated most of this overhead for repeat visitors. This is a direct system design implication of the three-way handshake cost.
net.ipv4.ip_local_port_range. (3) Enable net.ipv4.tcp_tw_reuse to allow reusing TIME_WAIT sockets for new outbound connections (safe for client-initiated connections). (4) Never enable tcp_tw_recycle — it breaks NAT and was removed from Linux 4.12. The root cause is almost always missing connection pooling.