Part XVIII — Networking
Chapter 25: Networking for Engineers
25.1 DNS
Analogy: DNS is like the phone book of the internet — you look up a name and get a number (IP address). Just like you do not memorize every phone number, your browser does not memorize every IP address. It looks up the name in a series of “phone books” (DNS servers), each one more authoritative than the last, until it finds the number it needs. And just like phone books get outdated, DNS records have a TTL that controls how long the “listing” is trusted before you need to look it up again.Translates domain names to IP addresses. The foundation of how clients find your services. Key record types: A (domain to IPv4 address), AAAA (domain to IPv6), CNAME (domain to another domain, alias), MX (mail server), TXT (verification, SPF, DKIM), SRV (service discovery with port and priority), NS (nameserver delegation). How DNS resolution works: Browser cache, then OS cache, then ISP recursive resolver, then root nameservers, then TLD nameservers (.com, .org), then authoritative nameserver, and finally the IP address is returned and cached at each layer. Each step adds latency. TTL (Time To Live) controls how long each layer caches the answer.
Recursive vs Iterative Resolution
| Aspect | Recursive Resolution | Iterative Resolution |
|---|---|---|
| Who does the work | The recursive resolver does all the chasing on behalf of the client | The client (or resolver) queries each server in turn |
| Flow | Client asks resolver once; resolver contacts root, TLD, and authoritative servers, then returns the final answer | Client asks root, gets a referral to TLD; asks TLD, gets a referral to authoritative; asks authoritative, gets the answer |
| Typical usage | Client to ISP/corporate resolver | Resolver to root/TLD/authoritative servers |
| Caching benefit | Resolver caches intermediate results for all its clients | Each intermediate answer can still be cached |
| Load on client | Low (single request) | Higher (multiple round-trips) |
8.8.8.8, 1.1.1.1, or your ISP) and that resolver performs iterative lookups on your behalf.
TTL Implications for Deployments
TTL trade-offs: Low TTL (30-60 seconds): fast failover, more DNS queries, higher DNS costs. High TTL (hours): fewer queries, slower failover. For services that need fast failover, use low TTLs. For stable services, higher TTLs reduce lookup time and DNS provider load. DNS-based load balancing: Weighted routing (send 80% to region A, 20% to region B). Geolocation routing (route users to the nearest region). Latency-based routing (route to the region with lowest measured latency). Health-check-based failover (remove unhealthy IPs from DNS responses). All major cloud DNS services support these (Route 53, Cloud DNS, Azure DNS). Debugging DNS:dig example.com (Linux/Mac) or nslookup example.com (Windows) to query DNS records. dig +trace example.com to see the full resolution chain. When “it works on my machine but not in production,” DNS caching is often the culprit.
Interview: Users in Asia report your service is slow but users in the US are fine. How do you investigate?
Interview: Users in Asia report your service is slow but users in the US are fine. How do you investigate?
Follow-up: We set up a CDN but API calls are still slow. The API response time is 50ms at the server but 400ms for Asia users.
Follow-up: We set up a CDN but API calls are still slow. The API response time is 50ms at the server but 400ms for Asia users.
25.2 HTTP, HTTPS, TLS
HTTP/1.1 limitations: One request-response at a time per TCP connection. Browsers work around this by opening 6-8 parallel connections per domain — wasteful (each needs TCP + TLS handshake). Head-of-line blocking: if one request is slow, all subsequent requests on that connection wait. HTTP/2 improvements: Multiplexes multiple streams over a single TCP connection, solving HTTP-level HOL blocking. Binary framing (more efficient than text). Header compression (HPACK). The remaining problem: TCP-level HOL blocking — a single lost packet stalls ALL streams because TCP guarantees in-order delivery. HTTP/3 (QUIC): Built on UDP instead of TCP. Each stream is independent — a lost packet in one stream does not block others. Faster connection setup (0-RTT resumption). Built-in TLS 1.3. When it matters most: Mobile/lossy networks where packet loss is common. For server-to-server on reliable networks, HTTP/2 is sufficient.HTTP/2 vs HTTP/3 (QUIC): Concrete Comparison
| Feature | HTTP/2 | HTTP/3 (QUIC) |
|---|---|---|
| Transport | TCP | UDP (QUIC protocol) |
| Head-of-line blocking | TCP-level HOL remains (one lost packet stalls all streams) | Eliminated (streams are independent at the transport layer) |
| Connection setup | TCP handshake + TLS handshake (2-3 RTTs) | 1 RTT for new connections, 0-RTT for resumed connections |
| Packet loss impact | 2% packet loss can degrade throughput by 30-40% | 2% packet loss degrades only affected streams; others are unimpacted |
| Connection migration | Breaks on IP change (e.g., Wi-Fi to cellular) | Survives IP changes using connection IDs instead of IP tuples |
| Multiplexing | Application-level multiplexing over single TCP stream | True independent streams at transport level |
| Encryption | TLS 1.2 or 1.3 on top of TCP | TLS 1.3 built into the protocol (always encrypted) |
frontend.com to api.example.com by default (same-origin policy). CORS headers tell the browser which cross-origin requests are allowed. Common mistakes: Access-Control-Allow-Origin: * with credentials (browsers reject this), not handling preflight OPTIONS requests, allowing all origins in production.
25.3 Load Balancers
Distribute traffic across server instances. The foundation of horizontal scaling. Layer 4 (TCP): Routes based on IP/port. Does not inspect request content. Very fast. Cannot route on URL path or headers. Use for: non-HTTP protocols, maximum performance. Layer 7 (HTTP): Inspects requests. Routes on URL path (/api to backend, /static to CDN), headers, cookies. Performs SSL termination, response caching, compression. Slower but much more flexible.
L4 vs L7 Load Balancer Comparison
| Aspect | Layer 4 (Transport) | Layer 7 (Application) |
|---|---|---|
| Operates on | TCP/UDP packets (IP + port) | HTTP requests (URL, headers, cookies, body) |
| Speed | Faster (no deep packet inspection) | Slower (must parse application protocol) |
| SSL/TLS | Passes through (or terminates) | Terminates and re-encrypts (offloads from backends) |
| Routing intelligence | IP hash, round-robin, least connections | Path-based, header-based, cookie-based, content-aware |
| Visibility | Cannot see request content | Full request/response visibility |
| Use cases | Database connections (MySQL, PostgreSQL), non-HTTP protocols (gRPC passthrough, MQTT), extremely high-throughput scenarios | API gateways, microservice routing, A/B testing, canary deployments |
| Connection handling | One-to-one: client connection maps to backend connection | Can multiplex: many client requests across fewer backend connections |
| Examples | AWS NLB, HAProxy (TCP mode), MetalLB | AWS ALB, NGINX, HAProxy (HTTP mode), Envoy, Traefik |
/health on each server. Failed servers are removed from rotation. Configure: check interval, healthy/unhealthy thresholds, timeout.
25.4 Reverse Proxies and API Gateways
Reverse proxy (NGINX, HAProxy, Envoy): SSL termination, static file serving, compression, request buffering, connection pooling to backends, load balancing. API Gateway (Kong, AWS API Gateway, Azure APIM): Reverse proxy + API features: authentication (JWT validation, API keys), rate limiting, request/response transformation, versioning routing, analytics, developer portal. What belongs where:- Gateway: TLS termination, authentication (token validation), rate limiting, request logging, correlation ID injection, CORS
- Application: Authorization (business rules), input validation, business logic, database access
- Anti-pattern: The God Gateway — business logic in the gateway couples all services and creates a deployment bottleneck
25.5 Service Discovery
How services find each other in a dynamic environment where instances start, stop, and move constantly. Client-side discovery: The client queries a service registry (Consul, Eureka) to get a list of available instances, then chooses one (using round-robin, random, or least-connections). The client handles load balancing. Simpler infrastructure, but every client needs a discovery library. Server-side discovery: The client sends requests to a load balancer or proxy, which queries the registry and routes to a healthy instance. The client does not need to know about the registry. Kubernetes Services work this way —http://order-service:8080 resolves via kube-dns to a ClusterIP that load-balances across pods.
Client-Side vs Server-Side Discovery
| Aspect | Client-Side Discovery | Server-Side Discovery |
|---|---|---|
| How it works | Client queries registry directly, gets instance list, picks one | Client calls a load balancer/proxy; it queries the registry and routes |
| Load balancing | Done by the client (needs LB logic in every service) | Done by the proxy/LB (centralized) |
| Client complexity | Higher (needs discovery SDK/library) | Lower (just call a stable endpoint) |
| Infrastructure | Simpler (no extra proxy hop) | Requires a load balancer or service proxy |
| Network hops | Fewer (client connects directly to instance) | Extra hop through the proxy |
| Language support | Need SDK per language (Java, Go, Python, etc.) | Language-agnostic (any HTTP client works) |
| Examples | Netflix Eureka + Ribbon, HashiCorp Consul (client mode), gRPC client-side LB | Kubernetes Services + kube-proxy, AWS ALB + ECS service discovery, Consul Connect with Envoy |
25.6 WebSockets and Real-Time Communication
WebSocket provides full-duplex communication over a single TCP connection. Unlike HTTP (request-response), WebSocket keeps the connection open for bidirectional messaging. When to use: Chat applications, live dashboards, collaborative editing, real-time notifications, gaming, live sports scores — any scenario where the server needs to push data to clients without polling. Alternatives: Server-Sent Events (SSE) for one-way server-to-client streaming (simpler, works over HTTP, auto-reconnects). Long polling (client makes a request, server holds it until data is available — simple but inefficient). For most “real-time” dashboards that update every few seconds, SSE is simpler and sufficient. WebSocket is needed for true bidirectional, low-latency communication.WebSockets vs SSE vs Long-Polling
| Aspect | WebSocket | Server-Sent Events (SSE) | Long-Polling |
|---|---|---|---|
| Direction | Full-duplex (bidirectional) | Server to client only | Simulated server push (client re-requests) |
| Protocol | Upgrades from HTTP to WS | Standard HTTP (text/event-stream) | Standard HTTP |
| Connection | Persistent, single TCP connection | Persistent HTTP connection | Repeated HTTP connections |
| Latency | Lowest (messages sent instantly in either direction) | Low (server pushes immediately) | Higher (each “push” requires a new request round-trip) |
| Auto-reconnect | Must implement manually | Built-in browser auto-reconnect | Built-in (client re-polls) |
| Binary data | Supported | Text only (Base64 encode for binary) | Supported |
| Browser support | All modern browsers | All modern browsers (not IE) | Universal |
| Proxy/firewall friendly | Can be blocked (non-HTTP after upgrade) | Excellent (plain HTTP) | Excellent (plain HTTP) |
| Scalability | Harder (stateful connections, need pub/sub) | Easier (HTTP infra, stateless reconnect) | Easiest to implement, worst at scale (high request volume) |
| Best for | Chat, gaming, collaborative editing, bidirectional streams | Live dashboards, notifications, news feeds, stock tickers | Legacy systems, simple notifications, low-frequency updates |
Interview: Your service needs to handle WebSocket connections from 1M concurrent users. Walk me through the architecture.
Interview: Your service needs to handle WebSocket connections from 1M concurrent users. Walk me through the architecture.
user_id -> gateway_server in Redis or a similar fast store. When a targeted message needs to reach one user, look up their gateway and route directly instead of broadcasting to all gateways. This reduces fan-out dramatically for unicast messages.Step 4 — Handling reconnections gracefully. Mobile users disconnect and reconnect constantly. Assign each connection a session ID. On reconnect, the gateway checks for buffered messages (stored briefly in Redis with a short TTL) and replays them. This prevents message loss during network blips without requiring persistent message storage.Step 5 — Monitoring and autoscaling. Track connections per gateway, message throughput, memory usage, and fan-out ratio. Set autoscaling policies on connection count (not CPU, which will be misleadingly low). Alert on connection imbalance across gateways.Common mistakes: Trying to use HTTP long-polling at this scale (connection overhead is 10x worse). Putting WebSocket servers behind an L7 ALB (adds latency and breaks upgrade headers if misconfigured). Ignoring the thundering herd problem — if a gateway crashes, 100K users reconnect simultaneously and can overwhelm other gateways. Mitigate with exponential backoff with jitter on the client side.Part XIX — Deployment and Release Engineering
Deployment is the most dangerous thing you do regularly. Every outage postmortem starts with “we deployed…” The goal of deployment engineering is to make deployments boring — so routine and safe that nobody worries about them. The path to boring deployments: small changes, automated testing, gradual rollout, automated rollback, and the discipline to separate deployment (putting code on servers) from release (exposing code to users).Chapter 26: Deployment Strategies
26.1 Rolling Deployment
Gradually replace old instances with new. Both versions run during rollout. Requires backward-compatible changes.Interview: You deploy a new version and error rates spike to 15%. Walk me through your response.
Interview: You deploy a new version and error rates spike to 15%. Walk me through your response.
Follow-up: The rollback fails because the deployment included a database migration that cannot be reversed. What now?
Follow-up: The rollback fails because the deployment included a database migration that cannot be reversed. What now?
26.2 Blue-Green
Analogy: Blue-green deployment is like having two identical stages in a theater — you rehearse on one while the audience watches the other, then swap. The audience (your users) never sees the stagehands moving props around. If the new show has a problem mid-performance, you instantly swing the spotlight back to the original stage where the old show is still ready to go. The cost? You need two full stages (double the infrastructure), and both stages need to work with the same backstage systems (your database).Two environments (Blue = current production, Green = new version). Deploy to Green, run smoke tests, switch traffic at the load balancer. Instant rollback: switch traffic back to Blue. The hard part — database migrations in blue-green: Both Blue and Green must work with the same database during the cutover. If Green requires a new column that Blue does not write to, or if Green removes a column that Blue still reads, the cutover breaks one of them. The pattern for safe blue-green with database migrations:
- Before cutover: Run expand-only migrations (add columns, add tables). Both Blue and Green can work with the expanded schema.
- Deploy Green: Green writes to both old and new columns. Green reads from new columns with fallback to old.
- Cutover: Switch traffic from Blue to Green.
- After cutover (days later): Run contract migrations (remove old columns, drop old tables) once Blue is no longer needed.
26.3 Canary
Route a small percentage of traffic to the new version. Monitor. Gradually increase. Catches issues under real load with limited blast radius. Canary rollout stages: 1% then monitor 5 minutes, then 5% then monitor 10 minutes, then 25% then monitor 15 minutes, then 50% then monitor 15 minutes, then 100%. Each stage compares canary metrics against baseline. Automated rollback criteria: Roll back if any of: error rate > baseline + 1%, p99 latency > baseline x 1.5, business metric (orders/minute) drops > 5%, any Sev1 alert fires. Tools like Argo Rollouts and Flagger automate this: they compare canary pods against baseline pods using Prometheus metrics and automatically promote or roll back. What makes canary better than blue-green: Canary catches issues that only appear under real production traffic patterns (specific user agents, geographic regions, data shapes). Blue-green catches issues in smoke tests, which are limited. Canary exposes fewer users to the risk (1% vs 100% during cutover).Interview: You deploy a new version and 2% of users report errors. 98% are fine. The canary metrics look green. What could be wrong?
Interview: You deploy a new version and 2% of users report errors. 98% are fine. The canary metrics look green. What could be wrong?
Deployment Strategy Selection Guide
| Factor | Rolling | Blue-Green | Canary |
|---|---|---|---|
| Risk level | Medium (both versions run, partial exposure) | Low (full smoke test before cutover, instant rollback) | Lowest (tiny % of traffic exposed initially) |
| Rollback speed | Slow (must re-deploy old version to remaining instances) | Instant (switch LB back to Blue) | Fast (route all traffic back to baseline) |
| Infrastructure cost | Low (no extra environments needed) | High (double the infrastructure during cutover) | Medium (baseline + small canary pool) |
| Complexity | Low (built into most orchestrators) | Medium (LB switching, environment management) | High (metrics comparison, automated promotion logic) |
| Downtime | Zero (if changes are backward-compatible) | Zero (traffic switch is atomic) | Zero (gradual shift) |
| Database migrations | Tricky (old and new code run simultaneously) | Hard (both environments share the DB) | Hard (canary and baseline share the DB) |
| Catches production-only bugs | Partially (issues surface as instances update) | No (smoke tests only, not real traffic) | Yes (real traffic at small scale) |
| Best for | Routine, low-risk changes; small teams; cost-sensitive environments | Major releases; compliance-heavy environments needing pre-cutover validation | High-traffic systems; changes with unknown risk; data-sensitive services |
| Team size to operate | Small | Medium (need to manage two environments) | Medium-Large (need observability and automation) |
| Tools | Kubernetes Deployment (default), ECS rolling update | Custom LB scripts, AWS Elastic Beanstalk swap, Kubernetes with Argo | Argo Rollouts, Flagger, Istio traffic splitting, LaunchDarkly |
Interview: Design a deployment strategy for a system that processes financial transactions. Downtime costs $100K/minute.
Interview: Design a deployment strategy for a system that processes financial transactions. Downtime costs $100K/minute.
26.4 Feature Flags
Decouple deployment from release. Deploy hidden behind a flag. Enable for specific users/percentages. Instant disable without rollback. Flag types: Release flags (hide incomplete features until ready — short-lived, remove after launch). Experiment flags (A/B testing — measure impact, remove after decision). Ops flags (kill switches to disable features under load — long-lived). Permission flags (enable features for specific customers or tiers — long-lived). The feature flag lifecycle: Create, then test in dev, then enable for internal users, then canary to 5%, then gradual rollout, then 100%, then remove the flag and dead code. The critical step most teams skip: removing flags after rollout. Stale flags accumulate as technical debt — code becomes littered with branching logic for flags that are always on. Set a cleanup deadline when creating every flag. Flag evaluation architecture: Server-side evaluation (flag service returns the result — simpler, no SDK needed, but adds a network call). Client-side evaluation with cached rules (SDK downloads rules, evaluates locally — faster, works offline, but rules can be stale). For latency-sensitive paths, use client-side with a streaming update channel.Feature Flag Best Practices
| Practice | Why It Matters |
|---|---|
| Set an expiry date on every release flag | Prevents stale flags from accumulating. Add a CI lint that warns on flags past their expiry. |
| Limit active flags per service | More than 10-15 active flags in one service creates a combinatorial testing nightmare. Track the count. |
| Always have a kill switch | Every new feature should be wrapped in a flag that can be turned off instantly — no deploy needed. |
| Flag cleanup sprints | Schedule regular cleanup (every 2-4 weeks). Remove flags that are 100% rolled out. Delete the dead code path. |
| Test both paths | Every flag creates two code paths. Unit tests must cover flag-on AND flag-off. CI should run tests with both states. |
| Avoid flag dependencies | Flag A should not depend on Flag B being enabled. If they do, document it and consider merging them. |
| Default to off for release flags | New release flags should default to disabled. This ensures a deploy without explicit enablement is a no-op. |
| Centralized flag dashboard | One place to see all active flags, their state, owner, expiry date, and percentage rollout. LaunchDarkly, Unleash, and Flagsmith provide this. |
| Audit log on flag changes | Every flag toggle should be logged with who, when, and why. This is essential for incident investigation. |
26.5 Graceful Shutdown and Connection Draining
Critical for zero-downtime deployments. When an instance is being replaced, it must finish in-flight requests before shutting down. The shutdown sequence:- Instance receives SIGTERM.
- Stop accepting new requests (mark as not ready — fail health checks or deregister from service discovery).
- Load balancer stops routing new traffic (health check fails or deregistration propagates — this takes a few seconds).
- Drain in-flight requests — wait for all currently processing requests to complete (drain period).
- Close resources — close database connections, flush logs and metrics, finish writing to message queues.
- Timeout guard — if draining has not completed within the grace period, log the remaining requests and exit.
- Process exits cleanly (exit code 0).
- If the process has not exited, the orchestrator sends SIGKILL (non-catchable, immediate termination).
Concrete Graceful Shutdown Sequence
terminationGracePeriodSeconds (default 30s — increase for long-running requests). Use a preStop hook to add a small delay (5-10 seconds) so the load balancer has time to deregister the pod before it stops accepting traffic. Your application must handle SIGTERM and stop accepting new connections while completing in-flight work.
26.6 Database Migration Safety
Never make breaking schema changes in the same deploy that changes application code. Use expand-and-contract. Test on production-sized data.26.7 CI/CD Pipeline Design
A well-designed CI/CD pipeline is the foundation of reliable software delivery. It automates the path from code commit to production deployment. CI pipeline stages: Lint (catch style and syntax issues instantly). Unit tests (fast, run on every commit). Build (compile, bundle, create artifacts). Integration tests (run against real dependencies). Security scanning (dependency vulnerabilities, static analysis). Artifact publishing (Docker image, npm package, JAR). CD pipeline stages: Deploy to staging (automatic on merge to main). Smoke tests (verify deployment health). Deploy to production (manual approval or automatic based on confidence). Post-deployment verification (health checks, error rate monitoring). Automatic rollback on failure. Pipeline principles: Keep pipelines fast (under 10 minutes for CI, under 30 minutes for full deploy). Fail fast (run the quickest checks first). Make pipelines reproducible (same commit always produces the same artifact). Cache dependencies aggressively (npm install should not download the internet every run). Pipeline-as-code (Jenkinsfile, GitHub Actions YAML — versioned alongside application code).CI/CD Pipeline Best Practices
| Practice | Details |
|---|---|
| Fast feedback first | Order stages by speed: lint (seconds) then unit tests (1-2 min) then build (2-3 min) then integration tests (5-10 min). A developer should know about a lint failure in under 60 seconds, not after a 10-minute build. |
| Parallel stages | Run independent stages concurrently. Lint, unit tests, and security scans can all run at the same time. Only sequential dependencies (build must finish before integration tests) should be serialized. |
| Artifact promotion | Build the artifact (Docker image, binary) exactly once. Promote the same artifact from staging to production. Never rebuild for production — rebuilds can produce different results (floating dependency versions, different build environments). |
| Immutable artifacts | Tag every artifact with the git SHA. myapp:abc123def — not myapp:latest. This guarantees you can always trace production back to a specific commit. |
| Environment parity | Staging should mirror production as closely as possible: same OS, same runtime version, same resource limits. Differences between staging and production are a top source of “works in staging, breaks in prod.” |
| Pipeline as code | Store pipeline definitions (GitHub Actions YAML, Jenkinsfile, .gitlab-ci.yml) in the same repo as the application. Changes to the pipeline go through the same PR review as application code. |
| Secrets management | Never hardcode secrets in pipeline files. Use the CI platform’s secret store (GitHub Secrets, GitLab CI Variables). Rotate secrets regularly. Audit access. |
| Flaky test quarantine | A flaky test that fails 5% of the time wastes enormous developer time. Quarantine flaky tests to a non-blocking stage, fix them, then move them back. Never let flaky tests erode trust in the pipeline. |
| Deployment windows | Avoid deploying on Fridays, before holidays, or during peak traffic. Automate this with deployment freeze windows in your CD tool. |
| Rollback automation | The deploy pipeline should include a one-click (or automated) rollback. If post-deployment health checks fail, the previous artifact is automatically re-deployed. |
Curated Resources
Networking Deep Dives
- Cloudflare Learning Center — Arguably the best free resource for understanding DNS, CDN, DDoS protection, SSL/TLS, and networking fundamentals. Each topic gets a clear, illustrated explanation with real-world context. Start with “What is DNS?” and “What is a CDN?” then explore DDoS attack types and mitigation strategies.
- Julia Evans’ Networking Zines — Julia Evans creates visual, hand-drawn explanations of networking concepts that make complex topics click instantly. Her zines on DNS, HTTP, TCP, and networking tools (dig, curl, tcpdump) are some of the best learning materials in existence. The visual format encodes information differently than text and helps concepts stick. Highly recommended for both beginners and experienced engineers who want to solidify mental models.
- QUIC Protocol and HTTP/3 — Cloudflare’s Explanation — Cloudflare’s writeup on HTTP/3 and the QUIC protocol is the clearest explanation of why HTTP/3 moved from TCP to UDP, how QUIC eliminates head-of-line blocking, and what 0-RTT connection resumption means in practice. For the formal specification, see RFC 9000 (QUIC Transport Protocol) and RFC 9114 (HTTP/3).
- AWS Well-Architected Framework — Networking Pillar — AWS’s opinionated guide to networking architecture in the cloud. Covers VPC design, subnet strategies, load balancing, DNS, CDN (CloudFront), and hybrid connectivity. Even if you do not use AWS, the architectural patterns (public/private subnet separation, NAT gateways, transit gateways) apply universally.
Deployment and Release Engineering
- Netflix Tech Blog — Deployment and Delivery — Netflix has published extensively on their deployment infrastructure, including Spinnaker (their open-source continuous delivery platform), Kayenta (automated canary analysis), and their philosophy on progressive delivery. Key posts to read: “Automated Canary Analysis at Netflix with Kayenta” and “Full Cycle Developers at Netflix.” These are not theoretical — they describe systems handling 250M+ subscribers.
- Google SRE Book — Chapter on Release Engineering — Free online. Google’s chapter on release engineering describes how they manage deployments across a codebase with billions of lines of code and tens of thousands of engineers. Covers hermetic builds, release branches, cherry-picks, and the philosophy that release engineering is a distinct engineering discipline, not a side task for developers.
- Charity Majors’ Blog on Progressive Delivery — Charity Majors (co-founder of Honeycomb, former infrastructure engineer at Facebook and Parse) writes some of the most incisive content on observability, deployment, and engineering culture. Her posts on testing in production, progressive delivery, and the relationship between deploy frequency and reliability are essential reading. She challenges conventional wisdom with data and experience.
- LaunchDarkly Blog — Feature Flag Best Practices — LaunchDarkly is the leading feature flag platform, and their blog is a comprehensive resource on feature flag lifecycle management, progressive delivery patterns, experimentation, and the organizational practices that make feature flags sustainable rather than technical debt. Particularly valuable: their guides on flag cleanup, testing strategies for flagged code, and the distinction between release flags, experiment flags, and operational flags.