Building production-ready Go applications requires more than just functional code. This chapter covers the “last mile” that separates a working prototype from a service you can deploy with confidence: structured logging, configuration management, graceful shutdown, health checks, metrics, containerization, and security hardening. These are the concerns that distinguish senior engineers from junior ones — and they are where Go truly shines as an operations-friendly language.
Graceful shutdown is the difference between “we deployed with zero downtime” and “some requests got 502 errors during the deploy.” When your service receives SIGTERM (which Kubernetes sends before killing a pod), it needs to stop accepting new requests, wait for in-flight requests to complete, close database connections cleanly, and then exit. Here is the standard pattern:
func main() { cfg := LoadConfig() // Initialize dependencies db := initDatabase(cfg.Database) defer db.Close() cache := initRedis(cfg.Redis) defer cache.Close() // Create server srv := &http.Server{ Addr: fmt.Sprintf("%s:%d", cfg.Server.Host, cfg.Server.Port), Handler: setupRoutes(db, cache), ReadTimeout: cfg.Server.ReadTimeout, WriteTimeout: cfg.Server.WriteTimeout, } // Start server in goroutine go func() { slog.Info("Server starting", "addr", srv.Addr) if err := srv.ListenAndServe(); err != http.ErrServerClosed { slog.Error("Server error", "error", err) os.Exit(1) } }() // Wait for interrupt signal quit := make(chan os.Signal, 1) signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM) <-quit slog.Info("Shutting down server...") // Graceful shutdown with timeout ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) defer cancel() // Shutdown order matters! // 1. Stop accepting new requests if err := srv.Shutdown(ctx); err != nil { slog.Error("Server shutdown error", "error", err) } // 2. Close database connections if err := db.Close(); err != nil { slog.Error("Database close error", "error", err) } // 3. Close cache connections if err := cache.Close(); err != nil { slog.Error("Cache close error", "error", err) } slog.Info("Server stopped")}
Design the graceful shutdown sequence for a Go service that has an HTTP server, a database connection pool, a Redis client, and background worker goroutines. What is the correct order and why?
Strong Answer:
The shutdown order must be the reverse of the dependency order. You shut down consumers before producers, and application-level resources before infrastructure-level resources.
Step 1: Stop accepting new HTTP connections by calling srv.Shutdown(ctx). This stops the listener, returns 503 to new connections, and waits for in-flight requests to complete (up to the context timeout). This is the first step because you want to stop new work from arriving.
Step 2: Signal background workers to stop. Cancel their context or close their input channel. Wait for them to finish (with a timeout). This is second because workers might have in-progress database writes that need to complete.
Step 3: Close the database connection pool with db.Close(). This waits for in-use connections to be returned, then closes all connections. This must happen after workers finish because workers might be using database connections.
Step 4: Close the Redis client. Same reasoning — it must outlive anything that uses it.
Step 5: Flush logs and metrics. This is last because you want to capture logs from all the shutdown steps above.
The timeout is critical. Each step should have a deadline. If a step hangs (a worker is stuck on a blocking call), the overall shutdown context should expire and force exit. A typical total shutdown timeout is 30 seconds.
In Go, the shutdown signal comes from signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM). Kubernetes sends SIGTERM, waits the terminationGracePeriodSeconds (default 30s), then sends SIGKILL. Your shutdown must complete within that window.
Follow-up: What is the difference between liveness and readiness probes in Kubernetes, and how do you implement them in Go?Liveness answers “is the process alive and not deadlocked?” A simple 200 OK response is sufficient. If liveness fails, Kubernetes restarts the pod. Readiness answers “can this instance handle traffic?” It should check all dependencies: database reachable, Redis connected, background workers running. If readiness fails, Kubernetes removes the pod from the service’s endpoints but does NOT restart it. This is the correct behavior when, for example, the database is temporarily down — restarting the pod would not help. In Go, liveness is a handler that always returns 200. Readiness is a handler that pings the database and Redis with a short timeout (2-3 seconds). During graceful shutdown, set readiness to fail first (so the load balancer stops sending traffic), then proceed with shutdown.
You are setting up structured logging for a Go microservice. Compare slog (standard library) versus zap (uber). What are the trade-offs, and how do you ensure request IDs appear in every log line?
Strong Answer:
slog (Go 1.21+) is the standard library’s structured logging package. Advantages: zero dependencies, part of the standard library so it will be maintained forever, good enough performance for most services, and a standard slog.Handler interface that allows pluggable backends. Disadvantages: slower than zap for very high-throughput logging, fewer features out of the box.
zap (uber) is a high-performance structured logger. Advantages: 3-5x faster than slog for JSON encoding (uses a zero-allocation encoder), has both a typed logger (zap.Logger) and a sugared logger (zap.SugaredLogger) for convenience, and has extensive middleware integrations. Disadvantages: external dependency, more complex API, and if your service is not CPU-bound on logging, the performance difference is irrelevant.
My recommendation: use slog for new projects unless profiling shows logging is a bottleneck (rare). Use zap if you are in a team that already uses it or if you genuinely need the maximum throughput (services logging 100K+ lines per second).
For request IDs in every log line: create a logging middleware that generates or extracts a request ID, creates a child logger with the request ID as a field (slog.With("request_id", id)), and stores it in the context using context.WithValue. Every function that logs retrieves the logger from the context. This way, every log line for a request automatically includes the request ID without any manual effort at each log call site.
Follow-up: How do you handle log levels in production versus development, and what should you log at each level?In development, use DEBUG level with a human-readable text format (colorized, with source file/line). In production, use INFO level with JSON format (for log aggregation systems like ELK, Datadog, Splunk). DEBUG: detailed internal state, variable values, loop iterations — only during active investigation. INFO: request received/completed, service started/stopped, configuration loaded, significant state changes. WARN: recoverable errors, degraded performance, deprecated usage, retry attempts. ERROR: failures that affect the user — failed requests, database errors, unrecoverable states. Never use ERROR for expected conditions (like 404 Not Found). The key principle: if someone pages you at 3 AM, ERROR logs should tell them what went wrong. If they need to investigate deeper, INFO and DEBUG logs provide the trail. In production, dynamically changing log levels without restart (via an admin endpoint or environment variable reload) is invaluable for debugging live issues.
You are containerizing a Go service with Docker. Walk me through the multi-stage Dockerfile, explain each decision, and describe how you would optimize the image size.
Strong Answer:
Stage 1 (builder): Use golang:1.21-alpine as the build image. Copy go.mod and go.sum first, then RUN go mod download — this caches dependencies as a Docker layer so they are not re-downloaded on every code change. Then copy the source code and build with CGO_ENABLED=0 GOOS=linux go build -ldflags="-w -s" -o /app/server ./cmd/server. The CGO_ENABLED=0 ensures a statically linked binary with no C dependencies. The -ldflags="-w -s" strips debug info for smaller binaries.
Stage 2 (runtime): Use alpine:3.18 (about 5MB) or scratch (0 bytes) as the runtime image. Copy only the compiled binary from the builder stage. Add ca-certificates if making HTTPS calls and tzdata for timezone support. Create a non-root user and run the binary as that user.
Image size optimization: the final image is typically 10-20MB (alpine + binary) versus 800MB+ if you used the golang image as the runtime. With scratch, it can be under 10MB, but you lose a shell for debugging, DNS resolution from libc (though Go’s pure-Go resolver works), and the ability to exec into the container.
Additional optimizations: use .dockerignore to exclude tests, docs, and development files from the build context (speeds up docker build). Inject version info via build args and ldflags. Add a HEALTHCHECK instruction so Docker (and Docker Compose) can monitor the container’s health natively.
Security: never run as root in production. Use USER appuser in the Dockerfile. Do not include secrets in the image — pass them via environment variables or secret management at runtime.
Follow-up: In Kubernetes, what resource requests and limits would you set for a Go service, and what happens if you set them wrong?CPU requests should match your typical usage (say 100m for a light service, 500m for compute-heavy). Memory requests should match the steady-state heap plus the Go runtime overhead (128Mi to 512Mi for most services). Limits should be 2-3x the requests to handle bursts. If memory limits are too low, the OOM killer terminates the pod with no warning. If CPU limits are too low, the pod gets CPU throttled, causing latency spikes. A critical Go-specific detail: Go’s garbage collector is CPU-intensive, and CPU throttling can cause GC pauses to stretch from milliseconds to seconds. Some teams set no CPU limit at all (only requests) and rely on cluster autoscaling, because CPU throttling harms Go services disproportionately. For memory, always set a limit — but set it high enough to account for GC overhead (the Go heap can temporarily be 2x the live data during a GC cycle).