Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Performance Optimization in Go
Go is designed for performance, but writing fast Go code requires understanding the runtime, memory model, and profiling tools. This chapter covers practical techniques for optimizing Go applications. The cardinal rule of performance optimization: measure before you optimize. The Go toolchain provides world-class profiling tools built right in —pprof for CPU and memory profiles, benchmarks with testing.B, and the trace tool for visualizing goroutine scheduling. Use them. The bottleneck is almost never where you think it is.
Profiling with pprof
Go’s built-in profiler helps identify performance bottlenecks. Think of pprof as a doctor’s diagnostic toolkit for your program: CPU profiling is like a heart monitor (showing where your program spends its time), memory profiling is like a blood test (revealing what is consuming resources), and goroutine profiling is like an X-ray (showing what is stuck and where).CPU Profiling
Memory Profiling
HTTP pprof Server
http://localhost:6060/debug/pprof/profile- CPU profilehttp://localhost:6060/debug/pprof/heap- Memory profilehttp://localhost:6060/debug/pprof/goroutine- Goroutine stackshttp://localhost:6060/debug/pprof/block- Blocking profilehttp://localhost:6060/debug/pprof/mutex- Mutex contention
Analyzing Profiles
Benchmarking
Writing Benchmarks
Memory Benchmarks
Running Benchmarks
Memory Optimization
Understanding Escape Analysis
Escape analysis is the compiler’s decision about whether a variable lives on the stack (fast, automatic cleanup) or the heap (slower, requires garbage collection). Think of it as the compiler asking: “Can this variable’s lifetime be guaranteed to end when the function returns?” If yes, it stays on the stack. If any reference to it escapes the function (via a pointer return, a closure capture, or passing to an interface), it must live on the heap where the garbage collector can manage it.Reducing Allocations
String Concatenation
Struct Field Alignment
CPUs read memory in aligned chunks (typically 8 bytes on 64-bit systems). When struct fields are not aligned to their natural boundaries, the compiler inserts padding bytes. By ordering fields from largest to smallest, you minimize wasted padding — this can matter significantly when you have millions of structs in memory.Concurrency Optimization
Goroutine Pool
Reducing Lock Contention
Atomic Operations
I/O Optimization
Buffered I/O
Connection Pooling
JSON Optimization
Standard Library Tips
Using Faster JSON Libraries
Code Generation
Compiler Optimizations
Inlining
Bounds Check Elimination
Common Anti-Patterns
Defer in Hot Loops
Interface Conversions
Profiling Checklist
- Identify hotspots with CPU profiling
- Check memory allocations with heap profiling
- Find goroutine leaks with goroutine profiling
- Detect lock contention with mutex profiling
- Analyze blocking with block profiling
Interview Questions
How do you identify memory leaks in Go?
How do you identify memory leaks in Go?
- Use
pprofheap profile to see allocations - Check goroutine count over time (
runtime.NumGoroutine()) - Monitor process memory with external tools
- Look for growing maps, slices, or channels
- Check for goroutines blocked forever
What's escape analysis and why does it matter?
What's escape analysis and why does it matter?
Escape analysis determines if a variable can stay on the stack or must escape to the heap. Stack allocation is faster and doesn’t require garbage collection. Use
go build -gcflags="-m" to see escape decisions.How do you reduce GC pressure?
How do you reduce GC pressure?
- Reduce allocations (pre-allocate slices, use sync.Pool)
- Avoid creating many short-lived objects
- Use value types instead of pointers when possible
- Batch operations to amortize allocation cost
- Consider
GOGCtuning for specific workloads
When should you use sync.Pool?
When should you use sync.Pool?
Use sync.Pool for:
- Frequently allocated/deallocated objects
- Objects with predictable lifecycle
- Buffers, temporary structs, connection wrappers
Summary
| Technique | When to Use |
|---|---|
| CPU Profiling | Identify slow functions |
| Memory Profiling | Find allocation hotspots |
| Benchmarking | Measure and compare performance |
| sync.Pool | Reduce GC pressure for temp objects |
| Sharding | Reduce lock contention |
| Buffered I/O | Reduce system calls |
| Pre-allocation | Avoid slice/map growth |
| Atomic Operations | Simple concurrent counters |
Interview Deep-Dive
Walk me through how you would diagnose a Go service that is using 10x more memory than expected. What tools do you use and in what order?
Walk me through how you would diagnose a Go service that is using 10x more memory than expected. What tools do you use and in what order?
Strong Answer:
- Step 1: Check
runtime.MemStatsto understand the breakdown —HeapAlloc(live heap objects),HeapSys(heap memory obtained from OS),NumGoroutine(goroutine count). If goroutine count is abnormally high (thousands when you expect hundreds), you have a goroutine leak, not a memory problem. - Step 2: Take a heap profile using pprof. If the service exposes the pprof HTTP endpoint:
go tool pprof http://localhost:6060/debug/pprof/heap. The default showsinuse_space— memory currently allocated and not freed. Switch toalloc_spaceto see cumulative allocations (helps find functions that allocate heavily even if memory is later freed). - Step 3: Use the pprof interactive mode:
top10to see which functions hold the most memory,list functionNameto see line-level allocations, and the web interface (-http=:8080) for flame graphs. - Common findings: a slice that grows unboundedly (like an in-memory cache without eviction), a sub-slice of a large slice keeping the entire backing array alive, goroutine leaks (each goroutine holds 2KB+ of stack), closures capturing large objects and preventing garbage collection, and
sync.Poolmisuse where objects are not being returned. - Step 4: If heap profile looks clean but RSS is high, check
HeapReleasedvsHeapSys. Go’s garbage collector frees memory logically but may not return it to the OS immediately. Usedebug.FreeOSMemory()in a test to force it, or setMADV_DONTNEEDviaGODEBUG=madvdontneed=1(default since Go 1.16).
go build -gcflags="-m" to see escape decisions. Variables escape when: their address is returned from a function, they are assigned to an interface, they are captured by a closure that outlives the function, or the compiler cannot prove they do not escape. The optimization strategy: when a hot function is allocating heavily on the heap, check escape analysis output. Sometimes a small refactoring — returning a value instead of a pointer, avoiding an interface conversion, or pre-sizing a slice — keeps the allocation on the stack and eliminates GC pressure entirely. But do not optimize blindly — profile first to find the actual hot paths.Explain struct field alignment in Go. When does it matter, and how much memory can you save?
Explain struct field alignment in Go. When does it matter, and how much memory can you save?
Strong Answer:
- CPUs read memory in aligned chunks (typically 8 bytes on 64-bit systems). When a struct field is not naturally aligned, the compiler inserts padding bytes. By ordering fields from largest to smallest, you minimize padding.
- Concrete example:
struct { a bool; b int64; c bool }is 24 bytes (1 + 7 padding + 8 + 1 + 7 padding). Reordered asstruct { b int64; a bool; c bool }it is 16 bytes (8 + 1 + 1 + 6 padding). That is 33% less memory per struct. - When it matters: if you have a slice of 10 million of these structs, the difference is 80MB (240MB vs 160MB). For hot structs in high-throughput code, struct alignment is a meaningful optimization. For config structs or one-off objects, readability matters more than layout.
- Use
go vet -fieldalignment ./...or thefieldalignmentanalyzer fromgolang.org/x/toolsto detect suboptimal layouts automatically. Some teams include this in CI. - A related optimization: for boolean flags, consider using a single
uint8with bitwise operations instead of multipleboolfields, each of which wastes 7 bytes of padding in many struct layouts.
sync.Pool is a cache of temporary objects that can be reused to avoid repeated allocation and deallocation. You put objects back when done, and get them later instead of allocating new ones. The pool is cleared on every GC cycle, so it is not a general-purpose cache — it is specifically for reducing allocation rate. The main gotcha: you MUST reset the object’s state before putting it back. If you put a bytes.Buffer back without calling Reset(), the next user gets stale data. Second gotcha: the pool items can be collected at any GC, so do not rely on objects being there — always provide a New function. Third: pools are per-P (per logical processor), so there is some overhead in the cross-P stealing mechanism. Use pools for buffers, temporary structs, and frequently allocated objects in hot paths. Do not use them for objects with complex initialization, long-lived objects, or objects where identity matters.Your Go service handles 50,000 requests per second and you need to reduce p99 latency from 200ms to 50ms. What is your approach?
Your Go service handles 50,000 requests per second and you need to reduce p99 latency from 200ms to 50ms. What is your approach?
Strong Answer:
- First, measure: take a CPU profile during peak traffic. The flame graph shows where time is spent. At 50K RPS, even a 1ms optimization saves 50 CPU-seconds per second.
- Common latency sources at this scale: GC pauses (check with
GODEBUG=gctrace=1), lock contention (take a mutex profile), database queries (add query timing logs), and serialization (JSON encoding/decoding). - If GC is the bottleneck: reduce allocation rate. Pre-allocate slices, use
sync.Poolfor temporary buffers, usestrings.Builderinstead of string concatenation, and consider usingGOGCto tune GC frequency (higher GOGC means fewer GC cycles but more memory usage). - If lock contention is the bottleneck: shard your data structures. Instead of one
sync.RWMutexprotecting a single map, use 256 shards each with their own mutex. This reduces contention by 256x. Useatomicoperations for simple counters instead of mutex-protected variables. - If database queries are the bottleneck: add caching (Redis) for hot reads, use connection pooling properly, batch writes, and consider read replicas.
- If JSON serialization is the bottleneck: switch to a faster JSON library (
json-iterator/gooreasyjsonwith code generation), or switch the internal protocol to protobuf for service-to-service calls. - The overall approach: profile, identify the top bottleneck, fix it, re-profile. Repeat. Never optimize based on assumptions.
GOGC controls the GC target: it sets the percentage of new heap allocations relative to live heap before triggering a GC. Default is 100, meaning GC triggers when new allocations equal the live heap size. Setting GOGC=200 means GC triggers less frequently (when new allocations are 2x the live heap), reducing CPU spent on GC but doubling memory usage. Setting GOGC=50 triggers more frequently, using less memory but more CPU. For latency-sensitive services, a higher GOGC can reduce the frequency of GC pauses. Go 1.19 also introduced GOMEMLIMIT which sets a soft memory limit, allowing the runtime to use all available memory while still triggering GC to stay under the limit.