Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Performance Optimization in Go

Go is designed for performance, but writing fast Go code requires understanding the runtime, memory model, and profiling tools. This chapter covers practical techniques for optimizing Go applications. The cardinal rule of performance optimization: measure before you optimize. The Go toolchain provides world-class profiling tools built right in — pprof for CPU and memory profiles, benchmarks with testing.B, and the trace tool for visualizing goroutine scheduling. Use them. The bottleneck is almost never where you think it is.

Profiling with pprof

Go’s built-in profiler helps identify performance bottlenecks. Think of pprof as a doctor’s diagnostic toolkit for your program: CPU profiling is like a heart monitor (showing where your program spends its time), memory profiling is like a blood test (revealing what is consuming resources), and goroutine profiling is like an X-ray (showing what is stuck and where).

CPU Profiling

import (
    "os"
    "runtime/pprof"
)

func main() {
    // Create CPU profile
    f, err := os.Create("cpu.prof")
    if err != nil {
        log.Fatal(err)
    }
    defer f.Close()
    
    if err := pprof.StartCPUProfile(f); err != nil {
        log.Fatal(err)
    }
    defer pprof.StopCPUProfile()
    
    // Your application code
    runApplication()
}

Memory Profiling

func main() {
    // At the end of your program
    f, err := os.Create("mem.prof")
    if err != nil {
        log.Fatal(err)
    }
    defer f.Close()
    
    runtime.GC() // Get up-to-date statistics
    if err := pprof.WriteHeapProfile(f); err != nil {
        log.Fatal(err)
    }
}

HTTP pprof Server

import _ "net/http/pprof"

func main() {
    // pprof endpoints automatically registered
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()
    
    // Your application
    runApplication()
}
Access profiles at:
  • http://localhost:6060/debug/pprof/profile - CPU profile
  • http://localhost:6060/debug/pprof/heap - Memory profile
  • http://localhost:6060/debug/pprof/goroutine - Goroutine stacks
  • http://localhost:6060/debug/pprof/block - Blocking profile
  • http://localhost:6060/debug/pprof/mutex - Mutex contention

Analyzing Profiles

# Interactive analysis
go tool pprof cpu.prof

# Top functions by CPU
(pprof) top10

# View function details
(pprof) list myFunction

# Generate flame graph
go tool pprof -http=:8080 cpu.prof

# Compare profiles
go tool pprof -base old.prof new.prof

Benchmarking

Writing Benchmarks

func BenchmarkFibonacci(b *testing.B) {
    for i := 0; i < b.N; i++ {
        Fibonacci(20)
    }
}

func BenchmarkFibonacciParallel(b *testing.B) {
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            Fibonacci(20)
        }
    })
}

// Benchmark with different inputs
func BenchmarkSort(b *testing.B) {
    sizes := []int{100, 1000, 10000}
    for _, size := range sizes {
        b.Run(fmt.Sprintf("size-%d", size), func(b *testing.B) {
            data := generateRandomSlice(size)
            b.ResetTimer() // Don't count setup
            for i := 0; i < b.N; i++ {
                sort.Ints(data)
            }
        })
    }
}

Memory Benchmarks

func BenchmarkAllocations(b *testing.B) {
    b.ReportAllocs() // Report memory allocations
    for i := 0; i < b.N; i++ {
        _ = make([]byte, 1024)
    }
}

Running Benchmarks

# Run all benchmarks
go test -bench=.

# Run specific benchmark
go test -bench=BenchmarkFibonacci

# Include memory stats
go test -bench=. -benchmem

# Run for specific duration
go test -bench=. -benchtime=5s

# Compare benchmarks
go install golang.org/x/perf/cmd/benchstat@latest
go test -bench=. -count=10 > old.txt
# Make changes
go test -bench=. -count=10 > new.txt
benchstat old.txt new.txt

Memory Optimization

Understanding Escape Analysis

Escape analysis is the compiler’s decision about whether a variable lives on the stack (fast, automatic cleanup) or the heap (slower, requires garbage collection). Think of it as the compiler asking: “Can this variable’s lifetime be guaranteed to end when the function returns?” If yes, it stays on the stack. If any reference to it escapes the function (via a pointer return, a closure capture, or passing to an interface), it must live on the heap where the garbage collector can manage it.
// Stack allocation (fast, no GC pressure)
func stackAlloc() int {
    x := 42  // Stays on stack -- value is copied on return
    return x
}

// Heap allocation (slower, adds GC pressure)
func heapAlloc() *int {
    x := 42   // Escapes to heap -- compiler cannot guarantee lifetime
    return &x // Pointer outlives the function, so x must survive on the heap
}

// Check escape analysis decisions:
// go build -gcflags="-m" ./...
// Look for "escapes to heap" in the output
Pitfall — Accidental Heap Escapes: Passing a local variable to an interface{} parameter forces a heap allocation because the compiler cannot determine the concrete type’s lifetime through the interface. This is why fmt.Println(x) causes x to escape — Println accepts interface{}. In hot paths, this matters. Use fmt.Fprintf with explicit format verbs or avoid fmt entirely in performance-critical code.

Reducing Allocations

// ❌ Bad: Allocates on each call
func processItems(items []Item) []Result {
    results := make([]Result, 0)  // Allocates, may grow
    for _, item := range items {
        results = append(results, process(item))
    }
    return results
}

// ✅ Good: Pre-allocate
func processItems(items []Item) []Result {
    results := make([]Result, 0, len(items))  // Pre-allocate capacity
    for _, item := range items {
        results = append(results, process(item))
    }
    return results
}

// ✅ Better: Reuse with sync.Pool
var resultPool = sync.Pool{
    New: func() interface{} {
        return make([]Result, 0, 100)
    },
}

func processItems(items []Item) []Result {
    results := resultPool.Get().([]Result)
    results = results[:0]  // Reset length, keep capacity
    defer resultPool.Put(results)
    
    for _, item := range items {
        results = append(results, process(item))
    }
    
    // Copy to return (if needed)
    out := make([]Result, len(results))
    copy(out, results)
    return out
}

String Concatenation

// ❌ Bad: O(n²) allocations
func buildString(parts []string) string {
    result := ""
    for _, part := range parts {
        result += part  // Creates new string each time
    }
    return result
}

// ✅ Good: strings.Builder
func buildString(parts []string) string {
    var builder strings.Builder
    builder.Grow(estimateSize(parts))  // Pre-allocate
    for _, part := range parts {
        builder.WriteString(part)
    }
    return builder.String()
}

// ✅ Also good for simple cases
func buildString(parts []string) string {
    return strings.Join(parts, "")
}

Struct Field Alignment

CPUs read memory in aligned chunks (typically 8 bytes on 64-bit systems). When struct fields are not aligned to their natural boundaries, the compiler inserts padding bytes. By ordering fields from largest to smallest, you minimize wasted padding — this can matter significantly when you have millions of structs in memory.
// Bad: 24 bytes (with padding)
type BadStruct struct {
    a bool    // 1 byte + 7 padding (to align the next int64)
    b int64   // 8 bytes
    c bool    // 1 byte + 7 padding (to round up struct size)
}

// Good: 16 bytes (minimal padding)
type GoodStruct struct {
    b int64   // 8 bytes
    a bool    // 1 byte
    c bool    // 1 byte + 6 padding (only at the end)
}

// Check with: go vet -fieldalignment ./...
// Or use the fieldalignment analyzer from golang.org/x/tools
When struct alignment matters: If you have a slice of 10 million structs, the difference between 24 bytes and 16 bytes per struct is 80MB of memory. For hot structs in high-throughput code, this optimization is worth applying. For config structs or one-off objects, readability matters more than layout.

Concurrency Optimization

Goroutine Pool

type Pool struct {
    work chan func()
    wg   sync.WaitGroup
}

func NewPool(size int) *Pool {
    p := &Pool{
        work: make(chan func(), size*2),
    }
    
    for i := 0; i < size; i++ {
        go p.worker()
    }
    
    return p
}

func (p *Pool) worker() {
    for fn := range p.work {
        fn()
        p.wg.Done()
    }
}

func (p *Pool) Submit(fn func()) {
    p.wg.Add(1)
    p.work <- fn
}

func (p *Pool) Wait() {
    p.wg.Wait()
}

func (p *Pool) Close() {
    close(p.work)
}

Reducing Lock Contention

// ❌ Bad: Single lock, high contention
type Cache struct {
    mu    sync.RWMutex
    items map[string]interface{}
}

// ✅ Good: Sharded cache
type ShardedCache struct {
    shards    [256]*shard
    shardMask uint8
}

type shard struct {
    mu    sync.RWMutex
    items map[string]interface{}
}

func NewShardedCache() *ShardedCache {
    c := &ShardedCache{shardMask: 255}
    for i := range c.shards {
        c.shards[i] = &shard{items: make(map[string]interface{})}
    }
    return c
}

func (c *ShardedCache) getShard(key string) *shard {
    hash := fnv32(key)
    return c.shards[hash&uint32(c.shardMask)]
}

func (c *ShardedCache) Get(key string) (interface{}, bool) {
    shard := c.getShard(key)
    shard.mu.RLock()
    defer shard.mu.RUnlock()
    val, ok := shard.items[key]
    return val, ok
}

func (c *ShardedCache) Set(key string, value interface{}) {
    shard := c.getShard(key)
    shard.mu.Lock()
    defer shard.mu.Unlock()
    shard.items[key] = value
}

Atomic Operations

import "sync/atomic"

// ❌ Mutex for simple counters
type Counter struct {
    mu    sync.Mutex
    value int64
}

func (c *Counter) Increment() {
    c.mu.Lock()
    c.value++
    c.mu.Unlock()
}

// ✅ Atomic for simple counters
type AtomicCounter struct {
    value atomic.Int64
}

func (c *AtomicCounter) Increment() {
    c.value.Add(1)
}

func (c *AtomicCounter) Value() int64 {
    return c.value.Load()
}

I/O Optimization

Buffered I/O

// ❌ Bad: Unbuffered writes
func writeLines(filename string, lines []string) error {
    f, err := os.Create(filename)
    if err != nil {
        return err
    }
    defer f.Close()
    
    for _, line := range lines {
        f.WriteString(line + "\n")  // Many small writes
    }
    return nil
}

// ✅ Good: Buffered writes
func writeLines(filename string, lines []string) error {
    f, err := os.Create(filename)
    if err != nil {
        return err
    }
    defer f.Close()
    
    w := bufio.NewWriter(f)
    defer w.Flush()
    
    for _, line := range lines {
        w.WriteString(line)
        w.WriteByte('\n')
    }
    return nil
}

Connection Pooling

// HTTP client with connection pooling
var httpClient = &http.Client{
    Timeout: 30 * time.Second,
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 10,
        IdleConnTimeout:     90 * time.Second,
        DisableCompression:  false,
    },
}

// Reuse client across requests
func fetchURL(url string) ([]byte, error) {
    resp, err := httpClient.Get(url)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()
    return io.ReadAll(resp.Body)
}

JSON Optimization

Standard Library Tips

// Pre-allocate buffer for encoding
var bufferPool = sync.Pool{
    New: func() interface{} {
        return new(bytes.Buffer)
    },
}

func encodeJSON(v interface{}) ([]byte, error) {
    buf := bufferPool.Get().(*bytes.Buffer)
    buf.Reset()
    defer bufferPool.Put(buf)
    
    encoder := json.NewEncoder(buf)
    if err := encoder.Encode(v); err != nil {
        return nil, err
    }
    
    result := make([]byte, buf.Len())
    copy(result, buf.Bytes())
    return result, nil
}

Using Faster JSON Libraries

import "github.com/json-iterator/go"

var json = jsoniter.ConfigCompatibleWithStandardLibrary

// Drop-in replacement for encoding/json
func parseJSON(data []byte) (*User, error) {
    var user User
    err := json.Unmarshal(data, &user)
    return &user, err
}

Code Generation

//go:generate easyjson -all user.go

// easyjson generates fast marshaling code
type User struct {
    ID   int    `json:"id"`
    Name string `json:"name"`
}

// Use generated methods
func (u *User) MarshalJSON() ([]byte, error)
func (u *User) UnmarshalJSON(data []byte) error

Compiler Optimizations

Inlining

// Small functions get inlined automatically
func add(a, b int) int {
    return a + b
}

// Check inlining decisions
// go build -gcflags="-m" ./...

// Force inlining (use sparingly)
//go:noinline
func doNotInline() {}

Bounds Check Elimination

// ❌ Bounds check on each access
func sum(s []int) int {
    total := 0
    for i := 0; i < len(s); i++ {
        total += s[i]  // Bounds check
    }
    return total
}

// ✅ BCE with hint
func sum(s []int) int {
    total := 0
    _ = s[len(s)-1]  // Hint: we'll access all elements
    for i := 0; i < len(s); i++ {
        total += s[i]  // No bounds check needed
    }
    return total
}

// ✅ Range loop (compiler optimizes)
func sum(s []int) int {
    total := 0
    for _, v := range s {
        total += v  // Optimized
    }
    return total
}

Common Anti-Patterns

Defer in Hot Loops

// ❌ Bad: defer overhead in loop
func processFiles(files []string) error {
    for _, file := range files {
        f, err := os.Open(file)
        if err != nil {
            return err
        }
        defer f.Close()  // Deferred until function returns, not loop iteration!
        // Also: all files stay open!
        process(f)
    }
    return nil
}

// ✅ Good: Close explicitly or use helper
func processFiles(files []string) error {
    for _, file := range files {
        if err := processFile(file); err != nil {
            return err
        }
    }
    return nil
}

func processFile(file string) error {
    f, err := os.Open(file)
    if err != nil {
        return err
    }
    defer f.Close()  // Now correctly scoped
    return process(f)
}

Interface Conversions

// ❌ Bad: Interface conversion in hot path
func processItems(items []interface{}) {
    for _, item := range items {
        if s, ok := item.(string); ok {  // Type assertion overhead
            processString(s)
        }
    }
}

// ✅ Good: Use concrete types or generics
func processStrings(items []string) {
    for _, item := range items {
        processString(item)
    }
}

Profiling Checklist

  1. Identify hotspots with CPU profiling
  2. Check memory allocations with heap profiling
  3. Find goroutine leaks with goroutine profiling
  4. Detect lock contention with mutex profiling
  5. Analyze blocking with block profiling
# Comprehensive profiling
go test -bench=. -cpuprofile=cpu.prof -memprofile=mem.prof -blockprofile=block.prof

# Trace for detailed analysis
go test -trace=trace.out
go tool trace trace.out

Interview Questions

  1. Use pprof heap profile to see allocations
  2. Check goroutine count over time (runtime.NumGoroutine())
  3. Monitor process memory with external tools
  4. Look for growing maps, slices, or channels
  5. Check for goroutines blocked forever
Escape analysis determines if a variable can stay on the stack or must escape to the heap. Stack allocation is faster and doesn’t require garbage collection. Use go build -gcflags="-m" to see escape decisions.
  • Reduce allocations (pre-allocate slices, use sync.Pool)
  • Avoid creating many short-lived objects
  • Use value types instead of pointers when possible
  • Batch operations to amortize allocation cost
  • Consider GOGC tuning for specific workloads
Use sync.Pool for:
  • Frequently allocated/deallocated objects
  • Objects with predictable lifecycle
  • Buffers, temporary structs, connection wrappers
Don’t use for long-lived objects or when object state matters.

Summary

TechniqueWhen to Use
CPU ProfilingIdentify slow functions
Memory ProfilingFind allocation hotspots
BenchmarkingMeasure and compare performance
sync.PoolReduce GC pressure for temp objects
ShardingReduce lock contention
Buffered I/OReduce system calls
Pre-allocationAvoid slice/map growth
Atomic OperationsSimple concurrent counters

Interview Deep-Dive

Strong Answer:
  • Step 1: Check runtime.MemStats to understand the breakdown — HeapAlloc (live heap objects), HeapSys (heap memory obtained from OS), NumGoroutine (goroutine count). If goroutine count is abnormally high (thousands when you expect hundreds), you have a goroutine leak, not a memory problem.
  • Step 2: Take a heap profile using pprof. If the service exposes the pprof HTTP endpoint: go tool pprof http://localhost:6060/debug/pprof/heap. The default shows inuse_space — memory currently allocated and not freed. Switch to alloc_space to see cumulative allocations (helps find functions that allocate heavily even if memory is later freed).
  • Step 3: Use the pprof interactive mode: top10 to see which functions hold the most memory, list functionName to see line-level allocations, and the web interface (-http=:8080) for flame graphs.
  • Common findings: a slice that grows unboundedly (like an in-memory cache without eviction), a sub-slice of a large slice keeping the entire backing array alive, goroutine leaks (each goroutine holds 2KB+ of stack), closures capturing large objects and preventing garbage collection, and sync.Pool misuse where objects are not being returned.
  • Step 4: If heap profile looks clean but RSS is high, check HeapReleased vs HeapSys. Go’s garbage collector frees memory logically but may not return it to the OS immediately. Use debug.FreeOSMemory() in a test to force it, or set MADV_DONTNEED via GODEBUG=madvdontneed=1 (default since Go 1.16).
Follow-up: What is escape analysis, and how does it affect your optimization strategy?Escape analysis is the compiler’s determination of whether a variable can live on the stack (cheap, automatic cleanup) or must escape to the heap (requires garbage collection). Run go build -gcflags="-m" to see escape decisions. Variables escape when: their address is returned from a function, they are assigned to an interface, they are captured by a closure that outlives the function, or the compiler cannot prove they do not escape. The optimization strategy: when a hot function is allocating heavily on the heap, check escape analysis output. Sometimes a small refactoring — returning a value instead of a pointer, avoiding an interface conversion, or pre-sizing a slice — keeps the allocation on the stack and eliminates GC pressure entirely. But do not optimize blindly — profile first to find the actual hot paths.
Strong Answer:
  • CPUs read memory in aligned chunks (typically 8 bytes on 64-bit systems). When a struct field is not naturally aligned, the compiler inserts padding bytes. By ordering fields from largest to smallest, you minimize padding.
  • Concrete example: struct { a bool; b int64; c bool } is 24 bytes (1 + 7 padding + 8 + 1 + 7 padding). Reordered as struct { b int64; a bool; c bool } it is 16 bytes (8 + 1 + 1 + 6 padding). That is 33% less memory per struct.
  • When it matters: if you have a slice of 10 million of these structs, the difference is 80MB (240MB vs 160MB). For hot structs in high-throughput code, struct alignment is a meaningful optimization. For config structs or one-off objects, readability matters more than layout.
  • Use go vet -fieldalignment ./... or the fieldalignment analyzer from golang.org/x/tools to detect suboptimal layouts automatically. Some teams include this in CI.
  • A related optimization: for boolean flags, consider using a single uint8 with bitwise operations instead of multiple bool fields, each of which wastes 7 bytes of padding in many struct layouts.
Follow-up: What is sync.Pool, and how does it reduce GC pressure? What are the gotchas?sync.Pool is a cache of temporary objects that can be reused to avoid repeated allocation and deallocation. You put objects back when done, and get them later instead of allocating new ones. The pool is cleared on every GC cycle, so it is not a general-purpose cache — it is specifically for reducing allocation rate. The main gotcha: you MUST reset the object’s state before putting it back. If you put a bytes.Buffer back without calling Reset(), the next user gets stale data. Second gotcha: the pool items can be collected at any GC, so do not rely on objects being there — always provide a New function. Third: pools are per-P (per logical processor), so there is some overhead in the cross-P stealing mechanism. Use pools for buffers, temporary structs, and frequently allocated objects in hot paths. Do not use them for objects with complex initialization, long-lived objects, or objects where identity matters.
Strong Answer:
  • First, measure: take a CPU profile during peak traffic. The flame graph shows where time is spent. At 50K RPS, even a 1ms optimization saves 50 CPU-seconds per second.
  • Common latency sources at this scale: GC pauses (check with GODEBUG=gctrace=1), lock contention (take a mutex profile), database queries (add query timing logs), and serialization (JSON encoding/decoding).
  • If GC is the bottleneck: reduce allocation rate. Pre-allocate slices, use sync.Pool for temporary buffers, use strings.Builder instead of string concatenation, and consider using GOGC to tune GC frequency (higher GOGC means fewer GC cycles but more memory usage).
  • If lock contention is the bottleneck: shard your data structures. Instead of one sync.RWMutex protecting a single map, use 256 shards each with their own mutex. This reduces contention by 256x. Use atomic operations for simple counters instead of mutex-protected variables.
  • If database queries are the bottleneck: add caching (Redis) for hot reads, use connection pooling properly, batch writes, and consider read replicas.
  • If JSON serialization is the bottleneck: switch to a faster JSON library (json-iterator/go or easyjson with code generation), or switch the internal protocol to protobuf for service-to-service calls.
  • The overall approach: profile, identify the top bottleneck, fix it, re-profile. Repeat. Never optimize based on assumptions.
Follow-up: How does Go’s garbage collector work, and what is GOGC?Go uses a concurrent, tri-color mark-and-sweep garbage collector. It runs concurrently with your application (not stop-the-world for most of the cycle), with brief pauses only for root scanning and termination. GOGC controls the GC target: it sets the percentage of new heap allocations relative to live heap before triggering a GC. Default is 100, meaning GC triggers when new allocations equal the live heap size. Setting GOGC=200 means GC triggers less frequently (when new allocations are 2x the live heap), reducing CPU spent on GC but doubling memory usage. Setting GOGC=50 triggers more frequently, using less memory but more CPU. For latency-sensitive services, a higher GOGC can reduce the frequency of GC pauses. Go 1.19 also introduced GOMEMLIMIT which sets a soft memory limit, allowing the runtime to use all available memory while still triggering GC to stay under the limit.