> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Performance Optimization

> Squeeze every cycle out of your C code

# Performance Optimization

Learn how to write C code that runs at maximum speed. This guide covers CPU architecture awareness, memory hierarchy optimization, and profiling techniques.

The single most important lesson in performance engineering: **measure before you optimize**. Intuition about where time goes is wrong more often than it is right. A function you think is slow might already be optimized by the compiler, while the bottleneck hides in a memory access pattern you never questioned. Profile first, optimize second, measure again third.

***

## Understanding Performance

<Accordion title="Where Time Goes">
  Modern programs spend most time waiting. Think of your CPU as a chef who can chop vegetables at superhuman speed but has to walk to the warehouse every time they need an ingredient that is not on the counter:

  * **Register access**: 1 cycle (the cutting board in front of you)
  * **L1 cache hit**: 3-4 cycles (the spice rack on the counter)
  * **L2 cache hit**: 10-12 cycles (the pantry in the next room)
  * **L3 cache hit**: 30-40 cycles (the storage closet down the hall)
  * **Memory access (cache miss)**: 100-300 cycles (driving to the warehouse)
  * **Branch misprediction**: 15-20 cycles penalty (starting to chop onions, then realizing you needed garlic)
  * **System calls**: Thousands of cycles (calling the supplier on the phone)
  * **Disk I/O**: Millions of cycles (ordering from another country)

  The key insight: **optimize memory access patterns first**. Most "slow" C code is not compute-bound -- it is memory-bound. The CPU is literally sitting idle, waiting for data to arrive from RAM.

  **Practical tip**: Before touching any code, run `perf stat ./your_program` and look at the IPC (instructions per cycle). An IPC below 1.0 almost always means you are memory-bound, not compute-bound.
</Accordion>

***

## Profiling Tools

<Tabs>
  <Tab title="perf">
    ```bash theme={null}
    # Record performance data
    perf record -g ./my_program

    # Analyze results
    perf report

    # Statistical overview
    perf stat ./my_program

    # Sample output:
    # Performance counter stats for './my_program':
    #       1,234.56 msec task-clock
    #    4,567,890,123 cycles
    #    3,456,789,012 instructions      # 0.76 IPC
    #       12,345,678 cache-references
    #          123,456 cache-misses      # 1.0% of all cache refs
    ```
  </Tab>

  <Tab title="gprof">
    ```bash theme={null}
    # Compile with profiling
    gcc -pg -O2 program.c -o program

    # Run the program (generates gmon.out)
    ./program

    # Analyze
    gprof program gmon.out > analysis.txt

    # Flat profile shows where time is spent
    # Call graph shows who calls what
    ```
  </Tab>

  <Tab title="Valgrind">
    ```bash theme={null}
    # Cache simulation
    valgrind --tool=cachegrind ./program

    # Annotate source with cache info
    cg_annotate cachegrind.out.*

    # Shows:
    # - Cache hit rates
    # - Branch prediction rates  
    # - Per-line cache behavior
    ```
  </Tab>
</Tabs>

***

## Memory Hierarchy Optimization

```c theme={null}
// BAD: Column-major access (cache unfriendly)
// Each iteration jumps to a completely different cache line.
// For a 1000x1000 matrix, every single access is a cache miss.
void matrix_sum_bad(int matrix[N][N]) {
    int sum = 0;
    for (int j = 0; j < N; j++) {        // Column first
        for (int i = 0; i < N; i++) {    // Row second
            sum += matrix[i][j];          // Jumps N*sizeof(int) bytes each access
        }
    }
}

// GOOD: Row-major access (cache friendly)
// C stores 2D arrays in row-major order: row 0, then row 1, etc.
// Sequential access means one cache miss loads ~16 ints at once
// (64-byte cache line / 4-byte int = 16 consecutive ints for free).
void matrix_sum_good(int matrix[N][N]) {
    int sum = 0;
    for (int i = 0; i < N; i++) {        // Row first
        for (int j = 0; j < N; j++) {    // Column second
            sum += matrix[i][j];          // Sequential access: next address in memory
        }
    }
}
// Performance difference: 10-50x on large matrices!
// Why so dramatic? Because the bad version's cache miss rate approaches 100%
// while the good version's approaches ~6% (one miss per 16 elements).

// BETTER: Blocking for cache locality
// The idea: instead of processing the whole matrix row by row, process it in
// small tiles (blocks) that fit entirely in L1 cache. This means all three
// matrices A, B, C have their active block resident in cache simultaneously.
// A typical L1 data cache is 32-48KB, so a 64x64 block of doubles = 32KB.
#define BLOCK 64

void matrix_multiply_blocked(double *A, double *B, double *C, int n) {
    for (int i = 0; i < n; i += BLOCK) {
        for (int j = 0; j < n; j += BLOCK) {
            for (int k = 0; k < n; k += BLOCK) {
                // Multiply block
                for (int ii = i; ii < i + BLOCK && ii < n; ii++) {
                    for (int jj = j; jj < j + BLOCK && jj < n; jj++) {
                        double sum = C[ii*n + jj];
                        for (int kk = k; kk < k + BLOCK && kk < n; kk++) {
                            sum += A[ii*n + kk] * B[kk*n + jj];
                        }
                        C[ii*n + jj] = sum;
                    }
                }
            }
        }
    }
}
```

***

## Data Structure Layout

The way you organize your structs determines how the CPU cache behaves. Think of it like organizing a library: if every book about physics also has a chapter of recipes bound into it, you waste shelf space carrying recipes you never read when you are doing physics research.

```c theme={null}
// BAD: Poor cache utilization (Array of Structs)
// Each particle is 60 bytes. When updating positions, we load ALL 60 bytes
// into cache but only use x, y, z, vx, vy, vz (24 bytes). That is 60% wasted bandwidth.
struct Particle_AoS {
    float x, y, z;        // Position (12 bytes)
    float vx, vy, vz;     // Velocity (12 bytes)
    float mass;            // 4 bytes
    int type;              // 4 bytes
    char name[32];        // Only used for debugging -- but loaded into cache every time!
};

struct Particle_AoS particles[1000000];

// When updating positions, we load name[] into cache too
void update_positions_aos(void) {
    for (int i = 0; i < 1000000; i++) {
        particles[i].x += particles[i].vx;  // Lots of cache misses
        particles[i].y += particles[i].vy;
        particles[i].z += particles[i].vz;
    }
}

// GOOD: Cache-friendly (Struct of Arrays)
struct ParticleSystem {
    float *x, *y, *z;     // Positions
    float *vx, *vy, *vz;  // Velocities
    float *mass;
    int *type;
    size_t count;
};

// Sequential access, no wasted cache space
void update_positions_soa(struct ParticleSystem *ps) {
    for (size_t i = 0; i < ps->count; i++) {
        ps->x[i] += ps->vx[i];  // Sequential access
        ps->y[i] += ps->vy[i];
        ps->z[i] += ps->vz[i];
    }
}
// Often 2-4x faster for large datasets

// BEST: Hybrid approach - group hot data together
// This is what game engines actually do. Separate data into "hot" fields
// (accessed every frame) and "cold" fields (accessed only for UI or debugging).
// You get cache efficiency without the awkward SoA API.
struct ParticleHot {
    float x, y, z;
    float vx, vy, vz;
};  // 24 bytes -- fits ~2.6 particles per cache line

struct ParticleCold {
    float mass;
    int type;
    char name[32];
};  // 40 bytes -- only loaded when you actually need mass/type/name

struct ParticleHot hot_data[1000000];   // Updated every frame
struct ParticleCold cold_data[1000000]; // Rarely accessed
// Practical tip: index both arrays with the same ID. hot_data[i] and
// cold_data[i] always refer to the same logical particle.
```

***

## Branch Prediction

Modern CPUs predict which way an `if` statement will go and start executing that path *before* they know the answer. If the prediction is wrong, they throw away 15-20 cycles of wasted work and restart. When data is random, the predictor is wrong \~50% of the time, which means every other iteration pays a 15-20 cycle penalty.

```c theme={null}
// BAD: Unpredictable branches
// With random data, the branch predictor is correct only ~50% of the time.
// On a tight loop processing millions of elements, this costs real seconds.
void process_bad(int *data, int n) {
    for (int i = 0; i < n; i++) {
        if (data[i] > 0) {    // 50% taken - unpredictable
            positive_count++;
        } else {
            negative_count++;
        }
    }
}

// GOOD: Branchless code
void process_branchless(int *data, int n) {
    for (int i = 0; i < n; i++) {
        int is_positive = data[i] > 0;
        positive_count += is_positive;
        negative_count += !is_positive;
    }
}

// BETTER: Sort data first if possible
// After sorting, branches become predictable
qsort(data, n, sizeof(int), compare);
process_bad(data, n);  // Now branches are predictable!

// Branchless min/max
int branchless_min(int a, int b) {
    return b ^ ((a ^ b) & -(a < b));
}

int branchless_max(int a, int b) {
    return a ^ ((a ^ b) & -(a < b));
}

// Branchless absolute value
int branchless_abs(int x) {
    int mask = x >> 31;
    return (x ^ mask) - mask;
}

// Conditional move using compiler hints
int conditional_value(int condition, int true_val, int false_val) {
    // GCC might generate CMOV instruction
    return condition ? true_val : false_val;
}
```

***

## SIMD Vectorization

SIMD (Single Instruction, Multiple Data) lets the CPU perform the same operation on multiple data elements in a single instruction. Think of it as the difference between washing dishes one at a time versus loading a dishwasher: same total work, but the parallel approach is dramatically faster.

**Before writing manual SIMD**: check if the compiler does it for you. Compile with `-O2 -march=native` and add `-fopt-info-vec-optimized` (GCC) or `-Rpass=loop-vectorize` (Clang) to see what the compiler auto-vectorizes. Manual SIMD is only worth it when the compiler cannot figure out the pattern on its own.

```c theme={null}
#include <immintrin.h>  // For AVX/SSE intrinsics

// Scalar version -- processes one float per iteration
void add_arrays_scalar(float *a, float *b, float *c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

// SIMD version (AVX - 8 floats at once)
// AVX registers are 256 bits wide = 8 x 32-bit floats.
// One _mm256_add_ps instruction does the work of 8 scalar additions.
void add_arrays_simd(float *a, float *b, float *c, int n) {
    int i;
    for (i = 0; i <= n - 8; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);   // Load 8 floats from a ("u" = unaligned)
        __m256 vb = _mm256_loadu_ps(&b[i]);   // Load 8 floats from b
        __m256 vc = _mm256_add_ps(va, vb);    // Add all 8 pairs simultaneously
        _mm256_storeu_ps(&c[i], vc);           // Store 8 results to c
    }
    
    // Handle remainder (when n is not a multiple of 8)
    // This scalar tail is necessary for correctness but processes very few elements.
    for (; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

// Dot product with SIMD
float dot_product_simd(float *a, float *b, int n) {
    __m256 sum = _mm256_setzero_ps();
    int i;
    
    for (i = 0; i <= n - 8; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        sum = _mm256_fmadd_ps(va, vb, sum);  // Fused multiply-add
    }
    
    // Horizontal sum
    __m128 hi = _mm256_extractf128_ps(sum, 1);
    __m128 lo = _mm256_castps256_ps128(sum);
    lo = _mm_add_ps(lo, hi);
    lo = _mm_hadd_ps(lo, lo);
    lo = _mm_hadd_ps(lo, lo);
    
    float result = _mm_cvtss_f32(lo);
    
    // Handle remainder
    for (; i < n; i++) {
        result += a[i] * b[i];
    }
    
    return result;
}
```

***

## Compiler Optimization Hints

```c theme={null}
// Alignment hints
void process_aligned(float * __restrict__ __attribute__((aligned(32))) data, int n) {
    // Compiler knows data is 32-byte aligned
    for (int i = 0; i < n; i++) {
        data[i] *= 2.0f;
    }
}

// Restrict pointers - no aliasing
void vector_add(float * __restrict__ a, 
                float * __restrict__ b,
                float * __restrict__ c, 
                int n) {
    // Compiler knows a, b, c don't overlap
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

// Branch prediction hints
#define likely(x)   __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)

int process_with_hints(int *data, int n) {
    for (int i = 0; i < n; i++) {
        if (unlikely(data[i] < 0)) {
            return -1;  // Error path - rarely taken
        }
        // Normal processing...
    }
    return 0;
}

// Force inlining
static inline __attribute__((always_inline)) 
int fast_abs(int x) {
    int mask = x >> 31;
    return (x ^ mask) - mask;
}

// Prevent inlining (for debugging)
__attribute__((noinline))
void debug_function(void) {
    // ...
}

// Hot/cold function hints
__attribute__((hot))
void frequently_called(void) {
    // Compiler optimizes more aggressively
}

__attribute__((cold))
void error_handler(void) {
    // Compiler optimizes for size
}

// Prefetch hints
void process_with_prefetch(float *data, int n) {
    for (int i = 0; i < n; i++) {
        // Prefetch data we'll need 64 iterations later
        __builtin_prefetch(&data[i + 64], 0, 3);
        data[i] = sqrt(data[i]);
    }
}
```

***

## Memory Allocation Optimization

`malloc` is not free. Each call involves locking a global mutex (in most implementations), searching a free list, potentially requesting memory from the OS via `mmap` or `sbrk`, and updating bookkeeping metadata. In a tight loop, this overhead dominates your computation time.

```c theme={null}
// Avoid frequent malloc/free
// BAD: Allocate in loop
// Each iteration: lock mutex -> search free list -> possibly syscall -> unlock mutex
// For 1 million iterations, that is 1 million lock/unlock cycles and free list scans.
void process_bad(int n) {
    for (int i = 0; i < n; i++) {
        char *buf = malloc(1024);  // ~50-200ns per call on modern systems
        do_work(buf);
        free(buf);
    }
}

// GOOD: Allocate once
void process_good(int n) {
    char *buf = malloc(1024);
    for (int i = 0; i < n; i++) {
        do_work(buf);
    }
    free(buf);
}

// BETTER: Arena allocator for bulk operations
// An arena is like a notepad: you write sequentially from front to back,
// and when you are done with everything, you rip off the whole page at once.
// No per-allocation bookkeeping, no fragmentation, no free-list searching.
typedef struct {
    char *memory;
    size_t used;
    size_t capacity;
} Arena;

void *arena_alloc(Arena *a, size_t size) {
    size = (size + 7) & ~7;  // Round up to 8-byte alignment for safety
    if (a->used + size > a->capacity) return NULL;
    void *ptr = a->memory + a->used;
    a->used += size;          // "Allocation" is just a pointer bump: ~1-2ns
    return ptr;
}

void arena_reset(Arena *a) {
    a->used = 0;  // O(1) "free" everything -- no per-object cleanup needed
    // Perfect for request-scoped work: allocate during request, reset at end
}

// Stack allocation for small buffers
void process_stack(int n) {
    for (int i = 0; i < n; i++) {
        char buf[1024];  // Stack allocation - very fast
        do_work(buf);
    }
}

// Avoid false sharing in multithreaded code
// False sharing happens when two threads modify different variables that live
// on the same cache line (64 bytes). The CPU invalidates the entire cache line
// whenever either thread writes, causing both threads to constantly reload data
// they did not change. This can make multithreaded code SLOWER than single-threaded.
struct __attribute__((aligned(64))) ThreadData {
    long counter;
    char padding[56];  // Pad to 64 bytes so each counter gets its own cache line
};

struct ThreadData counters[NUM_THREADS];
// Without padding: 4 threads sharing a cache line can run 10-50x slower
// than expected due to constant cache-line ping-ponging between CPU cores.
```

***

## Loop Optimizations

```c theme={null}
// Loop unrolling
void sum_unrolled(int *arr, int n, long *result) {
    long sum = 0;
    int i;
    
    // Process 4 elements per iteration
    for (i = 0; i <= n - 4; i += 4) {
        sum += arr[i];
        sum += arr[i + 1];
        sum += arr[i + 2];
        sum += arr[i + 3];
    }
    
    // Handle remainder
    for (; i < n; i++) {
        sum += arr[i];
    }
    
    *result = sum;
}

// Loop fusion - combine loops
// BAD: Two loops, two cache passes
void process_separate(float *a, float *b, int n) {
    for (int i = 0; i < n; i++) a[i] *= 2;
    for (int i = 0; i < n; i++) b[i] *= 3;
}

// GOOD: One loop, one cache pass
void process_fused(float *a, float *b, int n) {
    for (int i = 0; i < n; i++) {
        a[i] *= 2;
        b[i] *= 3;
    }
}

// Loop interchange for cache locality
// BAD: Column-major access
void init_bad(int m[N][N]) {
    for (int j = 0; j < N; j++)
        for (int i = 0; i < N; i++)
            m[i][j] = 0;
}

// GOOD: Row-major access
void init_good(int m[N][N]) {
    for (int i = 0; i < N; i++)
        for (int j = 0; j < N; j++)
            m[i][j] = 0;
}

// Loop invariant motion (manual)
// BAD: Recompute every iteration
for (int i = 0; i < n; i++) {
    result[i] = data[i] * sin(angle) * scale;
}

// GOOD: Compute once
double factor = sin(angle) * scale;
for (int i = 0; i < n; i++) {
    result[i] = data[i] * factor;
}
```

***

## String and Memory Operations

```c theme={null}
// Use optimized library functions
#include <string.h>

// memcpy is highly optimized - use it
void copy_data(char *dst, char *src, size_t n) {
    memcpy(dst, src, n);  // Uses SIMD, non-temporal stores, etc.
}

// For known small sizes, inline
static inline void copy_16(char *dst, char *src) {
    __m128i v = _mm_loadu_si128((const __m128i*)src);
    _mm_storeu_si128((__m128i*)dst, v);
}

// Avoid strlen in loops
// BAD
for (int i = 0; i < strlen(str); i++) {  // O(n²)!
    process(str[i]);
}

// GOOD
size_t len = strlen(str);
for (size_t i = 0; i < len; i++) {
    process(str[i]);
}

// Or even better
for (char *p = str; *p; p++) {
    process(*p);
}

// Use memset for initialization
int arr[1000];
memset(arr, 0, sizeof(arr));  // Faster than loop
```

***

## Micro-benchmarking

```c theme={null}
#include <time.h>

// High-resolution timing
static inline uint64_t rdtsc(void) {
    unsigned int lo, hi;
    __asm__ volatile ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

// Or use clock_gettime
double get_time(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec + ts.tv_nsec * 1e-9;
}

// Benchmark function
void benchmark(const char *name, void (*func)(void), int iterations) {
    // Warmup
    for (int i = 0; i < 100; i++) func();
    
    double start = get_time();
    for (int i = 0; i < iterations; i++) {
        func();
    }
    double end = get_time();
    
    double per_call = (end - start) / iterations * 1e9;
    printf("%s: %.2f ns per call\n", name, per_call);
}

// Prevent dead code elimination
// The compiler is smart enough to remove computations whose results are never used.
// These functions use inline assembly to "trick" the compiler into thinking
// the value is needed, without actually doing any work at runtime.
void escape(void *p) {
    __asm__ volatile("" : : "g"(p) : "memory");
    // Tells compiler: "p is used by something you cannot see, so do not optimize it away"
}

void clobber(void) {
    __asm__ volatile("" : : : "memory");
    // Tells compiler: "memory may have changed, so reload anything you cached"
    // Use this between benchmark iterations to prevent the compiler from
    // hoisting work out of your measurement loop.
}

// Practical tip: if your benchmark shows impossibly fast results (< 1ns per call),
// the compiler probably optimized away your entire computation. Use escape/clobber
// on your inputs and outputs to prevent this.
```

***

## Checklist

<CardGroup cols={2}>
  <Card title="Profile First" icon="chart-line">
    Never optimize without measuring
  </Card>

  <Card title="Cache Friendly" icon="memory">
    Sequential access, avoid cache misses
  </Card>

  <Card title="Reduce Branches" icon="code-branch">
    Predictable or branchless code
  </Card>

  <Card title="Vectorize" icon="bolt">
    Use SIMD for data-parallel work
  </Card>

  <Card title="Reduce Allocations" icon="layer-group">
    Pool, arena, or stack allocate
  </Card>

  <Card title="Use restrict" icon="code">
    Help compiler optimize pointer code
  </Card>
</CardGroup>

***

## Next Up

<Card title="Embedded Systems" icon="arrow-right" href="/courses/c-programming/embedded">
  Write C for resource-constrained systems
</Card>
