Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Performance Optimization

Learn how to write C code that runs at maximum speed. This guide covers CPU architecture awareness, memory hierarchy optimization, and profiling techniques. The single most important lesson in performance engineering: measure before you optimize. Intuition about where time goes is wrong more often than it is right. A function you think is slow might already be optimized by the compiler, while the bottleneck hides in a memory access pattern you never questioned. Profile first, optimize second, measure again third.

Understanding Performance

Modern programs spend most time waiting. Think of your CPU as a chef who can chop vegetables at superhuman speed but has to walk to the warehouse every time they need an ingredient that is not on the counter:
  • Register access: 1 cycle (the cutting board in front of you)
  • L1 cache hit: 3-4 cycles (the spice rack on the counter)
  • L2 cache hit: 10-12 cycles (the pantry in the next room)
  • L3 cache hit: 30-40 cycles (the storage closet down the hall)
  • Memory access (cache miss): 100-300 cycles (driving to the warehouse)
  • Branch misprediction: 15-20 cycles penalty (starting to chop onions, then realizing you needed garlic)
  • System calls: Thousands of cycles (calling the supplier on the phone)
  • Disk I/O: Millions of cycles (ordering from another country)
The key insight: optimize memory access patterns first. Most “slow” C code is not compute-bound — it is memory-bound. The CPU is literally sitting idle, waiting for data to arrive from RAM.Practical tip: Before touching any code, run perf stat ./your_program and look at the IPC (instructions per cycle). An IPC below 1.0 almost always means you are memory-bound, not compute-bound.

Profiling Tools

# Record performance data
perf record -g ./my_program

# Analyze results
perf report

# Statistical overview
perf stat ./my_program

# Sample output:
# Performance counter stats for './my_program':
#       1,234.56 msec task-clock
#    4,567,890,123 cycles
#    3,456,789,012 instructions      # 0.76 IPC
#       12,345,678 cache-references
#          123,456 cache-misses      # 1.0% of all cache refs

Memory Hierarchy Optimization

// BAD: Column-major access (cache unfriendly)
// Each iteration jumps to a completely different cache line.
// For a 1000x1000 matrix, every single access is a cache miss.
void matrix_sum_bad(int matrix[N][N]) {
    int sum = 0;
    for (int j = 0; j < N; j++) {        // Column first
        for (int i = 0; i < N; i++) {    // Row second
            sum += matrix[i][j];          // Jumps N*sizeof(int) bytes each access
        }
    }
}

// GOOD: Row-major access (cache friendly)
// C stores 2D arrays in row-major order: row 0, then row 1, etc.
// Sequential access means one cache miss loads ~16 ints at once
// (64-byte cache line / 4-byte int = 16 consecutive ints for free).
void matrix_sum_good(int matrix[N][N]) {
    int sum = 0;
    for (int i = 0; i < N; i++) {        // Row first
        for (int j = 0; j < N; j++) {    // Column second
            sum += matrix[i][j];          // Sequential access: next address in memory
        }
    }
}
// Performance difference: 10-50x on large matrices!
// Why so dramatic? Because the bad version's cache miss rate approaches 100%
// while the good version's approaches ~6% (one miss per 16 elements).

// BETTER: Blocking for cache locality
// The idea: instead of processing the whole matrix row by row, process it in
// small tiles (blocks) that fit entirely in L1 cache. This means all three
// matrices A, B, C have their active block resident in cache simultaneously.
// A typical L1 data cache is 32-48KB, so a 64x64 block of doubles = 32KB.
#define BLOCK 64

void matrix_multiply_blocked(double *A, double *B, double *C, int n) {
    for (int i = 0; i < n; i += BLOCK) {
        for (int j = 0; j < n; j += BLOCK) {
            for (int k = 0; k < n; k += BLOCK) {
                // Multiply block
                for (int ii = i; ii < i + BLOCK && ii < n; ii++) {
                    for (int jj = j; jj < j + BLOCK && jj < n; jj++) {
                        double sum = C[ii*n + jj];
                        for (int kk = k; kk < k + BLOCK && kk < n; kk++) {
                            sum += A[ii*n + kk] * B[kk*n + jj];
                        }
                        C[ii*n + jj] = sum;
                    }
                }
            }
        }
    }
}

Data Structure Layout

The way you organize your structs determines how the CPU cache behaves. Think of it like organizing a library: if every book about physics also has a chapter of recipes bound into it, you waste shelf space carrying recipes you never read when you are doing physics research.
// BAD: Poor cache utilization (Array of Structs)
// Each particle is 60 bytes. When updating positions, we load ALL 60 bytes
// into cache but only use x, y, z, vx, vy, vz (24 bytes). That is 60% wasted bandwidth.
struct Particle_AoS {
    float x, y, z;        // Position (12 bytes)
    float vx, vy, vz;     // Velocity (12 bytes)
    float mass;            // 4 bytes
    int type;              // 4 bytes
    char name[32];        // Only used for debugging -- but loaded into cache every time!
};

struct Particle_AoS particles[1000000];

// When updating positions, we load name[] into cache too
void update_positions_aos(void) {
    for (int i = 0; i < 1000000; i++) {
        particles[i].x += particles[i].vx;  // Lots of cache misses
        particles[i].y += particles[i].vy;
        particles[i].z += particles[i].vz;
    }
}

// GOOD: Cache-friendly (Struct of Arrays)
struct ParticleSystem {
    float *x, *y, *z;     // Positions
    float *vx, *vy, *vz;  // Velocities
    float *mass;
    int *type;
    size_t count;
};

// Sequential access, no wasted cache space
void update_positions_soa(struct ParticleSystem *ps) {
    for (size_t i = 0; i < ps->count; i++) {
        ps->x[i] += ps->vx[i];  // Sequential access
        ps->y[i] += ps->vy[i];
        ps->z[i] += ps->vz[i];
    }
}
// Often 2-4x faster for large datasets

// BEST: Hybrid approach - group hot data together
// This is what game engines actually do. Separate data into "hot" fields
// (accessed every frame) and "cold" fields (accessed only for UI or debugging).
// You get cache efficiency without the awkward SoA API.
struct ParticleHot {
    float x, y, z;
    float vx, vy, vz;
};  // 24 bytes -- fits ~2.6 particles per cache line

struct ParticleCold {
    float mass;
    int type;
    char name[32];
};  // 40 bytes -- only loaded when you actually need mass/type/name

struct ParticleHot hot_data[1000000];   // Updated every frame
struct ParticleCold cold_data[1000000]; // Rarely accessed
// Practical tip: index both arrays with the same ID. hot_data[i] and
// cold_data[i] always refer to the same logical particle.

Branch Prediction

Modern CPUs predict which way an if statement will go and start executing that path before they know the answer. If the prediction is wrong, they throw away 15-20 cycles of wasted work and restart. When data is random, the predictor is wrong ~50% of the time, which means every other iteration pays a 15-20 cycle penalty.
// BAD: Unpredictable branches
// With random data, the branch predictor is correct only ~50% of the time.
// On a tight loop processing millions of elements, this costs real seconds.
void process_bad(int *data, int n) {
    for (int i = 0; i < n; i++) {
        if (data[i] > 0) {    // 50% taken - unpredictable
            positive_count++;
        } else {
            negative_count++;
        }
    }
}

// GOOD: Branchless code
void process_branchless(int *data, int n) {
    for (int i = 0; i < n; i++) {
        int is_positive = data[i] > 0;
        positive_count += is_positive;
        negative_count += !is_positive;
    }
}

// BETTER: Sort data first if possible
// After sorting, branches become predictable
qsort(data, n, sizeof(int), compare);
process_bad(data, n);  // Now branches are predictable!

// Branchless min/max
int branchless_min(int a, int b) {
    return b ^ ((a ^ b) & -(a < b));
}

int branchless_max(int a, int b) {
    return a ^ ((a ^ b) & -(a < b));
}

// Branchless absolute value
int branchless_abs(int x) {
    int mask = x >> 31;
    return (x ^ mask) - mask;
}

// Conditional move using compiler hints
int conditional_value(int condition, int true_val, int false_val) {
    // GCC might generate CMOV instruction
    return condition ? true_val : false_val;
}

SIMD Vectorization

SIMD (Single Instruction, Multiple Data) lets the CPU perform the same operation on multiple data elements in a single instruction. Think of it as the difference between washing dishes one at a time versus loading a dishwasher: same total work, but the parallel approach is dramatically faster. Before writing manual SIMD: check if the compiler does it for you. Compile with -O2 -march=native and add -fopt-info-vec-optimized (GCC) or -Rpass=loop-vectorize (Clang) to see what the compiler auto-vectorizes. Manual SIMD is only worth it when the compiler cannot figure out the pattern on its own.
#include <immintrin.h>  // For AVX/SSE intrinsics

// Scalar version -- processes one float per iteration
void add_arrays_scalar(float *a, float *b, float *c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

// SIMD version (AVX - 8 floats at once)
// AVX registers are 256 bits wide = 8 x 32-bit floats.
// One _mm256_add_ps instruction does the work of 8 scalar additions.
void add_arrays_simd(float *a, float *b, float *c, int n) {
    int i;
    for (i = 0; i <= n - 8; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);   // Load 8 floats from a ("u" = unaligned)
        __m256 vb = _mm256_loadu_ps(&b[i]);   // Load 8 floats from b
        __m256 vc = _mm256_add_ps(va, vb);    // Add all 8 pairs simultaneously
        _mm256_storeu_ps(&c[i], vc);           // Store 8 results to c
    }
    
    // Handle remainder (when n is not a multiple of 8)
    // This scalar tail is necessary for correctness but processes very few elements.
    for (; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

// Dot product with SIMD
float dot_product_simd(float *a, float *b, int n) {
    __m256 sum = _mm256_setzero_ps();
    int i;
    
    for (i = 0; i <= n - 8; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        sum = _mm256_fmadd_ps(va, vb, sum);  // Fused multiply-add
    }
    
    // Horizontal sum
    __m128 hi = _mm256_extractf128_ps(sum, 1);
    __m128 lo = _mm256_castps256_ps128(sum);
    lo = _mm_add_ps(lo, hi);
    lo = _mm_hadd_ps(lo, lo);
    lo = _mm_hadd_ps(lo, lo);
    
    float result = _mm_cvtss_f32(lo);
    
    // Handle remainder
    for (; i < n; i++) {
        result += a[i] * b[i];
    }
    
    return result;
}

Compiler Optimization Hints

// Alignment hints
void process_aligned(float * __restrict__ __attribute__((aligned(32))) data, int n) {
    // Compiler knows data is 32-byte aligned
    for (int i = 0; i < n; i++) {
        data[i] *= 2.0f;
    }
}

// Restrict pointers - no aliasing
void vector_add(float * __restrict__ a, 
                float * __restrict__ b,
                float * __restrict__ c, 
                int n) {
    // Compiler knows a, b, c don't overlap
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

// Branch prediction hints
#define likely(x)   __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)

int process_with_hints(int *data, int n) {
    for (int i = 0; i < n; i++) {
        if (unlikely(data[i] < 0)) {
            return -1;  // Error path - rarely taken
        }
        // Normal processing...
    }
    return 0;
}

// Force inlining
static inline __attribute__((always_inline)) 
int fast_abs(int x) {
    int mask = x >> 31;
    return (x ^ mask) - mask;
}

// Prevent inlining (for debugging)
__attribute__((noinline))
void debug_function(void) {
    // ...
}

// Hot/cold function hints
__attribute__((hot))
void frequently_called(void) {
    // Compiler optimizes more aggressively
}

__attribute__((cold))
void error_handler(void) {
    // Compiler optimizes for size
}

// Prefetch hints
void process_with_prefetch(float *data, int n) {
    for (int i = 0; i < n; i++) {
        // Prefetch data we'll need 64 iterations later
        __builtin_prefetch(&data[i + 64], 0, 3);
        data[i] = sqrt(data[i]);
    }
}

Memory Allocation Optimization

malloc is not free. Each call involves locking a global mutex (in most implementations), searching a free list, potentially requesting memory from the OS via mmap or sbrk, and updating bookkeeping metadata. In a tight loop, this overhead dominates your computation time.
// Avoid frequent malloc/free
// BAD: Allocate in loop
// Each iteration: lock mutex -> search free list -> possibly syscall -> unlock mutex
// For 1 million iterations, that is 1 million lock/unlock cycles and free list scans.
void process_bad(int n) {
    for (int i = 0; i < n; i++) {
        char *buf = malloc(1024);  // ~50-200ns per call on modern systems
        do_work(buf);
        free(buf);
    }
}

// GOOD: Allocate once
void process_good(int n) {
    char *buf = malloc(1024);
    for (int i = 0; i < n; i++) {
        do_work(buf);
    }
    free(buf);
}

// BETTER: Arena allocator for bulk operations
// An arena is like a notepad: you write sequentially from front to back,
// and when you are done with everything, you rip off the whole page at once.
// No per-allocation bookkeeping, no fragmentation, no free-list searching.
typedef struct {
    char *memory;
    size_t used;
    size_t capacity;
} Arena;

void *arena_alloc(Arena *a, size_t size) {
    size = (size + 7) & ~7;  // Round up to 8-byte alignment for safety
    if (a->used + size > a->capacity) return NULL;
    void *ptr = a->memory + a->used;
    a->used += size;          // "Allocation" is just a pointer bump: ~1-2ns
    return ptr;
}

void arena_reset(Arena *a) {
    a->used = 0;  // O(1) "free" everything -- no per-object cleanup needed
    // Perfect for request-scoped work: allocate during request, reset at end
}

// Stack allocation for small buffers
void process_stack(int n) {
    for (int i = 0; i < n; i++) {
        char buf[1024];  // Stack allocation - very fast
        do_work(buf);
    }
}

// Avoid false sharing in multithreaded code
// False sharing happens when two threads modify different variables that live
// on the same cache line (64 bytes). The CPU invalidates the entire cache line
// whenever either thread writes, causing both threads to constantly reload data
// they did not change. This can make multithreaded code SLOWER than single-threaded.
struct __attribute__((aligned(64))) ThreadData {
    long counter;
    char padding[56];  // Pad to 64 bytes so each counter gets its own cache line
};

struct ThreadData counters[NUM_THREADS];
// Without padding: 4 threads sharing a cache line can run 10-50x slower
// than expected due to constant cache-line ping-ponging between CPU cores.

Loop Optimizations

// Loop unrolling
void sum_unrolled(int *arr, int n, long *result) {
    long sum = 0;
    int i;
    
    // Process 4 elements per iteration
    for (i = 0; i <= n - 4; i += 4) {
        sum += arr[i];
        sum += arr[i + 1];
        sum += arr[i + 2];
        sum += arr[i + 3];
    }
    
    // Handle remainder
    for (; i < n; i++) {
        sum += arr[i];
    }
    
    *result = sum;
}

// Loop fusion - combine loops
// BAD: Two loops, two cache passes
void process_separate(float *a, float *b, int n) {
    for (int i = 0; i < n; i++) a[i] *= 2;
    for (int i = 0; i < n; i++) b[i] *= 3;
}

// GOOD: One loop, one cache pass
void process_fused(float *a, float *b, int n) {
    for (int i = 0; i < n; i++) {
        a[i] *= 2;
        b[i] *= 3;
    }
}

// Loop interchange for cache locality
// BAD: Column-major access
void init_bad(int m[N][N]) {
    for (int j = 0; j < N; j++)
        for (int i = 0; i < N; i++)
            m[i][j] = 0;
}

// GOOD: Row-major access
void init_good(int m[N][N]) {
    for (int i = 0; i < N; i++)
        for (int j = 0; j < N; j++)
            m[i][j] = 0;
}

// Loop invariant motion (manual)
// BAD: Recompute every iteration
for (int i = 0; i < n; i++) {
    result[i] = data[i] * sin(angle) * scale;
}

// GOOD: Compute once
double factor = sin(angle) * scale;
for (int i = 0; i < n; i++) {
    result[i] = data[i] * factor;
}

String and Memory Operations

// Use optimized library functions
#include <string.h>

// memcpy is highly optimized - use it
void copy_data(char *dst, char *src, size_t n) {
    memcpy(dst, src, n);  // Uses SIMD, non-temporal stores, etc.
}

// For known small sizes, inline
static inline void copy_16(char *dst, char *src) {
    __m128i v = _mm_loadu_si128((const __m128i*)src);
    _mm_storeu_si128((__m128i*)dst, v);
}

// Avoid strlen in loops
// BAD
for (int i = 0; i < strlen(str); i++) {  // O(n²)!
    process(str[i]);
}

// GOOD
size_t len = strlen(str);
for (size_t i = 0; i < len; i++) {
    process(str[i]);
}

// Or even better
for (char *p = str; *p; p++) {
    process(*p);
}

// Use memset for initialization
int arr[1000];
memset(arr, 0, sizeof(arr));  // Faster than loop

Micro-benchmarking

#include <time.h>

// High-resolution timing
static inline uint64_t rdtsc(void) {
    unsigned int lo, hi;
    __asm__ volatile ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

// Or use clock_gettime
double get_time(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec + ts.tv_nsec * 1e-9;
}

// Benchmark function
void benchmark(const char *name, void (*func)(void), int iterations) {
    // Warmup
    for (int i = 0; i < 100; i++) func();
    
    double start = get_time();
    for (int i = 0; i < iterations; i++) {
        func();
    }
    double end = get_time();
    
    double per_call = (end - start) / iterations * 1e9;
    printf("%s: %.2f ns per call\n", name, per_call);
}

// Prevent dead code elimination
// The compiler is smart enough to remove computations whose results are never used.
// These functions use inline assembly to "trick" the compiler into thinking
// the value is needed, without actually doing any work at runtime.
void escape(void *p) {
    __asm__ volatile("" : : "g"(p) : "memory");
    // Tells compiler: "p is used by something you cannot see, so do not optimize it away"
}

void clobber(void) {
    __asm__ volatile("" : : : "memory");
    // Tells compiler: "memory may have changed, so reload anything you cached"
    // Use this between benchmark iterations to prevent the compiler from
    // hoisting work out of your measurement loop.
}

// Practical tip: if your benchmark shows impossibly fast results (< 1ns per call),
// the compiler probably optimized away your entire computation. Use escape/clobber
// on your inputs and outputs to prevent this.

Checklist

Profile First

Never optimize without measuring

Cache Friendly

Sequential access, avoid cache misses

Reduce Branches

Predictable or branchless code

Vectorize

Use SIMD for data-parallel work

Reduce Allocations

Pool, arena, or stack allocate

Use restrict

Help compiler optimize pointer code

Next Up

Embedded Systems

Write C for resource-constrained systems