> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Performance Optimization > Squeeze every cycle out of your C code # Performance Optimization Learn how to write C code that runs at maximum speed. This guide covers CPU architecture awareness, memory hierarchy optimization, and profiling techniques. The single most important lesson in performance engineering: **measure before you optimize**. Intuition about where time goes is wrong more often than it is right. A function you think is slow might already be optimized by the compiler, while the bottleneck hides in a memory access pattern you never questioned. Profile first, optimize second, measure again third. *** ## Understanding Performance Modern programs spend most time waiting. Think of your CPU as a chef who can chop vegetables at superhuman speed but has to walk to the warehouse every time they need an ingredient that is not on the counter: * **Register access**: 1 cycle (the cutting board in front of you) * **L1 cache hit**: 3-4 cycles (the spice rack on the counter) * **L2 cache hit**: 10-12 cycles (the pantry in the next room) * **L3 cache hit**: 30-40 cycles (the storage closet down the hall) * **Memory access (cache miss)**: 100-300 cycles (driving to the warehouse) * **Branch misprediction**: 15-20 cycles penalty (starting to chop onions, then realizing you needed garlic) * **System calls**: Thousands of cycles (calling the supplier on the phone) * **Disk I/O**: Millions of cycles (ordering from another country) The key insight: **optimize memory access patterns first**. Most "slow" C code is not compute-bound -- it is memory-bound. The CPU is literally sitting idle, waiting for data to arrive from RAM. **Practical tip**: Before touching any code, run `perf stat ./your_program` and look at the IPC (instructions per cycle). An IPC below 1.0 almost always means you are memory-bound, not compute-bound. *** ## Profiling Tools ```bash theme={null} # Record performance data perf record -g ./my_program # Analyze results perf report # Statistical overview perf stat ./my_program # Sample output: # Performance counter stats for './my_program': # 1,234.56 msec task-clock # 4,567,890,123 cycles # 3,456,789,012 instructions # 0.76 IPC # 12,345,678 cache-references # 123,456 cache-misses # 1.0% of all cache refs ``` ```bash theme={null} # Compile with profiling gcc -pg -O2 program.c -o program # Run the program (generates gmon.out) ./program # Analyze gprof program gmon.out > analysis.txt # Flat profile shows where time is spent # Call graph shows who calls what ``` ```bash theme={null} # Cache simulation valgrind --tool=cachegrind ./program # Annotate source with cache info cg_annotate cachegrind.out.* # Shows: # - Cache hit rates # - Branch prediction rates # - Per-line cache behavior ``` *** ## Memory Hierarchy Optimization ```c theme={null} // BAD: Column-major access (cache unfriendly) // Each iteration jumps to a completely different cache line. // For a 1000x1000 matrix, every single access is a cache miss. void matrix_sum_bad(int matrix[N][N]) { int sum = 0; for (int j = 0; j < N; j++) { // Column first for (int i = 0; i < N; i++) { // Row second sum += matrix[i][j]; // Jumps N*sizeof(int) bytes each access } } } // GOOD: Row-major access (cache friendly) // C stores 2D arrays in row-major order: row 0, then row 1, etc. // Sequential access means one cache miss loads ~16 ints at once // (64-byte cache line / 4-byte int = 16 consecutive ints for free). void matrix_sum_good(int matrix[N][N]) { int sum = 0; for (int i = 0; i < N; i++) { // Row first for (int j = 0; j < N; j++) { // Column second sum += matrix[i][j]; // Sequential access: next address in memory } } } // Performance difference: 10-50x on large matrices! // Why so dramatic? Because the bad version's cache miss rate approaches 100% // while the good version's approaches ~6% (one miss per 16 elements). // BETTER: Blocking for cache locality // The idea: instead of processing the whole matrix row by row, process it in // small tiles (blocks) that fit entirely in L1 cache. This means all three // matrices A, B, C have their active block resident in cache simultaneously. // A typical L1 data cache is 32-48KB, so a 64x64 block of doubles = 32KB. #define BLOCK 64 void matrix_multiply_blocked(double *A, double *B, double *C, int n) { for (int i = 0; i < n; i += BLOCK) { for (int j = 0; j < n; j += BLOCK) { for (int k = 0; k < n; k += BLOCK) { // Multiply block for (int ii = i; ii < i + BLOCK && ii < n; ii++) { for (int jj = j; jj < j + BLOCK && jj < n; jj++) { double sum = C[ii*n + jj]; for (int kk = k; kk < k + BLOCK && kk < n; kk++) { sum += A[ii*n + kk] * B[kk*n + jj]; } C[ii*n + jj] = sum; } } } } } } ``` *** ## Data Structure Layout The way you organize your structs determines how the CPU cache behaves. Think of it like organizing a library: if every book about physics also has a chapter of recipes bound into it, you waste shelf space carrying recipes you never read when you are doing physics research. ```c theme={null} // BAD: Poor cache utilization (Array of Structs) // Each particle is 60 bytes. When updating positions, we load ALL 60 bytes // into cache but only use x, y, z, vx, vy, vz (24 bytes). That is 60% wasted bandwidth. struct Particle_AoS { float x, y, z; // Position (12 bytes) float vx, vy, vz; // Velocity (12 bytes) float mass; // 4 bytes int type; // 4 bytes char name[32]; // Only used for debugging -- but loaded into cache every time! }; struct Particle_AoS particles[1000000]; // When updating positions, we load name[] into cache too void update_positions_aos(void) { for (int i = 0; i < 1000000; i++) { particles[i].x += particles[i].vx; // Lots of cache misses particles[i].y += particles[i].vy; particles[i].z += particles[i].vz; } } // GOOD: Cache-friendly (Struct of Arrays) struct ParticleSystem { float *x, *y, *z; // Positions float *vx, *vy, *vz; // Velocities float *mass; int *type; size_t count; }; // Sequential access, no wasted cache space void update_positions_soa(struct ParticleSystem *ps) { for (size_t i = 0; i < ps->count; i++) { ps->x[i] += ps->vx[i]; // Sequential access ps->y[i] += ps->vy[i]; ps->z[i] += ps->vz[i]; } } // Often 2-4x faster for large datasets // BEST: Hybrid approach - group hot data together // This is what game engines actually do. Separate data into "hot" fields // (accessed every frame) and "cold" fields (accessed only for UI or debugging). // You get cache efficiency without the awkward SoA API. struct ParticleHot { float x, y, z; float vx, vy, vz; }; // 24 bytes -- fits ~2.6 particles per cache line struct ParticleCold { float mass; int type; char name[32]; }; // 40 bytes -- only loaded when you actually need mass/type/name struct ParticleHot hot_data[1000000]; // Updated every frame struct ParticleCold cold_data[1000000]; // Rarely accessed // Practical tip: index both arrays with the same ID. hot_data[i] and // cold_data[i] always refer to the same logical particle. ``` *** ## Branch Prediction Modern CPUs predict which way an `if` statement will go and start executing that path *before* they know the answer. If the prediction is wrong, they throw away 15-20 cycles of wasted work and restart. When data is random, the predictor is wrong \~50% of the time, which means every other iteration pays a 15-20 cycle penalty. ```c theme={null} // BAD: Unpredictable branches // With random data, the branch predictor is correct only ~50% of the time. // On a tight loop processing millions of elements, this costs real seconds. void process_bad(int *data, int n) { for (int i = 0; i < n; i++) { if (data[i] > 0) { // 50% taken - unpredictable positive_count++; } else { negative_count++; } } } // GOOD: Branchless code void process_branchless(int *data, int n) { for (int i = 0; i < n; i++) { int is_positive = data[i] > 0; positive_count += is_positive; negative_count += !is_positive; } } // BETTER: Sort data first if possible // After sorting, branches become predictable qsort(data, n, sizeof(int), compare); process_bad(data, n); // Now branches are predictable! // Branchless min/max int branchless_min(int a, int b) { return b ^ ((a ^ b) & -(a < b)); } int branchless_max(int a, int b) { return a ^ ((a ^ b) & -(a < b)); } // Branchless absolute value int branchless_abs(int x) { int mask = x >> 31; return (x ^ mask) - mask; } // Conditional move using compiler hints int conditional_value(int condition, int true_val, int false_val) { // GCC might generate CMOV instruction return condition ? true_val : false_val; } ``` *** ## SIMD Vectorization SIMD (Single Instruction, Multiple Data) lets the CPU perform the same operation on multiple data elements in a single instruction. Think of it as the difference between washing dishes one at a time versus loading a dishwasher: same total work, but the parallel approach is dramatically faster. **Before writing manual SIMD**: check if the compiler does it for you. Compile with `-O2 -march=native` and add `-fopt-info-vec-optimized` (GCC) or `-Rpass=loop-vectorize` (Clang) to see what the compiler auto-vectorizes. Manual SIMD is only worth it when the compiler cannot figure out the pattern on its own. ```c theme={null} #include // For AVX/SSE intrinsics // Scalar version -- processes one float per iteration void add_arrays_scalar(float *a, float *b, float *c, int n) { for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } // SIMD version (AVX - 8 floats at once) // AVX registers are 256 bits wide = 8 x 32-bit floats. // One _mm256_add_ps instruction does the work of 8 scalar additions. void add_arrays_simd(float *a, float *b, float *c, int n) { int i; for (i = 0; i <= n - 8; i += 8) { __m256 va = _mm256_loadu_ps(&a[i]); // Load 8 floats from a ("u" = unaligned) __m256 vb = _mm256_loadu_ps(&b[i]); // Load 8 floats from b __m256 vc = _mm256_add_ps(va, vb); // Add all 8 pairs simultaneously _mm256_storeu_ps(&c[i], vc); // Store 8 results to c } // Handle remainder (when n is not a multiple of 8) // This scalar tail is necessary for correctness but processes very few elements. for (; i < n; i++) { c[i] = a[i] + b[i]; } } // Dot product with SIMD float dot_product_simd(float *a, float *b, int n) { __m256 sum = _mm256_setzero_ps(); int i; for (i = 0; i <= n - 8; i += 8) { __m256 va = _mm256_loadu_ps(&a[i]); __m256 vb = _mm256_loadu_ps(&b[i]); sum = _mm256_fmadd_ps(va, vb, sum); // Fused multiply-add } // Horizontal sum __m128 hi = _mm256_extractf128_ps(sum, 1); __m128 lo = _mm256_castps256_ps128(sum); lo = _mm_add_ps(lo, hi); lo = _mm_hadd_ps(lo, lo); lo = _mm_hadd_ps(lo, lo); float result = _mm_cvtss_f32(lo); // Handle remainder for (; i < n; i++) { result += a[i] * b[i]; } return result; } ``` *** ## Compiler Optimization Hints ```c theme={null} // Alignment hints void process_aligned(float * __restrict__ __attribute__((aligned(32))) data, int n) { // Compiler knows data is 32-byte aligned for (int i = 0; i < n; i++) { data[i] *= 2.0f; } } // Restrict pointers - no aliasing void vector_add(float * __restrict__ a, float * __restrict__ b, float * __restrict__ c, int n) { // Compiler knows a, b, c don't overlap for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } // Branch prediction hints #define likely(x) __builtin_expect(!!(x), 1) #define unlikely(x) __builtin_expect(!!(x), 0) int process_with_hints(int *data, int n) { for (int i = 0; i < n; i++) { if (unlikely(data[i] < 0)) { return -1; // Error path - rarely taken } // Normal processing... } return 0; } // Force inlining static inline __attribute__((always_inline)) int fast_abs(int x) { int mask = x >> 31; return (x ^ mask) - mask; } // Prevent inlining (for debugging) __attribute__((noinline)) void debug_function(void) { // ... } // Hot/cold function hints __attribute__((hot)) void frequently_called(void) { // Compiler optimizes more aggressively } __attribute__((cold)) void error_handler(void) { // Compiler optimizes for size } // Prefetch hints void process_with_prefetch(float *data, int n) { for (int i = 0; i < n; i++) { // Prefetch data we'll need 64 iterations later __builtin_prefetch(&data[i + 64], 0, 3); data[i] = sqrt(data[i]); } } ``` *** ## Memory Allocation Optimization `malloc` is not free. Each call involves locking a global mutex (in most implementations), searching a free list, potentially requesting memory from the OS via `mmap` or `sbrk`, and updating bookkeeping metadata. In a tight loop, this overhead dominates your computation time. ```c theme={null} // Avoid frequent malloc/free // BAD: Allocate in loop // Each iteration: lock mutex -> search free list -> possibly syscall -> unlock mutex // For 1 million iterations, that is 1 million lock/unlock cycles and free list scans. void process_bad(int n) { for (int i = 0; i < n; i++) { char *buf = malloc(1024); // ~50-200ns per call on modern systems do_work(buf); free(buf); } } // GOOD: Allocate once void process_good(int n) { char *buf = malloc(1024); for (int i = 0; i < n; i++) { do_work(buf); } free(buf); } // BETTER: Arena allocator for bulk operations // An arena is like a notepad: you write sequentially from front to back, // and when you are done with everything, you rip off the whole page at once. // No per-allocation bookkeeping, no fragmentation, no free-list searching. typedef struct { char *memory; size_t used; size_t capacity; } Arena; void *arena_alloc(Arena *a, size_t size) { size = (size + 7) & ~7; // Round up to 8-byte alignment for safety if (a->used + size > a->capacity) return NULL; void *ptr = a->memory + a->used; a->used += size; // "Allocation" is just a pointer bump: ~1-2ns return ptr; } void arena_reset(Arena *a) { a->used = 0; // O(1) "free" everything -- no per-object cleanup needed // Perfect for request-scoped work: allocate during request, reset at end } // Stack allocation for small buffers void process_stack(int n) { for (int i = 0; i < n; i++) { char buf[1024]; // Stack allocation - very fast do_work(buf); } } // Avoid false sharing in multithreaded code // False sharing happens when two threads modify different variables that live // on the same cache line (64 bytes). The CPU invalidates the entire cache line // whenever either thread writes, causing both threads to constantly reload data // they did not change. This can make multithreaded code SLOWER than single-threaded. struct __attribute__((aligned(64))) ThreadData { long counter; char padding[56]; // Pad to 64 bytes so each counter gets its own cache line }; struct ThreadData counters[NUM_THREADS]; // Without padding: 4 threads sharing a cache line can run 10-50x slower // than expected due to constant cache-line ping-ponging between CPU cores. ``` *** ## Loop Optimizations ```c theme={null} // Loop unrolling void sum_unrolled(int *arr, int n, long *result) { long sum = 0; int i; // Process 4 elements per iteration for (i = 0; i <= n - 4; i += 4) { sum += arr[i]; sum += arr[i + 1]; sum += arr[i + 2]; sum += arr[i + 3]; } // Handle remainder for (; i < n; i++) { sum += arr[i]; } *result = sum; } // Loop fusion - combine loops // BAD: Two loops, two cache passes void process_separate(float *a, float *b, int n) { for (int i = 0; i < n; i++) a[i] *= 2; for (int i = 0; i < n; i++) b[i] *= 3; } // GOOD: One loop, one cache pass void process_fused(float *a, float *b, int n) { for (int i = 0; i < n; i++) { a[i] *= 2; b[i] *= 3; } } // Loop interchange for cache locality // BAD: Column-major access void init_bad(int m[N][N]) { for (int j = 0; j < N; j++) for (int i = 0; i < N; i++) m[i][j] = 0; } // GOOD: Row-major access void init_good(int m[N][N]) { for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) m[i][j] = 0; } // Loop invariant motion (manual) // BAD: Recompute every iteration for (int i = 0; i < n; i++) { result[i] = data[i] * sin(angle) * scale; } // GOOD: Compute once double factor = sin(angle) * scale; for (int i = 0; i < n; i++) { result[i] = data[i] * factor; } ``` *** ## String and Memory Operations ```c theme={null} // Use optimized library functions #include // memcpy is highly optimized - use it void copy_data(char *dst, char *src, size_t n) { memcpy(dst, src, n); // Uses SIMD, non-temporal stores, etc. } // For known small sizes, inline static inline void copy_16(char *dst, char *src) { __m128i v = _mm_loadu_si128((const __m128i*)src); _mm_storeu_si128((__m128i*)dst, v); } // Avoid strlen in loops // BAD for (int i = 0; i < strlen(str); i++) { // O(n²)! process(str[i]); } // GOOD size_t len = strlen(str); for (size_t i = 0; i < len; i++) { process(str[i]); } // Or even better for (char *p = str; *p; p++) { process(*p); } // Use memset for initialization int arr[1000]; memset(arr, 0, sizeof(arr)); // Faster than loop ``` *** ## Micro-benchmarking ```c theme={null} #include // High-resolution timing static inline uint64_t rdtsc(void) { unsigned int lo, hi; __asm__ volatile ("rdtsc" : "=a" (lo), "=d" (hi)); return ((uint64_t)hi << 32) | lo; } // Or use clock_gettime double get_time(void) { struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); return ts.tv_sec + ts.tv_nsec * 1e-9; } // Benchmark function void benchmark(const char *name, void (*func)(void), int iterations) { // Warmup for (int i = 0; i < 100; i++) func(); double start = get_time(); for (int i = 0; i < iterations; i++) { func(); } double end = get_time(); double per_call = (end - start) / iterations * 1e9; printf("%s: %.2f ns per call\n", name, per_call); } // Prevent dead code elimination // The compiler is smart enough to remove computations whose results are never used. // These functions use inline assembly to "trick" the compiler into thinking // the value is needed, without actually doing any work at runtime. void escape(void *p) { __asm__ volatile("" : : "g"(p) : "memory"); // Tells compiler: "p is used by something you cannot see, so do not optimize it away" } void clobber(void) { __asm__ volatile("" : : : "memory"); // Tells compiler: "memory may have changed, so reload anything you cached" // Use this between benchmark iterations to prevent the compiler from // hoisting work out of your measurement loop. } // Practical tip: if your benchmark shows impossibly fast results (< 1ns per call), // the compiler probably optimized away your entire computation. Use escape/clobber // on your inputs and outputs to prevent this. ``` *** ## Checklist Never optimize without measuring Sequential access, avoid cache misses Predictable or branchless code Use SIMD for data-parallel work Pool, arena, or stack allocate Help compiler optimize pointer code *** ## Next Up Write C for resource-constrained systems