Learn how to write C code that runs at maximum speed. This guide covers CPU architecture awareness, memory hierarchy optimization, and profiling techniques.The single most important lesson in performance engineering: measure before you optimize. Intuition about where time goes is wrong more often than it is right. A function you think is slow might already be optimized by the compiler, while the bottleneck hides in a memory access pattern you never questioned. Profile first, optimize second, measure again third.
Modern programs spend most time waiting. Think of your CPU as a chef who can chop vegetables at superhuman speed but has to walk to the warehouse every time they need an ingredient that is not on the counter:
Register access: 1 cycle (the cutting board in front of you)
L1 cache hit: 3-4 cycles (the spice rack on the counter)
L2 cache hit: 10-12 cycles (the pantry in the next room)
L3 cache hit: 30-40 cycles (the storage closet down the hall)
Memory access (cache miss): 100-300 cycles (driving to the warehouse)
Branch misprediction: 15-20 cycles penalty (starting to chop onions, then realizing you needed garlic)
System calls: Thousands of cycles (calling the supplier on the phone)
Disk I/O: Millions of cycles (ordering from another country)
The key insight: optimize memory access patterns first. Most “slow” C code is not compute-bound — it is memory-bound. The CPU is literally sitting idle, waiting for data to arrive from RAM.Practical tip: Before touching any code, run perf stat ./your_program and look at the IPC (instructions per cycle). An IPC below 1.0 almost always means you are memory-bound, not compute-bound.
# Record performance dataperf record -g ./my_program# Analyze resultsperf report# Statistical overviewperf stat ./my_program# Sample output:# Performance counter stats for './my_program':# 1,234.56 msec task-clock# 4,567,890,123 cycles# 3,456,789,012 instructions # 0.76 IPC# 12,345,678 cache-references# 123,456 cache-misses # 1.0% of all cache refs
# Compile with profilinggcc -pg -O2 program.c -o program# Run the program (generates gmon.out)./program# Analyzegprof program gmon.out > analysis.txt# Flat profile shows where time is spent# Call graph shows who calls what
// BAD: Column-major access (cache unfriendly)// Each iteration jumps to a completely different cache line.// For a 1000x1000 matrix, every single access is a cache miss.void matrix_sum_bad(int matrix[N][N]) { int sum = 0; for (int j = 0; j < N; j++) { // Column first for (int i = 0; i < N; i++) { // Row second sum += matrix[i][j]; // Jumps N*sizeof(int) bytes each access } }}// GOOD: Row-major access (cache friendly)// C stores 2D arrays in row-major order: row 0, then row 1, etc.// Sequential access means one cache miss loads ~16 ints at once// (64-byte cache line / 4-byte int = 16 consecutive ints for free).void matrix_sum_good(int matrix[N][N]) { int sum = 0; for (int i = 0; i < N; i++) { // Row first for (int j = 0; j < N; j++) { // Column second sum += matrix[i][j]; // Sequential access: next address in memory } }}// Performance difference: 10-50x on large matrices!// Why so dramatic? Because the bad version's cache miss rate approaches 100%// while the good version's approaches ~6% (one miss per 16 elements).// BETTER: Blocking for cache locality// The idea: instead of processing the whole matrix row by row, process it in// small tiles (blocks) that fit entirely in L1 cache. This means all three// matrices A, B, C have their active block resident in cache simultaneously.// A typical L1 data cache is 32-48KB, so a 64x64 block of doubles = 32KB.#define BLOCK 64void matrix_multiply_blocked(double *A, double *B, double *C, int n) { for (int i = 0; i < n; i += BLOCK) { for (int j = 0; j < n; j += BLOCK) { for (int k = 0; k < n; k += BLOCK) { // Multiply block for (int ii = i; ii < i + BLOCK && ii < n; ii++) { for (int jj = j; jj < j + BLOCK && jj < n; jj++) { double sum = C[ii*n + jj]; for (int kk = k; kk < k + BLOCK && kk < n; kk++) { sum += A[ii*n + kk] * B[kk*n + jj]; } C[ii*n + jj] = sum; } } } } }}
The way you organize your structs determines how the CPU cache behaves. Think of it like organizing a library: if every book about physics also has a chapter of recipes bound into it, you waste shelf space carrying recipes you never read when you are doing physics research.
// BAD: Poor cache utilization (Array of Structs)// Each particle is 60 bytes. When updating positions, we load ALL 60 bytes// into cache but only use x, y, z, vx, vy, vz (24 bytes). That is 60% wasted bandwidth.struct Particle_AoS { float x, y, z; // Position (12 bytes) float vx, vy, vz; // Velocity (12 bytes) float mass; // 4 bytes int type; // 4 bytes char name[32]; // Only used for debugging -- but loaded into cache every time!};struct Particle_AoS particles[1000000];// When updating positions, we load name[] into cache toovoid update_positions_aos(void) { for (int i = 0; i < 1000000; i++) { particles[i].x += particles[i].vx; // Lots of cache misses particles[i].y += particles[i].vy; particles[i].z += particles[i].vz; }}// GOOD: Cache-friendly (Struct of Arrays)struct ParticleSystem { float *x, *y, *z; // Positions float *vx, *vy, *vz; // Velocities float *mass; int *type; size_t count;};// Sequential access, no wasted cache spacevoid update_positions_soa(struct ParticleSystem *ps) { for (size_t i = 0; i < ps->count; i++) { ps->x[i] += ps->vx[i]; // Sequential access ps->y[i] += ps->vy[i]; ps->z[i] += ps->vz[i]; }}// Often 2-4x faster for large datasets// BEST: Hybrid approach - group hot data together// This is what game engines actually do. Separate data into "hot" fields// (accessed every frame) and "cold" fields (accessed only for UI or debugging).// You get cache efficiency without the awkward SoA API.struct ParticleHot { float x, y, z; float vx, vy, vz;}; // 24 bytes -- fits ~2.6 particles per cache linestruct ParticleCold { float mass; int type; char name[32];}; // 40 bytes -- only loaded when you actually need mass/type/namestruct ParticleHot hot_data[1000000]; // Updated every framestruct ParticleCold cold_data[1000000]; // Rarely accessed// Practical tip: index both arrays with the same ID. hot_data[i] and// cold_data[i] always refer to the same logical particle.
Modern CPUs predict which way an if statement will go and start executing that path before they know the answer. If the prediction is wrong, they throw away 15-20 cycles of wasted work and restart. When data is random, the predictor is wrong ~50% of the time, which means every other iteration pays a 15-20 cycle penalty.
// BAD: Unpredictable branches// With random data, the branch predictor is correct only ~50% of the time.// On a tight loop processing millions of elements, this costs real seconds.void process_bad(int *data, int n) { for (int i = 0; i < n; i++) { if (data[i] > 0) { // 50% taken - unpredictable positive_count++; } else { negative_count++; } }}// GOOD: Branchless codevoid process_branchless(int *data, int n) { for (int i = 0; i < n; i++) { int is_positive = data[i] > 0; positive_count += is_positive; negative_count += !is_positive; }}// BETTER: Sort data first if possible// After sorting, branches become predictableqsort(data, n, sizeof(int), compare);process_bad(data, n); // Now branches are predictable!// Branchless min/maxint branchless_min(int a, int b) { return b ^ ((a ^ b) & -(a < b));}int branchless_max(int a, int b) { return a ^ ((a ^ b) & -(a < b));}// Branchless absolute valueint branchless_abs(int x) { int mask = x >> 31; return (x ^ mask) - mask;}// Conditional move using compiler hintsint conditional_value(int condition, int true_val, int false_val) { // GCC might generate CMOV instruction return condition ? true_val : false_val;}
SIMD (Single Instruction, Multiple Data) lets the CPU perform the same operation on multiple data elements in a single instruction. Think of it as the difference between washing dishes one at a time versus loading a dishwasher: same total work, but the parallel approach is dramatically faster.Before writing manual SIMD: check if the compiler does it for you. Compile with -O2 -march=native and add -fopt-info-vec-optimized (GCC) or -Rpass=loop-vectorize (Clang) to see what the compiler auto-vectorizes. Manual SIMD is only worth it when the compiler cannot figure out the pattern on its own.
#include <immintrin.h> // For AVX/SSE intrinsics// Scalar version -- processes one float per iterationvoid add_arrays_scalar(float *a, float *b, float *c, int n) { for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; }}// SIMD version (AVX - 8 floats at once)// AVX registers are 256 bits wide = 8 x 32-bit floats.// One _mm256_add_ps instruction does the work of 8 scalar additions.void add_arrays_simd(float *a, float *b, float *c, int n) { int i; for (i = 0; i <= n - 8; i += 8) { __m256 va = _mm256_loadu_ps(&a[i]); // Load 8 floats from a ("u" = unaligned) __m256 vb = _mm256_loadu_ps(&b[i]); // Load 8 floats from b __m256 vc = _mm256_add_ps(va, vb); // Add all 8 pairs simultaneously _mm256_storeu_ps(&c[i], vc); // Store 8 results to c } // Handle remainder (when n is not a multiple of 8) // This scalar tail is necessary for correctness but processes very few elements. for (; i < n; i++) { c[i] = a[i] + b[i]; }}// Dot product with SIMDfloat dot_product_simd(float *a, float *b, int n) { __m256 sum = _mm256_setzero_ps(); int i; for (i = 0; i <= n - 8; i += 8) { __m256 va = _mm256_loadu_ps(&a[i]); __m256 vb = _mm256_loadu_ps(&b[i]); sum = _mm256_fmadd_ps(va, vb, sum); // Fused multiply-add } // Horizontal sum __m128 hi = _mm256_extractf128_ps(sum, 1); __m128 lo = _mm256_castps256_ps128(sum); lo = _mm_add_ps(lo, hi); lo = _mm_hadd_ps(lo, lo); lo = _mm_hadd_ps(lo, lo); float result = _mm_cvtss_f32(lo); // Handle remainder for (; i < n; i++) { result += a[i] * b[i]; } return result;}
malloc is not free. Each call involves locking a global mutex (in most implementations), searching a free list, potentially requesting memory from the OS via mmap or sbrk, and updating bookkeeping metadata. In a tight loop, this overhead dominates your computation time.
// Avoid frequent malloc/free// BAD: Allocate in loop// Each iteration: lock mutex -> search free list -> possibly syscall -> unlock mutex// For 1 million iterations, that is 1 million lock/unlock cycles and free list scans.void process_bad(int n) { for (int i = 0; i < n; i++) { char *buf = malloc(1024); // ~50-200ns per call on modern systems do_work(buf); free(buf); }}// GOOD: Allocate oncevoid process_good(int n) { char *buf = malloc(1024); for (int i = 0; i < n; i++) { do_work(buf); } free(buf);}// BETTER: Arena allocator for bulk operations// An arena is like a notepad: you write sequentially from front to back,// and when you are done with everything, you rip off the whole page at once.// No per-allocation bookkeeping, no fragmentation, no free-list searching.typedef struct { char *memory; size_t used; size_t capacity;} Arena;void *arena_alloc(Arena *a, size_t size) { size = (size + 7) & ~7; // Round up to 8-byte alignment for safety if (a->used + size > a->capacity) return NULL; void *ptr = a->memory + a->used; a->used += size; // "Allocation" is just a pointer bump: ~1-2ns return ptr;}void arena_reset(Arena *a) { a->used = 0; // O(1) "free" everything -- no per-object cleanup needed // Perfect for request-scoped work: allocate during request, reset at end}// Stack allocation for small buffersvoid process_stack(int n) { for (int i = 0; i < n; i++) { char buf[1024]; // Stack allocation - very fast do_work(buf); }}// Avoid false sharing in multithreaded code// False sharing happens when two threads modify different variables that live// on the same cache line (64 bytes). The CPU invalidates the entire cache line// whenever either thread writes, causing both threads to constantly reload data// they did not change. This can make multithreaded code SLOWER than single-threaded.struct __attribute__((aligned(64))) ThreadData { long counter; char padding[56]; // Pad to 64 bytes so each counter gets its own cache line};struct ThreadData counters[NUM_THREADS];// Without padding: 4 threads sharing a cache line can run 10-50x slower// than expected due to constant cache-line ping-ponging between CPU cores.
// Loop unrollingvoid sum_unrolled(int *arr, int n, long *result) { long sum = 0; int i; // Process 4 elements per iteration for (i = 0; i <= n - 4; i += 4) { sum += arr[i]; sum += arr[i + 1]; sum += arr[i + 2]; sum += arr[i + 3]; } // Handle remainder for (; i < n; i++) { sum += arr[i]; } *result = sum;}// Loop fusion - combine loops// BAD: Two loops, two cache passesvoid process_separate(float *a, float *b, int n) { for (int i = 0; i < n; i++) a[i] *= 2; for (int i = 0; i < n; i++) b[i] *= 3;}// GOOD: One loop, one cache passvoid process_fused(float *a, float *b, int n) { for (int i = 0; i < n; i++) { a[i] *= 2; b[i] *= 3; }}// Loop interchange for cache locality// BAD: Column-major accessvoid init_bad(int m[N][N]) { for (int j = 0; j < N; j++) for (int i = 0; i < N; i++) m[i][j] = 0;}// GOOD: Row-major accessvoid init_good(int m[N][N]) { for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) m[i][j] = 0;}// Loop invariant motion (manual)// BAD: Recompute every iterationfor (int i = 0; i < n; i++) { result[i] = data[i] * sin(angle) * scale;}// GOOD: Compute oncedouble factor = sin(angle) * scale;for (int i = 0; i < n; i++) { result[i] = data[i] * factor;}
#include <time.h>// High-resolution timingstatic inline uint64_t rdtsc(void) { unsigned int lo, hi; __asm__ volatile ("rdtsc" : "=a" (lo), "=d" (hi)); return ((uint64_t)hi << 32) | lo;}// Or use clock_gettimedouble get_time(void) { struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); return ts.tv_sec + ts.tv_nsec * 1e-9;}// Benchmark functionvoid benchmark(const char *name, void (*func)(void), int iterations) { // Warmup for (int i = 0; i < 100; i++) func(); double start = get_time(); for (int i = 0; i < iterations; i++) { func(); } double end = get_time(); double per_call = (end - start) / iterations * 1e9; printf("%s: %.2f ns per call\n", name, per_call);}// Prevent dead code elimination// The compiler is smart enough to remove computations whose results are never used.// These functions use inline assembly to "trick" the compiler into thinking// the value is needed, without actually doing any work at runtime.void escape(void *p) { __asm__ volatile("" : : "g"(p) : "memory"); // Tells compiler: "p is used by something you cannot see, so do not optimize it away"}void clobber(void) { __asm__ volatile("" : : : "memory"); // Tells compiler: "memory may have changed, so reload anything you cached" // Use this between benchmark iterations to prevent the compiler from // hoisting work out of your measurement loop.}// Practical tip: if your benchmark shows impossibly fast results (< 1ns per call),// the compiler probably optimized away your entire computation. Use escape/clobber// on your inputs and outputs to prevent this.