Performance Optimization
Learn how to write C code that runs at maximum speed. This guide covers CPU architecture awareness, memory hierarchy optimization, and profiling techniques. The single most important lesson in performance engineering: measure before you optimize. Intuition about where time goes is wrong more often than it is right. A function you think is slow might already be optimized by the compiler, while the bottleneck hides in a memory access pattern you never questioned. Profile first, optimize second, measure again third.Understanding Performance
Where Time Goes
Where Time Goes
Modern programs spend most time waiting. Think of your CPU as a chef who can chop vegetables at superhuman speed but has to walk to the warehouse every time they need an ingredient that is not on the counter:
- Register access: 1 cycle (the cutting board in front of you)
- L1 cache hit: 3-4 cycles (the spice rack on the counter)
- L2 cache hit: 10-12 cycles (the pantry in the next room)
- L3 cache hit: 30-40 cycles (the storage closet down the hall)
- Memory access (cache miss): 100-300 cycles (driving to the warehouse)
- Branch misprediction: 15-20 cycles penalty (starting to chop onions, then realizing you needed garlic)
- System calls: Thousands of cycles (calling the supplier on the phone)
- Disk I/O: Millions of cycles (ordering from another country)
perf stat ./your_program and look at the IPC (instructions per cycle). An IPC below 1.0 almost always means you are memory-bound, not compute-bound.Profiling Tools
- perf
- gprof
- Valgrind
Memory Hierarchy Optimization
Data Structure Layout
The way you organize your structs determines how the CPU cache behaves. Think of it like organizing a library: if every book about physics also has a chapter of recipes bound into it, you waste shelf space carrying recipes you never read when you are doing physics research.Branch Prediction
Modern CPUs predict which way anif statement will go and start executing that path before they know the answer. If the prediction is wrong, they throw away 15-20 cycles of wasted work and restart. When data is random, the predictor is wrong ~50% of the time, which means every other iteration pays a 15-20 cycle penalty.
SIMD Vectorization
SIMD (Single Instruction, Multiple Data) lets the CPU perform the same operation on multiple data elements in a single instruction. Think of it as the difference between washing dishes one at a time versus loading a dishwasher: same total work, but the parallel approach is dramatically faster. Before writing manual SIMD: check if the compiler does it for you. Compile with-O2 -march=native and add -fopt-info-vec-optimized (GCC) or -Rpass=loop-vectorize (Clang) to see what the compiler auto-vectorizes. Manual SIMD is only worth it when the compiler cannot figure out the pattern on its own.
Compiler Optimization Hints
Memory Allocation Optimization
malloc is not free. Each call involves locking a global mutex (in most implementations), searching a free list, potentially requesting memory from the OS via mmap or sbrk, and updating bookkeeping metadata. In a tight loop, this overhead dominates your computation time.
Loop Optimizations
String and Memory Operations
Micro-benchmarking
Checklist
Profile First
Never optimize without measuring
Cache Friendly
Sequential access, avoid cache misses
Reduce Branches
Predictable or branchless code
Vectorize
Use SIMD for data-parallel work
Reduce Allocations
Pool, arena, or stack allocate
Use restrict
Help compiler optimize pointer code
Next Up
Embedded Systems
Write C for resource-constrained systems