Observability & Performance Engineering
A “Senior” engineer doesn’t just usetop. They use the hardware’s own diagnostic features to find bottlenecks at the cycle level. This chapter covers the internals of kernel and hardware observability.
1. Hardware Performance Monitoring Units (PMU)
Every modern CPU has a set of dedicated registers called PMUs that count specific hardware events without any software overhead.Common Events
- Cycles: Total CPU clock cycles.
- Instructions: Total instructions retired (the “committed” state).
- Cache Misses: L1/L2/LLC misses.
- Branch Mispredictions: Critical for pipeline performance.
Sampling vs. Counting
- Counting: The CPU simply increments a counter. At the end of the run, you see the total (e.g., “1 million cache misses”).
- Sampling: The CPU is configured to raise an interrupt every N events (e.g., every 1000 cache misses). The kernel then records the Instruction Pointer (RIP) at each interrupt. This allows
perfto create a Histogram of where the misses are happening in the code.
2. LBR (Last Branch Record)
Stack traces are expensive to capture, and many optimized binaries don’t have “Frame Pointers” (RBP is used as a general-purpose register).
- LBR is a hardware feature that keeps a ring buffer of the last 8-32 branches taken by the CPU.
- Benefit: It allows
perfto reconstruct the Call Graph and Hot Paths with near-zero overhead and without needing debug symbols or frame pointers.
3. ftrace: The Function Tracer
ftrace is the built-in Linux kernel tracer. It can trace every single function call inside the kernel.
The Magic: Dynamic Patching
How doesftrace avoid overhead when not in use?
- Compile Time: The compiler adds a “no-op” call (
mcountor__fentry__) at the start of every function. - Boot Time: The kernel finds all these calls and replaces them with a single 5-byte
NOPinstruction. - Activation: When you start tracing, the kernel dynamically patches those
NOPs withCALLinstructions to the tracer.
- Result: Zero overhead when disabled, and high-precision tracing when enabled.
4. kprobes and uprobes
Dynamic instrumentation allows you to “hook” any instruction in a running system.
- kprobes: Hooks into kernel code.
- uprobes: Hooks into user-space code (shared libraries, binaries).
- How it works: The kernel replaces the target instruction with a Breakpoint (e.g.,
INT3on x86). When hit, the CPU traps to the kernel, executes your probe handler (or eBPF program), and then resumes the original instruction.
5. Flame Graphs
The industry standard for visualizing performance.- X-axis: Alphabetical list of functions (width = % of total time).
- Y-axis: Stack depth.
- Goal: Find the “plateaus”—wide boxes that indicate where the CPU is spending the most time.
6. USE Method vs. RED Method
- USE Method (Brendan Gregg): For Resources (CPU, Memory, Disk).
- Utilization: How busy is the resource?
- Saturation: Is there a backlog of work?
- Errors: Are there any hardware/software errors?
- RED Method: For Services (Web servers, APIs).
- Rate: Requests per second.
- Errors: Failed requests.
- Duration: Response time (latency).
7. Perf + eBPF Triage Playbook
When a production system is slow, use this systematic approach:Step 1: Identify the Bottleneck Type
Step 2: CPU-Bound Triage
Step 3: Memory-Bound Triage
Step 4: I/O-Bound Triage
Step 5: Off-CPU Analysis
The CPU might not be the bottleneck—what is the process waiting for?Quick Reference Card
| Symptom | First Command | Deep Dive |
|---|---|---|
| High CPU | perf top | perf record -g + flame graph |
| High load, low CPU | offcputime | Check I/O and locks |
| Memory OOM | slabtop, vmstat | perf mem record |
| Slow disk | biolatency | bpftrace block tracepoints |
| Slow network | tcpconnlat, tcpretrans | ss -i, netstat -s |
| Lock contention | perf lock | lockstat, mutexsnoop |
Summary for Senior Engineers
- Don’t trust
topCPU %: High CPU % could be “stalled” waiting for memory (Cache Misses). Useperf statto check IPC (Instructions Per Cycle). - Use
ftraceto understand the “flow” of a specific kernel subsystem (e.g., the scheduler or network stack). - LBR is the secret to high-fidelity profiling on production systems with stripped binaries.
- Off-CPU Analysis: Performance is not just about what the CPU is doing, but what it’s waiting for (I/O, locks). Use eBPF to trace “off-cpu” time.