Skip to main content

Observability & Performance Engineering

A “Senior” engineer doesn’t just use top. They use the hardware’s own diagnostic features to find bottlenecks at the cycle level. This chapter covers the internals of kernel and hardware observability.

1. Hardware Performance Monitoring Units (PMU)

Every modern CPU has a set of dedicated registers called PMUs that count specific hardware events without any software overhead.

Common Events

  • Cycles: Total CPU clock cycles.
  • Instructions: Total instructions retired (the “committed” state).
  • Cache Misses: L1/L2/LLC misses.
  • Branch Mispredictions: Critical for pipeline performance.

Sampling vs. Counting

  • Counting: The CPU simply increments a counter. At the end of the run, you see the total (e.g., “1 million cache misses”).
  • Sampling: The CPU is configured to raise an interrupt every N events (e.g., every 1000 cache misses). The kernel then records the Instruction Pointer (RIP) at each interrupt. This allows perf to create a Histogram of where the misses are happening in the code.

2. LBR (Last Branch Record)

Stack traces are expensive to capture, and many optimized binaries don’t have “Frame Pointers” (RBP is used as a general-purpose register).
  • LBR is a hardware feature that keeps a ring buffer of the last 8-32 branches taken by the CPU.
  • Benefit: It allows perf to reconstruct the Call Graph and Hot Paths with near-zero overhead and without needing debug symbols or frame pointers.

3. ftrace: The Function Tracer

ftrace is the built-in Linux kernel tracer. It can trace every single function call inside the kernel.

The Magic: Dynamic Patching

How does ftrace avoid overhead when not in use?
  1. Compile Time: The compiler adds a “no-op” call (mcount or __fentry__) at the start of every function.
  2. Boot Time: The kernel finds all these calls and replaces them with a single 5-byte NOP instruction.
  3. Activation: When you start tracing, the kernel dynamically patches those NOPs with CALL instructions to the tracer.
  • Result: Zero overhead when disabled, and high-precision tracing when enabled.

4. kprobes and uprobes

Dynamic instrumentation allows you to “hook” any instruction in a running system.
  • kprobes: Hooks into kernel code.
  • uprobes: Hooks into user-space code (shared libraries, binaries).
  • How it works: The kernel replaces the target instruction with a Breakpoint (e.g., INT3 on x86). When hit, the CPU traps to the kernel, executes your probe handler (or eBPF program), and then resumes the original instruction.

5. Flame Graphs

The industry standard for visualizing performance.
  • X-axis: Alphabetical list of functions (width = % of total time).
  • Y-axis: Stack depth.
  • Goal: Find the “plateaus”—wide boxes that indicate where the CPU is spending the most time.

6. USE Method vs. RED Method

  • USE Method (Brendan Gregg): For Resources (CPU, Memory, Disk).
    • Utilization: How busy is the resource?
    • Saturation: Is there a backlog of work?
    • Errors: Are there any hardware/software errors?
  • RED Method: For Services (Web servers, APIs).
    • Rate: Requests per second.
    • Errors: Failed requests.
    • Duration: Response time (latency).

7. Perf + eBPF Triage Playbook

When a production system is slow, use this systematic approach:

Step 1: Identify the Bottleneck Type

# Quick health check
uptime                    # Load average (is it CPU-bound?)
free -h                   # Memory pressure?
iostat -x 1               # Disk utilization?
sar -n DEV 1              # Network saturation?

# PSI (Pressure Stall Information) - Linux 4.20+
cat /proc/pressure/cpu    # CPU stall time
cat /proc/pressure/memory # Memory stall time
cat /proc/pressure/io     # I/O stall time

Step 2: CPU-Bound Triage

# 1. Identify hot processes
top -b -n 1 | head -20

# 2. Profile the hot process (sample for 30s)
sudo perf record -g -p <PID> -- sleep 30

# 3. Analyze the profile
perf report --stdio --sort=overhead,symbol

# 4. Generate flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

# 5. Check IPC (Instructions Per Cycle) - are we stalled?
perf stat -p <PID> -- sleep 10
# IPC < 1 = likely memory-bound, IPC > 2 = CPU-efficient

Step 3: Memory-Bound Triage

# 1. Check for memory pressure
vmstat 1
# Look at: si/so (swap), wa (I/O wait), free

# 2. Profile cache misses
perf stat -e cache-misses,cache-references,LLC-load-misses -p <PID> -- sleep 10

# 3. Find functions causing cache misses
perf record -e cache-misses -g -p <PID> -- sleep 10
perf report

# 4. eBPF: Track page faults
sudo bpftrace -e 'tracepoint:exceptions:page_fault_user { @[comm] = count(); }'

Step 4: I/O-Bound Triage

# 1. Check disk latency distribution
sudo biolatency      # from bcc-tools

# 2. Identify slow I/O by file
sudo opensnoop       # See which files are being opened
sudo fileslower 10   # Files taking >10ms

# 3. eBPF: Trace read/write latency by process
sudo bpftrace -e '
  tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
  tracepoint:syscalls:sys_exit_read /@start[tid]/ {
    @us[comm] = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
  }
'

Step 5: Off-CPU Analysis

The CPU might not be the bottleneck—what is the process waiting for?
# eBPF: See why a process is sleeping
sudo offcputime -p <PID> 10    # from bcc-tools

# Visualize as a flame graph (off-cpu flame graph)
sudo offcputime -p <PID> -f 30 > out.offcpu
flamegraph.pl --color=io < out.offcpu > offcpu.svg

Quick Reference Card

SymptomFirst CommandDeep Dive
High CPUperf topperf record -g + flame graph
High load, low CPUoffcputimeCheck I/O and locks
Memory OOMslabtop, vmstatperf mem record
Slow diskbiolatencybpftrace block tracepoints
Slow networktcpconnlat, tcpretransss -i, netstat -s
Lock contentionperf locklockstat, mutexsnoop

Summary for Senior Engineers

  • Don’t trust top CPU %: High CPU % could be “stalled” waiting for memory (Cache Misses). Use perf stat to check IPC (Instructions Per Cycle).
  • Use ftrace to understand the “flow” of a specific kernel subsystem (e.g., the scheduler or network stack).
  • LBR is the secret to high-fidelity profiling on production systems with stripped binaries.
  • Off-CPU Analysis: Performance is not just about what the CPU is doing, but what it’s waiting for (I/O, locks). Use eBPF to trace “off-cpu” time.
Next: Modern I/O: io_uring & Userfaultfd