Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Observability & Performance Engineering
A senior engineer does not just usetop. They use the hardware’s own diagnostic features to find bottlenecks at the cycle level. The difference is like diagnosing a car problem by “it sounds funny” versus plugging into the OBD-II port and reading the exact sensor data. This chapter covers the internals of kernel and hardware observability — the tools that turn vague performance complaints into precise, actionable data.
Key Topics: PMU, perf, ftrace, eBPF, Flame Graphs, USE/RED methods
Time to Master: 15-20 hours of hands-on practice
1. Hardware Performance Monitoring Units (PMU)
Every modern CPU has a set of dedicated registers called PMUs that count specific hardware events without any software overhead. Think of PMUs as the speedometer and tachometer built into the CPU itself — they are always running, always accurate, and cost nothing to read.Common Events
- Cycles: Total CPU clock cycles. The fundamental measure of “how long.”
- Instructions: Total instructions retired (committed to architectural state). The ratio of Instructions/Cycles (IPC) is one of the most important performance metrics — IPC less than 1.0 usually means the CPU is stalling on memory.
- Cache Misses: L1/L2/LLC misses. Each LLC miss costs 100+ cycles (a trip to DRAM), so even a small miss rate can dominate runtime.
- Branch Mispredictions: Critical for pipeline performance. Each misprediction costs 15-20 cycles of pipeline flush.
Sampling vs. Counting
- Counting: The CPU simply increments a counter. At the end of the run, you see the total (e.g., “1 million cache misses”). Fast, lightweight, but tells you nothing about where in the code the events occurred.
- Sampling: The CPU is configured to raise an interrupt every N events (e.g., every 1000 cache misses). The kernel then records the Instruction Pointer (RIP) at each interrupt. This allows
perfto create a Histogram of where the misses are happening in the code. The tradeoff: sampling adds a small overhead from the interrupts and can miss very short-lived hot spots.
perf stat) to identify which type of bottleneck you have (CPU-bound, memory-bound, branch-bound), then switch to sampling (perf record) to find where in the code. Jumping straight to sampling is like using a microscope before you know which slide to examine.
2. LBR (Last Branch Record)
Stack traces are expensive to capture, and many optimized binaries do not have “Frame Pointers” (RBP is used as a general-purpose register for performance — -fomit-frame-pointer is the default in most compilers).
- LBR is a hardware feature that keeps a ring buffer of the last 8-32 branches taken by the CPU. Think of it as a dashcam for the instruction stream — it continuously records the last N turns the CPU took through the code.
- Benefit: It allows
perfto reconstruct the Call Graph and Hot Paths with near-zero overhead and without needing debug symbols or frame pointers. This makes it invaluable for profiling production binaries that were compiled with full optimizations.
perf record --call-graph lbr instead of --call-graph dwarf on production systems. LBR is hardware-assisted and adds negligible overhead, while DWARF unwinding can slow things down by 5-10% and requires debug info. On newer Intel CPUs, LBR has been superseded by Architectural LBR (Arch LBR) with deeper buffers (up to 32 entries).
3. ftrace: The Function Tracer
ftrace is the built-in Linux kernel tracer. It can trace every single function call inside the kernel. Think of it as inserting a print statement at the entry and exit of every kernel function — except it does it at runtime without recompilation.
The Magic: Dynamic Patching
How doesftrace avoid overhead when not in use? This is one of the most elegant tricks in the kernel.
- Compile Time: The compiler adds a “no-op” call (
mcountor__fentry__) at the start of every function. This is just a placeholder. - Boot Time: The kernel finds all these calls (there are tens of thousands) and replaces them with a single 5-byte
NOPinstruction. Now every function has a dormant hook that costs nothing. - Activation: When you start tracing, the kernel dynamically patches those
NOPs withCALLinstructions to the tracer using live code patching (stop-machine or text-poke-bp on newer kernels).
- Result: Zero overhead when disabled, and high-precision tracing when enabled.
function_graph tracer. It shows you the full call tree with timing:
4. kprobes and uprobes
Dynamic instrumentation allows you to “hook” any instruction in a running system — like placing a wiretap on any function call without rebuilding the software.
- kprobes: Hooks into kernel code. You can probe any kernel function or even a specific instruction offset within a function.
- uprobes: Hooks into user-space code (shared libraries, binaries). This lets you trace application-level events (e.g., “every time
mallocis called with size > 1MB”) from the kernel with eBPF. - How it works: The kernel replaces the target instruction with a Breakpoint (e.g.,
INT3on x86). When hit, the CPU traps to the kernel, executes your probe handler (or eBPF program), and then single-steps the original instruction before resuming normal execution.
uprobes are how tools like bpftrace can trace function calls in any language runtime (Go, Java, Python) without modifying the application. For example, to count HTTP requests in a Go server: bpftrace -e 'uprobe:/path/to/server:net/http.(*ServeMux).ServeHTTP { @requests = count(); }'
5. Flame Graphs
The industry standard for visualizing performance, invented by Brendan Gregg. A flame graph turns thousands of stack trace samples into a single interactive SVG that you can scan in seconds.- X-axis: Alphabetical list of functions (width = % of total time). A wide box means the function (or its children) account for a large fraction of the profile.
- Y-axis: Stack depth. The bottom is the entry point (
mainor the kernel entry), and the top is the leaf function actually running on the CPU. - Goal: Find the “plateaus” — wide, flat boxes at the top of the graph. These are the functions where the CPU is spending the most time doing actual work (as opposed to calling deeper functions).
6. USE Method vs. RED Method
These are systematic frameworks that prevent you from randomly poking at metrics. Without a framework, debugging performance feels like searching a dark room by bumping into furniture.- USE Method (Brendan Gregg): For Resources (CPU, Memory, Disk, Network interfaces). For each resource, ask three questions:
- Utilization: How busy is the resource? (e.g., CPU at 85%)
- Saturation: Is there a backlog of work? (e.g., CPU run queue length > number of cores)
- Errors: Are there any hardware/software errors? (e.g., disk I/O errors, ECC corrections)
- RED Method (Tom Wilkie): For Services (Web servers, APIs, microservices). For each service, measure:
- Rate: Requests per second.
- Errors: Failed requests per second.
- Duration: Response time distribution (p50, p95, p99 — not just the average).
7. Perf + eBPF Triage Playbook
When a production system is slow, use this systematic approach:Step 1: Identify the Bottleneck Type
Step 2: CPU-Bound Triage
Step 3: Memory-Bound Triage
Step 4: I/O-Bound Triage
Step 5: Off-CPU Analysis
The CPU might not be the bottleneck—what is the process waiting for?Quick Reference Card
| Symptom | First Command | Deep Dive |
|---|---|---|
| High CPU | perf top | perf record -g + flame graph |
| High load, low CPU | offcputime | Check I/O and locks |
| Memory OOM | slabtop, vmstat | perf mem record |
| Slow disk | biolatency | bpftrace block tracepoints |
| Slow network | tcpconnlat, tcpretrans | ss -i, netstat -s |
| Lock contention | perf lock | lockstat, mutexsnoop |
Summary for Senior Engineers
- Do not trust
topCPU %: High CPU % could be “stalled” waiting for memory (Cache Misses). A process showing 100% CPU might be doing useful work at IPC=3, or it might be thrashing the cache at IPC=0.3. Useperf statto check IPC and distinguish between the two. - Use
ftraceto understand the “flow” of a specific kernel subsystem. It is the fastest way to learn what the scheduler, VFS, or network stack actually does without reading thousands of lines of source code. - LBR is the secret to high-fidelity profiling on production systems with stripped binaries. It gives you call graphs with near-zero overhead and no need for frame pointers or debug info.
- Off-CPU Analysis: Performance is not just about what the CPU is doing, but what it is waiting for (I/O, locks, network). If your process is slow but CPU utilization is low, the bottleneck is off-CPU. Use eBPF
offcputimeto trace exactly what the process is blocking on. - Start with the framework: USE for resources, RED for services,
perf statfor the first 10 seconds of any investigation. A structured approach beats intuition-driven debugging every time.
Caveats and Common Pitfalls
Performance tools lie in subtle ways. They lie when you sample without context, when you trace at the wrong layer, when overhead changes the thing you are measuring, and when visualizations hide the data they were built on. The traps below cause more wrong conclusions in production than any other category of debugging mistake.Interview Deep-Dive
Production p99 latency spiked from 50ms to 500ms 20 minutes ago. CPU is at 40%. Walk me through the tools you reach for, in order, and why.
Production p99 latency spiked from 50ms to 500ms 20 minutes ago. CPU is at 40%. Walk me through the tools you reach for, in order, and why.
- Confirm the symptom and scope. First five minutes: is this real? Check the dashboard for the spike. Is it across all instances or just some? Is it correlated with a deploy, a traffic spike, or a downstream incident? If only some instances are affected, that points at hardware/local issues. If all are affected, it is environmental or systemic.
- Reach for the USE method on each instance.
uptime(load),vmstat 1(CPU breakdown — user/sys/iowait/idle),iostat -x 1(disk — look at await and util),sar -n DEV 1(network),ss -tin(TCP connection states and retransmits). 90% of “slow” issues show up in this 30-second drill. - Check PSI immediately.
cat /proc/pressure/{cpu,memory,io}. Ifio full avg10is high, you are I/O bound regardless of whatiostatshows. PSI measures stall time, which is what the user actually feels. - If CPU is 40% but latency is up, suspect off-CPU. That ratio screams “threads are waiting.” Run
bpftrace -e 'kprobe:finish_task_switch { @[kstack] = count(); }'for 10 seconds to see what is causing context switches. Oroffcputime -p <PID> 30 > out; flamegraph.pl --color=io < out > offcpu.svgfor a flame graph of where threads sleep. - Check downstream dependencies.
tcptracerortcpconnlatshows TCP connection latencies. If one downstream’s p99 went from 5ms to 200ms, you found your culprit. Use the application’s own tracing (Jaeger, Zipkin, OpenTelemetry) to confirm which span is slow. - Check for new contention.
perf lock recordfor kernel locks.mutexsnoop(eBPF) for application mutexes. New code can introduce a global mutex that does not show up under low load but serializes under high load. - Check for GC, JIT, or runtime pauses. Language-specific: JVM GC logs, Go runtime traces, Python’s
gcmodule. A long stop-the-world pause does not show up in syscall traces but kills tail latency. - Form and test a hypothesis. Never just “look at metrics until you see something weird.” State a hypothesis (“a downstream RPC is slow”), run the targeted tool that proves or disproves it, iterate. Random poking wastes hours.
safepoint_synchronize. The fix was a JVM flag change (G1 region size). Reference: Twitter’s engineering blog post on GC tuning. The lesson: on-CPU flame graphs are insufficient on their own — always generate off-CPU graphs for tail-latency investigations.offcputime shows your threads are blocked on futex_wait for 200ms? What does that mean and how do you investigate further?” futex_wait is the kernel side of pthread mutex / condvar / semaphore waits. To find which mutex, you need user-space stack traces — enable frame pointers or use perf with DWARF. Then you can see which application-level lock is contended.hdrhistogram style), or per-request traces. eBPF with histogram aggregation: bpftrace -e 'tracepoint:syscalls:sys_exit_read { @us = hist(elapsed/1000); }' to see the latency distribution.- “Restart the service and see if it helps.” Sometimes works, never teaches you anything, and if the cause is environmental (slow downstream, network issue), it does not help. Senior engineers diagnose before mitigating.
- “Just look at the application logs.” Logs are great for known issues but useless for performance pathologies. A slow request typically logs as a successful request that just took longer. You need profiling tools.
- Brendan Gregg, “Systems Performance” (2nd ed) — the bible of Linux performance analysis.
- Dean & Barroso, “The Tail at Scale” (CACM 2013) — canonical paper on tail latency.
- Netflix Tech Blog, “Linux Performance Analysis in 60,000 Milliseconds.”
A long-running daemon shows steadily increasing memory over weeks. RSS grows ~50MB per day. Find the leak.
A long-running daemon shows steadily increasing memory over weeks. RSS grows ~50MB per day. Find the leak.
- Confirm it is a leak, not just cache growth. Check
cat /proc/<pid>/smaps_rollupfor PSS over time. If anonymous memory grows but file-backed shrinks, it is application allocation. If the kernel’s slab is growing, it is a kernel-side leak (rare but real). - Identify the allocator. Is the daemon C/C++ (malloc/jemalloc), Go (runtime allocator), Java (heap), Python (PyMalloc + reference counting)? Each has its own profiling tools. For C/C++ with glibc malloc,
MALLOC_TRACEenv var ormtrace. For jemalloc,MALLOC_CONF=prof:true. For Go,pprofheap profiles. For JVM, NMT (Native Memory Tracking) or heap dumps. For Python,tracemallocmodule. - For C/C++, use AddressSanitizer or Valgrind in a test environment. ASan catches leaks at exit; Valgrind finds them but is 10-50x slower (not usable in production but fine for repro). LeakSanitizer is part of ASan and tracks long-running leaks if you can SIGUSR1 the process to emit a report.
- In production, capture heap profiles periodically. With jemalloc, set
MALLOC_CONF=prof:true,prof_active:false,lg_prof_sample:19and toggle profiling viamallctl. Take a heap profile at startup and another after a week of growth. Compare withjeprof --base baseline.heap current.heap— this shows allocation sites that grew, not just total. The big winners by allocation count vs bytes tell you whether you have many small leaks or fewer large ones. - Trace allocations with eBPF.
bpftrace -e 'uprobe:/path/to/binary:malloc { @sizes[kstack] = sum(arg0); }'aggregates by stack trace. Run for 10 minutes, see which call paths allocate most. For long-lived allocations, also instrumentfreeand look for stacks that allocate but never appear in free. - Common leak patterns. Caches without bounds, event listeners that register but never deregister, pool objects that grow without compaction, error paths that allocate but skip cleanup, third-party libraries with known leaks. Look for these patterns in code review of recently-changed modules.
- If allocator is fine but RSS still grows, suspect fragmentation. glibc malloc has known fragmentation issues with workloads that allocate many sizes. Switching to jemalloc or tcmalloc often fixes “leaks” that were really fragmentation.
MALLOC_ARENA_MAX=2env var can help with glibc multithreaded fragmentation.
/proc/slabinfo and slabtop — shows kernel slab cache growth. If dentry or a specific cache is growing without bound, it is a kernel issue. Use perf record -e kmem:kmalloc -g to capture allocation stacks. Often a buggy module or a userspace pattern that creates many short-lived dentries.- “Restart the service nightly to work around the leak.” This is a workaround, not a fix. Tells the interviewer you do not know how to find the actual leak. State it as last resort after describing the diagnostic process.
- “It must be the language runtime; nothing we can do.” Even when the leak is in the runtime (rare), you can usually work around it with allocator changes, GC tuning, or process recycling. Saying “nothing we can do” is a senior-level red flag.
- jemalloc documentation, “Heap Profiling” section.
- Brendan Gregg, “Memory Leak (and Growth) Flame Graphs” blog post.
- Go pprof documentation — runtime/pprof package, very well documented.
CPU is at 100%, but request throughput dropped by 50%. What is going on?
CPU is at 100%, but request throughput dropped by 50%. What is going on?
- High CPU + low throughput is a classic ‘spinning’ or ‘cache thrashing’ pattern. The CPU is busy but not doing useful work. Three big buckets: lock contention (threads spinning on a mutex), cache thrashing (CPU waiting on memory), or pathological GC/JIT activity.
- Check IPC first.
perf stat -p <PID> -- sleep 10. IPC less than 0.5 means the CPU is mostly stalled, usually on memory. IPC near 1.0 with high CPU is normal. IPC less than 0.3 is severe stall — almost always cache misses or branch mispredictions. - Lock contention check.
perf lock record -p <PID> -- sleep 10thenperf lock report. Orbpftrace -e 'kprobe:mutex_lock { @[kstack] = count(); }'. If 60% of context switches are on a single mutex, you found it. - Cache miss check.
perf stat -e cycles,instructions,cache-misses,cache-references,LLC-load-misses -p <PID>. LLC miss rate above 10% of references is severe.perf record -e LLC-load-misses -gshows which functions cause the misses. - GC/JIT check. Java: enable GC logging, look at pause times and frequency. Go:
GODEBUG=gctrace=1shows GC overhead. Python: GC pauses are usually short but reference cycle collection can be expensive. If GC eats 30% of CPU, throughput drops accordingly. - Spinning/busy-loop check.
perf top -p <PID>— if a tight loop in user code dominates, you have a busy-wait. Often introduced accidentally:while (!ready) {}instead ofpthread_cond_wait, or a poll loop without sleep. - Throughput/latency dependency. If CPU is 100% and throughput is down, latency must be way up (Little’s law). Confirm with the application’s own metrics. This is the same problem as scenario 1 (the latency spike), just framed differently.
_raw_spin_lock dominating — traced to a recently-merged feature that took a global mutex per transaction. Under high concurrency, the mutex serialized all work onto one CPU at a time, despite N cores being available. Fix: replace the global mutex with a per-shard mutex. CPU dropped to 30%, throughput tripled. Reference: pattern documented in “Lock contention in production” by various teams; specific incident details vary by company.cpupower frequency-info and /proc/cpuinfo show current frequency. If actual frequency is well below max under high load, the CPU is throttling — thermal, power-cap, or P-state misconfiguration. turbostat (intel) shows per-core frequency and C-states.perf stat will show high cache misses despite high CPU. Solution: pin work to physical cores only, or disable SMT for that workload (echo 0 > /sys/devices/system/cpu/cpu<N>/online for sibling cores).PTHREAD_PRIO_INHERIT to mitigate.- “Add more CPUs.” If the bottleneck is a global mutex, more CPUs makes it worse (more contention). Always understand the bottleneck before scaling.
- “The CPU is at 100% so the system is fully utilized — nothing to fix.” CPU at 100% with low throughput is a smell, not a healthy state. Useful work per CPU-second is the real metric.
- Intel “Top-down Microarchitecture Analysis Method” — systematic approach to analyzing CPU stalls.
- Brendan Gregg, “CPU Flame Graphs” — foundational for understanding CPU profiling.
- “What every programmer should know about memory” by Ulrich Drepper — canonical reference on cache effects.
Interview Deep-Dive (Original)
A production web server's p99 latency spiked from 50ms to 500ms, but CPU utilization is only 30%. Walk me through your systematic investigation.
A production web server's p99 latency spiked from 50ms to 500ms, but CPU utilization is only 30%. Walk me through your systematic investigation.
- Low CPU with high latency is the classic off-CPU problem — threads are waiting, not computing. Candidates: I/O, lock contention, sleep/futex waits, or scheduling latency.
- Step 1: Check I/O. Run
iostat -x 1for disk (look atawaitand%util), andss -tifor network (retransmissions, RTT). If disk await is spiking, that is your culprit. - Step 2: If I/O is clean, check lock contention with
perf lock recordor bpftrace on futex tracepoints. - Step 3: Off-CPU analysis.
offcputime -p <PID> 10shows where threads sleep and for how long. Generate an off-CPU flame graph — the widest bars reveal the bottleneck. Common findings: waiting on database responses, downstream RPCs, or a global mutex. - Step 4: Check scheduling latency with
runqlat -p <PID>. If run queue latency is high despite low average CPU, the work is bursty. - Most common root cause: a downstream dependency became slower, and threads spend 400ms extra waiting on network responses. The OS tools lead you to the right diagnosis.
You see an IPC of 0.3 for your application. The CPU supports 4+ IPC. What does this tell you and how do you investigate?
You see an IPC of 0.3 for your application. The CPU supports 4+ IPC. What does this tell you and how do you investigate?
- IPC of 0.3 means the pipeline is almost entirely stalled. The most common cause is memory-bound execution — the CPU waits for data from L2, L3, or DRAM on nearly every instruction.
- Confirm with
perf stat -e cycles,instructions,cache-misses,cache-references,LLC-load-misses. If LLC miss rate is high (more than 5% of memory references), the application is thrashing DRAM. Each LLC miss costs 60-100ns of pipeline stall. - Identify the functions causing misses:
perf record -e LLC-load-misses -g. Common culprits: traversing large data structures with poor locality (linked lists, hash tables with pointer chasing), random access across a large address space, or working set exceeding L3. - Other possible causes: branch mispredictions (check
branch-misses— if above 5%, the pipeline spends many cycles flushing), or instruction cache misses (large code footprint, common in JIT-compiled code). - Fixes: for memory-bound code, restructure data for cache locality (AoS to SoA, cache-oblivious algorithms, software prefetch). For branch-heavy code, convert to branchless operations. For large code footprint, use profile-guided optimization (PGO).
Explain the USE method for performance analysis. Apply it to diagnose why a Kubernetes pod is being OOM-killed intermittently.
Explain the USE method for performance analysis. Apply it to diagnose why a Kubernetes pod is being OOM-killed intermittently.
- USE (Brendan Gregg): for each resource (CPU, memory, disk, network), check Utilization (how busy), Saturation (is there a backlog), and Errors (error events).
- For the OOM-killed pod, focus on memory. Utilization: check
memory.currentin the cgroup. Is usage consistently near the limit? Saturation: checkmemory.pressure(PSI) for time spent stalled on reclaim. Errors: checkmemory.eventsforoom_killcount. - Investigation branches: steady climb over time means memory leak (profile with jemalloc, pprof, or JVM NMT). Sudden spikes mean transient load (large query, burst of requests). If page cache dominates (check
memory.stat—filevsanon), heavy file I/O is consuming the cgroup’s allowance. - Common mistake: setting limits too close to steady-state. Memory is not constant — GC, temp allocations, and kernel buffers cause spikes. Set limits at 1.5-2x steady-state and use
memory.highat 1.2x for backpressure before the hard limit.