Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Observability & Performance Engineering

A senior engineer does not just use top. They use the hardware’s own diagnostic features to find bottlenecks at the cycle level. The difference is like diagnosing a car problem by “it sounds funny” versus plugging into the OBD-II port and reading the exact sensor data. This chapter covers the internals of kernel and hardware observability — the tools that turn vague performance complaints into precise, actionable data.
Interview Frequency: High for systems, SRE, and performance roles
Key Topics: PMU, perf, ftrace, eBPF, Flame Graphs, USE/RED methods
Time to Master: 15-20 hours of hands-on practice

1. Hardware Performance Monitoring Units (PMU)

Every modern CPU has a set of dedicated registers called PMUs that count specific hardware events without any software overhead. Think of PMUs as the speedometer and tachometer built into the CPU itself — they are always running, always accurate, and cost nothing to read.

Common Events

  • Cycles: Total CPU clock cycles. The fundamental measure of “how long.”
  • Instructions: Total instructions retired (committed to architectural state). The ratio of Instructions/Cycles (IPC) is one of the most important performance metrics — IPC less than 1.0 usually means the CPU is stalling on memory.
  • Cache Misses: L1/L2/LLC misses. Each LLC miss costs 100+ cycles (a trip to DRAM), so even a small miss rate can dominate runtime.
  • Branch Mispredictions: Critical for pipeline performance. Each misprediction costs 15-20 cycles of pipeline flush.

Sampling vs. Counting

  • Counting: The CPU simply increments a counter. At the end of the run, you see the total (e.g., “1 million cache misses”). Fast, lightweight, but tells you nothing about where in the code the events occurred.
  • Sampling: The CPU is configured to raise an interrupt every N events (e.g., every 1000 cache misses). The kernel then records the Instruction Pointer (RIP) at each interrupt. This allows perf to create a Histogram of where the misses are happening in the code. The tradeoff: sampling adds a small overhead from the interrupts and can miss very short-lived hot spots.
Practical tip: Always start with counting (perf stat) to identify which type of bottleneck you have (CPU-bound, memory-bound, branch-bound), then switch to sampling (perf record) to find where in the code. Jumping straight to sampling is like using a microscope before you know which slide to examine.

2. LBR (Last Branch Record)

Stack traces are expensive to capture, and many optimized binaries do not have “Frame Pointers” (RBP is used as a general-purpose register for performance — -fomit-frame-pointer is the default in most compilers).
  • LBR is a hardware feature that keeps a ring buffer of the last 8-32 branches taken by the CPU. Think of it as a dashcam for the instruction stream — it continuously records the last N turns the CPU took through the code.
  • Benefit: It allows perf to reconstruct the Call Graph and Hot Paths with near-zero overhead and without needing debug symbols or frame pointers. This makes it invaluable for profiling production binaries that were compiled with full optimizations.
Practical tip: Use perf record --call-graph lbr instead of --call-graph dwarf on production systems. LBR is hardware-assisted and adds negligible overhead, while DWARF unwinding can slow things down by 5-10% and requires debug info. On newer Intel CPUs, LBR has been superseded by Architectural LBR (Arch LBR) with deeper buffers (up to 32 entries).

3. ftrace: The Function Tracer

ftrace is the built-in Linux kernel tracer. It can trace every single function call inside the kernel. Think of it as inserting a print statement at the entry and exit of every kernel function — except it does it at runtime without recompilation.

The Magic: Dynamic Patching

How does ftrace avoid overhead when not in use? This is one of the most elegant tricks in the kernel.
  1. Compile Time: The compiler adds a “no-op” call (mcount or __fentry__) at the start of every function. This is just a placeholder.
  2. Boot Time: The kernel finds all these calls (there are tens of thousands) and replaces them with a single 5-byte NOP instruction. Now every function has a dormant hook that costs nothing.
  3. Activation: When you start tracing, the kernel dynamically patches those NOPs with CALL instructions to the tracer using live code patching (stop-machine or text-poke-bp on newer kernels).
  • Result: Zero overhead when disabled, and high-precision tracing when enabled.
Practical tip: The fastest way to understand a kernel code path you have never read is to use ftrace’s function_graph tracer. It shows you the full call tree with timing:
# Trace only the block I/O subsystem for 5 seconds
echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo 'blk_*' > /sys/kernel/debug/tracing/set_ftrace_filter
echo 1 > /sys/kernel/debug/tracing/tracing_on
sleep 5
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace

4. kprobes and uprobes

Dynamic instrumentation allows you to “hook” any instruction in a running system — like placing a wiretap on any function call without rebuilding the software.
  • kprobes: Hooks into kernel code. You can probe any kernel function or even a specific instruction offset within a function.
  • uprobes: Hooks into user-space code (shared libraries, binaries). This lets you trace application-level events (e.g., “every time malloc is called with size > 1MB”) from the kernel with eBPF.
  • How it works: The kernel replaces the target instruction with a Breakpoint (e.g., INT3 on x86). When hit, the CPU traps to the kernel, executes your probe handler (or eBPF program), and then single-steps the original instruction before resuming normal execution.
Practical tip: uprobes are how tools like bpftrace can trace function calls in any language runtime (Go, Java, Python) without modifying the application. For example, to count HTTP requests in a Go server: bpftrace -e 'uprobe:/path/to/server:net/http.(*ServeMux).ServeHTTP { @requests = count(); }'

5. Flame Graphs

The industry standard for visualizing performance, invented by Brendan Gregg. A flame graph turns thousands of stack trace samples into a single interactive SVG that you can scan in seconds.
  • X-axis: Alphabetical list of functions (width = % of total time). A wide box means the function (or its children) account for a large fraction of the profile.
  • Y-axis: Stack depth. The bottom is the entry point (main or the kernel entry), and the top is the leaf function actually running on the CPU.
  • Goal: Find the “plateaus” — wide, flat boxes at the top of the graph. These are the functions where the CPU is spending the most time doing actual work (as opposed to calling deeper functions).
Practical tip: Flame graphs are most useful when you do not have a hypothesis. When someone says “the service is slow,” generate a flame graph first. The wide plateaus will immediately tell you whether the problem is in your code, a library, the kernel, or garbage collection. Do not skip this step and jump into code review.

6. USE Method vs. RED Method

These are systematic frameworks that prevent you from randomly poking at metrics. Without a framework, debugging performance feels like searching a dark room by bumping into furniture.
  • USE Method (Brendan Gregg): For Resources (CPU, Memory, Disk, Network interfaces). For each resource, ask three questions:
    • Utilization: How busy is the resource? (e.g., CPU at 85%)
    • Saturation: Is there a backlog of work? (e.g., CPU run queue length > number of cores)
    • Errors: Are there any hardware/software errors? (e.g., disk I/O errors, ECC corrections)
  • RED Method (Tom Wilkie): For Services (Web servers, APIs, microservices). For each service, measure:
    • Rate: Requests per second.
    • Errors: Failed requests per second.
    • Duration: Response time distribution (p50, p95, p99 — not just the average).
A senior engineer would say: “USE for infrastructure, RED for applications. Walk the USE checklist across every resource before blaming application code — 80% of ‘slow app’ tickets are actually saturated disks or memory pressure.”

7. Perf + eBPF Triage Playbook

When a production system is slow, use this systematic approach:

Step 1: Identify the Bottleneck Type

# Quick health check -- run these FIRST, before reaching for advanced tools.
# This 30-second drill identifies the bottleneck type in most cases.
uptime                    # Load average: if >> num_cpus, system is overloaded
free -h                   # Memory: low "available" = memory pressure
iostat -x 1               # Disk: %util near 100% = I/O bottleneck
sar -n DEV 1              # Network: check rxkB/s, txkB/s against NIC capacity

# PSI (Pressure Stall Information) - Linux 4.20+
# PSI tells you what % of time tasks are STALLED waiting for a resource.
# This is more actionable than utilization alone.
cat /proc/pressure/cpu    # "some" > 0 means some tasks waited for CPU
cat /proc/pressure/memory # "full" > 0 means ALL tasks stalled on memory
cat /proc/pressure/io     # "full" > 0 means ALL tasks stalled on I/O

Step 2: CPU-Bound Triage

# 1. Identify which process is consuming the most CPU
top -b -n 1 | head -20

# 2. Profile it -- capture stack traces every ~4000 cycles for 30 seconds
sudo perf record -g -p <PID> -- sleep 30

# 3. See which functions are hottest (text-based report)
perf report --stdio --sort=overhead,symbol

# 4. Generate a flame graph -- the single most useful visualization
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

# 5. Check IPC to determine if CPU is COMPUTING or STALLING
perf stat -p <PID> -- sleep 10
# IPC < 1.0 = CPU is mostly waiting for memory (cache misses, TLB misses)
# IPC 1.0-2.0 = typical application workload
# IPC > 2.0 = CPU-efficient, instruction-level parallelism is high
# This single number changes your entire debugging direction.

Step 3: Memory-Bound Triage

# 1. Check for memory pressure
vmstat 1
# Look at: si/so (swap), wa (I/O wait), free

# 2. Profile cache misses
perf stat -e cache-misses,cache-references,LLC-load-misses -p <PID> -- sleep 10

# 3. Find functions causing cache misses
perf record -e cache-misses -g -p <PID> -- sleep 10
perf report

# 4. eBPF: Track page faults
sudo bpftrace -e 'tracepoint:exceptions:page_fault_user { @[comm] = count(); }'

Step 4: I/O-Bound Triage

# 1. Check disk latency distribution
sudo biolatency      # from bcc-tools

# 2. Identify slow I/O by file
sudo opensnoop       # See which files are being opened
sudo fileslower 10   # Files taking >10ms

# 3. eBPF: Trace read/write latency by process
sudo bpftrace -e '
  tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
  tracepoint:syscalls:sys_exit_read /@start[tid]/ {
    @us[comm] = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
  }
'

Step 5: Off-CPU Analysis

The CPU might not be the bottleneck—what is the process waiting for?
# eBPF: See why a process is sleeping
sudo offcputime -p <PID> 10    # from bcc-tools

# Visualize as a flame graph (off-cpu flame graph)
sudo offcputime -p <PID> -f 30 > out.offcpu
flamegraph.pl --color=io < out.offcpu > offcpu.svg

Quick Reference Card

SymptomFirst CommandDeep Dive
High CPUperf topperf record -g + flame graph
High load, low CPUoffcputimeCheck I/O and locks
Memory OOMslabtop, vmstatperf mem record
Slow diskbiolatencybpftrace block tracepoints
Slow networktcpconnlat, tcpretransss -i, netstat -s
Lock contentionperf locklockstat, mutexsnoop

Summary for Senior Engineers

  • Do not trust top CPU %: High CPU % could be “stalled” waiting for memory (Cache Misses). A process showing 100% CPU might be doing useful work at IPC=3, or it might be thrashing the cache at IPC=0.3. Use perf stat to check IPC and distinguish between the two.
  • Use ftrace to understand the “flow” of a specific kernel subsystem. It is the fastest way to learn what the scheduler, VFS, or network stack actually does without reading thousands of lines of source code.
  • LBR is the secret to high-fidelity profiling on production systems with stripped binaries. It gives you call graphs with near-zero overhead and no need for frame pointers or debug info.
  • Off-CPU Analysis: Performance is not just about what the CPU is doing, but what it is waiting for (I/O, locks, network). If your process is slow but CPU utilization is low, the bottleneck is off-CPU. Use eBPF offcputime to trace exactly what the process is blocking on.
  • Start with the framework: USE for resources, RED for services, perf stat for the first 10 seconds of any investigation. A structured approach beats intuition-driven debugging every time.
Next: Modern I/O: io_uring & Userfaultfd

Caveats and Common Pitfalls

Performance tools lie in subtle ways. They lie when you sample without context, when you trace at the wrong layer, when overhead changes the thing you are measuring, and when visualizations hide the data they were built on. The traps below cause more wrong conclusions in production than any other category of debugging mistake.
Performance investigation traps that produce confident wrong answers:
  1. perf record without -g loses callstacks. A flame graph generated from a profile without callstacks shows only the leaf functions — it looks like a thin, fragmented graph with no clear hot paths. Engineers see this and conclude “the workload is spread out” when really they have just lost the parent call context. Always use -g (or --call-graph dwarf / --call-graph lbr) for any profile you intend to analyze. The overhead is real but tolerable for diagnosis.
  2. ftrace overhead in production can collapse the system you are trying to debug. ftrace has near-zero idle overhead, but enabling high-frequency tracepoints (every syscall, every context switch, every block I/O) on a busy system can add 5-30% CPU overhead, and the trace ring buffer can fill in seconds, dropping events. Worse, on workloads with hot syscall paths (high QPS web servers), function_graph tracing of the syscall path can serialize on the trace lock and turn a healthy system into a slow one. Always start with the smallest filter, and prefer eBPF for production tracing — it can do filtering in the kernel without crossing into user space.
  3. blktrace and iostat measure different layers and tell different stories. iostat shows aggregate disk metrics from the block layer (avg latency, util %, queue depth), seen after the I/O scheduler. blktrace shows individual I/O events at multiple layers (queued, dispatched, completed). A high iostat await could mean the disk is slow or that the scheduler is queuing — only blktrace distinguishes them. NVMe with none scheduler is mostly straight-through; with mq-deadline or kyber the scheduler can dominate latency under contention.
  4. Flame graphs lie if sampling is biased. perf record -F 99 samples at 99 Hz. If the workload has a regular periodic component at 100 Hz (timer ticks, audio buffer refills), the sampling will alias and show false hot paths. Worse, perf record only samples on-CPU work — if your bottleneck is off-CPU (waiting on locks, I/O, network), it will not appear in a CPU flame graph at all. Always pair on-CPU and off-CPU flame graphs, and vary sampling frequency (97, 99, 101 Hz) if you suspect aliasing.
Solutions and patterns:
  • Default to LBR call stacks for production profiling. perf record --call-graph lbr uses Last Branch Record — a hardware feature with near-zero overhead and no need for frame pointers or DWARF unwinding. The only catch: LBR captures up to 32 entries deep, so very deep call stacks are truncated. For typical workloads, this is fine.
  • Use eBPF (bpftrace, bcc) for production observability. eBPF programs run in the kernel after a verifier checks they are safe. Overhead is generally less than 1% even for complex tracing, and the kernel never crashes from a malformed BPF program (it would be rejected at load). The bcc tools (offcputime, biolatency, runqlat, tcpconnlat) are battle-tested on Netflix and Meta production.
  • Use the right tool for the right layer. For application-level: strace -c (syscall counts), perf stat (PMU counters), language-specific profilers. For kernel: ftrace (function tracing), perf record (sampling), eBPF (production tracing). For storage: biolatency, blktrace, nvme list and smartctl for hardware. For network: tcpdump, ss -i, tcptracer. Mismatching layers is the most common mistake.
  • Generate both on-CPU and off-CPU flame graphs. On-CPU shows where you spend CPU; off-CPU shows where you wait. A service at 30% CPU with high p99 latency is almost certainly off-CPU bound — generate the off-CPU graph first or you will waste hours looking in the wrong place.
  • When in doubt, validate the tool against ground truth. If iostat reports 95% util but dd if=/dev/sda of=/dev/null reads at full bandwidth, iostat is misleading you (often a queue-depth artifact on NVMe). Tools are heuristics; ground-truth benchmarks reveal when they are wrong.

Interview Deep-Dive

Strong Answer Framework:
  1. Confirm the symptom and scope. First five minutes: is this real? Check the dashboard for the spike. Is it across all instances or just some? Is it correlated with a deploy, a traffic spike, or a downstream incident? If only some instances are affected, that points at hardware/local issues. If all are affected, it is environmental or systemic.
  2. Reach for the USE method on each instance. uptime (load), vmstat 1 (CPU breakdown — user/sys/iowait/idle), iostat -x 1 (disk — look at await and util), sar -n DEV 1 (network), ss -tin (TCP connection states and retransmits). 90% of “slow” issues show up in this 30-second drill.
  3. Check PSI immediately. cat /proc/pressure/{cpu,memory,io}. If io full avg10 is high, you are I/O bound regardless of what iostat shows. PSI measures stall time, which is what the user actually feels.
  4. If CPU is 40% but latency is up, suspect off-CPU. That ratio screams “threads are waiting.” Run bpftrace -e 'kprobe:finish_task_switch { @[kstack] = count(); }' for 10 seconds to see what is causing context switches. Or offcputime -p <PID> 30 > out; flamegraph.pl --color=io < out > offcpu.svg for a flame graph of where threads sleep.
  5. Check downstream dependencies. tcptracer or tcpconnlat shows TCP connection latencies. If one downstream’s p99 went from 5ms to 200ms, you found your culprit. Use the application’s own tracing (Jaeger, Zipkin, OpenTelemetry) to confirm which span is slow.
  6. Check for new contention. perf lock record for kernel locks. mutexsnoop (eBPF) for application mutexes. New code can introduce a global mutex that does not show up under low load but serializes under high load.
  7. Check for GC, JIT, or runtime pauses. Language-specific: JVM GC logs, Go runtime traces, Python’s gc module. A long stop-the-world pause does not show up in syscall traces but kills tail latency.
  8. Form and test a hypothesis. Never just “look at metrics until you see something weird.” State a hypothesis (“a downstream RPC is slow”), run the targeted tool that proves or disproves it, iterate. Random poking wastes hours.
Real-World Example: In 2016, Twitter had a famous incident where p99 latency spiked due to a JVM GC change. The on-CPU flame graph showed nothing unusual; the off-CPU flame graph showed long pauses in safepoint_synchronize. The fix was a JVM flag change (G1 region size). Reference: Twitter’s engineering blog post on GC tuning. The lesson: on-CPU flame graphs are insufficient on their own — always generate off-CPU graphs for tail-latency investigations.
Senior follow-up 1: “What if offcputime shows your threads are blocked on futex_wait for 200ms? What does that mean and how do you investigate further?” futex_wait is the kernel side of pthread mutex / condvar / semaphore waits. To find which mutex, you need user-space stack traces — enable frame pointers or use perf with DWARF. Then you can see which application-level lock is contended.
Senior follow-up 2: “How would you investigate if the latency is bursty — not constant 500ms, but spikes every few seconds?” Bursty means averages hide it — look at percentiles per second (hdrhistogram style), or per-request traces. eBPF with histogram aggregation: bpftrace -e 'tracepoint:syscalls:sys_exit_read { @us = hist(elapsed/1000); }' to see the latency distribution.
Senior follow-up 3: “What is the difference between latency and tail latency, and why does tail latency dominate user experience in distributed systems?” Median latency is the typical case; tail latency (p99, p999) is the worst case. In a fan-out architecture (one user request triggers 100 backend calls), the user-facing latency is roughly the slowest of those 100. So even 1% slow backends create 63% slow user requests. This is “the tail at scale” (Dean & Barroso, 2013) — the canonical paper on this phenomenon.
Common Wrong Answers:
  1. “Restart the service and see if it helps.” Sometimes works, never teaches you anything, and if the cause is environmental (slow downstream, network issue), it does not help. Senior engineers diagnose before mitigating.
  2. “Just look at the application logs.” Logs are great for known issues but useless for performance pathologies. A slow request typically logs as a successful request that just took longer. You need profiling tools.
Further Reading:
  • Brendan Gregg, “Systems Performance” (2nd ed) — the bible of Linux performance analysis.
  • Dean & Barroso, “The Tail at Scale” (CACM 2013) — canonical paper on tail latency.
  • Netflix Tech Blog, “Linux Performance Analysis in 60,000 Milliseconds.”
Strong Answer Framework:
  1. Confirm it is a leak, not just cache growth. Check cat /proc/<pid>/smaps_rollup for PSS over time. If anonymous memory grows but file-backed shrinks, it is application allocation. If the kernel’s slab is growing, it is a kernel-side leak (rare but real).
  2. Identify the allocator. Is the daemon C/C++ (malloc/jemalloc), Go (runtime allocator), Java (heap), Python (PyMalloc + reference counting)? Each has its own profiling tools. For C/C++ with glibc malloc, MALLOC_TRACE env var or mtrace. For jemalloc, MALLOC_CONF=prof:true. For Go, pprof heap profiles. For JVM, NMT (Native Memory Tracking) or heap dumps. For Python, tracemalloc module.
  3. For C/C++, use AddressSanitizer or Valgrind in a test environment. ASan catches leaks at exit; Valgrind finds them but is 10-50x slower (not usable in production but fine for repro). LeakSanitizer is part of ASan and tracks long-running leaks if you can SIGUSR1 the process to emit a report.
  4. In production, capture heap profiles periodically. With jemalloc, set MALLOC_CONF=prof:true,prof_active:false,lg_prof_sample:19 and toggle profiling via mallctl. Take a heap profile at startup and another after a week of growth. Compare with jeprof --base baseline.heap current.heap — this shows allocation sites that grew, not just total. The big winners by allocation count vs bytes tell you whether you have many small leaks or fewer large ones.
  5. Trace allocations with eBPF. bpftrace -e 'uprobe:/path/to/binary:malloc { @sizes[kstack] = sum(arg0); }' aggregates by stack trace. Run for 10 minutes, see which call paths allocate most. For long-lived allocations, also instrument free and look for stacks that allocate but never appear in free.
  6. Common leak patterns. Caches without bounds, event listeners that register but never deregister, pool objects that grow without compaction, error paths that allocate but skip cleanup, third-party libraries with known leaks. Look for these patterns in code review of recently-changed modules.
  7. If allocator is fine but RSS still grows, suspect fragmentation. glibc malloc has known fragmentation issues with workloads that allocate many sizes. Switching to jemalloc or tcmalloc often fixes “leaks” that were really fragmentation. MALLOC_ARENA_MAX=2 env var can help with glibc multithreaded fragmentation.
Real-World Example: In 2014, Facebook engineers found a slow leak in HHVM’s JIT cache that took several days to manifest as RSS growth. They used jemalloc’s heap profiling to find the leak was in opcode metadata not being freed when functions were redefined. Reference: Facebook engineering blog “Profiling Memory Allocations” (around 2014). The lesson: long-running services need built-in observability for memory growth, not just OOM reaction.
Senior follow-up 1: “What is the difference between RSS, PSS, and USS, and which should you alert on?” RSS = pages mapped (counts shared pages multiple times). PSS = proportional share of shared pages (1/n of each shared page where n is users). USS = pages unique to this process. Alert on PSS for accurate per-process accounting; USS understates true cost; RSS overstates if many processes share libraries.
Senior follow-up 2: “How does jemalloc’s heap profiling actually work?” jemalloc samples allocations at a configurable interval (e.g., every 512KB allocated). At each sample, it captures the call stack. The aggregation across samples gives a statistical profile of where memory is being held. The overhead is tiny (a few percent) and the profile quality is excellent for finding leaks.
Senior follow-up 3: “What if the leak is in a kernel module loaded on the system, not in your daemon?” Check /proc/slabinfo and slabtop — shows kernel slab cache growth. If dentry or a specific cache is growing without bound, it is a kernel issue. Use perf record -e kmem:kmalloc -g to capture allocation stacks. Often a buggy module or a userspace pattern that creates many short-lived dentries.
Common Wrong Answers:
  1. “Restart the service nightly to work around the leak.” This is a workaround, not a fix. Tells the interviewer you do not know how to find the actual leak. State it as last resort after describing the diagnostic process.
  2. “It must be the language runtime; nothing we can do.” Even when the leak is in the runtime (rare), you can usually work around it with allocator changes, GC tuning, or process recycling. Saying “nothing we can do” is a senior-level red flag.
Further Reading:
  • jemalloc documentation, “Heap Profiling” section.
  • Brendan Gregg, “Memory Leak (and Growth) Flame Graphs” blog post.
  • Go pprof documentation — runtime/pprof package, very well documented.
Strong Answer Framework:
  1. High CPU + low throughput is a classic ‘spinning’ or ‘cache thrashing’ pattern. The CPU is busy but not doing useful work. Three big buckets: lock contention (threads spinning on a mutex), cache thrashing (CPU waiting on memory), or pathological GC/JIT activity.
  2. Check IPC first. perf stat -p <PID> -- sleep 10. IPC less than 0.5 means the CPU is mostly stalled, usually on memory. IPC near 1.0 with high CPU is normal. IPC less than 0.3 is severe stall — almost always cache misses or branch mispredictions.
  3. Lock contention check. perf lock record -p <PID> -- sleep 10 then perf lock report. Or bpftrace -e 'kprobe:mutex_lock { @[kstack] = count(); }'. If 60% of context switches are on a single mutex, you found it.
  4. Cache miss check. perf stat -e cycles,instructions,cache-misses,cache-references,LLC-load-misses -p <PID>. LLC miss rate above 10% of references is severe. perf record -e LLC-load-misses -g shows which functions cause the misses.
  5. GC/JIT check. Java: enable GC logging, look at pause times and frequency. Go: GODEBUG=gctrace=1 shows GC overhead. Python: GC pauses are usually short but reference cycle collection can be expensive. If GC eats 30% of CPU, throughput drops accordingly.
  6. Spinning/busy-loop check. perf top -p <PID> — if a tight loop in user code dominates, you have a busy-wait. Often introduced accidentally: while (!ready) {} instead of pthread_cond_wait, or a poll loop without sleep.
  7. Throughput/latency dependency. If CPU is 100% and throughput is down, latency must be way up (Little’s law). Confirm with the application’s own metrics. This is the same problem as scenario 1 (the latency spike), just framed differently.
Real-World Example: In 2019, a large fintech had a payment-processing service hit 100% CPU with throughput cut in half. Profiling showed _raw_spin_lock dominating — traced to a recently-merged feature that took a global mutex per transaction. Under high concurrency, the mutex serialized all work onto one CPU at a time, despite N cores being available. Fix: replace the global mutex with a per-shard mutex. CPU dropped to 30%, throughput tripled. Reference: pattern documented in “Lock contention in production” by various teams; specific incident details vary by company.
Senior follow-up 1: “How do you detect CPU thermal throttling versus real load?” cpupower frequency-info and /proc/cpuinfo show current frequency. If actual frequency is well below max under high load, the CPU is throttling — thermal, power-cap, or P-state misconfiguration. turbostat (intel) shows per-core frequency and C-states.
Senior follow-up 2: “Could 100% CPU + low throughput be a hyperthread pathology?” Yes — two threads on sibling hyperthreads share L1/L2 cache and execution units. If both are cache-heavy, they thrash each other. perf stat will show high cache misses despite high CPU. Solution: pin work to physical cores only, or disable SMT for that workload (echo 0 > /sys/devices/system/cpu/cpu<N>/online for sibling cores).
Senior follow-up 3: “What is ‘priority inversion’ at the application level and how does it appear in this scenario?” Low-priority thread holds a lock; high-priority thread blocks waiting; medium-priority threads run and prevent the low-priority thread from making progress. The high-priority thread is starved. Looks like 100% CPU on low+medium with high-priority work stalled. Linux supports priority inheritance for futexes via PTHREAD_PRIO_INHERIT to mitigate.
Common Wrong Answers:
  1. “Add more CPUs.” If the bottleneck is a global mutex, more CPUs makes it worse (more contention). Always understand the bottleneck before scaling.
  2. “The CPU is at 100% so the system is fully utilized — nothing to fix.” CPU at 100% with low throughput is a smell, not a healthy state. Useful work per CPU-second is the real metric.
Further Reading:
  • Intel “Top-down Microarchitecture Analysis Method” — systematic approach to analyzing CPU stalls.
  • Brendan Gregg, “CPU Flame Graphs” — foundational for understanding CPU profiling.
  • “What every programmer should know about memory” by Ulrich Drepper — canonical reference on cache effects.

Interview Deep-Dive (Original)

Strong Answer:
  • Low CPU with high latency is the classic off-CPU problem — threads are waiting, not computing. Candidates: I/O, lock contention, sleep/futex waits, or scheduling latency.
  • Step 1: Check I/O. Run iostat -x 1 for disk (look at await and %util), and ss -ti for network (retransmissions, RTT). If disk await is spiking, that is your culprit.
  • Step 2: If I/O is clean, check lock contention with perf lock record or bpftrace on futex tracepoints.
  • Step 3: Off-CPU analysis. offcputime -p <PID> 10 shows where threads sleep and for how long. Generate an off-CPU flame graph — the widest bars reveal the bottleneck. Common findings: waiting on database responses, downstream RPCs, or a global mutex.
  • Step 4: Check scheduling latency with runqlat -p <PID>. If run queue latency is high despite low average CPU, the work is bursty.
  • Most common root cause: a downstream dependency became slower, and threads spend 400ms extra waiting on network responses. The OS tools lead you to the right diagnosis.
Follow-up: What is the difference between on-CPU and off-CPU flame graphs?On-CPU flame graphs sample the instruction pointer while the CPU executes your code — they show where CPU time is spent. Off-CPU flame graphs record stack traces when a thread goes to sleep and the duration until it wakes — they show where waiting time is spent. In practice, I generate both. Sometimes the problem is a combination: 60% on-CPU doing expensive serialization, 40% off-CPU waiting for a database. You need both views.
Strong Answer:
  • IPC of 0.3 means the pipeline is almost entirely stalled. The most common cause is memory-bound execution — the CPU waits for data from L2, L3, or DRAM on nearly every instruction.
  • Confirm with perf stat -e cycles,instructions,cache-misses,cache-references,LLC-load-misses. If LLC miss rate is high (more than 5% of memory references), the application is thrashing DRAM. Each LLC miss costs 60-100ns of pipeline stall.
  • Identify the functions causing misses: perf record -e LLC-load-misses -g. Common culprits: traversing large data structures with poor locality (linked lists, hash tables with pointer chasing), random access across a large address space, or working set exceeding L3.
  • Other possible causes: branch mispredictions (check branch-misses — if above 5%, the pipeline spends many cycles flushing), or instruction cache misses (large code footprint, common in JIT-compiled code).
  • Fixes: for memory-bound code, restructure data for cache locality (AoS to SoA, cache-oblivious algorithms, software prefetch). For branch-heavy code, convert to branchless operations. For large code footprint, use profile-guided optimization (PGO).
Follow-up: How do hardware prefetchers work, and when do they fail?Hardware prefetchers detect access patterns and preload cache lines. The stride prefetcher detects sequential or strided access. They fail when patterns are unpredictable: pointer chasing in linked lists, random hash table lookups, and data-dependent access patterns. When prefetchers fail, every access is a full cache miss. This is why data structure choice has such a profound impact — an array the prefetcher can predict outperforms a pointer-based structure by 10-50x, even with the same asymptotic complexity.
Strong Answer:
  • USE (Brendan Gregg): for each resource (CPU, memory, disk, network), check Utilization (how busy), Saturation (is there a backlog), and Errors (error events).
  • For the OOM-killed pod, focus on memory. Utilization: check memory.current in the cgroup. Is usage consistently near the limit? Saturation: check memory.pressure (PSI) for time spent stalled on reclaim. Errors: check memory.events for oom_kill count.
  • Investigation branches: steady climb over time means memory leak (profile with jemalloc, pprof, or JVM NMT). Sudden spikes mean transient load (large query, burst of requests). If page cache dominates (check memory.statfile vs anon), heavy file I/O is consuming the cgroup’s allowance.
  • Common mistake: setting limits too close to steady-state. Memory is not constant — GC, temp allocations, and kernel buffers cause spikes. Set limits at 1.5-2x steady-state and use memory.high at 1.2x for backpressure before the hard limit.
Follow-up: What is PSI and how does it differ from utilization metrics?PSI measures the percentage of time tasks are stalled due to lack of a resource. You can be at 95% memory utilization with zero PSI stalls if the remaining 5% plus reclaimable cache suffices. Conversely, 70% utilization can have high PSI if the working set exceeds cache and every allocation triggers expensive direct reclaim. PSI is a better signal for actual performance impact and is what cgroup v2 uses to trigger proactive reclaim before OOM.