Tracing & Profiling
Production debugging requires deep observability skills. This module covers the essential tracing and profiling tools used at infrastructure and observability companies.
Prerequisites : Process fundamentals, system calls, basic eBPF
Interview Focus : perf, bpftrace, flame graphs, production debugging
Companies : This is THE differentiating skill for infra roles
Observability Stack Overview
┌─────────────────────────────────────────────────────────────────────┐
│ LINUX OBSERVABILITY TOOLS │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Application Level │
│ ┌─────────────────────────────────────────────────────────────────┐│
│ │ Logging (journald, syslog) │ Metrics (Prometheus, StatsD) ││
│ │ APM (Datadog, New Relic) │ Distributed Tracing (Jaeger) ││
│ └─────────────────────────────────────────────────────────────────┘│
│ │
│ System Call Interface │
│ ┌─────────────────────────────────────────────────────────────────┐│
│ │ strace, ltrace, perf trace ││
│ └─────────────────────────────────────────────────────────────────┘│
│ │
│ Kernel Tracing Infrastructure │
│ ┌─────────────────────────────────────────────────────────────────┐│
│ │ ││
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││
│ │ │ kprobes │ │ tracepoints │ │ USDT │ ││
│ │ │ (dynamic) │ │ (static) │ │ (user) │ ││
│ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ ││
│ │ │ │ │ ││
│ │ └────────────────┼────────────────┘ ││
│ │ ▼ ││
│ │ ┌─────────────────────────────────────────────────────────┐ ││
│ │ │ eBPF Runtime │ ││
│ │ │ (Verifier, JIT Compiler, Maps, Helpers) │ ││
│ │ └─────────────────────────────────────────────────────────┘ ││
│ │ │ ││
│ │ ┌───────────┬───────────┼───────────┬───────────┐ ││
│ │ ▼ ▼ ▼ ▼ ▼ ││
│ │ ┌─────┐ ┌─────┐ ┌─────────┐ ┌─────┐ ┌─────┐ ││
│ │ │perf │ │ftrace│ │bpftrace│ │ BCC │ │libbpf│ ││
│ │ └─────┘ └─────┘ └─────────┘ └─────┘ └─────┘ ││
│ │ ││
│ └─────────────────────────────────────────────────────────────────┘│
│ │
│ Hardware Performance Counters (PMU) │
│ ┌─────────────────────────────────────────────────────────────────┐│
│ │ CPU cycles, cache misses, branch mispredictions, IPC ││
│ └─────────────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────────────┘
perf - The Swiss Army Knife
Basic Usage
# Record CPU samples
perf record -g ./myapp
perf report
# Record specific events
perf record -e cycles,cache-misses -g ./myapp
# Profile running process
perf record -g -p $( pidof myapp ) -- sleep 30
# System-wide profiling
perf record -a -g -- sleep 10
perf stat - High-Level Stats
# Basic counters
perf stat ./myapp
# Example output:
# Performance counter stats for './myapp':
#
# 5,024.53 msec task-clock # 0.998 CPUs utilized
# 125 context-switches # 24.880/sec
# 5 cpu-migrations # 0.995/sec
# 15,432 page-faults # 3.071 K/sec
# 15,234,567,890 cycles # 3.032 GHz
# 12,345,678,901 instructions # 0.81 insn per cycle
# 987,654,321 branches # 196.602 M/sec
# 12,345,678 branch-misses # 1.25% of all branches
# Detailed counters
perf stat -d ./myapp # L1 cache stats
perf stat -dd ./myapp # L2 cache stats
perf stat -ddd ./myapp # L3 cache stats
# Specific events
perf list # List available events
perf stat -e 'cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses' ./myapp
perf Sampling Modes
CPU Profiling
Off-CPU Analysis
Memory Profiling
# On-CPU time (default)
perf record -g ./myapp
# Specific CPU (or all CPUs)
perf record -C 0,1,2,3 -a -- sleep 30
# Higher frequency for short programs
perf record -F 999 -g ./myapp # 999 Hz
# Generate flame graph
perf record -g ./myapp
perf script > out.perf
stackcollapse-perf.pl out.perf > out.folded
flamegraph.pl out.folded > flame.svg
# Off-CPU time (what's blocking?)
perf record -e sched:sched_switch -a -- sleep 30
# Better: use BCC offcputime
sudo offcputime-bpfcc -p $( pidof myapp ) 30 > off.stacks
flamegraph.pl --color=io off.stacks > offcpu.svg
# Page faults
perf record -e page-faults -g ./myapp
# Cache misses
perf record -e cache-misses -g ./myapp
# Memory allocation
perf probe -x /lib64/libc.so.6 malloc
perf record -e probe_libc:malloc -g ./myapp
perf trace - System Call Tracing
# Like strace, but faster
perf trace ./myapp
# Specific syscalls
perf trace -e open,read,write ./myapp
# With timing
perf trace --duration 0.1 ./myapp # Show syscalls > 100μs
# System-wide
perf trace -a -- sleep 10
ftrace - Kernel Function Tracer
Basic ftrace
# Setup
cd /sys/kernel/debug/tracing
# List available tracers
cat available_tracers
# nop function function_graph
# Enable function tracer
echo function > current_tracer
# Filter specific functions
echo 'tcp_*' > set_ftrace_filter
# Start/stop tracing
echo 1 > tracing_on
# ... run workload ...
echo 0 > tracing_on
# Read trace
cat trace
Function Graph Tracer
# Enable function_graph for call trees
echo function_graph > current_tracer
# Filter to specific functions
echo schedule > set_graph_function
# Read trace (shows call hierarchy with timing)
cat trace
# Example output:
# 4) | schedule() {
# 4) 0.523 us | rcu_note_context_switch();
# 4) | __schedule() {
# 4) 0.150 us | update_rq_clock();
# 4) | pick_next_task_fair() {
# 4) 0.098 us | update_curr();
# 4) 0.412 us | }
# 4) 2.518 us | }
# 4) 3.201 us | }
trace-cmd Frontend
# Record function graph
trace-cmd record -p function_graph -g schedule
# Record events
trace-cmd record -e sched:sched_switch -e sched:sched_wakeup
# Report
trace-cmd report
# Convert for visualization
trace-cmd report --cpu 0 -t | less
bpftrace - High-Level Tracing
One-Liners
# Count syscalls by process
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
# Trace open() calls
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
# Syscall latency histogram
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ { @ns = hist(nsecs - @start[tid]); delete(@start[tid]); }'
# Read size distribution
bpftrace -e 'tracepoint:syscalls:sys_exit_read { @bytes = hist(args->ret); }'
# Block I/O latency
bpftrace -e 'tracepoint:block:block_rq_issue { @start[args->dev, args->sector] = nsecs; }
tracepoint:block:block_rq_complete /@start[args->dev, args->sector]/ {
@us = hist((nsecs - @start[args->dev, args->sector]) / 1000);
delete(@start[args->dev, args->sector]); }'
Practical Scripts
# TCP connect latency
bpftrace -e '
kprobe:tcp_v4_connect {
@start[tid] = nsecs;
}
kretprobe:tcp_v4_connect /@start[tid]/ {
@connect_lat_us = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}'
# Trace process execution
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
printf("%s -> %s\n", comm, str(args->filename));
}'
# Memory allocation tracking
bpftrace -e '
uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc {
@bytes[comm] = sum(arg0);
}
interval:s:5 {
print(@bytes);
clear(@bytes);
}'
# Off-CPU time by kernel stack
bpftrace -e '
tracepoint:sched:sched_switch {
@start[args->prev_pid] = nsecs;
}
tracepoint:sched:sched_switch /@start[args->next_pid]/ {
@off_cpu[kstack] = sum(nsecs - @start[args->next_pid]);
delete(@start[args->next_pid]);
}'
bpftrace Built-in Variables
Variable Description pidProcess ID tidThread ID commProcess name nsecsNanosecond timestamp kstackKernel stack trace ustackUser stack trace arg0-argNFunction arguments retvalReturn value (kretprobe) args->fieldTracepoint arguments
Flame Graphs
Generating Flame Graphs
# CPU flame graph with perf
perf record -F 99 -g -p $( pidof myapp ) -- sleep 30
perf script > perf.data
stackcollapse-perf.pl perf.data > perf.folded
flamegraph.pl perf.folded > cpu_flame.svg
# Off-CPU flame graph
sudo offcputime-bpfcc -p $( pidof myapp ) -f 30 > off.stacks
flamegraph.pl --color=io --title= "Off-CPU Time" off.stacks > offcpu_flame.svg
# Memory flame graph
sudo memleak-bpfcc -p $( pidof myapp ) 30 > mem.stacks
flamegraph.pl --color=mem --title= "Memory" mem.stacks > mem_flame.svg
# Differential flame graph (compare before/after)
difffolded.pl before.folded after.folded | flamegraph.pl > diff_flame.svg
Reading Flame Graphs
┌─────────────────────────────────────────────────────────────────────┐
│ READING FLAME GRAPHS │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ main │ │
│ │ 100% │ │
│ └───────────────────────┬──────────────────────────────┘ │
│ ┌───────────────────┴───────────────────┐ │
│ ▼ ▼ │
│ ┌────────────────────────────────┐ ┌────────────────────────┐ │
│ │ process_request │ │ handle_connection │ │
│ │ 60% │ │ 40% │ │
│ └───────────────┬────────────────┘ └────────────────────────┘ │
│ ┌─────────┴─────────┐ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ parse_json │ │ query_database│ │
│ │ 25% │ │ 35% │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ Key insights: │
│ ───────────── │
│ • Width = Time spent in that function │
│ • Stack depth shows call hierarchy │
│ • Look for wide boxes = optimization targets │
│ • Flat tops = leaf functions doing work │
│ • Compare before/after with differential flame graphs │
│ │
└─────────────────────────────────────────────────────────────────────┘
Production Debugging Scenarios
Scenario 1: High CPU Usage
# Step 1: Identify the process
top -c
htop -p $( pidof myapp )
# Step 2: Quick perf analysis
perf top -p $( pidof myapp )
# Step 3: Detailed recording
perf record -F 99 -g -p $( pidof myapp ) -- sleep 30
perf report
# Step 4: Flame graph for visualization
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg
# Step 5: If it's kernel CPU
perf record -F 99 -g -a -- sleep 10 # System-wide
# Look for kernel functions in flame graph
Scenario 2: Application Latency Spikes
# Step 1: Check if it's syscall latency
sudo perf trace --duration 10 -p $( pidof myapp )
# Step 2: Find slow operations
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter /pid == $1/ { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
$lat = (nsecs - @start[tid]) / 1000000;
if ($lat > 10) {
printf("%s slow syscall %d: %d ms\n", comm, args->id, $lat);
}
delete(@start[tid]);
}' $( pidof myapp )
# Step 3: Off-CPU analysis
sudo offcputime-bpfcc -p $( pidof myapp ) 30 > off.stacks
flamegraph.pl --color=io off.stacks > offcpu.svg
# Step 4: Check for lock contention
sudo bpftrace -e '
tracepoint:lock:contention_begin { @start[tid] = nsecs; }
tracepoint:lock:contention_end /@start[tid]/ {
@contention_ns = hist(nsecs - @start[tid]);
delete(@start[tid]);
}'
Scenario 3: Memory Issues
# Step 1: Check memory maps
pmap -x $( pidof myapp )
cat /proc/ $( pidof myapp ) /smaps_rollup
# Step 2: Track allocations
sudo memleak-bpfcc -p $( pidof myapp ) 30
# Step 3: Page fault analysis
perf record -e page-faults -g -p $( pidof myapp ) -- sleep 30
perf report
# Step 4: Check for memory fragmentation
cat /proc/buddyinfo
cat /proc/pagetypeinfo
# Step 5: OOM analysis
dmesg | grep -i oom
journalctl -k | grep -i "out of memory"
Scenario 4: I/O Bottleneck
# Step 1: Overall I/O stats
iostat -xz 1
# Step 2: Per-process I/O
iotop
sudo biotop-bpfcc
# Step 3: I/O latency distribution
sudo biolatency-bpfcc 10
# Step 4: Slow I/O operations
sudo biosnoop-bpfcc
# Step 5: File-level analysis
sudo fileslower-bpfcc 10 -p $( pidof myapp )
# Step 6: Check what files
sudo opensnoop-bpfcc -p $( pidof myapp )
Tool Purpose execsnoopTrace new process execution exitsnoopTrace process exits threadsnoopTrace thread creation offcputimeOff-CPU time by stack runqlatRun queue latency cpudistOn-CPU time distribution
Tool Purpose opensnoopTrace file opens filelifeTrace file lifespan fileslowerTrace slow file I/O filetopTop files by I/O vfsstatVFS operation stats cachestatPage cache hit ratio
Tool Purpose biolatencyBlock I/O latency histogram biosnoopTrace block I/O operations biotopTop processes by I/O bitesizeI/O size distribution
Tool Purpose tcpconnectTrace outgoing TCP connections tcpacceptTrace incoming TCP connections tcpretransTrace TCP retransmissions tcpstatesTrace TCP state changes tcpsynblSYN backlog stats
Tool Purpose memleakTrace memory leaks oomkillTrace OOM killer slabratetopSlab allocation rates
USE Method (Utilization, Saturation, Errors)
For each resource: # CPU
mpstat -P ALL 1 # Utilization
vmstat 1 # Saturation (run queue)
dmesg | grep -i error # Errors
# Memory
free -m # Utilization
vmstat 1 # Saturation (si/so)
dmesg | grep -i oom # Errors
# Disk
iostat -xz 1 # Utilization (%util)
iostat -xz 1 # Saturation (avgqu-sz)
smartctl -H /dev/sda # Errors
# Network
ip -s link # Utilization
ss -s # Saturation (overflows)
netstat -s | grep error # Errors
RED Method (Rate, Errors, Duration)
For each service: # Rate: Requests per second
ss -s | grep established
# Errors: Error rate
tail -f /var/log/app.log | grep -i error
# Duration: Request latency
# Application metrics or:
sudo bpftrace -e 'usdt:./app:request_start { @start[tid] = nsecs; }
usdt:./app:request_end { @lat = hist(nsecs - @start[tid]); }'
Generate Flame Graphs
# CPU flame graph
perf record -F 99 -g -p PID -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg
# Off-CPU flame graph
sudo offcputime-bpfcc -f -p PID 30 > off.stacks
flamegraph.pl --color=io off.stacks > offcpu.svg
Check for Known Issues
# Kernel warnings
dmesg -T | tail -100
# System logs
journalctl -p err -b
# Hardware issues
mcelog --client
Interview Tips
Q: How would you debug a production performance issue?
Q: What's the overhead of tracing tools?
Next: Kernel Debugging →