Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Linux Tracing & Profiling - perf, ftrace, bpftrace, and observability tools

Tracing & Profiling

Production debugging requires deep observability skills. This module covers the essential tracing and profiling tools used at infrastructure and observability companies.
Prerequisites: Process fundamentals, system calls, basic eBPF
Interview Focus: perf, bpftrace, flame graphs, production debugging
Companies: This is THE differentiating skill for infra roles

Observability Stack Overview

┌─────────────────────────────────────────────────────────────────────┐
│                  LINUX OBSERVABILITY TOOLS                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Application Level                                                   │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  Logging (journald, syslog)  │  Metrics (Prometheus, StatsD)   ││
│  │  APM (Datadog, New Relic)    │  Distributed Tracing (Jaeger)   ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  System Call Interface                                               │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  strace, ltrace, perf trace                                     ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  Kernel Tracing Infrastructure                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                                                                  ││
│  │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐            ││
│  │   │   kprobes   │  │ tracepoints │  │    USDT    │            ││
│  │   │ (dynamic)   │  │  (static)   │  │  (user)    │            ││
│  │   └──────┬──────┘  └──────┬──────┘  └──────┬──────┘            ││
│  │          │                │                │                    ││
│  │          └────────────────┼────────────────┘                    ││
│  │                           ▼                                      ││
│  │   ┌─────────────────────────────────────────────────────────┐  ││
│  │   │                    eBPF Runtime                         │  ││
│  │   │  (Verifier, JIT Compiler, Maps, Helpers)               │  ││
│  │   └─────────────────────────────────────────────────────────┘  ││
│  │                           │                                      ││
│  │   ┌───────────┬───────────┼───────────┬───────────┐            ││
│  │   ▼           ▼           ▼           ▼           ▼            ││
│  │ ┌─────┐   ┌─────┐   ┌─────────┐   ┌─────┐   ┌─────┐           ││
│  │ │perf │   │ftrace│   │bpftrace│   │ BCC │   │libbpf│           ││
│  │ └─────┘   └─────┘   └─────────┘   └─────┘   └─────┘           ││
│  │                                                                  ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  Hardware Performance Counters (PMU)                                 │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  CPU cycles, cache misses, branch mispredictions, IPC          ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

perf - The Swiss Army Knife

Basic Usage

# Record CPU samples
perf record -g ./myapp
perf report

# Record specific events
perf record -e cycles,cache-misses -g ./myapp

# Profile running process
perf record -g -p $(pidof myapp) -- sleep 30

# System-wide profiling
perf record -a -g -- sleep 10

perf stat - High-Level Stats

# Basic counters
perf stat ./myapp

# Example output:
#  Performance counter stats for './myapp':
#
#          5,024.53 msec task-clock               #    0.998 CPUs utilized
#               125      context-switches         #   24.880/sec
#                 5      cpu-migrations           #    0.995/sec
#            15,432      page-faults              #    3.071 K/sec
#    15,234,567,890      cycles                   #    3.032 GHz
#    12,345,678,901      instructions             #    0.81  insn per cycle
#       987,654,321      branches                 #  196.602 M/sec
#        12,345,678      branch-misses            #    1.25% of all branches

# Detailed counters
perf stat -d ./myapp   # L1 cache stats
perf stat -dd ./myapp  # L2 cache stats
perf stat -ddd ./myapp # L3 cache stats

# Specific events
perf list  # List available events
perf stat -e 'cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses' ./myapp

perf Sampling Modes

# On-CPU time (default)
perf record -g ./myapp

# Specific CPU (or all CPUs)
perf record -C 0,1,2,3 -a -- sleep 30

# Higher frequency for short programs
perf record -F 999 -g ./myapp  # 999 Hz

# Generate flame graph
perf record -g ./myapp
perf script > out.perf
stackcollapse-perf.pl out.perf > out.folded
flamegraph.pl out.folded > flame.svg

perf trace - System Call Tracing

# Like strace, but faster
perf trace ./myapp

# Specific syscalls
perf trace -e open,read,write ./myapp

# With timing
perf trace --duration 0.1 ./myapp  # Show syscalls > 100μs

# System-wide
perf trace -a -- sleep 10

ftrace - Kernel Function Tracer

Basic ftrace

# Setup
cd /sys/kernel/debug/tracing

# List available tracers
cat available_tracers
# nop function function_graph

# Enable function tracer
echo function > current_tracer

# Filter specific functions
echo 'tcp_*' > set_ftrace_filter

# Start/stop tracing
echo 1 > tracing_on
# ... run workload ...
echo 0 > tracing_on

# Read trace
cat trace

Function Graph Tracer

# Enable function_graph for call trees
echo function_graph > current_tracer

# Filter to specific functions
echo schedule > set_graph_function

# Read trace (shows call hierarchy with timing)
cat trace

# Example output:
#  4)               |  schedule() {
#  4)   0.523 us    |    rcu_note_context_switch();
#  4)               |    __schedule() {
#  4)   0.150 us    |      update_rq_clock();
#  4)               |      pick_next_task_fair() {
#  4)   0.098 us    |        update_curr();
#  4)   0.412 us    |      }
#  4)   2.518 us    |    }
#  4)   3.201 us    |  }

trace-cmd Frontend

# Record function graph
trace-cmd record -p function_graph -g schedule

# Record events
trace-cmd record -e sched:sched_switch -e sched:sched_wakeup

# Report
trace-cmd report

# Convert for visualization
trace-cmd report --cpu 0 -t | less

bpftrace - High-Level Tracing

One-Liners

# Count syscalls by process
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# Trace open() calls
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'

# Syscall latency histogram
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
             tracepoint:raw_syscalls:sys_exit /@start[tid]/ { @ns = hist(nsecs - @start[tid]); delete(@start[tid]); }'

# Read size distribution
bpftrace -e 'tracepoint:syscalls:sys_exit_read { @bytes = hist(args->ret); }'

# Block I/O latency
bpftrace -e 'tracepoint:block:block_rq_issue { @start[args->dev, args->sector] = nsecs; }
             tracepoint:block:block_rq_complete /@start[args->dev, args->sector]/ { 
                 @us = hist((nsecs - @start[args->dev, args->sector]) / 1000);
                 delete(@start[args->dev, args->sector]); }'

Practical Scripts

# TCP connect latency
bpftrace -e '
kprobe:tcp_v4_connect {
    @start[tid] = nsecs;
}
kretprobe:tcp_v4_connect /@start[tid]/ {
    @connect_lat_us = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
}'

# Trace process execution
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("%s -> %s\n", comm, str(args->filename));
}'

# Memory allocation tracking
bpftrace -e '
uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc {
    @bytes[comm] = sum(arg0);
}
interval:s:5 {
    print(@bytes);
    clear(@bytes);
}'

# Off-CPU time by kernel stack
bpftrace -e '
tracepoint:sched:sched_switch {
    @start[args->prev_pid] = nsecs;
}
tracepoint:sched:sched_switch /@start[args->next_pid]/ {
    @off_cpu[kstack] = sum(nsecs - @start[args->next_pid]);
    delete(@start[args->next_pid]);
}'

bpftrace Built-in Variables

VariableDescription
pidProcess ID
tidThread ID
commProcess name
nsecsNanosecond timestamp
kstackKernel stack trace
ustackUser stack trace
arg0-argNFunction arguments
retvalReturn value (kretprobe)
args->fieldTracepoint arguments

Flame Graphs

Generating Flame Graphs

# CPU flame graph with perf
perf record -F 99 -g -p $(pidof myapp) -- sleep 30
perf script > perf.data
stackcollapse-perf.pl perf.data > perf.folded
flamegraph.pl perf.folded > cpu_flame.svg

# Off-CPU flame graph
sudo offcputime-bpfcc -p $(pidof myapp) -f 30 > off.stacks
flamegraph.pl --color=io --title="Off-CPU Time" off.stacks > offcpu_flame.svg

# Memory flame graph
sudo memleak-bpfcc -p $(pidof myapp) 30 > mem.stacks
flamegraph.pl --color=mem --title="Memory" mem.stacks > mem_flame.svg

# Differential flame graph (compare before/after)
difffolded.pl before.folded after.folded | flamegraph.pl > diff_flame.svg

Reading Flame Graphs

┌─────────────────────────────────────────────────────────────────────┐
│                    READING FLAME GRAPHS                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│           ┌──────────────────────────────────────────────────────┐  │
│           │                     main                              │  │
│           │                     100%                              │  │
│           └───────────────────────┬──────────────────────────────┘  │
│               ┌───────────────────┴───────────────────┐             │
│               ▼                                       ▼             │
│  ┌────────────────────────────────┐    ┌────────────────────────┐  │
│  │        process_request         │    │    handle_connection   │  │
│  │             60%                │    │          40%           │  │
│  └───────────────┬────────────────┘    └────────────────────────┘  │
│        ┌─────────┴─────────┐                                        │
│        ▼                   ▼                                        │
│  ┌──────────────┐   ┌──────────────┐                               │
│  │  parse_json  │   │ query_database│                               │
│  │     25%      │   │     35%      │                               │
│  └──────────────┘   └──────────────┘                               │
│                                                                      │
│  Key insights:                                                       │
│  ─────────────                                                      │
│  • Width = Time spent in that function                              │
│  • Stack depth shows call hierarchy                                 │
│  • Look for wide boxes = optimization targets                       │
│  • Flat tops = leaf functions doing work                            │
│  • Compare before/after with differential flame graphs              │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Production Debugging Scenarios

Scenario 1: High CPU Usage

# Step 1: Identify the process
top -c
htop -p $(pidof myapp)

# Step 2: Quick perf analysis
perf top -p $(pidof myapp)

# Step 3: Detailed recording
perf record -F 99 -g -p $(pidof myapp) -- sleep 30
perf report

# Step 4: Flame graph for visualization
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

# Step 5: If it's kernel CPU
perf record -F 99 -g -a -- sleep 10  # System-wide
# Look for kernel functions in flame graph

Scenario 2: Application Latency Spikes

# Step 1: Check if it's syscall latency
sudo perf trace --duration 10 -p $(pidof myapp)

# Step 2: Find slow operations
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter /pid == $1/ { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
    $lat = (nsecs - @start[tid]) / 1000000;
    if ($lat > 10) {
        printf("%s slow syscall %d: %d ms\n", comm, args->id, $lat);
    }
    delete(@start[tid]);
}' $(pidof myapp)

# Step 3: Off-CPU analysis
sudo offcputime-bpfcc -p $(pidof myapp) 30 > off.stacks
flamegraph.pl --color=io off.stacks > offcpu.svg

# Step 4: Check for lock contention
sudo bpftrace -e '
tracepoint:lock:contention_begin { @start[tid] = nsecs; }
tracepoint:lock:contention_end /@start[tid]/ {
    @contention_ns = hist(nsecs - @start[tid]);
    delete(@start[tid]);
}'

Scenario 3: Memory Issues

# Step 1: Check memory maps
pmap -x $(pidof myapp)
cat /proc/$(pidof myapp)/smaps_rollup

# Step 2: Track allocations
sudo memleak-bpfcc -p $(pidof myapp) 30

# Step 3: Page fault analysis
perf record -e page-faults -g -p $(pidof myapp) -- sleep 30
perf report

# Step 4: Check for memory fragmentation
cat /proc/buddyinfo
cat /proc/pagetypeinfo

# Step 5: OOM analysis
dmesg | grep -i oom
journalctl -k | grep -i "out of memory"

Scenario 4: I/O Bottleneck

# Step 1: Overall I/O stats
iostat -xz 1

# Step 2: Per-process I/O
iotop
sudo biotop-bpfcc

# Step 3: I/O latency distribution
sudo biolatency-bpfcc 10

# Step 4: Slow I/O operations
sudo biosnoop-bpfcc

# Step 5: File-level analysis
sudo fileslower-bpfcc 10 -p $(pidof myapp)

# Step 6: Check what files
sudo opensnoop-bpfcc -p $(pidof myapp)

BCC Tools Reference

Process/Thread Tools

ToolPurpose
execsnoopTrace new process execution
exitsnoopTrace process exits
threadsnoopTrace thread creation
offcputimeOff-CPU time by stack
runqlatRun queue latency
cpudistOn-CPU time distribution

File System Tools

ToolPurpose
opensnoopTrace file opens
filelifeTrace file lifespan
fileslowerTrace slow file I/O
filetopTop files by I/O
vfsstatVFS operation stats
cachestatPage cache hit ratio

Block I/O Tools

ToolPurpose
biolatencyBlock I/O latency histogram
biosnoopTrace block I/O operations
biotopTop processes by I/O
bitesizeI/O size distribution

Network Tools

ToolPurpose
tcpconnectTrace outgoing TCP connections
tcpacceptTrace incoming TCP connections
tcpretransTrace TCP retransmissions
tcpstatesTrace TCP state changes
tcpsynblSYN backlog stats

Memory Tools

ToolPurpose
memleakTrace memory leaks
oomkillTrace OOM killer
slabratetopSlab allocation rates

Performance Monitoring Checklist

1

USE Method (Utilization, Saturation, Errors)

For each resource:
# CPU
mpstat -P ALL 1          # Utilization
vmstat 1                  # Saturation (run queue)
dmesg | grep -i error     # Errors

# Memory
free -m                   # Utilization
vmstat 1                  # Saturation (si/so)
dmesg | grep -i oom       # Errors

# Disk
iostat -xz 1              # Utilization (%util)
iostat -xz 1              # Saturation (avgqu-sz)
smartctl -H /dev/sda      # Errors

# Network
ip -s link                # Utilization
ss -s                     # Saturation (overflows)
netstat -s | grep error   # Errors
2

RED Method (Rate, Errors, Duration)

For each service:
# Rate: Requests per second
ss -s | grep established

# Errors: Error rate
tail -f /var/log/app.log | grep -i error

# Duration: Request latency
# Application metrics or:
sudo bpftrace -e 'usdt:./app:request_start { @start[tid] = nsecs; }
                  usdt:./app:request_end   { @lat = hist(nsecs - @start[tid]); }'
3

Generate Flame Graphs

# CPU flame graph
perf record -F 99 -g -p PID -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

# Off-CPU flame graph
sudo offcputime-bpfcc -f -p PID 30 > off.stacks
flamegraph.pl --color=io off.stacks > offcpu.svg
4

Check for Known Issues

# Kernel warnings
dmesg -T | tail -100

# System logs
journalctl -p err -b

# Hardware issues
mcelog --client

Interview Tips

Framework answer:
  1. Characterize: What is the symptom? Latency? Throughput? Error rate?
  2. Triage with USE/RED:
    • Check resource utilization (CPU, memory, disk, network)
    • Check for saturation (queues, waits)
    • Check error rates
  3. Narrow down:
    • Is it application code? (perf, flame graphs)
    • Is it system calls? (strace, perf trace)
    • Is it kernel? (perf, bpftrace)
    • Is it hardware? (perf stat, mcelog)
  4. Deep dive:
    • CPU: perf record + flame graphs
    • Blocking: offcputime + off-CPU flame graphs
    • I/O: biolatency, biosnoop
    • Network: tcpretrans, tcpstates
  5. Validate fix: Compare before/after metrics
Tool overheads:
ToolOverheadNotes
perf stat~0%Just reads PMU
perf record2-10%Depends on frequency
strace100x+Ptrace per syscall
perf trace5-20%Much faster than strace
bpftrace1-5%Depends on events
ftrace1-10%Depends on filters
Production guidelines:
  • Always sample, don’t trace every event
  • Use low frequencies (99 Hz, not 9999 Hz)
  • Filter to specific processes
  • Prefer tracepoints over kprobes
  • Time-bound your tracing

Interview Deep-Dive

Strong Answer:
  • Step 1: Characterize the problem. Use perf stat -p <pid> -- sleep 30 to get high-level counters: CPU cycles, instructions, cache misses, context switches. If IPC (instructions per cycle) drops during spikes, it suggests cache misses or memory stalls. If context switches spike, something is preempting the process.
  • Step 2: CPU profiling. perf record -F 99 -g -p <pid> -- sleep 30 captures on-CPU stack samples at 99Hz. Generate a flame graph: perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg. Look for wide boxes that indicate hot functions. For Go, frame pointers are enabled by default since Go 1.21, so perf can unwind Go stacks correctly.
  • Step 3: Off-CPU analysis. If the CPU flame graph does not explain the latency, the process is spending time blocked. sudo offcputime-bpfcc -p <pid> -f 30 > off.stacks and flamegraph.pl --color=io off.stacks > offcpu.svg shows where the process is sleeping. Common Go-specific causes: garbage collection (look for runtime.gcBgMarkWorker in the off-CPU graph), channel blocking, or mutex contention.
  • Step 4: Targeted tracing. If GC is suspected, sudo bpftrace -e 'uprobe:/path/to/binary:runtime.gcStart { @gc_start[tid] = nsecs; } uretprobe:/path/to/binary:runtime.gcStart /@gc_start[tid]/ { @gc_duration = hist(nsecs - @gc_start[tid]); delete(@gc_start[tid]); }' measures GC pause duration directly.
  • Step 5: Syscall analysis. sudo perf trace --duration 10 -p <pid> shows syscalls taking longer than 10ms. Sort by duration to find the slowest operations. Correlate timestamps with the latency spike times from application metrics.
Follow-up: What is the overhead of each tool you mentioned, and how would you justify using them on a production system serving real traffic?Follow-up Answer:
  • perf stat: near zero overhead — it just reads hardware PMU counters. perf record at 99Hz: approximately 1-3% CPU overhead for stack sampling, proportional to the sampling rate. 99Hz is specifically chosen because it avoids aliasing with common periodic events (which tend to be round numbers). offcputime-bpfcc: 1-5% overhead because it hooks sched_switch tracepoints which fire on every context switch. bpftrace with targeted uprobe: varies, but a single uprobe on an infrequent function (GC runs seconds apart) adds negligible overhead. perf trace: 5-20% overhead because it traces every syscall. I would start with the lowest-overhead tools (perf stat, perf record) and only escalate to higher-overhead tools (offcputime, perf trace) if the lighter tools do not provide the answer. I would always time-bound tracing sessions (30 seconds to 5 minutes) and monitor the process’s latency during tracing to ensure I am not worsening the problem I am investigating.
Strong Answer:
  • On-CPU profiling (e.g., perf record) samples the call stack at regular intervals while the thread is executing on a CPU. It tells you where the process spends its compute time. If a function is CPU-bound (tight loops, computation), it dominates the flame graph.
  • Off-CPU profiling tracks the time a thread spends not running: blocked on I/O, waiting on a lock, sleeping on a futex, or descheduled by the scheduler. It records the stack trace at the point where the thread went to sleep and measures how long it slept.
  • An on-CPU flame graph completely misses the performance problem when latency is caused by blocking. For example: a web service has 50ms request latency but spends only 2ms of CPU time per request. The remaining 48ms is off-CPU time — waiting for database query responses (network I/O), waiting for disk reads, or waiting for a mutex held by another thread. The on-CPU flame graph shows 2ms of processing and reveals nothing about the 48ms of waiting.
  • The canonical diagnostic approach is to generate both flame graphs. If the on-CPU graph explains the latency (CPU time approximately equals wall time), optimize the hot functions. If there is a gap between CPU time and wall time, the off-CPU graph reveals what the process is waiting on. The off-CPU stacks typically show blocking syscalls (futex, epoll_wait, read, io_submit) and the application-level function that initiated the blocking call.
Follow-up: How would you create a combined “wall clock” flame graph that shows both on-CPU and off-CPU time in a single visualization?Follow-up Answer:
  • A wall clock flame graph requires capturing both on-CPU samples and off-CPU events, then merging them weighted by time. One approach: record on-CPU with perf record -F 99 -g and off-CPU with offcputime-bpfcc -f. Convert both to folded stack format, then concatenate them with appropriate scaling (on-CPU samples represent 1/99th of a second each, off-CPU entries represent actual microseconds). Feed the combined folded stacks to flamegraph.pl --color=chain which uses different colors for on-CPU (warm colors) and off-CPU (cool colors). This gives a single flame graph where the width represents wall clock time: wide warm boxes are CPU-bound functions, wide cool boxes are blocking functions. Brendan Gregg calls this a “hot/cold flame graph” and it is the gold standard for understanding end-to-end request latency.
Strong Answer:
  • strace uses the ptrace system call, which works by making the traced process stop at every syscall entry and exit. The mechanism: the tracer calls ptrace(PTRACE_SYSCALL, pid), the kernel sets a flag in the tracee’s task_struct, and on every syscall entry/exit, the kernel stops the tracee, sends SIGTRAP to the tracer, the tracer reads the syscall information via ptrace(PTRACE_GETREGS), decides to continue, and calls ptrace(PTRACE_SYSCALL) again.
  • This means every single syscall involves: stopping the tracee, scheduling the tracer, reading registers, deciding to continue, scheduling the tracee back. That is at least 2 context switches per syscall (tracer wakes, tracee wakes) plus signal delivery overhead. Each context switch costs 2-10 microseconds, so strace adds 10-20 microseconds per syscall. For a process making 10,000 syscalls per second, that is 100-200 milliseconds of overhead per second, easily causing 2-10x slowdown.
  • eBPF-based tracing (bpftrace, perf trace) attaches a BPF program directly to the syscall tracepoint. The BPF program runs in kernel context on the same CPU as the traced process, with no context switches, no signal delivery, and no user-space tracer involvement. The BPF program executes in 50-200 nanoseconds per syscall — 100x less overhead than ptrace.
  • In production, strace can turn a healthy service into an unresponsive one. I have seen production incidents caused by attaching strace to a busy process. For production debugging, always use perf trace (5-20% overhead) or bpftrace with tracepoints (1-5% overhead).
Follow-up: Are there any scenarios where strace is still the right tool despite its overhead?Follow-up Answer:
  • Yes, strace is still valuable for: debugging startup failures (the process is not yet running, so overhead does not matter), analyzing one-shot commands (like why curl fails to connect), reproducing bugs in development environments, and when eBPF is not available (old kernels, restricted environments without CAP_BPF). strace’s output format is also more human-readable than raw perf trace output, showing decoded flag names, file paths, and error strings. For quick local debugging where you can afford the overhead, strace’s simplicity is valuable. The rule: strace for development, eBPF for production.

Next: Kernel Debugging →