Skip to main content
Linux Tracing & Profiling - perf, ftrace, bpftrace, and observability tools

Tracing & Profiling

Production debugging requires deep observability skills. This module covers the essential tracing and profiling tools used at infrastructure and observability companies.
Prerequisites: Process fundamentals, system calls, basic eBPF
Interview Focus: perf, bpftrace, flame graphs, production debugging
Companies: This is THE differentiating skill for infra roles

Observability Stack Overview

┌─────────────────────────────────────────────────────────────────────┐
│                  LINUX OBSERVABILITY TOOLS                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Application Level                                                   │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  Logging (journald, syslog)  │  Metrics (Prometheus, StatsD)   ││
│  │  APM (Datadog, New Relic)    │  Distributed Tracing (Jaeger)   ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  System Call Interface                                               │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  strace, ltrace, perf trace                                     ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  Kernel Tracing Infrastructure                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                                                                  ││
│  │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐            ││
│  │   │   kprobes   │  │ tracepoints │  │    USDT    │            ││
│  │   │ (dynamic)   │  │  (static)   │  │  (user)    │            ││
│  │   └──────┬──────┘  └──────┬──────┘  └──────┬──────┘            ││
│  │          │                │                │                    ││
│  │          └────────────────┼────────────────┘                    ││
│  │                           ▼                                      ││
│  │   ┌─────────────────────────────────────────────────────────┐  ││
│  │   │                    eBPF Runtime                         │  ││
│  │   │  (Verifier, JIT Compiler, Maps, Helpers)               │  ││
│  │   └─────────────────────────────────────────────────────────┘  ││
│  │                           │                                      ││
│  │   ┌───────────┬───────────┼───────────┬───────────┐            ││
│  │   ▼           ▼           ▼           ▼           ▼            ││
│  │ ┌─────┐   ┌─────┐   ┌─────────┐   ┌─────┐   ┌─────┐           ││
│  │ │perf │   │ftrace│   │bpftrace│   │ BCC │   │libbpf│           ││
│  │ └─────┘   └─────┘   └─────────┘   └─────┘   └─────┘           ││
│  │                                                                  ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  Hardware Performance Counters (PMU)                                 │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  CPU cycles, cache misses, branch mispredictions, IPC          ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

perf - The Swiss Army Knife

Basic Usage

# Record CPU samples
perf record -g ./myapp
perf report

# Record specific events
perf record -e cycles,cache-misses -g ./myapp

# Profile running process
perf record -g -p $(pidof myapp) -- sleep 30

# System-wide profiling
perf record -a -g -- sleep 10

perf stat - High-Level Stats

# Basic counters
perf stat ./myapp

# Example output:
#  Performance counter stats for './myapp':
#
#          5,024.53 msec task-clock               #    0.998 CPUs utilized
#               125      context-switches         #   24.880/sec
#                 5      cpu-migrations           #    0.995/sec
#            15,432      page-faults              #    3.071 K/sec
#    15,234,567,890      cycles                   #    3.032 GHz
#    12,345,678,901      instructions             #    0.81  insn per cycle
#       987,654,321      branches                 #  196.602 M/sec
#        12,345,678      branch-misses            #    1.25% of all branches

# Detailed counters
perf stat -d ./myapp   # L1 cache stats
perf stat -dd ./myapp  # L2 cache stats
perf stat -ddd ./myapp # L3 cache stats

# Specific events
perf list  # List available events
perf stat -e 'cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses' ./myapp

perf Sampling Modes

# On-CPU time (default)
perf record -g ./myapp

# Specific CPU (or all CPUs)
perf record -C 0,1,2,3 -a -- sleep 30

# Higher frequency for short programs
perf record -F 999 -g ./myapp  # 999 Hz

# Generate flame graph
perf record -g ./myapp
perf script > out.perf
stackcollapse-perf.pl out.perf > out.folded
flamegraph.pl out.folded > flame.svg

perf trace - System Call Tracing

# Like strace, but faster
perf trace ./myapp

# Specific syscalls
perf trace -e open,read,write ./myapp

# With timing
perf trace --duration 0.1 ./myapp  # Show syscalls > 100μs

# System-wide
perf trace -a -- sleep 10

ftrace - Kernel Function Tracer

Basic ftrace

# Setup
cd /sys/kernel/debug/tracing

# List available tracers
cat available_tracers
# nop function function_graph

# Enable function tracer
echo function > current_tracer

# Filter specific functions
echo 'tcp_*' > set_ftrace_filter

# Start/stop tracing
echo 1 > tracing_on
# ... run workload ...
echo 0 > tracing_on

# Read trace
cat trace

Function Graph Tracer

# Enable function_graph for call trees
echo function_graph > current_tracer

# Filter to specific functions
echo schedule > set_graph_function

# Read trace (shows call hierarchy with timing)
cat trace

# Example output:
#  4)               |  schedule() {
#  4)   0.523 us    |    rcu_note_context_switch();
#  4)               |    __schedule() {
#  4)   0.150 us    |      update_rq_clock();
#  4)               |      pick_next_task_fair() {
#  4)   0.098 us    |        update_curr();
#  4)   0.412 us    |      }
#  4)   2.518 us    |    }
#  4)   3.201 us    |  }

trace-cmd Frontend

# Record function graph
trace-cmd record -p function_graph -g schedule

# Record events
trace-cmd record -e sched:sched_switch -e sched:sched_wakeup

# Report
trace-cmd report

# Convert for visualization
trace-cmd report --cpu 0 -t | less

bpftrace - High-Level Tracing

One-Liners

# Count syscalls by process
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# Trace open() calls
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'

# Syscall latency histogram
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
             tracepoint:raw_syscalls:sys_exit /@start[tid]/ { @ns = hist(nsecs - @start[tid]); delete(@start[tid]); }'

# Read size distribution
bpftrace -e 'tracepoint:syscalls:sys_exit_read { @bytes = hist(args->ret); }'

# Block I/O latency
bpftrace -e 'tracepoint:block:block_rq_issue { @start[args->dev, args->sector] = nsecs; }
             tracepoint:block:block_rq_complete /@start[args->dev, args->sector]/ { 
                 @us = hist((nsecs - @start[args->dev, args->sector]) / 1000);
                 delete(@start[args->dev, args->sector]); }'

Practical Scripts

# TCP connect latency
bpftrace -e '
kprobe:tcp_v4_connect {
    @start[tid] = nsecs;
}
kretprobe:tcp_v4_connect /@start[tid]/ {
    @connect_lat_us = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
}'

# Trace process execution
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("%s -> %s\n", comm, str(args->filename));
}'

# Memory allocation tracking
bpftrace -e '
uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc {
    @bytes[comm] = sum(arg0);
}
interval:s:5 {
    print(@bytes);
    clear(@bytes);
}'

# Off-CPU time by kernel stack
bpftrace -e '
tracepoint:sched:sched_switch {
    @start[args->prev_pid] = nsecs;
}
tracepoint:sched:sched_switch /@start[args->next_pid]/ {
    @off_cpu[kstack] = sum(nsecs - @start[args->next_pid]);
    delete(@start[args->next_pid]);
}'

bpftrace Built-in Variables

VariableDescription
pidProcess ID
tidThread ID
commProcess name
nsecsNanosecond timestamp
kstackKernel stack trace
ustackUser stack trace
arg0-argNFunction arguments
retvalReturn value (kretprobe)
args->fieldTracepoint arguments

Flame Graphs

Generating Flame Graphs

# CPU flame graph with perf
perf record -F 99 -g -p $(pidof myapp) -- sleep 30
perf script > perf.data
stackcollapse-perf.pl perf.data > perf.folded
flamegraph.pl perf.folded > cpu_flame.svg

# Off-CPU flame graph
sudo offcputime-bpfcc -p $(pidof myapp) -f 30 > off.stacks
flamegraph.pl --color=io --title="Off-CPU Time" off.stacks > offcpu_flame.svg

# Memory flame graph
sudo memleak-bpfcc -p $(pidof myapp) 30 > mem.stacks
flamegraph.pl --color=mem --title="Memory" mem.stacks > mem_flame.svg

# Differential flame graph (compare before/after)
difffolded.pl before.folded after.folded | flamegraph.pl > diff_flame.svg

Reading Flame Graphs

┌─────────────────────────────────────────────────────────────────────┐
│                    READING FLAME GRAPHS                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│           ┌──────────────────────────────────────────────────────┐  │
│           │                     main                              │  │
│           │                     100%                              │  │
│           └───────────────────────┬──────────────────────────────┘  │
│               ┌───────────────────┴───────────────────┐             │
│               ▼                                       ▼             │
│  ┌────────────────────────────────┐    ┌────────────────────────┐  │
│  │        process_request         │    │    handle_connection   │  │
│  │             60%                │    │          40%           │  │
│  └───────────────┬────────────────┘    └────────────────────────┘  │
│        ┌─────────┴─────────┐                                        │
│        ▼                   ▼                                        │
│  ┌──────────────┐   ┌──────────────┐                               │
│  │  parse_json  │   │ query_database│                               │
│  │     25%      │   │     35%      │                               │
│  └──────────────┘   └──────────────┘                               │
│                                                                      │
│  Key insights:                                                       │
│  ─────────────                                                      │
│  • Width = Time spent in that function                              │
│  • Stack depth shows call hierarchy                                 │
│  • Look for wide boxes = optimization targets                       │
│  • Flat tops = leaf functions doing work                            │
│  • Compare before/after with differential flame graphs              │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Production Debugging Scenarios

Scenario 1: High CPU Usage

# Step 1: Identify the process
top -c
htop -p $(pidof myapp)

# Step 2: Quick perf analysis
perf top -p $(pidof myapp)

# Step 3: Detailed recording
perf record -F 99 -g -p $(pidof myapp) -- sleep 30
perf report

# Step 4: Flame graph for visualization
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

# Step 5: If it's kernel CPU
perf record -F 99 -g -a -- sleep 10  # System-wide
# Look for kernel functions in flame graph

Scenario 2: Application Latency Spikes

# Step 1: Check if it's syscall latency
sudo perf trace --duration 10 -p $(pidof myapp)

# Step 2: Find slow operations
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter /pid == $1/ { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
    $lat = (nsecs - @start[tid]) / 1000000;
    if ($lat > 10) {
        printf("%s slow syscall %d: %d ms\n", comm, args->id, $lat);
    }
    delete(@start[tid]);
}' $(pidof myapp)

# Step 3: Off-CPU analysis
sudo offcputime-bpfcc -p $(pidof myapp) 30 > off.stacks
flamegraph.pl --color=io off.stacks > offcpu.svg

# Step 4: Check for lock contention
sudo bpftrace -e '
tracepoint:lock:contention_begin { @start[tid] = nsecs; }
tracepoint:lock:contention_end /@start[tid]/ {
    @contention_ns = hist(nsecs - @start[tid]);
    delete(@start[tid]);
}'

Scenario 3: Memory Issues

# Step 1: Check memory maps
pmap -x $(pidof myapp)
cat /proc/$(pidof myapp)/smaps_rollup

# Step 2: Track allocations
sudo memleak-bpfcc -p $(pidof myapp) 30

# Step 3: Page fault analysis
perf record -e page-faults -g -p $(pidof myapp) -- sleep 30
perf report

# Step 4: Check for memory fragmentation
cat /proc/buddyinfo
cat /proc/pagetypeinfo

# Step 5: OOM analysis
dmesg | grep -i oom
journalctl -k | grep -i "out of memory"

Scenario 4: I/O Bottleneck

# Step 1: Overall I/O stats
iostat -xz 1

# Step 2: Per-process I/O
iotop
sudo biotop-bpfcc

# Step 3: I/O latency distribution
sudo biolatency-bpfcc 10

# Step 4: Slow I/O operations
sudo biosnoop-bpfcc

# Step 5: File-level analysis
sudo fileslower-bpfcc 10 -p $(pidof myapp)

# Step 6: Check what files
sudo opensnoop-bpfcc -p $(pidof myapp)

BCC Tools Reference

Process/Thread Tools

ToolPurpose
execsnoopTrace new process execution
exitsnoopTrace process exits
threadsnoopTrace thread creation
offcputimeOff-CPU time by stack
runqlatRun queue latency
cpudistOn-CPU time distribution

File System Tools

ToolPurpose
opensnoopTrace file opens
filelifeTrace file lifespan
fileslowerTrace slow file I/O
filetopTop files by I/O
vfsstatVFS operation stats
cachestatPage cache hit ratio

Block I/O Tools

ToolPurpose
biolatencyBlock I/O latency histogram
biosnoopTrace block I/O operations
biotopTop processes by I/O
bitesizeI/O size distribution

Network Tools

ToolPurpose
tcpconnectTrace outgoing TCP connections
tcpacceptTrace incoming TCP connections
tcpretransTrace TCP retransmissions
tcpstatesTrace TCP state changes
tcpsynblSYN backlog stats

Memory Tools

ToolPurpose
memleakTrace memory leaks
oomkillTrace OOM killer
slabratetopSlab allocation rates

Performance Monitoring Checklist

1

USE Method (Utilization, Saturation, Errors)

For each resource:
# CPU
mpstat -P ALL 1          # Utilization
vmstat 1                  # Saturation (run queue)
dmesg | grep -i error     # Errors

# Memory
free -m                   # Utilization
vmstat 1                  # Saturation (si/so)
dmesg | grep -i oom       # Errors

# Disk
iostat -xz 1              # Utilization (%util)
iostat -xz 1              # Saturation (avgqu-sz)
smartctl -H /dev/sda      # Errors

# Network
ip -s link                # Utilization
ss -s                     # Saturation (overflows)
netstat -s | grep error   # Errors
2

RED Method (Rate, Errors, Duration)

For each service:
# Rate: Requests per second
ss -s | grep established

# Errors: Error rate
tail -f /var/log/app.log | grep -i error

# Duration: Request latency
# Application metrics or:
sudo bpftrace -e 'usdt:./app:request_start { @start[tid] = nsecs; }
                  usdt:./app:request_end   { @lat = hist(nsecs - @start[tid]); }'
3

Generate Flame Graphs

# CPU flame graph
perf record -F 99 -g -p PID -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

# Off-CPU flame graph
sudo offcputime-bpfcc -f -p PID 30 > off.stacks
flamegraph.pl --color=io off.stacks > offcpu.svg
4

Check for Known Issues

# Kernel warnings
dmesg -T | tail -100

# System logs
journalctl -p err -b

# Hardware issues
mcelog --client

Interview Tips

Framework answer:
  1. Characterize: What is the symptom? Latency? Throughput? Error rate?
  2. Triage with USE/RED:
    • Check resource utilization (CPU, memory, disk, network)
    • Check for saturation (queues, waits)
    • Check error rates
  3. Narrow down:
    • Is it application code? (perf, flame graphs)
    • Is it system calls? (strace, perf trace)
    • Is it kernel? (perf, bpftrace)
    • Is it hardware? (perf stat, mcelog)
  4. Deep dive:
    • CPU: perf record + flame graphs
    • Blocking: offcputime + off-CPU flame graphs
    • I/O: biolatency, biosnoop
    • Network: tcpretrans, tcpstates
  5. Validate fix: Compare before/after metrics
Tool overheads:
ToolOverheadNotes
perf stat~0%Just reads PMU
perf record2-10%Depends on frequency
strace100x+Ptrace per syscall
perf trace5-20%Much faster than strace
bpftrace1-5%Depends on events
ftrace1-10%Depends on filters
Production guidelines:
  • Always sample, don’t trace every event
  • Use low frequencies (99 Hz, not 9999 Hz)
  • Filter to specific processes
  • Prefer tracepoints over kprobes
  • Time-bound your tracing

Next: Kernel Debugging →