Tracing & Profiling

Production debugging requires deep observability skills. This module covers the essential tracing and profiling tools used at infrastructure and observability companies.

Prerequisites: Process fundamentals, system calls, basic eBPF
Interview Focus: perf, bpftrace, flame graphs, production debugging
Companies: This is THE differentiating skill for infra roles

Observability Stack Overview

┌─────────────────────────────────────────────────────────────────────┐
│                  LINUX OBSERVABILITY TOOLS                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Application Level                                                   │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  Logging (journald, syslog)  │  Metrics (Prometheus, StatsD)   ││
│  │  APM (Datadog, New Relic)    │  Distributed Tracing (Jaeger)   ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  System Call Interface                                               │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  strace, ltrace, perf trace                                     ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  Kernel Tracing Infrastructure                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                                                                  ││
│  │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐            ││
│  │   │   kprobes   │  │ tracepoints │  │    USDT    │            ││
│  │   │ (dynamic)   │  │  (static)   │  │  (user)    │            ││
│  │   └──────┬──────┘  └──────┬──────┘  └──────┬──────┘            ││
│  │          │                │                │                    ││
│  │          └────────────────┼────────────────┘                    ││
│  │                           ▼                                      ││
│  │   ┌─────────────────────────────────────────────────────────┐  ││
│  │   │                    eBPF Runtime                         │  ││
│  │   │  (Verifier, JIT Compiler, Maps, Helpers)               │  ││
│  │   └─────────────────────────────────────────────────────────┘  ││
│  │                           │                                      ││
│  │   ┌───────────┬───────────┼───────────┬───────────┐            ││
│  │   ▼           ▼           ▼           ▼           ▼            ││
│  │ ┌─────┐   ┌─────┐   ┌─────────┐   ┌─────┐   ┌─────┐           ││
│  │ │perf │   │ftrace│   │bpftrace│   │ BCC │   │libbpf│           ││
│  │ └─────┘   └─────┘   └─────────┘   └─────┘   └─────┘           ││
│  │                                                                  ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  Hardware Performance Counters (PMU)                                 │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  CPU cycles, cache misses, branch mispredictions, IPC          ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

perf - The Swiss Army Knife

Basic Usage

# Record CPU samples
perf record -g ./myapp
perf report

# Record specific events
perf record -e cycles,cache-misses -g ./myapp

# Profile running process
perf record -g -p $(pidof myapp) -- sleep 30

# System-wide profiling
perf record -a -g -- sleep 10

perf stat - High-Level Stats

# Basic counters
perf stat ./myapp

# Example output:
#  Performance counter stats for './myapp':
#
#          5,024.53 msec task-clock               #    0.998 CPUs utilized
#               125      context-switches         #   24.880/sec
#                 5      cpu-migrations           #    0.995/sec
#            15,432      page-faults              #    3.071 K/sec
#    15,234,567,890      cycles                   #    3.032 GHz
#    12,345,678,901      instructions             #    0.81  insn per cycle
#       987,654,321      branches                 #  196.602 M/sec
#        12,345,678      branch-misses            #    1.25% of all branches

# Detailed counters
perf stat -d ./myapp   # L1 cache stats
perf stat -dd ./myapp  # L2 cache stats
perf stat -ddd ./myapp # L3 cache stats

# Specific events
perf list  # List available events
perf stat -e 'cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses' ./myapp

perf Sampling Modes

CPU Profiling
Off-CPU Analysis
Memory Profiling

# On-CPU time (default)
perf record -g ./myapp

# Specific CPU (or all CPUs)
perf record -C 0,1,2,3 -a -- sleep 30

# Higher frequency for short programs
perf record -F 999 -g ./myapp  # 999 Hz

# Generate flame graph
perf record -g ./myapp
perf script > out.perf
stackcollapse-perf.pl out.perf > out.folded
flamegraph.pl out.folded > flame.svg

# Off-CPU time (what's blocking?)
perf record -e sched:sched_switch -a -- sleep 30

# Better: use BCC offcputime
sudo offcputime-bpfcc -p $(pidof myapp) 30 > off.stacks
flamegraph.pl --color=io off.stacks > offcpu.svg

# Page faults
perf record -e page-faults -g ./myapp

# Cache misses
perf record -e cache-misses -g ./myapp

# Memory allocation
perf probe -x /lib64/libc.so.6 malloc
perf record -e probe_libc:malloc -g ./myapp

perf trace - System Call Tracing

# Like strace, but faster
perf trace ./myapp

# Specific syscalls
perf trace -e open,read,write ./myapp

# With timing
perf trace --duration 0.1 ./myapp  # Show syscalls > 100μs

# System-wide
perf trace -a -- sleep 10

ftrace - Kernel Function Tracer

Basic ftrace

# Setup
cd /sys/kernel/debug/tracing

# List available tracers
cat available_tracers
# nop function function_graph

# Enable function tracer
echo function > current_tracer

# Filter specific functions
echo 'tcp_*' > set_ftrace_filter

# Start/stop tracing
echo 1 > tracing_on
# ... run workload ...
echo 0 > tracing_on

# Read trace
cat trace

Function Graph Tracer

# Enable function_graph for call trees
echo function_graph > current_tracer

# Filter to specific functions
echo schedule > set_graph_function

# Read trace (shows call hierarchy with timing)
cat trace

# Example output:
#  4)               |  schedule() {
#  4)   0.523 us    |    rcu_note_context_switch();
#  4)               |    __schedule() {
#  4)   0.150 us    |      update_rq_clock();
#  4)               |      pick_next_task_fair() {
#  4)   0.098 us    |        update_curr();
#  4)   0.412 us    |      }
#  4)   2.518 us    |    }
#  4)   3.201 us    |  }

trace-cmd Frontend

# Record function graph
trace-cmd record -p function_graph -g schedule

# Record events
trace-cmd record -e sched:sched_switch -e sched:sched_wakeup

# Report
trace-cmd report

# Convert for visualization
trace-cmd report --cpu 0 -t | less

bpftrace - High-Level Tracing

One-Liners

# Count syscalls by process
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# Trace open() calls
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'

# Syscall latency histogram
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
             tracepoint:raw_syscalls:sys_exit /@start[tid]/ { @ns = hist(nsecs - @start[tid]); delete(@start[tid]); }'

# Read size distribution
bpftrace -e 'tracepoint:syscalls:sys_exit_read { @bytes = hist(args->ret); }'

# Block I/O latency
bpftrace -e 'tracepoint:block:block_rq_issue { @start[args->dev, args->sector] = nsecs; }
             tracepoint:block:block_rq_complete /@start[args->dev, args->sector]/ { 
                 @us = hist((nsecs - @start[args->dev, args->sector]) / 1000);
                 delete(@start[args->dev, args->sector]); }'

Practical Scripts

# TCP connect latency
bpftrace -e '
kprobe:tcp_v4_connect {
    @start[tid] = nsecs;
}
kretprobe:tcp_v4_connect /@start[tid]/ {
    @connect_lat_us = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
}'

# Trace process execution
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("%s -> %s\n", comm, str(args->filename));
}'

# Memory allocation tracking
bpftrace -e '
uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc {
    @bytes[comm] = sum(arg0);
}
interval:s:5 {
    print(@bytes);
    clear(@bytes);
}'

# Off-CPU time by kernel stack
bpftrace -e '
tracepoint:sched:sched_switch {
    @start[args->prev_pid] = nsecs;
}
tracepoint:sched:sched_switch /@start[args->next_pid]/ {
    @off_cpu[kstack] = sum(nsecs - @start[args->next_pid]);
    delete(@start[args->next_pid]);
}'

bpftrace Built-in Variables

Variable	Description
`pid`	Process ID
`tid`	Thread ID
`comm`	Process name
`nsecs`	Nanosecond timestamp
`kstack`	Kernel stack trace
`ustack`	User stack trace
`arg0-argN`	Function arguments
`retval`	Return value (kretprobe)
`args->field`	Tracepoint arguments

Flame Graphs

Generating Flame Graphs

# CPU flame graph with perf
perf record -F 99 -g -p $(pidof myapp) -- sleep 30
perf script > perf.data
stackcollapse-perf.pl perf.data > perf.folded
flamegraph.pl perf.folded > cpu_flame.svg

# Off-CPU flame graph
sudo offcputime-bpfcc -p $(pidof myapp) -f 30 > off.stacks
flamegraph.pl --color=io --title="Off-CPU Time" off.stacks > offcpu_flame.svg

# Memory flame graph
sudo memleak-bpfcc -p $(pidof myapp) 30 > mem.stacks
flamegraph.pl --color=mem --title="Memory" mem.stacks > mem_flame.svg

# Differential flame graph (compare before/after)
difffolded.pl before.folded after.folded | flamegraph.pl > diff_flame.svg

Reading Flame Graphs

┌─────────────────────────────────────────────────────────────────────┐
│                    READING FLAME GRAPHS                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│           ┌──────────────────────────────────────────────────────┐  │
│           │                     main                              │  │
│           │                     100%                              │  │
│           └───────────────────────┬──────────────────────────────┘  │
│               ┌───────────────────┴───────────────────┐             │
│               ▼                                       ▼             │
│  ┌────────────────────────────────┐    ┌────────────────────────┐  │
│  │        process_request         │    │    handle_connection   │  │
│  │             60%                │    │          40%           │  │
│  └───────────────┬────────────────┘    └────────────────────────┘  │
│        ┌─────────┴─────────┐                                        │
│        ▼                   ▼                                        │
│  ┌──────────────┐   ┌──────────────┐                               │
│  │  parse_json  │   │ query_database│                               │
│  │     25%      │   │     35%      │                               │
│  └──────────────┘   └──────────────┘                               │
│                                                                      │
│  Key insights:                                                       │
│  ─────────────                                                      │
│  • Width = Time spent in that function                              │
│  • Stack depth shows call hierarchy                                 │
│  • Look for wide boxes = optimization targets                       │
│  • Flat tops = leaf functions doing work                            │
│  • Compare before/after with differential flame graphs              │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Production Debugging Scenarios

Scenario 1: High CPU Usage

# Step 1: Identify the process
top -c
htop -p $(pidof myapp)

# Step 2: Quick perf analysis
perf top -p $(pidof myapp)

# Step 3: Detailed recording
perf record -F 99 -g -p $(pidof myapp) -- sleep 30
perf report

# Step 4: Flame graph for visualization
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

# Step 5: If it's kernel CPU
perf record -F 99 -g -a -- sleep 10  # System-wide
# Look for kernel functions in flame graph

Scenario 2: Application Latency Spikes

# Step 1: Check if it's syscall latency
sudo perf trace --duration 10 -p $(pidof myapp)

# Step 2: Find slow operations
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter /pid == $1/ { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
    $lat = (nsecs - @start[tid]) / 1000000;
    if ($lat > 10) {
        printf("%s slow syscall %d: %d ms\n", comm, args->id, $lat);
    }
    delete(@start[tid]);
}' $(pidof myapp)

# Step 3: Off-CPU analysis
sudo offcputime-bpfcc -p $(pidof myapp) 30 > off.stacks
flamegraph.pl --color=io off.stacks > offcpu.svg

# Step 4: Check for lock contention
sudo bpftrace -e '
tracepoint:lock:contention_begin { @start[tid] = nsecs; }
tracepoint:lock:contention_end /@start[tid]/ {
    @contention_ns = hist(nsecs - @start[tid]);
    delete(@start[tid]);
}'

Scenario 3: Memory Issues

# Step 1: Check memory maps
pmap -x $(pidof myapp)
cat /proc/$(pidof myapp)/smaps_rollup

# Step 2: Track allocations
sudo memleak-bpfcc -p $(pidof myapp) 30

# Step 3: Page fault analysis
perf record -e page-faults -g -p $(pidof myapp) -- sleep 30
perf report

# Step 4: Check for memory fragmentation
cat /proc/buddyinfo
cat /proc/pagetypeinfo

# Step 5: OOM analysis
dmesg | grep -i oom
journalctl -k | grep -i "out of memory"

Scenario 4: I/O Bottleneck

# Step 1: Overall I/O stats
iostat -xz 1

# Step 2: Per-process I/O
iotop
sudo biotop-bpfcc

# Step 3: I/O latency distribution
sudo biolatency-bpfcc 10

# Step 4: Slow I/O operations
sudo biosnoop-bpfcc

# Step 5: File-level analysis
sudo fileslower-bpfcc 10 -p $(pidof myapp)

# Step 6: Check what files
sudo opensnoop-bpfcc -p $(pidof myapp)

BCC Tools Reference

Process/Thread Tools

Tool	Purpose
`execsnoop`	Trace new process execution
`exitsnoop`	Trace process exits
`threadsnoop`	Trace thread creation
`offcputime`	Off-CPU time by stack
`runqlat`	Run queue latency
`cpudist`	On-CPU time distribution

File System Tools

Tool	Purpose
`opensnoop`	Trace file opens
`filelife`	Trace file lifespan
`fileslower`	Trace slow file I/O
`filetop`	Top files by I/O
`vfsstat`	VFS operation stats
`cachestat`	Page cache hit ratio

Block I/O Tools

Tool	Purpose
`biolatency`	Block I/O latency histogram
`biosnoop`	Trace block I/O operations
`biotop`	Top processes by I/O
`bitesize`	I/O size distribution

Network Tools

Tool	Purpose
`tcpconnect`	Trace outgoing TCP connections
`tcpaccept`	Trace incoming TCP connections
`tcpretrans`	Trace TCP retransmissions
`tcpstates`	Trace TCP state changes
`tcpsynbl`	SYN backlog stats

Memory Tools

Tool	Purpose
`memleak`	Trace memory leaks
`oomkill`	Trace OOM killer
`slabratetop`	Slab allocation rates

Performance Monitoring Checklist

USE Method (Utilization, Saturation, Errors)

For each resource:

# CPU
mpstat -P ALL 1          # Utilization
vmstat 1                  # Saturation (run queue)
dmesg | grep -i error     # Errors

# Memory
free -m                   # Utilization
vmstat 1                  # Saturation (si/so)
dmesg | grep -i oom       # Errors

# Disk
iostat -xz 1              # Utilization (%util)
iostat -xz 1              # Saturation (avgqu-sz)
smartctl -H /dev/sda      # Errors

# Network
ip -s link                # Utilization
ss -s                     # Saturation (overflows)
netstat -s | grep error   # Errors

RED Method (Rate, Errors, Duration)

For each service:

# Rate: Requests per second
ss -s | grep established

# Errors: Error rate
tail -f /var/log/app.log | grep -i error

# Duration: Request latency
# Application metrics or:
sudo bpftrace -e 'usdt:./app:request_start { @start[tid] = nsecs; }
                  usdt:./app:request_end   { @lat = hist(nsecs - @start[tid]); }'

Generate Flame Graphs

# CPU flame graph
perf record -F 99 -g -p PID -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

# Off-CPU flame graph
sudo offcputime-bpfcc -f -p PID 30 > off.stacks
flamegraph.pl --color=io off.stacks > offcpu.svg

Check for Known Issues

# Kernel warnings
dmesg -T | tail -100

# System logs
journalctl -p err -b

# Hardware issues
mcelog --client

Interview Tips

Q: How would you debug a production performance issue?

Framework answer:

Characterize: What is the symptom? Latency? Throughput? Error rate?
Triage with USE/RED:
- Check resource utilization (CPU, memory, disk, network)
- Check for saturation (queues, waits)
- Check error rates
Narrow down:
- Is it application code? (perf, flame graphs)
- Is it system calls? (strace, perf trace)
- Is it kernel? (perf, bpftrace)
- Is it hardware? (perf stat, mcelog)
Deep dive:
- CPU: perf record + flame graphs
- Blocking: offcputime + off-CPU flame graphs
- I/O: biolatency, biosnoop
- Network: tcpretrans, tcpstates
Validate fix: Compare before/after metrics

Q: What's the overhead of tracing tools?

Tool overheads:

Tool	Overhead	Notes
perf stat	~0%	Just reads PMU
perf record	2-10%	Depends on frequency
strace	100x+	Ptrace per syscall
perf trace	5-20%	Much faster than strace
bpftrace	1-5%	Depends on events
ftrace	1-10%	Depends on filters

Production guidelines:

Always sample, don’t trace every event
Use low frequencies (99 Hz, not 9999 Hz)
Filter to specific processes
Prefer tracepoints over kprobes
Time-bound your tracing

Next: Kernel Debugging →

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Tracing & Profiling

​Observability Stack Overview

​perf - The Swiss Army Knife

​Basic Usage

​perf stat - High-Level Stats

​perf Sampling Modes

​perf trace - System Call Tracing

​ftrace - Kernel Function Tracer

​Basic ftrace

​Function Graph Tracer

​trace-cmd Frontend

​bpftrace - High-Level Tracing

​One-Liners

​Practical Scripts

​bpftrace Built-in Variables

​Flame Graphs

​Generating Flame Graphs

​Reading Flame Graphs

​Production Debugging Scenarios

​Scenario 1: High CPU Usage

​Scenario 2: Application Latency Spikes

​Scenario 3: Memory Issues

​Scenario 4: I/O Bottleneck

​BCC Tools Reference

​Process/Thread Tools

​File System Tools

​Block I/O Tools

​Network Tools

​Memory Tools

​Performance Monitoring Checklist

​Interview Tips

Tracing & Profiling

Observability Stack Overview

perf - The Swiss Army Knife

Basic Usage

perf stat - High-Level Stats

perf Sampling Modes

perf trace - System Call Tracing

ftrace - Kernel Function Tracer

Basic ftrace

Function Graph Tracer

trace-cmd Frontend

bpftrace - High-Level Tracing

One-Liners

Practical Scripts

bpftrace Built-in Variables

Flame Graphs

Generating Flame Graphs

Reading Flame Graphs

Production Debugging Scenarios

Scenario 1: High CPU Usage

Scenario 2: Application Latency Spikes

Scenario 3: Memory Issues

Scenario 4: I/O Bottleneck

BCC Tools Reference

Process/Thread Tools

File System Tools

Block I/O Tools

Network Tools

Memory Tools

Performance Monitoring Checklist

Interview Tips