> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Tracing & Profiling

> Master Linux observability with perf, ftrace, bpftrace, and flame graphs for production debugging

<Frame>
  <img src="https://mintcdn.com/devweeekends/1GcDwVN8SzYRbJg1/images/courses/linux-internals/tracing-profiling-concept.svg?fit=max&auto=format&n=1GcDwVN8SzYRbJg1&q=85&s=a5c78d2cc77d8cf394f0ffd93e1333e5" alt="Linux Tracing & Profiling - perf, ftrace, bpftrace, and observability tools" width="1080" height="1080" data-path="images/courses/linux-internals/tracing-profiling-concept.svg" />
</Frame>

# Tracing & Profiling

Production debugging requires deep observability skills. This module covers the essential tracing and profiling tools used at infrastructure and observability companies.

<Info>
  **Prerequisites**: Process fundamentals, system calls, basic eBPF\
  **Interview Focus**: perf, bpftrace, flame graphs, production debugging\
  **Companies**: This is THE differentiating skill for infra roles
</Info>

***

## Observability Stack Overview

```
┌─────────────────────────────────────────────────────────────────────┐
│                  LINUX OBSERVABILITY TOOLS                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Application Level                                                   │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  Logging (journald, syslog)  │  Metrics (Prometheus, StatsD)   ││
│  │  APM (Datadog, New Relic)    │  Distributed Tracing (Jaeger)   ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  System Call Interface                                               │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  strace, ltrace, perf trace                                     ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  Kernel Tracing Infrastructure                                       │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │                                                                  ││
│  │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐            ││
│  │   │   kprobes   │  │ tracepoints │  │    USDT    │            ││
│  │   │ (dynamic)   │  │  (static)   │  │  (user)    │            ││
│  │   └──────┬──────┘  └──────┬──────┘  └──────┬──────┘            ││
│  │          │                │                │                    ││
│  │          └────────────────┼────────────────┘                    ││
│  │                           ▼                                      ││
│  │   ┌─────────────────────────────────────────────────────────┐  ││
│  │   │                    eBPF Runtime                         │  ││
│  │   │  (Verifier, JIT Compiler, Maps, Helpers)               │  ││
│  │   └─────────────────────────────────────────────────────────┘  ││
│  │                           │                                      ││
│  │   ┌───────────┬───────────┼───────────┬───────────┐            ││
│  │   ▼           ▼           ▼           ▼           ▼            ││
│  │ ┌─────┐   ┌─────┐   ┌─────────┐   ┌─────┐   ┌─────┐           ││
│  │ │perf │   │ftrace│   │bpftrace│   │ BCC │   │libbpf│           ││
│  │ └─────┘   └─────┘   └─────────┘   └─────┘   └─────┘           ││
│  │                                                                  ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  Hardware Performance Counters (PMU)                                 │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  CPU cycles, cache misses, branch mispredictions, IPC          ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

***

## perf - The Swiss Army Knife

### Basic Usage

```bash theme={null}
# Record CPU samples
perf record -g ./myapp
perf report

# Record specific events
perf record -e cycles,cache-misses -g ./myapp

# Profile running process
perf record -g -p $(pidof myapp) -- sleep 30

# System-wide profiling
perf record -a -g -- sleep 10
```

### perf stat - High-Level Stats

```bash theme={null}
# Basic counters
perf stat ./myapp

# Example output:
#  Performance counter stats for './myapp':
#
#          5,024.53 msec task-clock               #    0.998 CPUs utilized
#               125      context-switches         #   24.880/sec
#                 5      cpu-migrations           #    0.995/sec
#            15,432      page-faults              #    3.071 K/sec
#    15,234,567,890      cycles                   #    3.032 GHz
#    12,345,678,901      instructions             #    0.81  insn per cycle
#       987,654,321      branches                 #  196.602 M/sec
#        12,345,678      branch-misses            #    1.25% of all branches

# Detailed counters
perf stat -d ./myapp   # L1 cache stats
perf stat -dd ./myapp  # L2 cache stats
perf stat -ddd ./myapp # L3 cache stats

# Specific events
perf list  # List available events
perf stat -e 'cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses' ./myapp
```

### perf Sampling Modes

<Tabs>
  <Tab title="CPU Profiling">
    ```bash theme={null}
    # On-CPU time (default)
    perf record -g ./myapp

    # Specific CPU (or all CPUs)
    perf record -C 0,1,2,3 -a -- sleep 30

    # Higher frequency for short programs
    perf record -F 999 -g ./myapp  # 999 Hz

    # Generate flame graph
    perf record -g ./myapp
    perf script > out.perf
    stackcollapse-perf.pl out.perf > out.folded
    flamegraph.pl out.folded > flame.svg
    ```
  </Tab>

  <Tab title="Off-CPU Analysis">
    ```bash theme={null}
    # Off-CPU time (what's blocking?)
    perf record -e sched:sched_switch -a -- sleep 30

    # Better: use BCC offcputime
    sudo offcputime-bpfcc -p $(pidof myapp) 30 > off.stacks
    flamegraph.pl --color=io off.stacks > offcpu.svg
    ```
  </Tab>

  <Tab title="Memory Profiling">
    ```bash theme={null}
    # Page faults
    perf record -e page-faults -g ./myapp

    # Cache misses
    perf record -e cache-misses -g ./myapp

    # Memory allocation
    perf probe -x /lib64/libc.so.6 malloc
    perf record -e probe_libc:malloc -g ./myapp
    ```
  </Tab>
</Tabs>

### perf trace - System Call Tracing

```bash theme={null}
# Like strace, but faster
perf trace ./myapp

# Specific syscalls
perf trace -e open,read,write ./myapp

# With timing
perf trace --duration 0.1 ./myapp  # Show syscalls > 100μs

# System-wide
perf trace -a -- sleep 10
```

***

## ftrace - Kernel Function Tracer

### Basic ftrace

```bash theme={null}
# Setup
cd /sys/kernel/debug/tracing

# List available tracers
cat available_tracers
# nop function function_graph

# Enable function tracer
echo function > current_tracer

# Filter specific functions
echo 'tcp_*' > set_ftrace_filter

# Start/stop tracing
echo 1 > tracing_on
# ... run workload ...
echo 0 > tracing_on

# Read trace
cat trace
```

### Function Graph Tracer

```bash theme={null}
# Enable function_graph for call trees
echo function_graph > current_tracer

# Filter to specific functions
echo schedule > set_graph_function

# Read trace (shows call hierarchy with timing)
cat trace

# Example output:
#  4)               |  schedule() {
#  4)   0.523 us    |    rcu_note_context_switch();
#  4)               |    __schedule() {
#  4)   0.150 us    |      update_rq_clock();
#  4)               |      pick_next_task_fair() {
#  4)   0.098 us    |        update_curr();
#  4)   0.412 us    |      }
#  4)   2.518 us    |    }
#  4)   3.201 us    |  }
```

### trace-cmd Frontend

```bash theme={null}
# Record function graph
trace-cmd record -p function_graph -g schedule

# Record events
trace-cmd record -e sched:sched_switch -e sched:sched_wakeup

# Report
trace-cmd report

# Convert for visualization
trace-cmd report --cpu 0 -t | less
```

***

## bpftrace - High-Level Tracing

### One-Liners

```bash theme={null}
# Count syscalls by process
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# Trace open() calls
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'

# Syscall latency histogram
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
             tracepoint:raw_syscalls:sys_exit /@start[tid]/ { @ns = hist(nsecs - @start[tid]); delete(@start[tid]); }'

# Read size distribution
bpftrace -e 'tracepoint:syscalls:sys_exit_read { @bytes = hist(args->ret); }'

# Block I/O latency
bpftrace -e 'tracepoint:block:block_rq_issue { @start[args->dev, args->sector] = nsecs; }
             tracepoint:block:block_rq_complete /@start[args->dev, args->sector]/ { 
                 @us = hist((nsecs - @start[args->dev, args->sector]) / 1000);
                 delete(@start[args->dev, args->sector]); }'
```

### Practical Scripts

```bash theme={null}
# TCP connect latency
bpftrace -e '
kprobe:tcp_v4_connect {
    @start[tid] = nsecs;
}
kretprobe:tcp_v4_connect /@start[tid]/ {
    @connect_lat_us = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
}'

# Trace process execution
bpftrace -e '
tracepoint:syscalls:sys_enter_execve {
    printf("%s -> %s\n", comm, str(args->filename));
}'

# Memory allocation tracking
bpftrace -e '
uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc {
    @bytes[comm] = sum(arg0);
}
interval:s:5 {
    print(@bytes);
    clear(@bytes);
}'

# Off-CPU time by kernel stack
bpftrace -e '
tracepoint:sched:sched_switch {
    @start[args->prev_pid] = nsecs;
}
tracepoint:sched:sched_switch /@start[args->next_pid]/ {
    @off_cpu[kstack] = sum(nsecs - @start[args->next_pid]);
    delete(@start[args->next_pid]);
}'
```

### bpftrace Built-in Variables

| Variable      | Description              |
| ------------- | ------------------------ |
| `pid`         | Process ID               |
| `tid`         | Thread ID                |
| `comm`        | Process name             |
| `nsecs`       | Nanosecond timestamp     |
| `kstack`      | Kernel stack trace       |
| `ustack`      | User stack trace         |
| `arg0-argN`   | Function arguments       |
| `retval`      | Return value (kretprobe) |
| `args->field` | Tracepoint arguments     |

***

## Flame Graphs

### Generating Flame Graphs

```bash theme={null}
# CPU flame graph with perf
perf record -F 99 -g -p $(pidof myapp) -- sleep 30
perf script > perf.data
stackcollapse-perf.pl perf.data > perf.folded
flamegraph.pl perf.folded > cpu_flame.svg

# Off-CPU flame graph
sudo offcputime-bpfcc -p $(pidof myapp) -f 30 > off.stacks
flamegraph.pl --color=io --title="Off-CPU Time" off.stacks > offcpu_flame.svg

# Memory flame graph
sudo memleak-bpfcc -p $(pidof myapp) 30 > mem.stacks
flamegraph.pl --color=mem --title="Memory" mem.stacks > mem_flame.svg

# Differential flame graph (compare before/after)
difffolded.pl before.folded after.folded | flamegraph.pl > diff_flame.svg
```

### Reading Flame Graphs

```
┌─────────────────────────────────────────────────────────────────────┐
│                    READING FLAME GRAPHS                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│           ┌──────────────────────────────────────────────────────┐  │
│           │                     main                              │  │
│           │                     100%                              │  │
│           └───────────────────────┬──────────────────────────────┘  │
│               ┌───────────────────┴───────────────────┐             │
│               ▼                                       ▼             │
│  ┌────────────────────────────────┐    ┌────────────────────────┐  │
│  │        process_request         │    │    handle_connection   │  │
│  │             60%                │    │          40%           │  │
│  └───────────────┬────────────────┘    └────────────────────────┘  │
│        ┌─────────┴─────────┐                                        │
│        ▼                   ▼                                        │
│  ┌──────────────┐   ┌──────────────┐                               │
│  │  parse_json  │   │ query_database│                               │
│  │     25%      │   │     35%      │                               │
│  └──────────────┘   └──────────────┘                               │
│                                                                      │
│  Key insights:                                                       │
│  ─────────────                                                      │
│  • Width = Time spent in that function                              │
│  • Stack depth shows call hierarchy                                 │
│  • Look for wide boxes = optimization targets                       │
│  • Flat tops = leaf functions doing work                            │
│  • Compare before/after with differential flame graphs              │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

***

## Production Debugging Scenarios

### Scenario 1: High CPU Usage

```bash theme={null}
# Step 1: Identify the process
top -c
htop -p $(pidof myapp)

# Step 2: Quick perf analysis
perf top -p $(pidof myapp)

# Step 3: Detailed recording
perf record -F 99 -g -p $(pidof myapp) -- sleep 30
perf report

# Step 4: Flame graph for visualization
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

# Step 5: If it's kernel CPU
perf record -F 99 -g -a -- sleep 10  # System-wide
# Look for kernel functions in flame graph
```

### Scenario 2: Application Latency Spikes

```bash theme={null}
# Step 1: Check if it's syscall latency
sudo perf trace --duration 10 -p $(pidof myapp)

# Step 2: Find slow operations
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter /pid == $1/ { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
    $lat = (nsecs - @start[tid]) / 1000000;
    if ($lat > 10) {
        printf("%s slow syscall %d: %d ms\n", comm, args->id, $lat);
    }
    delete(@start[tid]);
}' $(pidof myapp)

# Step 3: Off-CPU analysis
sudo offcputime-bpfcc -p $(pidof myapp) 30 > off.stacks
flamegraph.pl --color=io off.stacks > offcpu.svg

# Step 4: Check for lock contention
sudo bpftrace -e '
tracepoint:lock:contention_begin { @start[tid] = nsecs; }
tracepoint:lock:contention_end /@start[tid]/ {
    @contention_ns = hist(nsecs - @start[tid]);
    delete(@start[tid]);
}'
```

### Scenario 3: Memory Issues

```bash theme={null}
# Step 1: Check memory maps
pmap -x $(pidof myapp)
cat /proc/$(pidof myapp)/smaps_rollup

# Step 2: Track allocations
sudo memleak-bpfcc -p $(pidof myapp) 30

# Step 3: Page fault analysis
perf record -e page-faults -g -p $(pidof myapp) -- sleep 30
perf report

# Step 4: Check for memory fragmentation
cat /proc/buddyinfo
cat /proc/pagetypeinfo

# Step 5: OOM analysis
dmesg | grep -i oom
journalctl -k | grep -i "out of memory"
```

### Scenario 4: I/O Bottleneck

```bash theme={null}
# Step 1: Overall I/O stats
iostat -xz 1

# Step 2: Per-process I/O
iotop
sudo biotop-bpfcc

# Step 3: I/O latency distribution
sudo biolatency-bpfcc 10

# Step 4: Slow I/O operations
sudo biosnoop-bpfcc

# Step 5: File-level analysis
sudo fileslower-bpfcc 10 -p $(pidof myapp)

# Step 6: Check what files
sudo opensnoop-bpfcc -p $(pidof myapp)
```

***

## BCC Tools Reference

### Process/Thread Tools

| Tool          | Purpose                     |
| ------------- | --------------------------- |
| `execsnoop`   | Trace new process execution |
| `exitsnoop`   | Trace process exits         |
| `threadsnoop` | Trace thread creation       |
| `offcputime`  | Off-CPU time by stack       |
| `runqlat`     | Run queue latency           |
| `cpudist`     | On-CPU time distribution    |

### File System Tools

| Tool         | Purpose              |
| ------------ | -------------------- |
| `opensnoop`  | Trace file opens     |
| `filelife`   | Trace file lifespan  |
| `fileslower` | Trace slow file I/O  |
| `filetop`    | Top files by I/O     |
| `vfsstat`    | VFS operation stats  |
| `cachestat`  | Page cache hit ratio |

### Block I/O Tools

| Tool         | Purpose                     |
| ------------ | --------------------------- |
| `biolatency` | Block I/O latency histogram |
| `biosnoop`   | Trace block I/O operations  |
| `biotop`     | Top processes by I/O        |
| `bitesize`   | I/O size distribution       |

### Network Tools

| Tool         | Purpose                        |
| ------------ | ------------------------------ |
| `tcpconnect` | Trace outgoing TCP connections |
| `tcpaccept`  | Trace incoming TCP connections |
| `tcpretrans` | Trace TCP retransmissions      |
| `tcpstates`  | Trace TCP state changes        |
| `tcpsynbl`   | SYN backlog stats              |

### Memory Tools

| Tool          | Purpose               |
| ------------- | --------------------- |
| `memleak`     | Trace memory leaks    |
| `oomkill`     | Trace OOM killer      |
| `slabratetop` | Slab allocation rates |

***

## Performance Monitoring Checklist

<Steps>
  <Step title="USE Method (Utilization, Saturation, Errors)">
    For each resource:

    ```bash theme={null}
    # CPU
    mpstat -P ALL 1          # Utilization
    vmstat 1                  # Saturation (run queue)
    dmesg | grep -i error     # Errors

    # Memory
    free -m                   # Utilization
    vmstat 1                  # Saturation (si/so)
    dmesg | grep -i oom       # Errors

    # Disk
    iostat -xz 1              # Utilization (%util)
    iostat -xz 1              # Saturation (avgqu-sz)
    smartctl -H /dev/sda      # Errors

    # Network
    ip -s link                # Utilization
    ss -s                     # Saturation (overflows)
    netstat -s | grep error   # Errors
    ```
  </Step>

  <Step title="RED Method (Rate, Errors, Duration)">
    For each service:

    ```bash theme={null}
    # Rate: Requests per second
    ss -s | grep established

    # Errors: Error rate
    tail -f /var/log/app.log | grep -i error

    # Duration: Request latency
    # Application metrics or:
    sudo bpftrace -e 'usdt:./app:request_start { @start[tid] = nsecs; }
                      usdt:./app:request_end   { @lat = hist(nsecs - @start[tid]); }'
    ```
  </Step>

  <Step title="Generate Flame Graphs">
    ```bash theme={null}
    # CPU flame graph
    perf record -F 99 -g -p PID -- sleep 30
    perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

    # Off-CPU flame graph
    sudo offcputime-bpfcc -f -p PID 30 > off.stacks
    flamegraph.pl --color=io off.stacks > offcpu.svg
    ```
  </Step>

  <Step title="Check for Known Issues">
    ```bash theme={null}
    # Kernel warnings
    dmesg -T | tail -100

    # System logs
    journalctl -p err -b

    # Hardware issues
    mcelog --client
    ```
  </Step>
</Steps>

***

## Interview Tips

<AccordionGroup>
  <Accordion title="Q: How would you debug a production performance issue?" icon="question">
    **Framework answer**:

    1. **Characterize**: What is the symptom? Latency? Throughput? Error rate?

    2. **Triage with USE/RED**:
       * Check resource utilization (CPU, memory, disk, network)
       * Check for saturation (queues, waits)
       * Check error rates

    3. **Narrow down**:
       * Is it application code? (perf, flame graphs)
       * Is it system calls? (strace, perf trace)
       * Is it kernel? (perf, bpftrace)
       * Is it hardware? (perf stat, mcelog)

    4. **Deep dive**:
       * CPU: perf record + flame graphs
       * Blocking: offcputime + off-CPU flame graphs
       * I/O: biolatency, biosnoop
       * Network: tcpretrans, tcpstates

    5. **Validate fix**: Compare before/after metrics
  </Accordion>

  <Accordion title="Q: What's the overhead of tracing tools?" icon="question">
    **Tool overheads**:

    | Tool        | Overhead | Notes                   |
    | ----------- | -------- | ----------------------- |
    | perf stat   | \~0%     | Just reads PMU          |
    | perf record | 2-10%    | Depends on frequency    |
    | strace      | 100x+    | Ptrace per syscall      |
    | perf trace  | 5-20%    | Much faster than strace |
    | bpftrace    | 1-5%     | Depends on events       |
    | ftrace      | 1-10%    | Depends on filters      |

    **Production guidelines**:

    * Always sample, don't trace every event
    * Use low frequencies (99 Hz, not 9999 Hz)
    * Filter to specific processes
    * Prefer tracepoints over kprobes
    * Time-bound your tracing
  </Accordion>
</AccordionGroup>

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="You need to profile a production Go service that is experiencing periodic latency spikes, but you cannot attach a debugger or restart the process. Walk through your complete investigation methodology using only kernel tracing tools." icon="message">
    **Strong Answer:**

    * Step 1: Characterize the problem. Use `perf stat -p <pid> -- sleep 30` to get high-level counters: CPU cycles, instructions, cache misses, context switches. If IPC (instructions per cycle) drops during spikes, it suggests cache misses or memory stalls. If context switches spike, something is preempting the process.
    * Step 2: CPU profiling. `perf record -F 99 -g -p <pid> -- sleep 30` captures on-CPU stack samples at 99Hz. Generate a flame graph: `perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg`. Look for wide boxes that indicate hot functions. For Go, frame pointers are enabled by default since Go 1.21, so `perf` can unwind Go stacks correctly.
    * Step 3: Off-CPU analysis. If the CPU flame graph does not explain the latency, the process is spending time blocked. `sudo offcputime-bpfcc -p <pid> -f 30 > off.stacks` and `flamegraph.pl --color=io off.stacks > offcpu.svg` shows where the process is sleeping. Common Go-specific causes: garbage collection (look for `runtime.gcBgMarkWorker` in the off-CPU graph), channel blocking, or mutex contention.
    * Step 4: Targeted tracing. If GC is suspected, `sudo bpftrace -e 'uprobe:/path/to/binary:runtime.gcStart { @gc_start[tid] = nsecs; } uretprobe:/path/to/binary:runtime.gcStart /@gc_start[tid]/ { @gc_duration = hist(nsecs - @gc_start[tid]); delete(@gc_start[tid]); }'` measures GC pause duration directly.
    * Step 5: Syscall analysis. `sudo perf trace --duration 10 -p <pid>` shows syscalls taking longer than 10ms. Sort by duration to find the slowest operations. Correlate timestamps with the latency spike times from application metrics.

    **Follow-up:** What is the overhead of each tool you mentioned, and how would you justify using them on a production system serving real traffic?

    **Follow-up Answer:**

    * `perf stat`: near zero overhead -- it just reads hardware PMU counters. `perf record` at 99Hz: approximately 1-3% CPU overhead for stack sampling, proportional to the sampling rate. 99Hz is specifically chosen because it avoids aliasing with common periodic events (which tend to be round numbers). `offcputime-bpfcc`: 1-5% overhead because it hooks `sched_switch` tracepoints which fire on every context switch. `bpftrace` with targeted uprobe: varies, but a single uprobe on an infrequent function (GC runs seconds apart) adds negligible overhead. `perf trace`: 5-20% overhead because it traces every syscall. I would start with the lowest-overhead tools (perf stat, perf record) and only escalate to higher-overhead tools (offcputime, perf trace) if the lighter tools do not provide the answer. I would always time-bound tracing sessions (30 seconds to 5 minutes) and monitor the process's latency during tracing to ensure I am not worsening the problem I am investigating.
  </Accordion>

  <Accordion title="Explain the difference between on-CPU and off-CPU profiling. When would an on-CPU flame graph completely miss the performance problem?" icon="message">
    **Strong Answer:**

    * On-CPU profiling (e.g., `perf record`) samples the call stack at regular intervals while the thread is executing on a CPU. It tells you where the process spends its compute time. If a function is CPU-bound (tight loops, computation), it dominates the flame graph.
    * Off-CPU profiling tracks the time a thread spends not running: blocked on I/O, waiting on a lock, sleeping on a futex, or descheduled by the scheduler. It records the stack trace at the point where the thread went to sleep and measures how long it slept.
    * An on-CPU flame graph completely misses the performance problem when latency is caused by blocking. For example: a web service has 50ms request latency but spends only 2ms of CPU time per request. The remaining 48ms is off-CPU time -- waiting for database query responses (network I/O), waiting for disk reads, or waiting for a mutex held by another thread. The on-CPU flame graph shows 2ms of processing and reveals nothing about the 48ms of waiting.
    * The canonical diagnostic approach is to generate both flame graphs. If the on-CPU graph explains the latency (CPU time approximately equals wall time), optimize the hot functions. If there is a gap between CPU time and wall time, the off-CPU graph reveals what the process is waiting on. The off-CPU stacks typically show blocking syscalls (`futex`, `epoll_wait`, `read`, `io_submit`) and the application-level function that initiated the blocking call.

    **Follow-up:** How would you create a combined "wall clock" flame graph that shows both on-CPU and off-CPU time in a single visualization?

    **Follow-up Answer:**

    * A wall clock flame graph requires capturing both on-CPU samples and off-CPU events, then merging them weighted by time. One approach: record on-CPU with `perf record -F 99 -g` and off-CPU with `offcputime-bpfcc -f`. Convert both to folded stack format, then concatenate them with appropriate scaling (on-CPU samples represent 1/99th of a second each, off-CPU entries represent actual microseconds). Feed the combined folded stacks to `flamegraph.pl --color=chain` which uses different colors for on-CPU (warm colors) and off-CPU (cool colors). This gives a single flame graph where the width represents wall clock time: wide warm boxes are CPU-bound functions, wide cool boxes are blocking functions. Brendan Gregg calls this a "hot/cold flame graph" and it is the gold standard for understanding end-to-end request latency.
  </Accordion>

  <Accordion title="A colleague says 'strace is good enough for production debugging.' Explain why this is dangerous and describe the kernel mechanisms that make strace so slow compared to eBPF-based tracing." icon="message">
    **Strong Answer:**

    * strace uses the ptrace system call, which works by making the traced process stop at every syscall entry and exit. The mechanism: the tracer calls `ptrace(PTRACE_SYSCALL, pid)`, the kernel sets a flag in the tracee's `task_struct`, and on every syscall entry/exit, the kernel stops the tracee, sends SIGTRAP to the tracer, the tracer reads the syscall information via `ptrace(PTRACE_GETREGS)`, decides to continue, and calls `ptrace(PTRACE_SYSCALL)` again.
    * This means every single syscall involves: stopping the tracee, scheduling the tracer, reading registers, deciding to continue, scheduling the tracee back. That is at least 2 context switches per syscall (tracer wakes, tracee wakes) plus signal delivery overhead. Each context switch costs 2-10 microseconds, so strace adds 10-20 microseconds per syscall. For a process making 10,000 syscalls per second, that is 100-200 milliseconds of overhead per second, easily causing 2-10x slowdown.
    * eBPF-based tracing (bpftrace, perf trace) attaches a BPF program directly to the syscall tracepoint. The BPF program runs in kernel context on the same CPU as the traced process, with no context switches, no signal delivery, and no user-space tracer involvement. The BPF program executes in 50-200 nanoseconds per syscall -- 100x less overhead than ptrace.
    * In production, strace can turn a healthy service into an unresponsive one. I have seen production incidents caused by attaching strace to a busy process. For production debugging, always use `perf trace` (5-20% overhead) or bpftrace with tracepoints (1-5% overhead).

    **Follow-up:** Are there any scenarios where strace is still the right tool despite its overhead?

    **Follow-up Answer:**

    * Yes, strace is still valuable for: debugging startup failures (the process is not yet running, so overhead does not matter), analyzing one-shot commands (like why `curl` fails to connect), reproducing bugs in development environments, and when eBPF is not available (old kernels, restricted environments without CAP\_BPF). strace's output format is also more human-readable than raw perf trace output, showing decoded flag names, file paths, and error strings. For quick local debugging where you can afford the overhead, strace's simplicity is valuable. The rule: strace for development, eBPF for production.
  </Accordion>
</AccordionGroup>

***

Next: [Kernel Debugging →](/courses/linux-internals/kernel-debugging)
