Observability & Performance Engineering
1. Hardware Performance Monitoring Units (PMU)
Common Events
Sampling vs. Counting
2. LBR (Last Branch Record)
3. ftrace: The Function Tracer
The Magic: Dynamic Patching
4. kprobes and uprobes
5. Flame Graphs
6. USE Method vs. RED Method
7. Perf + eBPF Triage Playbook
Step 1: Identify the Bottleneck Type
Step 2: CPU-Bound Triage
Step 3: Memory-Bound Triage
Step 4: I/O-Bound Triage
Step 5: Off-CPU Analysis
Quick Reference Card
Summary for Senior Engineers

Observability & Performance Engineering

A “Senior” engineer doesn’t just use top. They use the hardware’s own diagnostic features to find bottlenecks at the cycle level. This chapter covers the internals of kernel and hardware observability.

1. Hardware Performance Monitoring Units (PMU)

Every modern CPU has a set of dedicated registers called PMUs that count specific hardware events without any software overhead.

Common Events

Cycles: Total CPU clock cycles.
Instructions: Total instructions retired (the “committed” state).
Cache Misses: L1/L2/LLC misses.
Branch Mispredictions: Critical for pipeline performance.

Sampling vs. Counting

Counting: The CPU simply increments a counter. At the end of the run, you see the total (e.g., “1 million cache misses”).
Sampling: The CPU is configured to raise an interrupt every N events (e.g., every 1000 cache misses). The kernel then records the Instruction Pointer (RIP) at each interrupt. This allows perf to create a Histogram of where the misses are happening in the code.

2. LBR (Last Branch Record)

Stack traces are expensive to capture, and many optimized binaries don’t have “Frame Pointers” (RBP is used as a general-purpose register).

LBR is a hardware feature that keeps a ring buffer of the last 8-32 branches taken by the CPU.
Benefit: It allows perf to reconstruct the Call Graph and Hot Paths with near-zero overhead and without needing debug symbols or frame pointers.

3. `ftrace`: The Function Tracer

ftrace is the built-in Linux kernel tracer. It can trace every single function call inside the kernel.

The Magic: Dynamic Patching

How does ftrace avoid overhead when not in use?

Compile Time: The compiler adds a “no-op” call (mcount or __fentry__) at the start of every function.
Boot Time: The kernel finds all these calls and replaces them with a single 5-byte NOP instruction.
Activation: When you start tracing, the kernel dynamically patches those NOPs with CALL instructions to the tracer.

Result: Zero overhead when disabled, and high-precision tracing when enabled.

4. `kprobes` and `uprobes`

Dynamic instrumentation allows you to “hook” any instruction in a running system.

kprobes: Hooks into kernel code.
uprobes: Hooks into user-space code (shared libraries, binaries).
How it works: The kernel replaces the target instruction with a Breakpoint (e.g., INT3 on x86). When hit, the CPU traps to the kernel, executes your probe handler (or eBPF program), and then resumes the original instruction.

5. Flame Graphs

The industry standard for visualizing performance.

X-axis: Alphabetical list of functions (width = % of total time).
Y-axis: Stack depth.
Goal: Find the “plateaus”—wide boxes that indicate where the CPU is spending the most time.

6. USE Method vs. RED Method

USE Method (Brendan Gregg): For Resources (CPU, Memory, Disk).
- Utilization: How busy is the resource?
- Saturation: Is there a backlog of work?
- Errors: Are there any hardware/software errors?
RED Method: For Services (Web servers, APIs).
- Rate: Requests per second.
- Errors: Failed requests.
- Duration: Response time (latency).

7. Perf + eBPF Triage Playbook

When a production system is slow, use this systematic approach:

Step 1: Identify the Bottleneck Type

# Quick health check
uptime                    # Load average (is it CPU-bound?)
free -h                   # Memory pressure?
iostat -x 1               # Disk utilization?
sar -n DEV 1              # Network saturation?

# PSI (Pressure Stall Information) - Linux 4.20+
cat /proc/pressure/cpu    # CPU stall time
cat /proc/pressure/memory # Memory stall time
cat /proc/pressure/io     # I/O stall time

Step 2: CPU-Bound Triage

# 1. Identify hot processes
top -b -n 1 | head -20

# 2. Profile the hot process (sample for 30s)
sudo perf record -g -p <PID> -- sleep 30

# 3. Analyze the profile
perf report --stdio --sort=overhead,symbol

# 4. Generate flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

# 5. Check IPC (Instructions Per Cycle) - are we stalled?
perf stat -p <PID> -- sleep 10
# IPC < 1 = likely memory-bound, IPC > 2 = CPU-efficient

Step 3: Memory-Bound Triage

# 1. Check for memory pressure
vmstat 1
# Look at: si/so (swap), wa (I/O wait), free

# 2. Profile cache misses
perf stat -e cache-misses,cache-references,LLC-load-misses -p <PID> -- sleep 10

# 3. Find functions causing cache misses
perf record -e cache-misses -g -p <PID> -- sleep 10
perf report

# 4. eBPF: Track page faults
sudo bpftrace -e 'tracepoint:exceptions:page_fault_user { @[comm] = count(); }'

Step 4: I/O-Bound Triage

# 1. Check disk latency distribution
sudo biolatency      # from bcc-tools

# 2. Identify slow I/O by file
sudo opensnoop       # See which files are being opened
sudo fileslower 10   # Files taking >10ms

# 3. eBPF: Trace read/write latency by process
sudo bpftrace -e '
  tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
  tracepoint:syscalls:sys_exit_read /@start[tid]/ {
    @us[comm] = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
  }
'

Step 5: Off-CPU Analysis

The CPU might not be the bottleneck—what is the process waiting for?

# eBPF: See why a process is sleeping
sudo offcputime -p <PID> 10    # from bcc-tools

# Visualize as a flame graph (off-cpu flame graph)
sudo offcputime -p <PID> -f 30 > out.offcpu
flamegraph.pl --color=io < out.offcpu > offcpu.svg

Quick Reference Card

Symptom	First Command	Deep Dive
High CPU	`perf top`	`perf record -g` + flame graph
High load, low CPU	`offcputime`	Check I/O and locks
Memory OOM	`slabtop`, `vmstat`	`perf mem record`
Slow disk	`biolatency`	`bpftrace` block tracepoints
Slow network	`tcpconnlat`, `tcpretrans`	`ss -i`, `netstat -s`
Lock contention	`perf lock`	`lockstat`, `mutexsnoop`

Summary for Senior Engineers

Don’t trust top CPU %: High CPU % could be “stalled” waiting for memory (Cache Misses). Use perf stat to check IPC (Instructions Per Cycle).
Use ftrace to understand the “flow” of a specific kernel subsystem (e.g., the scheduler or network stack).
LBR is the secret to high-fidelity profiling on production systems with stripped binaries.
Off-CPU Analysis: Performance is not just about what the CPU is doing, but what it’s waiting for (I/O, locks). Use eBPF to trace “off-cpu” time.

Next: Modern I/O: io_uring & Userfaultfd →

Modern Features Interview Prep

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Observability & Performance Engineering

​1. Hardware Performance Monitoring Units (PMU)

​Common Events

​Sampling vs. Counting

​2. LBR (Last Branch Record)

​3. ftrace: The Function Tracer

​The Magic: Dynamic Patching

​4. kprobes and uprobes

​5. Flame Graphs

​6. USE Method vs. RED Method

​7. Perf + eBPF Triage Playbook

​Step 1: Identify the Bottleneck Type

​Step 2: CPU-Bound Triage

​Step 3: Memory-Bound Triage

​Step 4: I/O-Bound Triage

​Step 5: Off-CPU Analysis

​Quick Reference Card

​Summary for Senior Engineers