> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# eBPF Deep Dive

> Master eBPF for observability: architecture, program types, maps, verifier, and production tooling

<Frame>
  <img src="https://mintcdn.com/devweeekends/1GcDwVN8SzYRbJg1/images/courses/linux-internals/ebpf-concept.svg?fit=max&auto=format&n=1GcDwVN8SzYRbJg1&q=85&s=7c0f02a08fe4c70a2013e6577ba75d2d" alt="eBPF - Safely extend the kernel at runtime" width="1080" height="1080" data-path="images/courses/linux-internals/ebpf-concept.svg" />
</Frame>

# eBPF Deep Dive

eBPF (extended Berkeley Packet Filter) is the technology revolutionizing observability, networking, and security in Linux. Companies like Datadog, Grafana, and Cloudflare use eBPF extensively. Mastering it is essential for infrastructure and observability engineering roles.

<Info>
  **Interview Frequency**: Very High (core skill for observability roles)\
  **Key Topics**: BPF architecture, program types, maps, verifier, bpftrace\
  **Time to Master**: 18-20 hours
</Info>

***

## What is eBPF?

eBPF allows running sandboxed programs in the Linux kernel without changing kernel source code or loading kernel modules.

<img src="https://mintcdn.com/devweeekends/1GcDwVN8SzYRbJg1/images/courses/linux-ebpf-lifecycle.svg?fit=max&auto=format&n=1GcDwVN8SzYRbJg1&q=85&s=2867753f88b6a6243ed13de2c2e1b1cd" alt="eBPF Lifecycle" width="1080" height="1080" data-path="images/courses/linux-ebpf-lifecycle.svg" />

***

## eBPF Program Types

Different attach points for different use cases:

### Core Program Types

| Type                          | Attach Point                | Use Case                           |
| ----------------------------- | --------------------------- | ---------------------------------- |
| `BPF_PROG_TYPE_KPROBE`        | Kernel function entry/exit  | Tracing syscalls, kernel functions |
| `BPF_PROG_TYPE_TRACEPOINT`    | Static kernel tracepoints   | Stable tracing points              |
| `BPF_PROG_TYPE_PERF_EVENT`    | perf events (PMU, software) | Performance monitoring             |
| `BPF_PROG_TYPE_XDP`           | Network driver (before SKB) | High-performance packet processing |
| `BPF_PROG_TYPE_SCHED_CLS`     | Traffic control classifier  | Container networking               |
| `BPF_PROG_TYPE_SOCKET_FILTER` | Socket                      | Packet filtering                   |
| `BPF_PROG_TYPE_CGROUP_*`      | Cgroup hooks                | Container resource control         |
| `BPF_PROG_TYPE_LSM`           | LSM hooks                   | Security policies                  |

### Kprobes vs Tracepoints

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    KPROBES VS TRACEPOINTS                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  KPROBES (Dynamic)                                                           │
│  ────────────────                                                           │
│  + Can attach to ANY kernel function                                        │
│  + Very flexible for debugging                                               │
│  - Unstable ABI (function signatures can change)                            │
│  - Higher overhead                                                           │
│  - May break between kernel versions                                         │
│                                                                              │
│  TRACEPOINTS (Static)                                                        │
│  ───────────────────                                                        │
│  + Stable ABI (maintained by kernel developers)                             │
│  + Lower overhead (optimized instrumentation)                               │
│  + Documented arguments                                                      │
│  - Limited to predefined points                                             │
│  - May not cover everything you need                                         │
│                                                                              │
│  Best Practice: Prefer tracepoints when available, use kprobes when needed  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Available Tracepoints

```bash theme={null}
# List all tracepoints
sudo ls /sys/kernel/debug/tracing/events/

# List syscall tracepoints
sudo ls /sys/kernel/debug/tracing/events/syscalls/

# View tracepoint format
sudo cat /sys/kernel/debug/tracing/events/sched/sched_switch/format
```

***

## BPF Maps

Maps are key-value stores shared between eBPF programs and user space.

### Map Types

```c theme={null}
// Common map types
BPF_MAP_TYPE_HASH          // Hash table
BPF_MAP_TYPE_ARRAY         // Array (fixed-size)
BPF_MAP_TYPE_PERCPU_HASH   // Per-CPU hash (no locking)
BPF_MAP_TYPE_PERCPU_ARRAY  // Per-CPU array
BPF_MAP_TYPE_RINGBUF       // Ring buffer (efficient event streaming)
BPF_MAP_TYPE_PERF_EVENT_ARRAY  // Per-CPU event buffer
BPF_MAP_TYPE_LRU_HASH      // LRU evicting hash
BPF_MAP_TYPE_STACK_TRACE   // Stack traces storage
BPF_MAP_TYPE_LPM_TRIE      // Longest prefix match (routing)
```

### Map Declaration (libbpf style)

```c theme={null}
// In eBPF program (kernel side)
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10240);
    __type(key, u32);               // Key: PID
    __type(value, u64);             // Value: count
} syscall_count SEC(".maps");

// Using the map
SEC("tracepoint/syscalls/sys_enter_read")
int trace_read(struct trace_event_raw_sys_enter *ctx)
{
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    u64 *count = bpf_map_lookup_elem(&syscall_count, &pid);
    if (count) {
        (*count)++;
    } else {
        u64 initial = 1;
        bpf_map_update_elem(&syscall_count, &pid, &initial, BPF_ANY);
    }
    return 0;
}
```

### Ring Buffer vs Perf Buffer

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    RINGBUF VS PERF_EVENT_ARRAY                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PERF_EVENT_ARRAY (Legacy)                                                   │
│  ─────────────────────────                                                  │
│  - Per-CPU buffers (separate buffer per CPU)                                │
│  - User space must poll each CPU                                            │
│  - Can lose events if buffer full                                           │
│  - Higher memory overhead                                                    │
│                                                                              │
│  RINGBUF (Preferred, kernel 5.8+)                                           │
│  ─────────────────────────────────                                          │
│  - Single shared ring buffer                                                 │
│  - Automatic ordering of events                                              │
│  - Reservation-based (no loss if size check)                                │
│  - More efficient memory usage                                               │
│  - Variable-length records                                                   │
│                                                                              │
│  Always use RINGBUF for new code on kernel 5.8+                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

```c theme={null}
// Ring buffer usage
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);  // 256 KB
} events SEC(".maps");

struct event {
    u32 pid;
    char comm[16];
    u64 timestamp;
};

SEC("tracepoint/sched/sched_process_exec")
int trace_exec(struct trace_event_raw_sched_process_exec *ctx)
{
    struct event *e;
    
    // Reserve space in ring buffer
    e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e)
        return 0;
    
    // Fill event data
    e->pid = bpf_get_current_pid_tgid() >> 32;
    e->timestamp = bpf_ktime_get_ns();
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
    
    // Submit to user space
    bpf_ringbuf_submit(e, 0);
    return 0;
}
```

***

## The BPF Verifier

The verifier ensures eBPF programs are safe to run in the kernel.

### Verifier Checks

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        BPF VERIFIER CHECKS                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. CONTROL FLOW                                                             │
│     - No unbounded loops (must have provable termination)                   │
│     - No unreachable instructions                                            │
│     - Maximum instruction count (1M default)                                 │
│     - Maximum stack depth (512 bytes)                                        │
│                                                                              │
│  2. MEMORY SAFETY                                                            │
│     - All memory accesses must be bounded                                   │
│     - Pointer arithmetic checked                                            │
│     - NULL checks before dereference                                        │
│     - Stack access within bounds                                            │
│                                                                              │
│  3. TYPE SAFETY                                                              │
│     - Registers tracked for type                                            │
│     - Helper function argument types checked                                │
│     - Map key/value types verified                                          │
│                                                                              │
│  4. PRIVILEGE CHECKS                                                         │
│     - Certain helpers require CAP_BPF or CAP_SYS_ADMIN                      │
│     - Some program types restricted                                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Common Verifier Errors

```c theme={null}
// ERROR: Unbounded loop
for (int i = 0; i < n; i++) { }  // n is unknown at verify time

// SOLUTION: Use bounded loop
#pragma unroll
for (int i = 0; i < 100; i++) {
    if (i >= n) break;
}

// ERROR: Potential NULL dereference
u64 *value = bpf_map_lookup_elem(&my_map, &key);
*value = 42;  // ERROR: value might be NULL

// SOLUTION: Check for NULL
u64 *value = bpf_map_lookup_elem(&my_map, &key);
if (value)
    *value = 42;

// ERROR: Out of bounds access
char buf[16];
buf[idx] = 'x';  // ERROR: idx could be >= 16

// SOLUTION: Bound the index
if (idx < 16)
    buf[idx] = 'x';
```

### Verifier Debugging

```bash theme={null}
# Get verbose verifier output
sudo bpftool prog load program.o /sys/fs/bpf/prog -d

# View loaded program with verifier log
sudo bpftool prog dump xlated id <prog_id>
```

***

## Helper Functions

eBPF programs can call kernel-provided helper functions.

### Common Helpers

```c theme={null}
// Get current time in nanoseconds
u64 ts = bpf_ktime_get_ns();

// Get current PID/TID
u64 pid_tgid = bpf_get_current_pid_tgid();
u32 pid = pid_tgid >> 32;
u32 tid = pid_tgid & 0xFFFFFFFF;

// Get current task's comm
char comm[16];
bpf_get_current_comm(&comm, sizeof(comm));

// Get stack trace
u64 stack_id = bpf_get_stackid(ctx, &stack_map, 0);

// Read from kernel memory
bpf_probe_read_kernel(&dst, size, src);

// Read from user memory
bpf_probe_read_user(&dst, size, src);

// Read from user string
bpf_probe_read_user_str(&dst, size, src);

// Print debug message (limited, for development)
bpf_printk("PID %d called\n", pid);

// Send signal to current task
bpf_send_signal(SIGKILL);

// Get current cgroup ID
u64 cgroup_id = bpf_get_current_cgroup_id();
```

### Available Helpers Per Program Type

```bash theme={null}
# List helpers available for a program type
sudo bpftool feature probe kernel | grep -A 100 "kprobe" | head -50
```

***

## BPF CO-RE (Compile Once, Run Everywhere)

CO-RE solves the problem of kernel version differences.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                            BPF CO-RE                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Without CO-RE:                                                              │
│  - Compile eBPF program for each kernel version                             │
│  - Must match exact struct layouts                                          │
│  - Breaks when kernel changes struct fields                                  │
│                                                                              │
│  With CO-RE:                                                                 │
│  - Compile once with BTF (BPF Type Format)                                  │
│  - libbpf adjusts offsets at load time                                      │
│  - Works across kernel versions                                              │
│                                                                              │
│  Example: struct task_struct                                                 │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │  Kernel 5.4:                  Kernel 5.15:                         │     │
│  │  struct task_struct {         struct task_struct {                  │     │
│  │    ...                          ...                                 │     │
│  │    pid_t pid;  // offset 100    void *new_field;  // added         │     │
│  │    ...                          pid_t pid;  // offset 108 (moved!) │     │
│  │  }                            }                                     │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│  CO-RE reads BTF from running kernel, adjusts offsets automatically         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### CO-RE Code Example

```c theme={null}
#include "vmlinux.h"  // Generated from kernel BTF
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>

SEC("kprobe/do_sys_open")
int BPF_KPROBE(trace_open, int dfd, const char *filename)
{
    struct task_struct *task = (void *)bpf_get_current_task();
    
    // CO-RE: works across kernel versions
    pid_t pid = BPF_CORE_READ(task, pid);
    pid_t tgid = BPF_CORE_READ(task, tgid);
    
    // Read parent's PID (nested struct access)
    pid_t ppid = BPF_CORE_READ(task, real_parent, pid);
    
    bpf_printk("PID %d (parent %d) opening file\n", pid, ppid);
    return 0;
}
```

***

## bpftrace - High-Level Tracing

bpftrace is the easiest way to write eBPF programs.

### bpftrace Basics

```bash theme={null}
# Trace all syscalls
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'

# Trace open() calls with filename
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { 
    printf("%s opened %s\n", comm, str(args->filename)); 
}'

# Histogram of read() sizes
sudo bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ { 
    @size = hist(args->ret); 
}'

# Track time spent in functions
sudo bpftrace -e 'kprobe:do_sys_open { @start[tid] = nsecs; }
                  kretprobe:do_sys_open /@start[tid]/ { 
                      @duration = hist(nsecs - @start[tid]); 
                      delete(@start[tid]); 
                  }'
```

### bpftrace One-Liners for Observability

```bash theme={null}
# Top syscalls by process
sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# Disk I/O latency histogram
sudo bpftrace -e 'kprobe:blk_account_io_start { @start[arg0] = nsecs; }
    kprobe:blk_account_io_done /@start[arg0]/ { 
        @us = hist((nsecs - @start[arg0]) / 1000); 
        delete(@start[arg0]); 
    }'

# TCP connections
sudo bpftrace -e 'kprobe:tcp_connect { 
    @[comm] = count(); 
    printf("%s connecting\n", comm); 
}'

# Page faults by process
sudo bpftrace -e 'tracepoint:exceptions:page_fault_user { 
    @[comm] = count(); 
}'

# Memory allocations
sudo bpftrace -e 'tracepoint:kmem:kmalloc { 
    @bytes = hist(args->bytes_alloc); 
}'

# Context switches
sudo bpftrace -e 'tracepoint:sched:sched_switch { 
    @[args->prev_comm] = count(); 
}'

# Off-CPU time
sudo bpftrace -e 'tracepoint:sched:sched_switch { 
    @[args->prev_comm] = sum(args->prev_state != 0 ? 1 : 0); 
}'
```

### bpftrace Variables

| Variable  | Description                  |
| --------- | ---------------------------- |
| `pid`     | Process ID                   |
| `tid`     | Thread ID                    |
| `uid`     | User ID                      |
| `comm`    | Process name                 |
| `nsecs`   | Timestamp (nanoseconds)      |
| `cpu`     | CPU number                   |
| `curtask` | Current task\_struct pointer |
| `args`    | Tracepoint arguments         |
| `retval`  | Return value (kretprobe)     |

***

## Production eBPF Tools

### BCC (BPF Compiler Collection)

```bash theme={null}
# Install BCC tools
sudo apt install bpfcc-tools linux-headers-$(uname -r)

# Trace process execution
sudo execsnoop-bpfcc

# Trace open() calls
sudo opensnoop-bpfcc

# Trace TCP connections
sudo tcpconnect-bpfcc
sudo tcpaccept-bpfcc
sudo tcpretrans-bpfcc

# Profile on-CPU time
sudo profile-bpfcc -F 99 10

# Trace block I/O
sudo biolatency-bpfcc

# Trace filesystem operations
sudo ext4slower-bpfcc 1

# Memory allocation tracing
sudo memleak-bpfcc

# Cache hit ratio
sudo cachestat-bpfcc
```

### libbpf-based Tools

```c theme={null}
// Modern libbpf skeleton approach
#include "trace.skel.h"

int main(int argc, char **argv)
{
    struct trace_bpf *skel;
    int err;
    
    // Open and load BPF program
    skel = trace_bpf__open_and_load();
    if (!skel) {
        fprintf(stderr, "Failed to load BPF skeleton\n");
        return 1;
    }
    
    // Attach BPF programs
    err = trace_bpf__attach(skel);
    if (err) {
        fprintf(stderr, "Failed to attach BPF skeleton\n");
        goto cleanup;
    }
    
    // Set up ring buffer callback
    struct ring_buffer *rb = ring_buffer__new(
        bpf_map__fd(skel->maps.events),
        handle_event,
        NULL,
        NULL
    );
    
    // Poll for events
    while (!exiting) {
        err = ring_buffer__poll(rb, 100);
        // Handle events...
    }
    
cleanup:
    ring_buffer__free(rb);
    trace_bpf__destroy(skel);
    return err;
}
```

***

## XDP (eXpress Data Path)

XDP allows packet processing at the network driver level.

### XDP Program

```c theme={null}
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>

SEC("xdp")
int xdp_drop_icmp(struct xdp_md *ctx)
{
    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    
    // Parse Ethernet header
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;
    
    // Only handle IPv4
    if (eth->h_proto != htons(ETH_P_IP))
        return XDP_PASS;
    
    // Parse IP header
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;
    
    // Drop ICMP packets
    if (ip->protocol == IPPROTO_ICMP)
        return XDP_DROP;
    
    return XDP_PASS;
}
```

### XDP Return Codes

| Code           | Action                         |
| -------------- | ------------------------------ |
| `XDP_PASS`     | Pass to network stack          |
| `XDP_DROP`     | Drop packet                    |
| `XDP_TX`       | Bounce back out same interface |
| `XDP_REDIRECT` | Redirect to another interface  |
| `XDP_ABORTED`  | Error, drop and trace          |

***

## Lab Exercises

<AccordionGroup>
  <Accordion title="Lab 1: First bpftrace Script" icon="terminal">
    **Objective**: Write basic tracing scripts

    ```bash theme={null}
    # List available tracepoints
    sudo bpftrace -l 'tracepoint:*' | head -50

    # Trace file opens with latency
    sudo bpftrace -e '
    tracepoint:syscalls:sys_enter_openat {
        @start[tid] = nsecs;
        @fname[tid] = args->filename;
    }
    tracepoint:syscalls:sys_exit_openat /@start[tid]/ {
        $dur = (nsecs - @start[tid]) / 1000;
        printf("%s opened %s in %d us (fd=%d)\n", 
               comm, str(@fname[tid]), $dur, args->ret);
        delete(@start[tid]);
        delete(@fname[tid]);
    }'

    # Histogram of process lifetimes
    sudo bpftrace -e '
    tracepoint:sched:sched_process_fork {
        @birth[args->child_pid] = nsecs;
    }
    tracepoint:sched:sched_process_exit /@birth[args->pid]/ {
        @lifetime = hist((nsecs - @birth[args->pid]) / 1000000);
        delete(@birth[args->pid]);
    }'
    ```
  </Accordion>

  <Accordion title="Lab 2: Write libbpf Program" icon="code">
    **Objective**: Create a complete eBPF program with libbpf

    ```c theme={null}
    // trace.bpf.c
    #include "vmlinux.h"
    #include <bpf/bpf_helpers.h>
    #include <bpf/bpf_tracing.h>

    struct event {
        u32 pid;
        u32 uid;
        char comm[16];
        char filename[256];
    };

    struct {
        __uint(type, BPF_MAP_TYPE_RINGBUF);
        __uint(max_entries, 256 * 1024);
    } events SEC(".maps");

    SEC("tracepoint/syscalls/sys_enter_openat")
    int tracepoint__syscalls__sys_enter_openat(
        struct trace_event_raw_sys_enter *ctx)
    {
        struct event *e;
        
        e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
        if (!e)
            return 0;
        
        e->pid = bpf_get_current_pid_tgid() >> 32;
        e->uid = bpf_get_current_uid_gid() >> 32;
        bpf_get_current_comm(&e->comm, sizeof(e->comm));
        bpf_probe_read_user_str(&e->filename, sizeof(e->filename),
                                (void *)ctx->args[1]);
        
        bpf_ringbuf_submit(e, 0);
        return 0;
    }

    char LICENSE[] SEC("license") = "GPL";
    ```

    Build with:

    ```bash theme={null}
    clang -g -O2 -target bpf -c trace.bpf.c -o trace.bpf.o
    bpftool gen skeleton trace.bpf.o > trace.skel.h
    ```
  </Accordion>

  <Accordion title="Lab 3: Performance Profiling with eBPF" icon="chart-line">
    **Objective**: Profile production workloads

    ```bash theme={null}
    # CPU profiling with flame graphs
    sudo profile-bpfcc -F 99 -f 30 > profile.txt
    flamegraph.pl profile.txt > profile.svg

    # Off-CPU analysis
    sudo offcputime-bpfcc -f 30 > offcpu.txt
    flamegraph.pl --color=io offcpu.txt > offcpu.svg

    # Combined on/off CPU
    sudo bpftrace -e '
    profile:hz:99 { @on[kstack] = count(); }
    tracepoint:sched:sched_switch { 
        @off[kstack] = count(); 
    }' -c "sleep 10"

    # Latency analysis
    sudo funclatency-bpfcc do_sys_open -m
    ```
  </Accordion>
</AccordionGroup>

***

## Interview Questions

<AccordionGroup>
  <Accordion title="Q1: Explain the eBPF verifier and why it's necessary" icon="question">
    **Answer**:

    The verifier ensures eBPF programs are safe to run in kernel context.

    **Why necessary**:

    * eBPF runs with kernel privileges
    * Bugs could crash the kernel or leak data
    * Must guarantee termination (no infinite loops)
    * Must prevent out-of-bounds memory access

    **Key checks**:

    1. **Control flow**: No unbounded loops, reachable exit
    2. **Memory safety**: All accesses bounds-checked
    3. **Type safety**: Correct helper function arguments
    4. **Privilege**: Capability checks for dangerous operations

    **Limitations**:

    * Some valid programs rejected (false negatives)
    * Complex loop bounds hard to prove
    * Instruction count limits
  </Accordion>

  <Accordion title="Q2: What's the difference between kprobes and tracepoints?" icon="question">
    **Answer**:

    | Aspect        | Kprobes              | Tracepoints            |
    | ------------- | -------------------- | ---------------------- |
    | Type          | Dynamic              | Static                 |
    | Attach points | Any function         | Predefined only        |
    | ABI stability | None                 | Maintained             |
    | Overhead      | Higher               | Lower                  |
    | Arguments     | Read from stack/regs | Structured, documented |
    | Cross-kernel  | May break            | Stable                 |

    **Best practices**:

    * Use tracepoints when available (stable, efficient)
    * Use kprobes for specific functions not covered
    * CO-RE helps with kprobe portability
    * Document kprobe usage for maintenance
  </Accordion>

  <Accordion title="Q3: How would you use eBPF to debug latency in a microservice?" icon="question">
    **Answer**:

    **Approach**:

    1. **Identify entry/exit points**:

    ```bash theme={null}
    # Trace HTTP request handling
    sudo bpftrace -e '
    uprobe:/path/to/service:handleRequest { @start[tid] = nsecs; }
    uretprobe:/path/to/service:handleRequest /@start[tid]/ {
        @latency = hist((nsecs - @start[tid]) / 1000000);
        delete(@start[tid]);
    }'
    ```

    2. **Break down latency**:

    * Trace syscalls (read, write, connect)
    * Trace specific functions (DB queries, cache lookups)
    * Measure off-CPU time (blocking)

    3. **Identify bottlenecks**:

    ```bash theme={null}
    # Stack traces for slow operations
    sudo bpftrace -e 'uprobe:... /@lat > 10000000/ { print(ustack); }'
    ```

    4. **Production-safe approach**:

    * Start with low-overhead tracepoints
    * Sample rather than trace all events
    * Use ring buffer for event collection
    * Set reasonable map sizes
  </Accordion>

  <Accordion title="Q4: Explain XDP and when you would use it" icon="question">
    **Answer**:

    **What is XDP**:

    * Runs eBPF at network driver level
    * Before sk\_buff allocation (very early)
    * Near-native speed packet processing

    **Use cases**:

    * DDoS mitigation (drop malicious packets)
    * Load balancing (Facebook's Katran)
    * Packet filtering (Cloudflare)
    * Traffic steering

    **Performance**:

    * Can process 10M+ packets per second per core
    * 10-100x faster than iptables for simple rules

    **Limitations**:

    * Limited packet modification capabilities
    * Driver must support XDP
    * Complex protocols need network stack

    **Comparison with TC**:

    * XDP: Earlier, faster, limited
    * TC: After sk\_buff, full networking features
  </Accordion>
</AccordionGroup>

***

## Key Takeaways

<CardGroup cols={2}>
  <Card title="eBPF Safety" icon="shield">
    Verifier ensures programs are safe before running in kernel
  </Card>

  <Card title="Maps for Data" icon="database">
    BPF maps enable data sharing between kernel and user space
  </Card>

  <Card title="CO-RE Portability" icon="arrows-rotate">
    Compile once, run everywhere with BTF and libbpf
  </Card>

  <Card title="bpftrace Power" icon="terminal">
    High-level scripting for quick observability tasks
  </Card>
</CardGroup>

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="You need to build a production monitoring tool that tracks the latency of every disk I/O operation per container. Describe your eBPF-based approach, including the specific program types, maps, and how you would handle the per-container attribution." icon="message">
    **Strong Answer:**

    * I would use two tracepoint programs: `tracepoint/block/block_rq_issue` to record when a block I/O request is dispatched to the device driver, and `tracepoint/block/block_rq_complete` to record when it completes. On issue, I would store the timestamp keyed by `(dev, sector)` in a BPF hash map. On completion, I would look up the start time, compute the delta, and emit a latency event.
    * For per-container attribution, I would use `bpf_get_current_cgroup_id()` at the issue tracepoint to capture the cgroup ID of the process that initiated the I/O. I would store this alongside the timestamp in the hash map, so the completion handler can attribute the latency to the correct container even though the completion runs in interrupt context (where the "current" task is arbitrary).
    * For efficient data export, I would use a `BPF_MAP_TYPE_PERCPU_HASH` map keyed by `(cgroup_id, latency_bucket)` to build a per-container latency histogram directly in kernel space. The user-space agent reads this map periodically (every 5-10 seconds), aggregates across CPUs, and exports to Prometheus. This approach avoids per-event ring buffer overhead.
    * For production safety: bounded map sizes (10240 entries), PERCPU maps to avoid lock contention, and the programs attach to stable tracepoints (not kprobes) for kernel version compatibility.

    **Follow-up:** What happens if the hash map fills up because I/O requests are issued faster than they complete?

    **Follow-up Answer:**

    * If the `(dev, sector)` tracking map fills up, `bpf_map_update_elem()` returns `-ENOSPC` and the entry is silently dropped. The corresponding completion event will not find a matching start timestamp and will skip that I/O. This means we lose visibility into some requests during extreme load, but the program remains safe and does not crash or block I/O. To mitigate, I would size the map based on the expected maximum I/O queue depth across all devices (for NVMe with 64K queue depth per queue, this could be large). I would also add a per-CPU counter for dropped entries so the monitoring system can alert when we are losing data.
  </Accordion>

  <Accordion title="Explain the BPF verifier in detail. What invariants does it enforce, why are bounded loops necessary, and describe a real scenario where a valid program is rejected by the verifier." icon="message">
    **Strong Answer:**

    * The BPF verifier is a static analyzer that runs at program load time (before any execution) and simulates every possible execution path through the program. It maintains a state machine tracking the type, value range, and liveness of each register and stack slot at every instruction.
    * Key invariants enforced: First, termination -- every loop must have a provable upper bound. The verifier tracks loop iterations and rejects programs where it cannot prove the loop exits within a bounded number of iterations. Second, memory safety -- every pointer dereference must be preceded by a bounds check. If `bpf_map_lookup_elem()` returns a pointer, the verifier marks it as "possibly NULL" and requires an explicit NULL check before dereferencing. Third, type safety -- the verifier tracks which registers contain pointers to map values, packet data, stack, or scalars, and ensures they are used correctly (you cannot pass a packet pointer where a map pointer is expected). Fourth, privilege -- certain helper functions require `CAP_BPF` or `CAP_SYS_ADMIN`.
    * A real rejection scenario: you write a program that iterates over a linked list of variable length in packet data. Even if you add `if (i >= MAX_ITERATIONS) break;`, the verifier might reject it because the packet pointer arithmetic creates too many possible states. The verifier explores states exponentially, and complex pointer arithmetic with conditional branches can exceed the instruction count limit (1 million verified instructions). The fix is to restructure the code to reduce branching complexity, use `bpf_loop()` helper (kernel 5.17+), or split the logic into multiple tail-called programs.

    **Follow-up:** How does BPF CO-RE (Compile Once, Run Everywhere) interact with the verifier to enable cross-kernel portability?

    **Follow-up Answer:**

    * CO-RE works by embedding relocation information in the compiled BPF program's ELF file. When you use `BPF_CORE_READ(task, pid)`, the compiler emits a relocation record saying "access field `pid` at offset X in struct `task_struct`." At load time, libbpf reads the running kernel's BTF (BPF Type Format) data to find the actual offset of `pid` in the current kernel's `task_struct`, which may differ from the compile-time offset. libbpf patches the BPF instructions to use the correct offset before submitting to the verifier. The verifier then sees a program with correct offsets for the running kernel and validates it normally. If the field does not exist at all (e.g., a field removed in a newer kernel), libbpf can detect this and handle it gracefully (returning a default value or disabling that part of the program).
  </Accordion>

  <Accordion title="Compare XDP and TC (traffic control) BPF programs for packet processing. When would you choose each, and what are the performance characteristics?" icon="message">
    **Strong Answer:**

    * XDP (eXpress Data Path) programs run at the earliest possible point in the network receive path, before the kernel allocates an `sk_buff` structure. They operate on raw `xdp_md` contexts with direct packet data access. TC BPF programs run later, after `sk_buff` allocation, at the traffic control layer in both ingress and egress paths.
    * Performance difference is significant: XDP can process 10-20 million packets per second per core because it avoids the overhead of sk\_buff allocation (\~200-300 cycles per packet). TC processes 2-5 million packets per second per core, which is still much faster than iptables but slower than XDP.
    * I would choose XDP for: DDoS mitigation (drop malicious packets before they consume memory), load balancing (redirect packets to different NICs or CPUs), and simple packet filtering (firewalling at line rate). Facebook's Katran load balancer and Cloudflare's DDoS mitigation use XDP.
    * I would choose TC for: more complex packet manipulation (full sk\_buff available with all parsed headers), egress path processing (XDP only works on ingress), container networking (Cilium uses TC BPF for pod-to-pod traffic because it needs access to socket-level metadata), and when compatibility with the full networking stack is needed (TC programs can interact with connection tracking, netfilter marks, etc.).
    * XDP limitations: cannot modify packets that need fragmentation (no access to GSO), cannot directly interact with the socket layer, requires NIC driver support for native mode (falls back to generic/slower mode otherwise).

    **Follow-up:** How does AF\_XDP enable zero-copy packet processing, and when would you use it instead of regular XDP?

    **Follow-up Answer:**

    * AF\_XDP is a socket type that works with XDP to deliver raw packets directly to user space without kernel copying. An XDP program uses `XDP_REDIRECT` with `bpf_redirect_map()` to send packets to an `XSKMAP` (XDP socket map). The user-space application creates an AF\_XDP socket with shared UMEM (user memory), where packet data is DMA'd directly from the NIC into user-space-accessible memory. This achieves true zero-copy: the packet data is never copied by the kernel. I would use AF\_XDP when the application needs to process every packet (not just filter/drop), such as custom protocol implementations, high-frequency trading network stacks, or DPDK-like applications that want kernel bypass without the complexity of a full DPDK setup. The trade-off versus pure XDP is that AF\_XDP requires user-space processing latency, while XDP programs run to completion in the kernel.
  </Accordion>
</AccordionGroup>

***

Next: [Tracing Infrastructure →](/courses/linux-internals/tracing)