Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

eBPF - Safely extend the kernel at runtime

eBPF Deep Dive

eBPF (extended Berkeley Packet Filter) is the technology revolutionizing observability, networking, and security in Linux. Companies like Datadog, Grafana, and Cloudflare use eBPF extensively. Mastering it is essential for infrastructure and observability engineering roles.
Interview Frequency: Very High (core skill for observability roles)
Key Topics: BPF architecture, program types, maps, verifier, bpftrace
Time to Master: 18-20 hours

What is eBPF?

eBPF allows running sandboxed programs in the Linux kernel without changing kernel source code or loading kernel modules. eBPF Lifecycle

eBPF Program Types

Different attach points for different use cases:

Core Program Types

TypeAttach PointUse Case
BPF_PROG_TYPE_KPROBEKernel function entry/exitTracing syscalls, kernel functions
BPF_PROG_TYPE_TRACEPOINTStatic kernel tracepointsStable tracing points
BPF_PROG_TYPE_PERF_EVENTperf events (PMU, software)Performance monitoring
BPF_PROG_TYPE_XDPNetwork driver (before SKB)High-performance packet processing
BPF_PROG_TYPE_SCHED_CLSTraffic control classifierContainer networking
BPF_PROG_TYPE_SOCKET_FILTERSocketPacket filtering
BPF_PROG_TYPE_CGROUP_*Cgroup hooksContainer resource control
BPF_PROG_TYPE_LSMLSM hooksSecurity policies

Kprobes vs Tracepoints

┌─────────────────────────────────────────────────────────────────────────────┐
│                    KPROBES VS TRACEPOINTS                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  KPROBES (Dynamic)                                                           │
│  ────────────────                                                           │
│  + Can attach to ANY kernel function                                        │
│  + Very flexible for debugging                                               │
│  - Unstable ABI (function signatures can change)                            │
│  - Higher overhead                                                           │
│  - May break between kernel versions                                         │
│                                                                              │
│  TRACEPOINTS (Static)                                                        │
│  ───────────────────                                                        │
│  + Stable ABI (maintained by kernel developers)                             │
│  + Lower overhead (optimized instrumentation)                               │
│  + Documented arguments                                                      │
│  - Limited to predefined points                                             │
│  - May not cover everything you need                                         │
│                                                                              │
│  Best Practice: Prefer tracepoints when available, use kprobes when needed  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Available Tracepoints

# List all tracepoints
sudo ls /sys/kernel/debug/tracing/events/

# List syscall tracepoints
sudo ls /sys/kernel/debug/tracing/events/syscalls/

# View tracepoint format
sudo cat /sys/kernel/debug/tracing/events/sched/sched_switch/format

BPF Maps

Maps are key-value stores shared between eBPF programs and user space.

Map Types

// Common map types
BPF_MAP_TYPE_HASH          // Hash table
BPF_MAP_TYPE_ARRAY         // Array (fixed-size)
BPF_MAP_TYPE_PERCPU_HASH   // Per-CPU hash (no locking)
BPF_MAP_TYPE_PERCPU_ARRAY  // Per-CPU array
BPF_MAP_TYPE_RINGBUF       // Ring buffer (efficient event streaming)
BPF_MAP_TYPE_PERF_EVENT_ARRAY  // Per-CPU event buffer
BPF_MAP_TYPE_LRU_HASH      // LRU evicting hash
BPF_MAP_TYPE_STACK_TRACE   // Stack traces storage
BPF_MAP_TYPE_LPM_TRIE      // Longest prefix match (routing)

Map Declaration (libbpf style)

// In eBPF program (kernel side)
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10240);
    __type(key, u32);               // Key: PID
    __type(value, u64);             // Value: count
} syscall_count SEC(".maps");

// Using the map
SEC("tracepoint/syscalls/sys_enter_read")
int trace_read(struct trace_event_raw_sys_enter *ctx)
{
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    u64 *count = bpf_map_lookup_elem(&syscall_count, &pid);
    if (count) {
        (*count)++;
    } else {
        u64 initial = 1;
        bpf_map_update_elem(&syscall_count, &pid, &initial, BPF_ANY);
    }
    return 0;
}

Ring Buffer vs Perf Buffer

┌─────────────────────────────────────────────────────────────────────────────┐
│                    RINGBUF VS PERF_EVENT_ARRAY                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PERF_EVENT_ARRAY (Legacy)                                                   │
│  ─────────────────────────                                                  │
│  - Per-CPU buffers (separate buffer per CPU)                                │
│  - User space must poll each CPU                                            │
│  - Can lose events if buffer full                                           │
│  - Higher memory overhead                                                    │
│                                                                              │
│  RINGBUF (Preferred, kernel 5.8+)                                           │
│  ─────────────────────────────────                                          │
│  - Single shared ring buffer                                                 │
│  - Automatic ordering of events                                              │
│  - Reservation-based (no loss if size check)                                │
│  - More efficient memory usage                                               │
│  - Variable-length records                                                   │
│                                                                              │
│  Always use RINGBUF for new code on kernel 5.8+                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
// Ring buffer usage
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);  // 256 KB
} events SEC(".maps");

struct event {
    u32 pid;
    char comm[16];
    u64 timestamp;
};

SEC("tracepoint/sched/sched_process_exec")
int trace_exec(struct trace_event_raw_sched_process_exec *ctx)
{
    struct event *e;
    
    // Reserve space in ring buffer
    e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e)
        return 0;
    
    // Fill event data
    e->pid = bpf_get_current_pid_tgid() >> 32;
    e->timestamp = bpf_ktime_get_ns();
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
    
    // Submit to user space
    bpf_ringbuf_submit(e, 0);
    return 0;
}

The BPF Verifier

The verifier ensures eBPF programs are safe to run in the kernel.

Verifier Checks

┌─────────────────────────────────────────────────────────────────────────────┐
│                        BPF VERIFIER CHECKS                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. CONTROL FLOW                                                             │
│     - No unbounded loops (must have provable termination)                   │
│     - No unreachable instructions                                            │
│     - Maximum instruction count (1M default)                                 │
│     - Maximum stack depth (512 bytes)                                        │
│                                                                              │
│  2. MEMORY SAFETY                                                            │
│     - All memory accesses must be bounded                                   │
│     - Pointer arithmetic checked                                            │
│     - NULL checks before dereference                                        │
│     - Stack access within bounds                                            │
│                                                                              │
│  3. TYPE SAFETY                                                              │
│     - Registers tracked for type                                            │
│     - Helper function argument types checked                                │
│     - Map key/value types verified                                          │
│                                                                              │
│  4. PRIVILEGE CHECKS                                                         │
│     - Certain helpers require CAP_BPF or CAP_SYS_ADMIN                      │
│     - Some program types restricted                                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Common Verifier Errors

// ERROR: Unbounded loop
for (int i = 0; i < n; i++) { }  // n is unknown at verify time

// SOLUTION: Use bounded loop
#pragma unroll
for (int i = 0; i < 100; i++) {
    if (i >= n) break;
}

// ERROR: Potential NULL dereference
u64 *value = bpf_map_lookup_elem(&my_map, &key);
*value = 42;  // ERROR: value might be NULL

// SOLUTION: Check for NULL
u64 *value = bpf_map_lookup_elem(&my_map, &key);
if (value)
    *value = 42;

// ERROR: Out of bounds access
char buf[16];
buf[idx] = 'x';  // ERROR: idx could be >= 16

// SOLUTION: Bound the index
if (idx < 16)
    buf[idx] = 'x';

Verifier Debugging

# Get verbose verifier output
sudo bpftool prog load program.o /sys/fs/bpf/prog -d

# View loaded program with verifier log
sudo bpftool prog dump xlated id <prog_id>

Helper Functions

eBPF programs can call kernel-provided helper functions.

Common Helpers

// Get current time in nanoseconds
u64 ts = bpf_ktime_get_ns();

// Get current PID/TID
u64 pid_tgid = bpf_get_current_pid_tgid();
u32 pid = pid_tgid >> 32;
u32 tid = pid_tgid & 0xFFFFFFFF;

// Get current task's comm
char comm[16];
bpf_get_current_comm(&comm, sizeof(comm));

// Get stack trace
u64 stack_id = bpf_get_stackid(ctx, &stack_map, 0);

// Read from kernel memory
bpf_probe_read_kernel(&dst, size, src);

// Read from user memory
bpf_probe_read_user(&dst, size, src);

// Read from user string
bpf_probe_read_user_str(&dst, size, src);

// Print debug message (limited, for development)
bpf_printk("PID %d called\n", pid);

// Send signal to current task
bpf_send_signal(SIGKILL);

// Get current cgroup ID
u64 cgroup_id = bpf_get_current_cgroup_id();

Available Helpers Per Program Type

# List helpers available for a program type
sudo bpftool feature probe kernel | grep -A 100 "kprobe" | head -50

BPF CO-RE (Compile Once, Run Everywhere)

CO-RE solves the problem of kernel version differences.
┌─────────────────────────────────────────────────────────────────────────────┐
│                            BPF CO-RE                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Without CO-RE:                                                              │
│  - Compile eBPF program for each kernel version                             │
│  - Must match exact struct layouts                                          │
│  - Breaks when kernel changes struct fields                                  │
│                                                                              │
│  With CO-RE:                                                                 │
│  - Compile once with BTF (BPF Type Format)                                  │
│  - libbpf adjusts offsets at load time                                      │
│  - Works across kernel versions                                              │
│                                                                              │
│  Example: struct task_struct                                                 │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │  Kernel 5.4:                  Kernel 5.15:                         │     │
│  │  struct task_struct {         struct task_struct {                  │     │
│  │    ...                          ...                                 │     │
│  │    pid_t pid;  // offset 100    void *new_field;  // added         │     │
│  │    ...                          pid_t pid;  // offset 108 (moved!) │     │
│  │  }                            }                                     │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│  CO-RE reads BTF from running kernel, adjusts offsets automatically         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

CO-RE Code Example

#include "vmlinux.h"  // Generated from kernel BTF
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>

SEC("kprobe/do_sys_open")
int BPF_KPROBE(trace_open, int dfd, const char *filename)
{
    struct task_struct *task = (void *)bpf_get_current_task();
    
    // CO-RE: works across kernel versions
    pid_t pid = BPF_CORE_READ(task, pid);
    pid_t tgid = BPF_CORE_READ(task, tgid);
    
    // Read parent's PID (nested struct access)
    pid_t ppid = BPF_CORE_READ(task, real_parent, pid);
    
    bpf_printk("PID %d (parent %d) opening file\n", pid, ppid);
    return 0;
}

bpftrace - High-Level Tracing

bpftrace is the easiest way to write eBPF programs.

bpftrace Basics

# Trace all syscalls
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'

# Trace open() calls with filename
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { 
    printf("%s opened %s\n", comm, str(args->filename)); 
}'

# Histogram of read() sizes
sudo bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ { 
    @size = hist(args->ret); 
}'

# Track time spent in functions
sudo bpftrace -e 'kprobe:do_sys_open { @start[tid] = nsecs; }
                  kretprobe:do_sys_open /@start[tid]/ { 
                      @duration = hist(nsecs - @start[tid]); 
                      delete(@start[tid]); 
                  }'

bpftrace One-Liners for Observability

# Top syscalls by process
sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# Disk I/O latency histogram
sudo bpftrace -e 'kprobe:blk_account_io_start { @start[arg0] = nsecs; }
    kprobe:blk_account_io_done /@start[arg0]/ { 
        @us = hist((nsecs - @start[arg0]) / 1000); 
        delete(@start[arg0]); 
    }'

# TCP connections
sudo bpftrace -e 'kprobe:tcp_connect { 
    @[comm] = count(); 
    printf("%s connecting\n", comm); 
}'

# Page faults by process
sudo bpftrace -e 'tracepoint:exceptions:page_fault_user { 
    @[comm] = count(); 
}'

# Memory allocations
sudo bpftrace -e 'tracepoint:kmem:kmalloc { 
    @bytes = hist(args->bytes_alloc); 
}'

# Context switches
sudo bpftrace -e 'tracepoint:sched:sched_switch { 
    @[args->prev_comm] = count(); 
}'

# Off-CPU time
sudo bpftrace -e 'tracepoint:sched:sched_switch { 
    @[args->prev_comm] = sum(args->prev_state != 0 ? 1 : 0); 
}'

bpftrace Variables

VariableDescription
pidProcess ID
tidThread ID
uidUser ID
commProcess name
nsecsTimestamp (nanoseconds)
cpuCPU number
curtaskCurrent task_struct pointer
argsTracepoint arguments
retvalReturn value (kretprobe)

Production eBPF Tools

BCC (BPF Compiler Collection)

# Install BCC tools
sudo apt install bpfcc-tools linux-headers-$(uname -r)

# Trace process execution
sudo execsnoop-bpfcc

# Trace open() calls
sudo opensnoop-bpfcc

# Trace TCP connections
sudo tcpconnect-bpfcc
sudo tcpaccept-bpfcc
sudo tcpretrans-bpfcc

# Profile on-CPU time
sudo profile-bpfcc -F 99 10

# Trace block I/O
sudo biolatency-bpfcc

# Trace filesystem operations
sudo ext4slower-bpfcc 1

# Memory allocation tracing
sudo memleak-bpfcc

# Cache hit ratio
sudo cachestat-bpfcc

libbpf-based Tools

// Modern libbpf skeleton approach
#include "trace.skel.h"

int main(int argc, char **argv)
{
    struct trace_bpf *skel;
    int err;
    
    // Open and load BPF program
    skel = trace_bpf__open_and_load();
    if (!skel) {
        fprintf(stderr, "Failed to load BPF skeleton\n");
        return 1;
    }
    
    // Attach BPF programs
    err = trace_bpf__attach(skel);
    if (err) {
        fprintf(stderr, "Failed to attach BPF skeleton\n");
        goto cleanup;
    }
    
    // Set up ring buffer callback
    struct ring_buffer *rb = ring_buffer__new(
        bpf_map__fd(skel->maps.events),
        handle_event,
        NULL,
        NULL
    );
    
    // Poll for events
    while (!exiting) {
        err = ring_buffer__poll(rb, 100);
        // Handle events...
    }
    
cleanup:
    ring_buffer__free(rb);
    trace_bpf__destroy(skel);
    return err;
}

XDP (eXpress Data Path)

XDP allows packet processing at the network driver level.

XDP Program

#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>

SEC("xdp")
int xdp_drop_icmp(struct xdp_md *ctx)
{
    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    
    // Parse Ethernet header
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;
    
    // Only handle IPv4
    if (eth->h_proto != htons(ETH_P_IP))
        return XDP_PASS;
    
    // Parse IP header
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;
    
    // Drop ICMP packets
    if (ip->protocol == IPPROTO_ICMP)
        return XDP_DROP;
    
    return XDP_PASS;
}

XDP Return Codes

CodeAction
XDP_PASSPass to network stack
XDP_DROPDrop packet
XDP_TXBounce back out same interface
XDP_REDIRECTRedirect to another interface
XDP_ABORTEDError, drop and trace

Lab Exercises

Objective: Write basic tracing scripts
# List available tracepoints
sudo bpftrace -l 'tracepoint:*' | head -50

# Trace file opens with latency
sudo bpftrace -e '
tracepoint:syscalls:sys_enter_openat {
    @start[tid] = nsecs;
    @fname[tid] = args->filename;
}
tracepoint:syscalls:sys_exit_openat /@start[tid]/ {
    $dur = (nsecs - @start[tid]) / 1000;
    printf("%s opened %s in %d us (fd=%d)\n", 
           comm, str(@fname[tid]), $dur, args->ret);
    delete(@start[tid]);
    delete(@fname[tid]);
}'

# Histogram of process lifetimes
sudo bpftrace -e '
tracepoint:sched:sched_process_fork {
    @birth[args->child_pid] = nsecs;
}
tracepoint:sched:sched_process_exit /@birth[args->pid]/ {
    @lifetime = hist((nsecs - @birth[args->pid]) / 1000000);
    delete(@birth[args->pid]);
}'
Objective: Create a complete eBPF program with libbpf
// trace.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>

struct event {
    u32 pid;
    u32 uid;
    char comm[16];
    char filename[256];
};

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);
} events SEC(".maps");

SEC("tracepoint/syscalls/sys_enter_openat")
int tracepoint__syscalls__sys_enter_openat(
    struct trace_event_raw_sys_enter *ctx)
{
    struct event *e;
    
    e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e)
        return 0;
    
    e->pid = bpf_get_current_pid_tgid() >> 32;
    e->uid = bpf_get_current_uid_gid() >> 32;
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
    bpf_probe_read_user_str(&e->filename, sizeof(e->filename),
                            (void *)ctx->args[1]);
    
    bpf_ringbuf_submit(e, 0);
    return 0;
}

char LICENSE[] SEC("license") = "GPL";
Build with:
clang -g -O2 -target bpf -c trace.bpf.c -o trace.bpf.o
bpftool gen skeleton trace.bpf.o > trace.skel.h
Objective: Profile production workloads
# CPU profiling with flame graphs
sudo profile-bpfcc -F 99 -f 30 > profile.txt
flamegraph.pl profile.txt > profile.svg

# Off-CPU analysis
sudo offcputime-bpfcc -f 30 > offcpu.txt
flamegraph.pl --color=io offcpu.txt > offcpu.svg

# Combined on/off CPU
sudo bpftrace -e '
profile:hz:99 { @on[kstack] = count(); }
tracepoint:sched:sched_switch { 
    @off[kstack] = count(); 
}' -c "sleep 10"

# Latency analysis
sudo funclatency-bpfcc do_sys_open -m

Interview Questions

Answer:The verifier ensures eBPF programs are safe to run in kernel context.Why necessary:
  • eBPF runs with kernel privileges
  • Bugs could crash the kernel or leak data
  • Must guarantee termination (no infinite loops)
  • Must prevent out-of-bounds memory access
Key checks:
  1. Control flow: No unbounded loops, reachable exit
  2. Memory safety: All accesses bounds-checked
  3. Type safety: Correct helper function arguments
  4. Privilege: Capability checks for dangerous operations
Limitations:
  • Some valid programs rejected (false negatives)
  • Complex loop bounds hard to prove
  • Instruction count limits
Answer:
AspectKprobesTracepoints
TypeDynamicStatic
Attach pointsAny functionPredefined only
ABI stabilityNoneMaintained
OverheadHigherLower
ArgumentsRead from stack/regsStructured, documented
Cross-kernelMay breakStable
Best practices:
  • Use tracepoints when available (stable, efficient)
  • Use kprobes for specific functions not covered
  • CO-RE helps with kprobe portability
  • Document kprobe usage for maintenance
Answer:Approach:
  1. Identify entry/exit points:
# Trace HTTP request handling
sudo bpftrace -e '
uprobe:/path/to/service:handleRequest { @start[tid] = nsecs; }
uretprobe:/path/to/service:handleRequest /@start[tid]/ {
    @latency = hist((nsecs - @start[tid]) / 1000000);
    delete(@start[tid]);
}'
  1. Break down latency:
  • Trace syscalls (read, write, connect)
  • Trace specific functions (DB queries, cache lookups)
  • Measure off-CPU time (blocking)
  1. Identify bottlenecks:
# Stack traces for slow operations
sudo bpftrace -e 'uprobe:... /@lat > 10000000/ { print(ustack); }'
  1. Production-safe approach:
  • Start with low-overhead tracepoints
  • Sample rather than trace all events
  • Use ring buffer for event collection
  • Set reasonable map sizes
Answer:What is XDP:
  • Runs eBPF at network driver level
  • Before sk_buff allocation (very early)
  • Near-native speed packet processing
Use cases:
  • DDoS mitigation (drop malicious packets)
  • Load balancing (Facebook’s Katran)
  • Packet filtering (Cloudflare)
  • Traffic steering
Performance:
  • Can process 10M+ packets per second per core
  • 10-100x faster than iptables for simple rules
Limitations:
  • Limited packet modification capabilities
  • Driver must support XDP
  • Complex protocols need network stack
Comparison with TC:
  • XDP: Earlier, faster, limited
  • TC: After sk_buff, full networking features

Key Takeaways

eBPF Safety

Verifier ensures programs are safe before running in kernel

Maps for Data

BPF maps enable data sharing between kernel and user space

CO-RE Portability

Compile once, run everywhere with BTF and libbpf

bpftrace Power

High-level scripting for quick observability tasks

Interview Deep-Dive

Strong Answer:
  • I would use two tracepoint programs: tracepoint/block/block_rq_issue to record when a block I/O request is dispatched to the device driver, and tracepoint/block/block_rq_complete to record when it completes. On issue, I would store the timestamp keyed by (dev, sector) in a BPF hash map. On completion, I would look up the start time, compute the delta, and emit a latency event.
  • For per-container attribution, I would use bpf_get_current_cgroup_id() at the issue tracepoint to capture the cgroup ID of the process that initiated the I/O. I would store this alongside the timestamp in the hash map, so the completion handler can attribute the latency to the correct container even though the completion runs in interrupt context (where the “current” task is arbitrary).
  • For efficient data export, I would use a BPF_MAP_TYPE_PERCPU_HASH map keyed by (cgroup_id, latency_bucket) to build a per-container latency histogram directly in kernel space. The user-space agent reads this map periodically (every 5-10 seconds), aggregates across CPUs, and exports to Prometheus. This approach avoids per-event ring buffer overhead.
  • For production safety: bounded map sizes (10240 entries), PERCPU maps to avoid lock contention, and the programs attach to stable tracepoints (not kprobes) for kernel version compatibility.
Follow-up: What happens if the hash map fills up because I/O requests are issued faster than they complete?Follow-up Answer:
  • If the (dev, sector) tracking map fills up, bpf_map_update_elem() returns -ENOSPC and the entry is silently dropped. The corresponding completion event will not find a matching start timestamp and will skip that I/O. This means we lose visibility into some requests during extreme load, but the program remains safe and does not crash or block I/O. To mitigate, I would size the map based on the expected maximum I/O queue depth across all devices (for NVMe with 64K queue depth per queue, this could be large). I would also add a per-CPU counter for dropped entries so the monitoring system can alert when we are losing data.
Strong Answer:
  • The BPF verifier is a static analyzer that runs at program load time (before any execution) and simulates every possible execution path through the program. It maintains a state machine tracking the type, value range, and liveness of each register and stack slot at every instruction.
  • Key invariants enforced: First, termination — every loop must have a provable upper bound. The verifier tracks loop iterations and rejects programs where it cannot prove the loop exits within a bounded number of iterations. Second, memory safety — every pointer dereference must be preceded by a bounds check. If bpf_map_lookup_elem() returns a pointer, the verifier marks it as “possibly NULL” and requires an explicit NULL check before dereferencing. Third, type safety — the verifier tracks which registers contain pointers to map values, packet data, stack, or scalars, and ensures they are used correctly (you cannot pass a packet pointer where a map pointer is expected). Fourth, privilege — certain helper functions require CAP_BPF or CAP_SYS_ADMIN.
  • A real rejection scenario: you write a program that iterates over a linked list of variable length in packet data. Even if you add if (i >= MAX_ITERATIONS) break;, the verifier might reject it because the packet pointer arithmetic creates too many possible states. The verifier explores states exponentially, and complex pointer arithmetic with conditional branches can exceed the instruction count limit (1 million verified instructions). The fix is to restructure the code to reduce branching complexity, use bpf_loop() helper (kernel 5.17+), or split the logic into multiple tail-called programs.
Follow-up: How does BPF CO-RE (Compile Once, Run Everywhere) interact with the verifier to enable cross-kernel portability?Follow-up Answer:
  • CO-RE works by embedding relocation information in the compiled BPF program’s ELF file. When you use BPF_CORE_READ(task, pid), the compiler emits a relocation record saying “access field pid at offset X in struct task_struct.” At load time, libbpf reads the running kernel’s BTF (BPF Type Format) data to find the actual offset of pid in the current kernel’s task_struct, which may differ from the compile-time offset. libbpf patches the BPF instructions to use the correct offset before submitting to the verifier. The verifier then sees a program with correct offsets for the running kernel and validates it normally. If the field does not exist at all (e.g., a field removed in a newer kernel), libbpf can detect this and handle it gracefully (returning a default value or disabling that part of the program).
Strong Answer:
  • XDP (eXpress Data Path) programs run at the earliest possible point in the network receive path, before the kernel allocates an sk_buff structure. They operate on raw xdp_md contexts with direct packet data access. TC BPF programs run later, after sk_buff allocation, at the traffic control layer in both ingress and egress paths.
  • Performance difference is significant: XDP can process 10-20 million packets per second per core because it avoids the overhead of sk_buff allocation (~200-300 cycles per packet). TC processes 2-5 million packets per second per core, which is still much faster than iptables but slower than XDP.
  • I would choose XDP for: DDoS mitigation (drop malicious packets before they consume memory), load balancing (redirect packets to different NICs or CPUs), and simple packet filtering (firewalling at line rate). Facebook’s Katran load balancer and Cloudflare’s DDoS mitigation use XDP.
  • I would choose TC for: more complex packet manipulation (full sk_buff available with all parsed headers), egress path processing (XDP only works on ingress), container networking (Cilium uses TC BPF for pod-to-pod traffic because it needs access to socket-level metadata), and when compatibility with the full networking stack is needed (TC programs can interact with connection tracking, netfilter marks, etc.).
  • XDP limitations: cannot modify packets that need fragmentation (no access to GSO), cannot directly interact with the socket layer, requires NIC driver support for native mode (falls back to generic/slower mode otherwise).
Follow-up: How does AF_XDP enable zero-copy packet processing, and when would you use it instead of regular XDP?Follow-up Answer:
  • AF_XDP is a socket type that works with XDP to deliver raw packets directly to user space without kernel copying. An XDP program uses XDP_REDIRECT with bpf_redirect_map() to send packets to an XSKMAP (XDP socket map). The user-space application creates an AF_XDP socket with shared UMEM (user memory), where packet data is DMA’d directly from the NIC into user-space-accessible memory. This achieves true zero-copy: the packet data is never copied by the kernel. I would use AF_XDP when the application needs to process every packet (not just filter/drop), such as custom protocol implementations, high-frequency trading network stacks, or DPDK-like applications that want kernel bypass without the complexity of a full DPDK setup. The trade-off versus pure XDP is that AF_XDP requires user-space processing latency, while XDP programs run to completion in the kernel.

Next: Tracing Infrastructure →