Skip to main content
eBPF - Safely extend the kernel at runtime

eBPF Deep Dive

eBPF (extended Berkeley Packet Filter) is the technology revolutionizing observability, networking, and security in Linux. Companies like Datadog, Grafana, and Cloudflare use eBPF extensively. Mastering it is essential for infrastructure and observability engineering roles.
Interview Frequency: Very High (core skill for observability roles)
Key Topics: BPF architecture, program types, maps, verifier, bpftrace
Time to Master: 18-20 hours

What is eBPF?

eBPF allows running sandboxed programs in the Linux kernel without changing kernel source code or loading kernel modules. eBPF Lifecycle

eBPF Program Types

Different attach points for different use cases:

Core Program Types

TypeAttach PointUse Case
BPF_PROG_TYPE_KPROBEKernel function entry/exitTracing syscalls, kernel functions
BPF_PROG_TYPE_TRACEPOINTStatic kernel tracepointsStable tracing points
BPF_PROG_TYPE_PERF_EVENTperf events (PMU, software)Performance monitoring
BPF_PROG_TYPE_XDPNetwork driver (before SKB)High-performance packet processing
BPF_PROG_TYPE_SCHED_CLSTraffic control classifierContainer networking
BPF_PROG_TYPE_SOCKET_FILTERSocketPacket filtering
BPF_PROG_TYPE_CGROUP_*Cgroup hooksContainer resource control
BPF_PROG_TYPE_LSMLSM hooksSecurity policies

Kprobes vs Tracepoints

┌─────────────────────────────────────────────────────────────────────────────┐
│                    KPROBES VS TRACEPOINTS                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  KPROBES (Dynamic)                                                           │
│  ────────────────                                                           │
│  + Can attach to ANY kernel function                                        │
│  + Very flexible for debugging                                               │
│  - Unstable ABI (function signatures can change)                            │
│  - Higher overhead                                                           │
│  - May break between kernel versions                                         │
│                                                                              │
│  TRACEPOINTS (Static)                                                        │
│  ───────────────────                                                        │
│  + Stable ABI (maintained by kernel developers)                             │
│  + Lower overhead (optimized instrumentation)                               │
│  + Documented arguments                                                      │
│  - Limited to predefined points                                             │
│  - May not cover everything you need                                         │
│                                                                              │
│  Best Practice: Prefer tracepoints when available, use kprobes when needed  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Available Tracepoints

# List all tracepoints
sudo ls /sys/kernel/debug/tracing/events/

# List syscall tracepoints
sudo ls /sys/kernel/debug/tracing/events/syscalls/

# View tracepoint format
sudo cat /sys/kernel/debug/tracing/events/sched/sched_switch/format

BPF Maps

Maps are key-value stores shared between eBPF programs and user space.

Map Types

// Common map types
BPF_MAP_TYPE_HASH          // Hash table
BPF_MAP_TYPE_ARRAY         // Array (fixed-size)
BPF_MAP_TYPE_PERCPU_HASH   // Per-CPU hash (no locking)
BPF_MAP_TYPE_PERCPU_ARRAY  // Per-CPU array
BPF_MAP_TYPE_RINGBUF       // Ring buffer (efficient event streaming)
BPF_MAP_TYPE_PERF_EVENT_ARRAY  // Per-CPU event buffer
BPF_MAP_TYPE_LRU_HASH      // LRU evicting hash
BPF_MAP_TYPE_STACK_TRACE   // Stack traces storage
BPF_MAP_TYPE_LPM_TRIE      // Longest prefix match (routing)

Map Declaration (libbpf style)

// In eBPF program (kernel side)
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10240);
    __type(key, u32);               // Key: PID
    __type(value, u64);             // Value: count
} syscall_count SEC(".maps");

// Using the map
SEC("tracepoint/syscalls/sys_enter_read")
int trace_read(struct trace_event_raw_sys_enter *ctx)
{
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    u64 *count = bpf_map_lookup_elem(&syscall_count, &pid);
    if (count) {
        (*count)++;
    } else {
        u64 initial = 1;
        bpf_map_update_elem(&syscall_count, &pid, &initial, BPF_ANY);
    }
    return 0;
}

Ring Buffer vs Perf Buffer

┌─────────────────────────────────────────────────────────────────────────────┐
│                    RINGBUF VS PERF_EVENT_ARRAY                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PERF_EVENT_ARRAY (Legacy)                                                   │
│  ─────────────────────────                                                  │
│  - Per-CPU buffers (separate buffer per CPU)                                │
│  - User space must poll each CPU                                            │
│  - Can lose events if buffer full                                           │
│  - Higher memory overhead                                                    │
│                                                                              │
│  RINGBUF (Preferred, kernel 5.8+)                                           │
│  ─────────────────────────────────                                          │
│  - Single shared ring buffer                                                 │
│  - Automatic ordering of events                                              │
│  - Reservation-based (no loss if size check)                                │
│  - More efficient memory usage                                               │
│  - Variable-length records                                                   │
│                                                                              │
│  Always use RINGBUF for new code on kernel 5.8+                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
// Ring buffer usage
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);  // 256 KB
} events SEC(".maps");

struct event {
    u32 pid;
    char comm[16];
    u64 timestamp;
};

SEC("tracepoint/sched/sched_process_exec")
int trace_exec(struct trace_event_raw_sched_process_exec *ctx)
{
    struct event *e;
    
    // Reserve space in ring buffer
    e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e)
        return 0;
    
    // Fill event data
    e->pid = bpf_get_current_pid_tgid() >> 32;
    e->timestamp = bpf_ktime_get_ns();
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
    
    // Submit to user space
    bpf_ringbuf_submit(e, 0);
    return 0;
}

The BPF Verifier

The verifier ensures eBPF programs are safe to run in the kernel.

Verifier Checks

┌─────────────────────────────────────────────────────────────────────────────┐
│                        BPF VERIFIER CHECKS                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. CONTROL FLOW                                                             │
│     - No unbounded loops (must have provable termination)                   │
│     - No unreachable instructions                                            │
│     - Maximum instruction count (1M default)                                 │
│     - Maximum stack depth (512 bytes)                                        │
│                                                                              │
│  2. MEMORY SAFETY                                                            │
│     - All memory accesses must be bounded                                   │
│     - Pointer arithmetic checked                                            │
│     - NULL checks before dereference                                        │
│     - Stack access within bounds                                            │
│                                                                              │
│  3. TYPE SAFETY                                                              │
│     - Registers tracked for type                                            │
│     - Helper function argument types checked                                │
│     - Map key/value types verified                                          │
│                                                                              │
│  4. PRIVILEGE CHECKS                                                         │
│     - Certain helpers require CAP_BPF or CAP_SYS_ADMIN                      │
│     - Some program types restricted                                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Common Verifier Errors

// ERROR: Unbounded loop
for (int i = 0; i < n; i++) { }  // n is unknown at verify time

// SOLUTION: Use bounded loop
#pragma unroll
for (int i = 0; i < 100; i++) {
    if (i >= n) break;
}

// ERROR: Potential NULL dereference
u64 *value = bpf_map_lookup_elem(&my_map, &key);
*value = 42;  // ERROR: value might be NULL

// SOLUTION: Check for NULL
u64 *value = bpf_map_lookup_elem(&my_map, &key);
if (value)
    *value = 42;

// ERROR: Out of bounds access
char buf[16];
buf[idx] = 'x';  // ERROR: idx could be >= 16

// SOLUTION: Bound the index
if (idx < 16)
    buf[idx] = 'x';

Verifier Debugging

# Get verbose verifier output
sudo bpftool prog load program.o /sys/fs/bpf/prog -d

# View loaded program with verifier log
sudo bpftool prog dump xlated id <prog_id>

Helper Functions

eBPF programs can call kernel-provided helper functions.

Common Helpers

// Get current time in nanoseconds
u64 ts = bpf_ktime_get_ns();

// Get current PID/TID
u64 pid_tgid = bpf_get_current_pid_tgid();
u32 pid = pid_tgid >> 32;
u32 tid = pid_tgid & 0xFFFFFFFF;

// Get current task's comm
char comm[16];
bpf_get_current_comm(&comm, sizeof(comm));

// Get stack trace
u64 stack_id = bpf_get_stackid(ctx, &stack_map, 0);

// Read from kernel memory
bpf_probe_read_kernel(&dst, size, src);

// Read from user memory
bpf_probe_read_user(&dst, size, src);

// Read from user string
bpf_probe_read_user_str(&dst, size, src);

// Print debug message (limited, for development)
bpf_printk("PID %d called\n", pid);

// Send signal to current task
bpf_send_signal(SIGKILL);

// Get current cgroup ID
u64 cgroup_id = bpf_get_current_cgroup_id();

Available Helpers Per Program Type

# List helpers available for a program type
sudo bpftool feature probe kernel | grep -A 100 "kprobe" | head -50

BPF CO-RE (Compile Once, Run Everywhere)

CO-RE solves the problem of kernel version differences.
┌─────────────────────────────────────────────────────────────────────────────┐
│                            BPF CO-RE                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Without CO-RE:                                                              │
│  - Compile eBPF program for each kernel version                             │
│  - Must match exact struct layouts                                          │
│  - Breaks when kernel changes struct fields                                  │
│                                                                              │
│  With CO-RE:                                                                 │
│  - Compile once with BTF (BPF Type Format)                                  │
│  - libbpf adjusts offsets at load time                                      │
│  - Works across kernel versions                                              │
│                                                                              │
│  Example: struct task_struct                                                 │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │  Kernel 5.4:                  Kernel 5.15:                         │     │
│  │  struct task_struct {         struct task_struct {                  │     │
│  │    ...                          ...                                 │     │
│  │    pid_t pid;  // offset 100    void *new_field;  // added         │     │
│  │    ...                          pid_t pid;  // offset 108 (moved!) │     │
│  │  }                            }                                     │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│  CO-RE reads BTF from running kernel, adjusts offsets automatically         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

CO-RE Code Example

#include "vmlinux.h"  // Generated from kernel BTF
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>

SEC("kprobe/do_sys_open")
int BPF_KPROBE(trace_open, int dfd, const char *filename)
{
    struct task_struct *task = (void *)bpf_get_current_task();
    
    // CO-RE: works across kernel versions
    pid_t pid = BPF_CORE_READ(task, pid);
    pid_t tgid = BPF_CORE_READ(task, tgid);
    
    // Read parent's PID (nested struct access)
    pid_t ppid = BPF_CORE_READ(task, real_parent, pid);
    
    bpf_printk("PID %d (parent %d) opening file\n", pid, ppid);
    return 0;
}

bpftrace - High-Level Tracing

bpftrace is the easiest way to write eBPF programs.

bpftrace Basics

# Trace all syscalls
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'

# Trace open() calls with filename
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { 
    printf("%s opened %s\n", comm, str(args->filename)); 
}'

# Histogram of read() sizes
sudo bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ { 
    @size = hist(args->ret); 
}'

# Track time spent in functions
sudo bpftrace -e 'kprobe:do_sys_open { @start[tid] = nsecs; }
                  kretprobe:do_sys_open /@start[tid]/ { 
                      @duration = hist(nsecs - @start[tid]); 
                      delete(@start[tid]); 
                  }'

bpftrace One-Liners for Observability

# Top syscalls by process
sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# Disk I/O latency histogram
sudo bpftrace -e 'kprobe:blk_account_io_start { @start[arg0] = nsecs; }
    kprobe:blk_account_io_done /@start[arg0]/ { 
        @us = hist((nsecs - @start[arg0]) / 1000); 
        delete(@start[arg0]); 
    }'

# TCP connections
sudo bpftrace -e 'kprobe:tcp_connect { 
    @[comm] = count(); 
    printf("%s connecting\n", comm); 
}'

# Page faults by process
sudo bpftrace -e 'tracepoint:exceptions:page_fault_user { 
    @[comm] = count(); 
}'

# Memory allocations
sudo bpftrace -e 'tracepoint:kmem:kmalloc { 
    @bytes = hist(args->bytes_alloc); 
}'

# Context switches
sudo bpftrace -e 'tracepoint:sched:sched_switch { 
    @[args->prev_comm] = count(); 
}'

# Off-CPU time
sudo bpftrace -e 'tracepoint:sched:sched_switch { 
    @[args->prev_comm] = sum(args->prev_state != 0 ? 1 : 0); 
}'

bpftrace Variables

VariableDescription
pidProcess ID
tidThread ID
uidUser ID
commProcess name
nsecsTimestamp (nanoseconds)
cpuCPU number
curtaskCurrent task_struct pointer
argsTracepoint arguments
retvalReturn value (kretprobe)

Production eBPF Tools

BCC (BPF Compiler Collection)

# Install BCC tools
sudo apt install bpfcc-tools linux-headers-$(uname -r)

# Trace process execution
sudo execsnoop-bpfcc

# Trace open() calls
sudo opensnoop-bpfcc

# Trace TCP connections
sudo tcpconnect-bpfcc
sudo tcpaccept-bpfcc
sudo tcpretrans-bpfcc

# Profile on-CPU time
sudo profile-bpfcc -F 99 10

# Trace block I/O
sudo biolatency-bpfcc

# Trace filesystem operations
sudo ext4slower-bpfcc 1

# Memory allocation tracing
sudo memleak-bpfcc

# Cache hit ratio
sudo cachestat-bpfcc

libbpf-based Tools

// Modern libbpf skeleton approach
#include "trace.skel.h"

int main(int argc, char **argv)
{
    struct trace_bpf *skel;
    int err;
    
    // Open and load BPF program
    skel = trace_bpf__open_and_load();
    if (!skel) {
        fprintf(stderr, "Failed to load BPF skeleton\n");
        return 1;
    }
    
    // Attach BPF programs
    err = trace_bpf__attach(skel);
    if (err) {
        fprintf(stderr, "Failed to attach BPF skeleton\n");
        goto cleanup;
    }
    
    // Set up ring buffer callback
    struct ring_buffer *rb = ring_buffer__new(
        bpf_map__fd(skel->maps.events),
        handle_event,
        NULL,
        NULL
    );
    
    // Poll for events
    while (!exiting) {
        err = ring_buffer__poll(rb, 100);
        // Handle events...
    }
    
cleanup:
    ring_buffer__free(rb);
    trace_bpf__destroy(skel);
    return err;
}

XDP (eXpress Data Path)

XDP allows packet processing at the network driver level.

XDP Program

#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>

SEC("xdp")
int xdp_drop_icmp(struct xdp_md *ctx)
{
    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    
    // Parse Ethernet header
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;
    
    // Only handle IPv4
    if (eth->h_proto != htons(ETH_P_IP))
        return XDP_PASS;
    
    // Parse IP header
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;
    
    // Drop ICMP packets
    if (ip->protocol == IPPROTO_ICMP)
        return XDP_DROP;
    
    return XDP_PASS;
}

XDP Return Codes

CodeAction
XDP_PASSPass to network stack
XDP_DROPDrop packet
XDP_TXBounce back out same interface
XDP_REDIRECTRedirect to another interface
XDP_ABORTEDError, drop and trace

Lab Exercises

Objective: Write basic tracing scripts
# List available tracepoints
sudo bpftrace -l 'tracepoint:*' | head -50

# Trace file opens with latency
sudo bpftrace -e '
tracepoint:syscalls:sys_enter_openat {
    @start[tid] = nsecs;
    @fname[tid] = args->filename;
}
tracepoint:syscalls:sys_exit_openat /@start[tid]/ {
    $dur = (nsecs - @start[tid]) / 1000;
    printf("%s opened %s in %d us (fd=%d)\n", 
           comm, str(@fname[tid]), $dur, args->ret);
    delete(@start[tid]);
    delete(@fname[tid]);
}'

# Histogram of process lifetimes
sudo bpftrace -e '
tracepoint:sched:sched_process_fork {
    @birth[args->child_pid] = nsecs;
}
tracepoint:sched:sched_process_exit /@birth[args->pid]/ {
    @lifetime = hist((nsecs - @birth[args->pid]) / 1000000);
    delete(@birth[args->pid]);
}'
Objective: Create a complete eBPF program with libbpf
// trace.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>

struct event {
    u32 pid;
    u32 uid;
    char comm[16];
    char filename[256];
};

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);
} events SEC(".maps");

SEC("tracepoint/syscalls/sys_enter_openat")
int tracepoint__syscalls__sys_enter_openat(
    struct trace_event_raw_sys_enter *ctx)
{
    struct event *e;
    
    e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e)
        return 0;
    
    e->pid = bpf_get_current_pid_tgid() >> 32;
    e->uid = bpf_get_current_uid_gid() >> 32;
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
    bpf_probe_read_user_str(&e->filename, sizeof(e->filename),
                            (void *)ctx->args[1]);
    
    bpf_ringbuf_submit(e, 0);
    return 0;
}

char LICENSE[] SEC("license") = "GPL";
Build with:
clang -g -O2 -target bpf -c trace.bpf.c -o trace.bpf.o
bpftool gen skeleton trace.bpf.o > trace.skel.h
Objective: Profile production workloads
# CPU profiling with flame graphs
sudo profile-bpfcc -F 99 -f 30 > profile.txt
flamegraph.pl profile.txt > profile.svg

# Off-CPU analysis
sudo offcputime-bpfcc -f 30 > offcpu.txt
flamegraph.pl --color=io offcpu.txt > offcpu.svg

# Combined on/off CPU
sudo bpftrace -e '
profile:hz:99 { @on[kstack] = count(); }
tracepoint:sched:sched_switch { 
    @off[kstack] = count(); 
}' -c "sleep 10"

# Latency analysis
sudo funclatency-bpfcc do_sys_open -m

Interview Questions

Answer:The verifier ensures eBPF programs are safe to run in kernel context.Why necessary:
  • eBPF runs with kernel privileges
  • Bugs could crash the kernel or leak data
  • Must guarantee termination (no infinite loops)
  • Must prevent out-of-bounds memory access
Key checks:
  1. Control flow: No unbounded loops, reachable exit
  2. Memory safety: All accesses bounds-checked
  3. Type safety: Correct helper function arguments
  4. Privilege: Capability checks for dangerous operations
Limitations:
  • Some valid programs rejected (false negatives)
  • Complex loop bounds hard to prove
  • Instruction count limits
Answer:
AspectKprobesTracepoints
TypeDynamicStatic
Attach pointsAny functionPredefined only
ABI stabilityNoneMaintained
OverheadHigherLower
ArgumentsRead from stack/regsStructured, documented
Cross-kernelMay breakStable
Best practices:
  • Use tracepoints when available (stable, efficient)
  • Use kprobes for specific functions not covered
  • CO-RE helps with kprobe portability
  • Document kprobe usage for maintenance
Answer:Approach:
  1. Identify entry/exit points:
# Trace HTTP request handling
sudo bpftrace -e '
uprobe:/path/to/service:handleRequest { @start[tid] = nsecs; }
uretprobe:/path/to/service:handleRequest /@start[tid]/ {
    @latency = hist((nsecs - @start[tid]) / 1000000);
    delete(@start[tid]);
}'
  1. Break down latency:
  • Trace syscalls (read, write, connect)
  • Trace specific functions (DB queries, cache lookups)
  • Measure off-CPU time (blocking)
  1. Identify bottlenecks:
# Stack traces for slow operations
sudo bpftrace -e 'uprobe:... /@lat > 10000000/ { print(ustack); }'
  1. Production-safe approach:
  • Start with low-overhead tracepoints
  • Sample rather than trace all events
  • Use ring buffer for event collection
  • Set reasonable map sizes
Answer:What is XDP:
  • Runs eBPF at network driver level
  • Before sk_buff allocation (very early)
  • Near-native speed packet processing
Use cases:
  • DDoS mitigation (drop malicious packets)
  • Load balancing (Facebook’s Katran)
  • Packet filtering (Cloudflare)
  • Traffic steering
Performance:
  • Can process 10M+ packets per second per core
  • 10-100x faster than iptables for simple rules
Limitations:
  • Limited packet modification capabilities
  • Driver must support XDP
  • Complex protocols need network stack
Comparison with TC:
  • XDP: Earlier, faster, limited
  • TC: After sk_buff, full networking features

Key Takeaways

eBPF Safety

Verifier ensures programs are safe before running in kernel

Maps for Data

BPF maps enable data sharing between kernel and user space

CO-RE Portability

Compile once, run everywhere with BTF and libbpf

bpftrace Power

High-level scripting for quick observability tasks

Next: Tracing Infrastructure →