eBPF Deep Dive
eBPF (extended Berkeley Packet Filter) is the technology revolutionizing observability, networking, and security in Linux. Companies like Datadog, Grafana, and Cloudflare use eBPF extensively. Mastering it is essential for infrastructure and observability engineering roles.
Interview Frequency : Very High (core skill for observability roles)
Key Topics : BPF architecture, program types, maps, verifier, bpftrace
Time to Master : 18-20 hours
What is eBPF?
eBPF allows running sandboxed programs in the Linux kernel without changing kernel source code or loading kernel modules.
eBPF Program Types
Different attach points for different use cases:
Core Program Types
Type Attach Point Use Case BPF_PROG_TYPE_KPROBEKernel function entry/exit Tracing syscalls, kernel functions BPF_PROG_TYPE_TRACEPOINTStatic kernel tracepoints Stable tracing points BPF_PROG_TYPE_PERF_EVENTperf events (PMU, software) Performance monitoring BPF_PROG_TYPE_XDPNetwork driver (before SKB) High-performance packet processing BPF_PROG_TYPE_SCHED_CLSTraffic control classifier Container networking BPF_PROG_TYPE_SOCKET_FILTERSocket Packet filtering BPF_PROG_TYPE_CGROUP_*Cgroup hooks Container resource control BPF_PROG_TYPE_LSMLSM hooks Security policies
Kprobes vs Tracepoints
┌─────────────────────────────────────────────────────────────────────────────┐
│ KPROBES VS TRACEPOINTS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ KPROBES (Dynamic) │
│ ──────────────── │
│ + Can attach to ANY kernel function │
│ + Very flexible for debugging │
│ - Unstable ABI (function signatures can change) │
│ - Higher overhead │
│ - May break between kernel versions │
│ │
│ TRACEPOINTS (Static) │
│ ─────────────────── │
│ + Stable ABI (maintained by kernel developers) │
│ + Lower overhead (optimized instrumentation) │
│ + Documented arguments │
│ - Limited to predefined points │
│ - May not cover everything you need │
│ │
│ Best Practice: Prefer tracepoints when available, use kprobes when needed │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Available Tracepoints
# List all tracepoints
sudo ls /sys/kernel/debug/tracing/events/
# List syscall tracepoints
sudo ls /sys/kernel/debug/tracing/events/syscalls/
# View tracepoint format
sudo cat /sys/kernel/debug/tracing/events/sched/sched_switch/format
BPF Maps
Maps are key-value stores shared between eBPF programs and user space.
Map Types
// Common map types
BPF_MAP_TYPE_HASH // Hash table
BPF_MAP_TYPE_ARRAY // Array (fixed-size)
BPF_MAP_TYPE_PERCPU_HASH // Per-CPU hash (no locking)
BPF_MAP_TYPE_PERCPU_ARRAY // Per-CPU array
BPF_MAP_TYPE_RINGBUF // Ring buffer (efficient event streaming)
BPF_MAP_TYPE_PERF_EVENT_ARRAY // Per-CPU event buffer
BPF_MAP_TYPE_LRU_HASH // LRU evicting hash
BPF_MAP_TYPE_STACK_TRACE // Stack traces storage
BPF_MAP_TYPE_LPM_TRIE // Longest prefix match (routing)
Map Declaration (libbpf style)
// In eBPF program (kernel side)
struct {
__uint (type, BPF_MAP_TYPE_HASH);
__uint (max_entries, 10240 );
__type (key, u32); // Key: PID
__type (value, u64); // Value: count
} syscall_count SEC ( ".maps" );
// Using the map
SEC ( "tracepoint/syscalls/sys_enter_read" )
int trace_read ( struct trace_event_raw_sys_enter * ctx )
{
u32 pid = bpf_get_current_pid_tgid () >> 32 ;
u64 * count = bpf_map_lookup_elem ( & syscall_count, & pid);
if (count) {
( * count) ++ ;
} else {
u64 initial = 1 ;
bpf_map_update_elem ( & syscall_count, & pid, & initial, BPF_ANY);
}
return 0 ;
}
Ring Buffer vs Perf Buffer
┌─────────────────────────────────────────────────────────────────────────────┐
│ RINGBUF VS PERF_EVENT_ARRAY │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ PERF_EVENT_ARRAY (Legacy) │
│ ───────────────────────── │
│ - Per-CPU buffers (separate buffer per CPU) │
│ - User space must poll each CPU │
│ - Can lose events if buffer full │
│ - Higher memory overhead │
│ │
│ RINGBUF (Preferred, kernel 5.8+) │
│ ───────────────────────────────── │
│ - Single shared ring buffer │
│ - Automatic ordering of events │
│ - Reservation-based (no loss if size check) │
│ - More efficient memory usage │
│ - Variable-length records │
│ │
│ Always use RINGBUF for new code on kernel 5.8+ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
// Ring buffer usage
struct {
__uint (type, BPF_MAP_TYPE_RINGBUF);
__uint (max_entries, 256 * 1024 ); // 256 KB
} events SEC ( ".maps" );
struct event {
u32 pid;
char comm [ 16 ];
u64 timestamp;
};
SEC ( "tracepoint/sched/sched_process_exec" )
int trace_exec ( struct trace_event_raw_sched_process_exec * ctx )
{
struct event * e;
// Reserve space in ring buffer
e = bpf_ringbuf_reserve ( & events, sizeof ( * e), 0 );
if ( ! e)
return 0 ;
// Fill event data
e -> pid = bpf_get_current_pid_tgid () >> 32 ;
e -> timestamp = bpf_ktime_get_ns ();
bpf_get_current_comm ( & e -> comm , sizeof ( e -> comm ));
// Submit to user space
bpf_ringbuf_submit (e, 0 );
return 0 ;
}
The BPF Verifier
The verifier ensures eBPF programs are safe to run in the kernel.
Verifier Checks
┌─────────────────────────────────────────────────────────────────────────────┐
│ BPF VERIFIER CHECKS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. CONTROL FLOW │
│ - No unbounded loops (must have provable termination) │
│ - No unreachable instructions │
│ - Maximum instruction count (1M default) │
│ - Maximum stack depth (512 bytes) │
│ │
│ 2. MEMORY SAFETY │
│ - All memory accesses must be bounded │
│ - Pointer arithmetic checked │
│ - NULL checks before dereference │
│ - Stack access within bounds │
│ │
│ 3. TYPE SAFETY │
│ - Registers tracked for type │
│ - Helper function argument types checked │
│ - Map key/value types verified │
│ │
│ 4. PRIVILEGE CHECKS │
│ - Certain helpers require CAP_BPF or CAP_SYS_ADMIN │
│ - Some program types restricted │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Common Verifier Errors
// ERROR: Unbounded loop
for ( int i = 0 ; i < n; i ++ ) { } // n is unknown at verify time
// SOLUTION: Use bounded loop
#pragma unroll
for ( int i = 0 ; i < 100 ; i ++ ) {
if (i >= n) break ;
}
// ERROR: Potential NULL dereference
u64 * value = bpf_map_lookup_elem ( & my_map , & key );
* value = 42 ; // ERROR: value might be NULL
// SOLUTION: Check for NULL
u64 * value = bpf_map_lookup_elem ( & my_map , & key );
if (value)
* value = 42 ;
// ERROR: Out of bounds access
char buf [ 16 ];
buf [idx] = 'x' ; // ERROR: idx could be >= 16
// SOLUTION: Bound the index
if (idx < 16 )
buf [idx] = 'x' ;
Verifier Debugging
# Get verbose verifier output
sudo bpftool prog load program.o /sys/fs/bpf/prog -d
# View loaded program with verifier log
sudo bpftool prog dump xlated id < prog_i d >
Helper Functions
eBPF programs can call kernel-provided helper functions.
Common Helpers
// Get current time in nanoseconds
u64 ts = bpf_ktime_get_ns ();
// Get current PID/TID
u64 pid_tgid = bpf_get_current_pid_tgid ();
u32 pid = pid_tgid >> 32 ;
u32 tid = pid_tgid & 0x FFFFFFFF ;
// Get current task's comm
char comm [ 16 ];
bpf_get_current_comm ( & comm , sizeof (comm));
// Get stack trace
u64 stack_id = bpf_get_stackid (ctx, & stack_map , 0 );
// Read from kernel memory
bpf_probe_read_kernel ( & dst , size, src);
// Read from user memory
bpf_probe_read_user ( & dst , size, src);
// Read from user string
bpf_probe_read_user_str ( & dst , size, src);
// Print debug message (limited, for development)
bpf_printk ( "PID %d called \n " , pid);
// Send signal to current task
bpf_send_signal (SIGKILL);
// Get current cgroup ID
u64 cgroup_id = bpf_get_current_cgroup_id ();
Available Helpers Per Program Type
# List helpers available for a program type
sudo bpftool feature probe kernel | grep -A 100 "kprobe" | head -50
BPF CO-RE (Compile Once, Run Everywhere)
CO-RE solves the problem of kernel version differences.
┌─────────────────────────────────────────────────────────────────────────────┐
│ BPF CO-RE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Without CO-RE: │
│ - Compile eBPF program for each kernel version │
│ - Must match exact struct layouts │
│ - Breaks when kernel changes struct fields │
│ │
│ With CO-RE: │
│ - Compile once with BTF (BPF Type Format) │
│ - libbpf adjusts offsets at load time │
│ - Works across kernel versions │
│ │
│ Example: struct task_struct │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Kernel 5.4: Kernel 5.15: │ │
│ │ struct task_struct { struct task_struct { │ │
│ │ ... ... │ │
│ │ pid_t pid; // offset 100 void *new_field; // added │ │
│ │ ... pid_t pid; // offset 108 (moved!) │ │
│ │ } } │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ CO-RE reads BTF from running kernel, adjusts offsets automatically │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
CO-RE Code Example
#include "vmlinux.h" // Generated from kernel BTF
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
SEC ( "kprobe/do_sys_open" )
int BPF_KPROBE (trace_open, int dfd , const char * filename )
{
struct task_struct * task = ( void * ) bpf_get_current_task ();
// CO-RE: works across kernel versions
pid_t pid = BPF_CORE_READ (task, pid);
pid_t tgid = BPF_CORE_READ (task, tgid);
// Read parent's PID (nested struct access)
pid_t ppid = BPF_CORE_READ (task, real_parent, pid);
bpf_printk ( "PID %d (parent %d ) opening file \n " , pid, ppid);
return 0 ;
}
bpftrace - High-Level Tracing
bpftrace is the easiest way to write eBPF programs.
bpftrace Basics
# Trace all syscalls
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }'
# Trace open() calls with filename
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
printf("%s opened %s\n", comm, str(args->filename));
}'
# Histogram of read() sizes
sudo bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ {
@size = hist(args->ret);
}'
# Track time spent in functions
sudo bpftrace -e 'kprobe:do_sys_open { @start[tid] = nsecs; }
kretprobe:do_sys_open /@start[tid]/ {
@duration = hist(nsecs - @start[tid]);
delete(@start[tid]);
}'
bpftrace One-Liners for Observability
# Top syscalls by process
sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
# Disk I/O latency histogram
sudo bpftrace -e 'kprobe:blk_account_io_start { @start[arg0] = nsecs; }
kprobe:blk_account_io_done /@start[arg0]/ {
@us = hist((nsecs - @start[arg0]) / 1000);
delete(@start[arg0]);
}'
# TCP connections
sudo bpftrace -e 'kprobe:tcp_connect {
@[comm] = count();
printf("%s connecting\n", comm);
}'
# Page faults by process
sudo bpftrace -e 'tracepoint:exceptions:page_fault_user {
@[comm] = count();
}'
# Memory allocations
sudo bpftrace -e 'tracepoint:kmem:kmalloc {
@bytes = hist(args->bytes_alloc);
}'
# Context switches
sudo bpftrace -e 'tracepoint:sched:sched_switch {
@[args->prev_comm] = count();
}'
# Off-CPU time
sudo bpftrace -e 'tracepoint:sched:sched_switch {
@[args->prev_comm] = sum(args->prev_state != 0 ? 1 : 0);
}'
bpftrace Variables
Variable Description pidProcess ID tidThread ID uidUser ID commProcess name nsecsTimestamp (nanoseconds) cpuCPU number curtaskCurrent task_struct pointer argsTracepoint arguments retvalReturn value (kretprobe)
BCC (BPF Compiler Collection)
# Install BCC tools
sudo apt install bpfcc-tools linux-headers- $( uname -r )
# Trace process execution
sudo execsnoop-bpfcc
# Trace open() calls
sudo opensnoop-bpfcc
# Trace TCP connections
sudo tcpconnect-bpfcc
sudo tcpaccept-bpfcc
sudo tcpretrans-bpfcc
# Profile on-CPU time
sudo profile-bpfcc -F 99 10
# Trace block I/O
sudo biolatency-bpfcc
# Trace filesystem operations
sudo ext4slower-bpfcc 1
# Memory allocation tracing
sudo memleak-bpfcc
# Cache hit ratio
sudo cachestat-bpfcc
// Modern libbpf skeleton approach
#include "trace.skel.h"
int main ( int argc , char ** argv )
{
struct trace_bpf * skel;
int err;
// Open and load BPF program
skel = trace_bpf__open_and_load ();
if ( ! skel) {
fprintf (stderr, "Failed to load BPF skeleton \n " );
return 1 ;
}
// Attach BPF programs
err = trace_bpf__attach (skel);
if (err) {
fprintf (stderr, "Failed to attach BPF skeleton \n " );
goto cleanup;
}
// Set up ring buffer callback
struct ring_buffer * rb = ring_buffer__new (
bpf_map__fd ( skel -> maps . events ),
handle_event,
NULL ,
NULL
);
// Poll for events
while ( ! exiting) {
err = ring_buffer__poll (rb, 100 );
// Handle events...
}
cleanup:
ring_buffer__free (rb);
trace_bpf__destroy (skel);
return err;
}
XDP (eXpress Data Path)
XDP allows packet processing at the network driver level.
XDP Program
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
SEC ( "xdp" )
int xdp_drop_icmp ( struct xdp_md * ctx )
{
void * data = ( void * )( long ) ctx -> data ;
void * data_end = ( void * )( long ) ctx -> data_end ;
// Parse Ethernet header
struct ethhdr * eth = data;
if (( void * )(eth + 1 ) > data_end)
return XDP_PASS;
// Only handle IPv4
if ( eth -> h_proto != htons (ETH_P_IP))
return XDP_PASS;
// Parse IP header
struct iphdr * ip = ( void * )(eth + 1 );
if (( void * )(ip + 1 ) > data_end)
return XDP_PASS;
// Drop ICMP packets
if ( ip -> protocol == IPPROTO_ICMP)
return XDP_DROP;
return XDP_PASS;
}
XDP Return Codes
Code Action XDP_PASSPass to network stack XDP_DROPDrop packet XDP_TXBounce back out same interface XDP_REDIRECTRedirect to another interface XDP_ABORTEDError, drop and trace
Lab Exercises
Lab 1: First bpftrace Script
Objective : Write basic tracing scripts# List available tracepoints
sudo bpftrace -l 'tracepoint:*' | head -50
# Trace file opens with latency
sudo bpftrace -e '
tracepoint:syscalls:sys_enter_openat {
@start[tid] = nsecs;
@fname[tid] = args->filename;
}
tracepoint:syscalls:sys_exit_openat /@start[tid]/ {
$dur = (nsecs - @start[tid]) / 1000;
printf("%s opened %s in %d us (fd=%d)\n",
comm, str(@fname[tid]), $dur, args->ret);
delete(@start[tid]);
delete(@fname[tid]);
}'
# Histogram of process lifetimes
sudo bpftrace -e '
tracepoint:sched:sched_process_fork {
@birth[args->child_pid] = nsecs;
}
tracepoint:sched:sched_process_exit /@birth[args->pid]/ {
@lifetime = hist((nsecs - @birth[args->pid]) / 1000000);
delete(@birth[args->pid]);
}'
Lab 2: Write libbpf Program
Objective : Create a complete eBPF program with libbpf// trace.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
struct event {
u32 pid;
u32 uid;
char comm [ 16 ];
char filename [ 256 ];
};
struct {
__uint (type, BPF_MAP_TYPE_RINGBUF);
__uint (max_entries, 256 * 1024 );
} events SEC ( ".maps" );
SEC ( "tracepoint/syscalls/sys_enter_openat" )
int tracepoint__syscalls__sys_enter_openat (
struct trace_event_raw_sys_enter * ctx )
{
struct event * e;
e = bpf_ringbuf_reserve ( & events, sizeof ( * e), 0 );
if ( ! e)
return 0 ;
e -> pid = bpf_get_current_pid_tgid () >> 32 ;
e -> uid = bpf_get_current_uid_gid () >> 32 ;
bpf_get_current_comm ( & e -> comm , sizeof ( e -> comm ));
bpf_probe_read_user_str ( & e -> filename , sizeof ( e -> filename ),
( void * ) ctx -> args [ 1 ]);
bpf_ringbuf_submit (e, 0 );
return 0 ;
}
char LICENSE [] SEC ( "license" ) = "GPL" ;
Build with: clang -g -O2 -target bpf -c trace.bpf.c -o trace.bpf.o
bpftool gen skeleton trace.bpf.o > trace.skel.h
Lab 3: Performance Profiling with eBPF
Interview Questions
Q1: Explain the eBPF verifier and why it's necessary
Answer :The verifier ensures eBPF programs are safe to run in kernel context. Why necessary :
eBPF runs with kernel privileges
Bugs could crash the kernel or leak data
Must guarantee termination (no infinite loops)
Must prevent out-of-bounds memory access
Key checks :
Control flow : No unbounded loops, reachable exit
Memory safety : All accesses bounds-checked
Type safety : Correct helper function arguments
Privilege : Capability checks for dangerous operations
Limitations :
Some valid programs rejected (false negatives)
Complex loop bounds hard to prove
Instruction count limits
Q2: What's the difference between kprobes and tracepoints?
Answer :Aspect Kprobes Tracepoints Type Dynamic Static Attach points Any function Predefined only ABI stability None Maintained Overhead Higher Lower Arguments Read from stack/regs Structured, documented Cross-kernel May break Stable
Best practices :
Use tracepoints when available (stable, efficient)
Use kprobes for specific functions not covered
CO-RE helps with kprobe portability
Document kprobe usage for maintenance
Q3: How would you use eBPF to debug latency in a microservice?
Answer :Approach :
Identify entry/exit points :
# Trace HTTP request handling
sudo bpftrace -e '
uprobe:/path/to/service:handleRequest { @start[tid] = nsecs; }
uretprobe:/path/to/service:handleRequest /@start[tid]/ {
@latency = hist((nsecs - @start[tid]) / 1000000);
delete(@start[tid]);
}'
Break down latency :
Trace syscalls (read, write, connect)
Trace specific functions (DB queries, cache lookups)
Measure off-CPU time (blocking)
Identify bottlenecks :
# Stack traces for slow operations
sudo bpftrace -e 'uprobe:... /@lat > 10000000/ { print(ustack); }'
Production-safe approach :
Start with low-overhead tracepoints
Sample rather than trace all events
Use ring buffer for event collection
Set reasonable map sizes
Q4: Explain XDP and when you would use it
Answer :What is XDP :
Runs eBPF at network driver level
Before sk_buff allocation (very early)
Near-native speed packet processing
Use cases :
DDoS mitigation (drop malicious packets)
Load balancing (Facebook’s Katran)
Packet filtering (Cloudflare)
Traffic steering
Performance :
Can process 10M+ packets per second per core
10-100x faster than iptables for simple rules
Limitations :
Limited packet modification capabilities
Driver must support XDP
Complex protocols need network stack
Comparison with TC :
XDP: Earlier, faster, limited
TC: After sk_buff, full networking features
Key Takeaways
eBPF Safety Verifier ensures programs are safe before running in kernel
Maps for Data BPF maps enable data sharing between kernel and user space
CO-RE Portability Compile once, run everywhere with BTF and libbpf
bpftrace Power High-level scripting for quick observability tasks
Next: Tracing Infrastructure →