Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
eBPF Deep Dive
eBPF (extended Berkeley Packet Filter) is the technology revolutionizing observability, networking, and security in Linux. Companies like Datadog, Grafana, and Cloudflare use eBPF extensively. Mastering it is essential for infrastructure and observability engineering roles.Key Topics: BPF architecture, program types, maps, verifier, bpftrace
Time to Master: 18-20 hours
What is eBPF?
eBPF allows running sandboxed programs in the Linux kernel without changing kernel source code or loading kernel modules.eBPF Program Types
Different attach points for different use cases:Core Program Types
| Type | Attach Point | Use Case |
|---|---|---|
BPF_PROG_TYPE_KPROBE | Kernel function entry/exit | Tracing syscalls, kernel functions |
BPF_PROG_TYPE_TRACEPOINT | Static kernel tracepoints | Stable tracing points |
BPF_PROG_TYPE_PERF_EVENT | perf events (PMU, software) | Performance monitoring |
BPF_PROG_TYPE_XDP | Network driver (before SKB) | High-performance packet processing |
BPF_PROG_TYPE_SCHED_CLS | Traffic control classifier | Container networking |
BPF_PROG_TYPE_SOCKET_FILTER | Socket | Packet filtering |
BPF_PROG_TYPE_CGROUP_* | Cgroup hooks | Container resource control |
BPF_PROG_TYPE_LSM | LSM hooks | Security policies |
Kprobes vs Tracepoints
Available Tracepoints
BPF Maps
Maps are key-value stores shared between eBPF programs and user space.Map Types
Map Declaration (libbpf style)
Ring Buffer vs Perf Buffer
The BPF Verifier
The verifier ensures eBPF programs are safe to run in the kernel.Verifier Checks
Common Verifier Errors
Verifier Debugging
Helper Functions
eBPF programs can call kernel-provided helper functions.Common Helpers
Available Helpers Per Program Type
BPF CO-RE (Compile Once, Run Everywhere)
CO-RE solves the problem of kernel version differences.CO-RE Code Example
bpftrace - High-Level Tracing
bpftrace is the easiest way to write eBPF programs.bpftrace Basics
bpftrace One-Liners for Observability
bpftrace Variables
| Variable | Description |
|---|---|
pid | Process ID |
tid | Thread ID |
uid | User ID |
comm | Process name |
nsecs | Timestamp (nanoseconds) |
cpu | CPU number |
curtask | Current task_struct pointer |
args | Tracepoint arguments |
retval | Return value (kretprobe) |
Production eBPF Tools
BCC (BPF Compiler Collection)
libbpf-based Tools
XDP (eXpress Data Path)
XDP allows packet processing at the network driver level.XDP Program
XDP Return Codes
| Code | Action |
|---|---|
XDP_PASS | Pass to network stack |
XDP_DROP | Drop packet |
XDP_TX | Bounce back out same interface |
XDP_REDIRECT | Redirect to another interface |
XDP_ABORTED | Error, drop and trace |
Lab Exercises
Lab 1: First bpftrace Script
Lab 1: First bpftrace Script
Lab 2: Write libbpf Program
Lab 2: Write libbpf Program
Lab 3: Performance Profiling with eBPF
Lab 3: Performance Profiling with eBPF
Interview Questions
Q1: Explain the eBPF verifier and why it's necessary
Q1: Explain the eBPF verifier and why it's necessary
- eBPF runs with kernel privileges
- Bugs could crash the kernel or leak data
- Must guarantee termination (no infinite loops)
- Must prevent out-of-bounds memory access
- Control flow: No unbounded loops, reachable exit
- Memory safety: All accesses bounds-checked
- Type safety: Correct helper function arguments
- Privilege: Capability checks for dangerous operations
- Some valid programs rejected (false negatives)
- Complex loop bounds hard to prove
- Instruction count limits
Q2: What's the difference between kprobes and tracepoints?
Q2: What's the difference between kprobes and tracepoints?
| Aspect | Kprobes | Tracepoints |
|---|---|---|
| Type | Dynamic | Static |
| Attach points | Any function | Predefined only |
| ABI stability | None | Maintained |
| Overhead | Higher | Lower |
| Arguments | Read from stack/regs | Structured, documented |
| Cross-kernel | May break | Stable |
- Use tracepoints when available (stable, efficient)
- Use kprobes for specific functions not covered
- CO-RE helps with kprobe portability
- Document kprobe usage for maintenance
Q3: How would you use eBPF to debug latency in a microservice?
Q3: How would you use eBPF to debug latency in a microservice?
- Identify entry/exit points:
- Break down latency:
- Trace syscalls (read, write, connect)
- Trace specific functions (DB queries, cache lookups)
- Measure off-CPU time (blocking)
- Identify bottlenecks:
- Production-safe approach:
- Start with low-overhead tracepoints
- Sample rather than trace all events
- Use ring buffer for event collection
- Set reasonable map sizes
Q4: Explain XDP and when you would use it
Q4: Explain XDP and when you would use it
- Runs eBPF at network driver level
- Before sk_buff allocation (very early)
- Near-native speed packet processing
- DDoS mitigation (drop malicious packets)
- Load balancing (Facebook’s Katran)
- Packet filtering (Cloudflare)
- Traffic steering
- Can process 10M+ packets per second per core
- 10-100x faster than iptables for simple rules
- Limited packet modification capabilities
- Driver must support XDP
- Complex protocols need network stack
- XDP: Earlier, faster, limited
- TC: After sk_buff, full networking features
Key Takeaways
eBPF Safety
Maps for Data
CO-RE Portability
bpftrace Power
Interview Deep-Dive
You need to build a production monitoring tool that tracks the latency of every disk I/O operation per container. Describe your eBPF-based approach, including the specific program types, maps, and how you would handle the per-container attribution.
You need to build a production monitoring tool that tracks the latency of every disk I/O operation per container. Describe your eBPF-based approach, including the specific program types, maps, and how you would handle the per-container attribution.
- I would use two tracepoint programs:
tracepoint/block/block_rq_issueto record when a block I/O request is dispatched to the device driver, andtracepoint/block/block_rq_completeto record when it completes. On issue, I would store the timestamp keyed by(dev, sector)in a BPF hash map. On completion, I would look up the start time, compute the delta, and emit a latency event. - For per-container attribution, I would use
bpf_get_current_cgroup_id()at the issue tracepoint to capture the cgroup ID of the process that initiated the I/O. I would store this alongside the timestamp in the hash map, so the completion handler can attribute the latency to the correct container even though the completion runs in interrupt context (where the “current” task is arbitrary). - For efficient data export, I would use a
BPF_MAP_TYPE_PERCPU_HASHmap keyed by(cgroup_id, latency_bucket)to build a per-container latency histogram directly in kernel space. The user-space agent reads this map periodically (every 5-10 seconds), aggregates across CPUs, and exports to Prometheus. This approach avoids per-event ring buffer overhead. - For production safety: bounded map sizes (10240 entries), PERCPU maps to avoid lock contention, and the programs attach to stable tracepoints (not kprobes) for kernel version compatibility.
- If the
(dev, sector)tracking map fills up,bpf_map_update_elem()returns-ENOSPCand the entry is silently dropped. The corresponding completion event will not find a matching start timestamp and will skip that I/O. This means we lose visibility into some requests during extreme load, but the program remains safe and does not crash or block I/O. To mitigate, I would size the map based on the expected maximum I/O queue depth across all devices (for NVMe with 64K queue depth per queue, this could be large). I would also add a per-CPU counter for dropped entries so the monitoring system can alert when we are losing data.
Explain the BPF verifier in detail. What invariants does it enforce, why are bounded loops necessary, and describe a real scenario where a valid program is rejected by the verifier.
Explain the BPF verifier in detail. What invariants does it enforce, why are bounded loops necessary, and describe a real scenario where a valid program is rejected by the verifier.
- The BPF verifier is a static analyzer that runs at program load time (before any execution) and simulates every possible execution path through the program. It maintains a state machine tracking the type, value range, and liveness of each register and stack slot at every instruction.
- Key invariants enforced: First, termination — every loop must have a provable upper bound. The verifier tracks loop iterations and rejects programs where it cannot prove the loop exits within a bounded number of iterations. Second, memory safety — every pointer dereference must be preceded by a bounds check. If
bpf_map_lookup_elem()returns a pointer, the verifier marks it as “possibly NULL” and requires an explicit NULL check before dereferencing. Third, type safety — the verifier tracks which registers contain pointers to map values, packet data, stack, or scalars, and ensures they are used correctly (you cannot pass a packet pointer where a map pointer is expected). Fourth, privilege — certain helper functions requireCAP_BPForCAP_SYS_ADMIN. - A real rejection scenario: you write a program that iterates over a linked list of variable length in packet data. Even if you add
if (i >= MAX_ITERATIONS) break;, the verifier might reject it because the packet pointer arithmetic creates too many possible states. The verifier explores states exponentially, and complex pointer arithmetic with conditional branches can exceed the instruction count limit (1 million verified instructions). The fix is to restructure the code to reduce branching complexity, usebpf_loop()helper (kernel 5.17+), or split the logic into multiple tail-called programs.
- CO-RE works by embedding relocation information in the compiled BPF program’s ELF file. When you use
BPF_CORE_READ(task, pid), the compiler emits a relocation record saying “access fieldpidat offset X in structtask_struct.” At load time, libbpf reads the running kernel’s BTF (BPF Type Format) data to find the actual offset ofpidin the current kernel’stask_struct, which may differ from the compile-time offset. libbpf patches the BPF instructions to use the correct offset before submitting to the verifier. The verifier then sees a program with correct offsets for the running kernel and validates it normally. If the field does not exist at all (e.g., a field removed in a newer kernel), libbpf can detect this and handle it gracefully (returning a default value or disabling that part of the program).
Compare XDP and TC (traffic control) BPF programs for packet processing. When would you choose each, and what are the performance characteristics?
Compare XDP and TC (traffic control) BPF programs for packet processing. When would you choose each, and what are the performance characteristics?
- XDP (eXpress Data Path) programs run at the earliest possible point in the network receive path, before the kernel allocates an
sk_buffstructure. They operate on rawxdp_mdcontexts with direct packet data access. TC BPF programs run later, aftersk_buffallocation, at the traffic control layer in both ingress and egress paths. - Performance difference is significant: XDP can process 10-20 million packets per second per core because it avoids the overhead of sk_buff allocation (~200-300 cycles per packet). TC processes 2-5 million packets per second per core, which is still much faster than iptables but slower than XDP.
- I would choose XDP for: DDoS mitigation (drop malicious packets before they consume memory), load balancing (redirect packets to different NICs or CPUs), and simple packet filtering (firewalling at line rate). Facebook’s Katran load balancer and Cloudflare’s DDoS mitigation use XDP.
- I would choose TC for: more complex packet manipulation (full sk_buff available with all parsed headers), egress path processing (XDP only works on ingress), container networking (Cilium uses TC BPF for pod-to-pod traffic because it needs access to socket-level metadata), and when compatibility with the full networking stack is needed (TC programs can interact with connection tracking, netfilter marks, etc.).
- XDP limitations: cannot modify packets that need fragmentation (no access to GSO), cannot directly interact with the socket layer, requires NIC driver support for native mode (falls back to generic/slower mode otherwise).
- AF_XDP is a socket type that works with XDP to deliver raw packets directly to user space without kernel copying. An XDP program uses
XDP_REDIRECTwithbpf_redirect_map()to send packets to anXSKMAP(XDP socket map). The user-space application creates an AF_XDP socket with shared UMEM (user memory), where packet data is DMA’d directly from the NIC into user-space-accessible memory. This achieves true zero-copy: the packet data is never copied by the kernel. I would use AF_XDP when the application needs to process every packet (not just filter/drop), such as custom protocol implementations, high-frequency trading network stacks, or DPDK-like applications that want kernel bypass without the complexity of a full DPDK setup. The trade-off versus pure XDP is that AF_XDP requires user-space processing latency, while XDP programs run to completion in the kernel.
Next: Tracing Infrastructure →