Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Hands-on Projects

Theory is essential, but hands-on practice is what separates good candidates from great ones. These projects will help you build practical skills that interviewers look for. Each project is designed to force you through the same kernel interfaces and trade-offs that production infrastructure tools deal with daily. By the end, you will not just know what namespaces or eBPF are — you will have felt the friction of using them, debugged the edge cases, and built the muscle memory that interviewers can sense immediately.
Skill Level: Intermediate to Advanced
Time Investment: 20-40 hours total
Outcome: Portfolio pieces for interviews + deep understanding

Project 1: Build a Container from Scratch

Difficulty: Medium-Hard
Time: 4-6 hours
Skills: Namespaces, cgroups, syscalls

Goal

Build a minimal container runtime in C or Go that demonstrates your understanding of Linux isolation primitives. This is arguably the single best project for infrastructure interviews because it touches namespaces, cgroups, filesystem isolation, and syscalls — all in one coherent application. When you can explain “I built my own container runtime,” interviewers pay attention.

Requirements

1

Create isolated namespaces

// Clone with new namespaces
// Each CLONE_NEW* flag creates a new instance of that kernel namespace
// for the child process. This is exactly what 'docker run' does internally.
int flags = CLONE_NEWPID |   // New PID namespace -- child sees itself as PID 1
            CLONE_NEWNS  |   // New mount namespace -- isolated filesystem view
            CLONE_NEWNET |   // New network namespace -- empty network stack
            CLONE_NEWUTS |   // New UTS namespace -- separate hostname
            CLONE_NEWIPC;    // New IPC namespace -- isolated shared memory

// clone() is the syscall that fork() is built on, but with namespace control
// stack_top points to the TOP of allocated stack memory (stacks grow down)
pid_t pid = clone(child_func, stack_top, flags | SIGCHLD, NULL);

// After clone: parent and child are in DIFFERENT namespaces.
// The child's /proc/self/ns/* symlinks will point to different inodes
// than the parent's, proving the isolation.
Common mistake: Forgetting that clone() requires you to allocate and manage the child’s stack manually, unlike fork(). If you pass a NULL stack pointer, the child will segfault immediately. Allocate at least 64KB with mmap() and pass the TOP of the allocation (stack grows downward on x86).
2

Set up cgroup resource limits

// Create a cgroup for our container (cgroups v2 unified hierarchy)
// This creates the control group directory in the cgroup filesystem
mkdir("/sys/fs/cgroup/mycontainer", 0755);

// Set memory limit -- hard cap at 100MB
// The kernel will OOM-kill the container's processes if they exceed this
write_file("/sys/fs/cgroup/mycontainer/memory.max", "100M");

// Set CPU limit (50% of 1 CPU)
// Format: "quota period" in microseconds
// 50000/100000 = 50ms of CPU time per 100ms period = 50% of one core
write_file("/sys/fs/cgroup/mycontainer/cpu.max", "50000 100000");

// Add the child process to this cgroup
// Writing the PID moves it (and future children) into the cgroup
write_file("/sys/fs/cgroup/mycontainer/cgroup.procs", pid_str);

// IMPORTANT: Always add cleanup logic to remove the cgroup on exit.
// Leaked cgroups accumulate and eventually hit the kernel's cgroup limit.
cgroups v1 vs v2: If your test system uses cgroups v1 (check mount | grep cgroup), the paths change. Memory goes to /sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes and CPU to /sys/fs/cgroup/cpu/mycontainer/cpu.cfs_quota_us. Most modern distros (Ubuntu 22.04+, Fedora 34+) default to v2.
3

Set up root filesystem with pivot_root

// pivot_root swaps the root filesystem for the calling process
// This is more secure than chroot because it actually detaches the old root,
// while chroot only changes the pathname resolution starting point.

// First, bind-mount the new root to itself (pivot_root requires this)
mount(rootfs, rootfs, NULL, MS_BIND | MS_REC, NULL);

// Create a temporary mountpoint for the old root inside the new root
mkdir(old_root, 0755);

// Pivot: new root becomes /, old root becomes /old_root
// After this call, the container's / is the rootfs we prepared
syscall(SYS_pivot_root, rootfs, old_root);

// Clean up: unmount and remove the old root
// Without this, the container could access the host filesystem via /old_root
chdir("/");
umount2("/old_root", MNT_DETACH);  // Lazy unmount -- safe even if busy
rmdir("/old_root");

// Mount essential filesystems inside the container
mount("proc", "/proc", "proc", 0, NULL);   // Container needs its own /proc
mount("sysfs", "/sys", "sysfs", 0, NULL);  // And /sys for device info
mount("tmpfs", "/tmp", "tmpfs", 0, NULL);   // Writable temp space
4

Execute user command

// Set the container's hostname (visible because we have CLONE_NEWUTS)
sethostname("container", 9);

// Drop to non-root if possible (defense in depth)
// In production runtimes, this is where capabilities are dropped too

// Replace the current process image with the user's command
// execve does NOT return on success -- the child_func process becomes /bin/sh
execve("/bin/sh", argv, envp);

// If we reach here, execve failed
perror("execve");
exit(1);

Bonus Features

  • Network namespace with veth pair — creates a virtual ethernet cable between host and container, the same mechanism Docker uses for bridge networking
  • User namespace for rootless containers — maps container UID 0 to an unprivileged host UID, eliminating the need for real root privileges
  • Seccomp filtering — restrict which syscalls the container can make (block mount, ptrace, kexec_load, etc.)
  • Capability dropping — start with all capabilities, drop everything except what the workload needs

What You’ll Learn

  • How Docker/containerd actually work — your container runtime is a stripped-down version of what runc does
  • Practical namespace manipulation — you will feel the difference between PID 1 inside vs outside the namespace
  • Cgroup setup and limits — and what happens when a process exceeds them (OOM killer behavior)
  • Filesystem isolation with pivot_root — and why chroot is insufficient for security
  • The startup cost of containers — measuring clone+pivot+mount gives you real numbers for “containers are lightweight”
Interview leverage: When an interviewer asks “how do containers work?”, most candidates say “namespaces and cgroups.” You will be able to say “I built one. The PID namespace is created by clone() with CLONE_NEWPID, the filesystem isolation uses pivot_root rather than chroot because chroot can be escaped with fchdir, and the resource limits use cgroups v2 where you write to memory.max and cpu.max.” That level of specificity is unmistakable.

Project 2: Syscall Tracer with eBPF

Difficulty: Hard
Time: 6-8 hours
Skills: eBPF, kernel tracing, data structures

Goal

Build a strace-like tool using eBPF that can trace syscalls with minimal overhead. Traditional strace uses ptrace(), which stops the traced process on every syscall (2 context switches per syscall). Your eBPF tracer runs inside the kernel and adds less than 1% overhead, making it safe for production use.

Requirements

1

Set up BPF program to trace syscalls

// syscall_tracer.bpf.c
// This program runs INSIDE the kernel at the raw_syscalls tracepoint.
// It fires on every syscall entry for every process on the system.

// The SEC() macro tells libbpf which tracepoint to attach to.
// raw_syscalls/sys_enter fires before the syscall handler runs.
SEC("tracepoint/raw_syscalls/sys_enter")
int trace_enter(struct trace_event_raw_sys_enter *ctx)
{
    // bpf_get_current_pid_tgid() returns (tgid << 32 | tid)
    // tgid = thread group ID = what userspace calls PID
    // tid = thread ID = what the kernel calls PID (confusing, but important)
    u64 id = bpf_get_current_pid_tgid();
    u32 pid = id >> 32;  // Extract tgid (userspace PID)
    
    // Filter to target PID only -- without this, we trace EVERYTHING
    // which floods the ring buffer and wastes CPU
    if (target_pid && pid != target_pid)
        return 0;
    
    // Reserve space in the ring buffer for our event
    // Ring buffers are the modern way to send data from BPF to userspace
    // (replacing the older perf buffer which had per-CPU overhead)
    struct syscall_event *event;
    event = bpf_ringbuf_reserve(&events, sizeof(*event), 0);
    if (!event)
        return 0;  // Buffer full -- drop event rather than block
    
    event->pid = pid;
    event->tid = id;  // Full tid for thread-level tracing
    event->syscall_nr = ctx->id;  // Syscall number (e.g., 1 = write)
    event->timestamp = bpf_ktime_get_ns();  // Monotonic nanosecond clock
    bpf_get_current_comm(&event->comm, sizeof(event->comm));  // Process name
    
    // Capture first 6 arguments (Linux syscalls have max 6 args)
    // These are the raw register values -- you decode them in userspace
    event->args[0] = ctx->args[0];
    event->args[1] = ctx->args[1];
    event->args[2] = ctx->args[2];
    event->args[3] = ctx->args[3];
    event->args[4] = ctx->args[4];
    event->args[5] = ctx->args[5];
    
    // Submit the event to userspace
    bpf_ringbuf_submit(event, 0);
    return 0;
}
2

Capture return values

// We need a SECOND tracepoint to capture what the syscall returned.
// sys_enter fires before execution, sys_exit fires after.
SEC("tracepoint/raw_syscalls/sys_exit")
int trace_exit(struct trace_event_raw_sys_exit *ctx)
{
    u64 id = bpf_get_current_pid_tgid();
    
    struct exit_event *event;
    event = bpf_ringbuf_reserve(&exits, sizeof(*event), 0);
    if (!event)
        return 0;
    
    event->tid = id;
    event->ret = ctx->ret;  // Return value: >= 0 = success, < 0 = -errno
    event->timestamp = bpf_ktime_get_ns();
    
    bpf_ringbuf_submit(event, 0);
    return 0;
}
// To correlate enter/exit: match on tid + timestamp ordering.
// A BPF hash map keyed by tid can store enter events until
// the matching exit arrives, computing syscall duration.
3

Build user-space consumer

// User-space: Read events from the ring buffer and display them
// This callback fires for every event the BPF program submits
static int handle_event(void *ctx, void *data, size_t len)
{
    struct syscall_event *e = data;
    
    // syscall_name() maps the number to a string (e.g., 1 -> "write")
    // You can generate this from /usr/include/asm/unistd_64.h
    printf("[%s:%d] %s(",
           e->comm, e->pid, syscall_name(e->syscall_nr));
    
    // Format arguments based on syscall type
    // Different syscalls interpret the same register values differently:
    // write(fd, buf, count) vs open(pathname, flags, mode)
    format_syscall_args(e->syscall_nr, e->args);
    
    printf(")\n");
    return 0;
}

// Main loop: poll the ring buffer and invoke callback for each event
// ring_buffer__poll() blocks until events are available or timeout
4

Add filtering capabilities

  • Filter by PID — use a BPF map to store target PIDs, check membership in the BPF program
  • Filter by syscall type (file, network, memory) — classify syscall numbers into categories
  • Show only slow syscalls (above threshold) — compute duration from enter/exit timestamps, only emit events that exceed the threshold
  • Filter by return value — show only errors (ret < 0) for debugging permission issues

Expected Output

$ ./syscall_tracer -p 1234
[myapp:1234] openat(AT_FDCWD, "/etc/passwd", O_RDONLY) = 3       [2.1us]
[myapp:1234] fstat(3, {...}) = 0                                  [0.8us]
[myapp:1234] read(3, "root:x:0:0:..."..., 4096) = 2847           [1.5us]
[myapp:1234] close(3) = 0                                        [0.6us]
# The [duration] column shows wall-clock time for each syscall
# Long durations on read/write indicate I/O bottlenecks
# Long durations on futex indicate lock contention

Bonus Features

  • Latency histogram per syscall — aggregate durations into log2 buckets directly in BPF (avoid per-event export overhead)
  • Argument decoding (file paths, flags) — use bpf_probe_read_user_str() to read string arguments from userspace memory
  • Export to JSON format — for integration with monitoring pipelines
  • Integration with flame graphs — stack trace capture with bpf_get_stackid() to show WHERE syscalls originate
Kernel version matters: Ring buffers (BPF_MAP_TYPE_RINGBUF) require kernel 5.8+. On older kernels, use perf buffers (BPF_MAP_TYPE_PERF_EVENT_ARRAY) instead. The BTF (BPF Type Format) support needed for CO-RE requires 5.2+. Check uname -r and cat /sys/kernel/btf/vmlinux before starting.

Project 3: Memory Leak Detector

Difficulty: Hard
Time: 6-8 hours
Skills: eBPF, memory management, stack traces

Goal

Build a tool that tracks memory allocations and identifies leaks in running processes. The core idea is elegant: hook malloc() to record every allocation (address, size, call stack), hook free() to remove the record. Whatever remains after a measurement period is a potential leak.

Approach

// Track allocations by hooking libc's malloc() with a uprobe
// Uprobes are like kprobes but for userspace functions -- the kernel
// patches a breakpoint into the target function's code
SEC("uprobe/libc.so:malloc")
int trace_malloc(struct pt_regs *ctx)
{
    u64 size = PT_REGS_PARM1(ctx);  // malloc's first arg = requested size
    u64 id = bpf_get_current_pid_tgid();
    
    // Store the requested size temporarily, keyed by thread ID
    // We need this because we cannot get the return value (the pointer)
    // until the function returns -- that is what the uretprobe is for
    bpf_map_update_elem(&alloc_sizes, &id, &size, BPF_ANY);
    return 0;
}

// Capture the return value of malloc (the allocated pointer)
SEC("uretprobe/libc.so:malloc")
int trace_malloc_ret(struct pt_regs *ctx)
{
    u64 addr = PT_REGS_RC(ctx);  // Return value = pointer to allocated memory
    u64 id = bpf_get_current_pid_tgid();
    
    // Look up the size we stored in the entry probe
    u64 *size = bpf_map_lookup_elem(&alloc_sizes, &id);
    if (!size)
        return 0;  // Missed the entry -- can happen under extreme load
    
    // Record the allocation with full context
    struct alloc_info info = {
        .size = *size,
        .timestamp = bpf_ktime_get_ns(),
    };
    // Capture the userspace call stack -- this is HOW we identify
    // which code path leaked. Without stack traces, "you have a leak"
    // is useless. With them, you know exactly which function allocated
    // the leaked memory.
    bpf_get_stack(ctx, &info.stack, sizeof(info.stack), BPF_F_USER_STACK);
    
    // Store allocation keyed by address -- free() will remove it
    bpf_map_update_elem(&allocations, &addr, &info, BPF_ANY);
    bpf_map_delete_elem(&alloc_sizes, &id);  // Clean up temp storage
    
    return 0;
}

// Track deallocations -- remove the allocation record
SEC("uprobe/libc.so:free")
int trace_free(struct pt_regs *ctx)
{
    u64 addr = PT_REGS_PARM1(ctx);  // free's first arg = pointer to free
    
    // Remove from allocations map -- this pointer is no longer outstanding
    // If the address is not in the map, this is a free of memory we
    // did not track (allocated before we attached), which is fine
    bpf_map_delete_elem(&allocations, &addr);
    
    return 0;
}
// After the measurement period, iterate the allocations map in userspace.
// Every remaining entry is memory that was allocated but never freed
// during our observation window -- a candidate leak.
Why this beats Valgrind for production: Valgrind runs your program in a virtual CPU, slowing it 10-50x. This eBPF approach attaches to a running process with less than 5% overhead because the probes execute a few hundred nanoseconds of BPF code per allocation. You can attach to a production process, observe for 30 seconds, detach, and analyze — without restarting anything.

Output Format

$ ./memleak -p 1234 30
Attaching to process 1234...
Tracing for 30 seconds...

[08:23:15] Top outstanding allocations:

24576 bytes in 12 allocations from:
    malloc+0x0              # libc entry point
    json_parse+0x123        # <-- this function allocates but never frees
    handle_request+0x456    # called from request handler
    main+0x789

8192 bytes in 4 allocations from:
    malloc+0x0
    create_buffer+0x55      # allocates buffers that outlive the function
    process_data+0x123
    worker_thread+0x456

# Stack traces are symbolized using /proc/<pid>/maps and DWARF debug info
# Without debug symbols, you get raw addresses -- still useful with addr2line
Sizing the BPF map: The allocations hash map must be large enough to hold all outstanding allocations simultaneously. A typical application might have 10,000-100,000 live allocations. Set max_entries accordingly. If the map fills up, new allocations are silently untracked and will appear as false-negative leaks (freed but never recorded). Monitor the map size with bpftool map show during operation.

Project 4: Production CPU Profiler

Difficulty: Very Hard
Time: 8-10 hours
Skills: Perf events, stack unwinding, visualization

Goal

Build a sampling CPU profiler that generates flame graphs, suitable for production use. Unlike instrumentation-based profilers (which modify code), sampling profilers periodically interrupt the CPU and record what it is doing. The key insight: if you sample 100 times per second and function X appears in 30% of samples, it is using approximately 30% of CPU time. Statistical profiling with zero code changes.

Components

  1. Sampler: Use perf_event_open() for low-overhead sampling — the kernel does the interrupt and stack capture, you just read the results
  2. Stack Walker: Capture kernel and user stacks — frame pointers or DWARF unwinding
  3. Aggregator: Collapse and count identical stacks — convert N samples into “stack A appeared M times”
  4. Visualizer: Generate flame graph SVG — the standard visualization pioneered by Brendan Gregg

Key Implementation

// Set up a perf event that samples the CPU at ~100 Hz
struct perf_event_attr attr = {
    .type = PERF_TYPE_SOFTWARE,
    .config = PERF_COUNT_SW_CPU_CLOCK,  // Sample on CPU clock
    .sample_period = 10000000,  // ~100 Hz (10ms period in nanoseconds)
    .sample_type = PERF_SAMPLE_CALLCHAIN | PERF_SAMPLE_TID,
    // CALLCHAIN: capture the full stack trace at each sample
    // TID: record which thread was running
    .exclude_kernel = 0,  // Include kernel stacks (essential for I/O analysis)
    .exclude_user = 0,    // Include user stacks (your application code)
};

// Open the perf event -- pid=-1,cpu=N means "profile all processes on CPU N"
// For per-process profiling, use pid=target,cpu=-1
int fd = perf_event_open(&attr, pid, -1, -1, 0);

// Memory-map the ring buffer for zero-copy sample reading
// The kernel writes samples directly here, userspace reads without syscalls
void *mmap_base = mmap(NULL, page_size * (1 + 2^n),
                       PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

// Process samples from the ring buffer
struct perf_event_header *header;
while ((header = get_next_sample(mmap_base)) != NULL) {
    if (header->type == PERF_RECORD_SAMPLE) {
        // Each sample contains a callchain: an array of instruction pointers
        // from the current function up to the entry point
        process_sample((struct sample_event *)header);
    }
    // Also handle PERF_RECORD_MMAP2 events -- these tell you when shared
    // libraries are loaded, which you need to map addresses to symbol names
}

Flame Graph Generation

# Step 1: Collapse stacks -- aggregate identical call chains
stacks = {}
for sample in samples:
    # Reverse the callchain so it reads bottom-up (main -> ... -> leaf)
    # Symbolize addresses using the process's /proc/pid/maps
    stack_str = ";".join(reversed(sample.callchain))
    stacks[stack_str] = stacks.get(stack_str, 0) + 1

# Step 2: Output Brendan Gregg's folded stack format
# Each line is "frame1;frame2;frame3 count"
for stack, count in stacks.items():
    print(f"{stack} {count}")

# Step 3: Pipe through flamegraph.pl to generate SVG
# cat folded.txt | flamegraph.pl > profile.svg
# The width of each box is proportional to how often that function
# appeared in samples -- wider = more CPU time
Production safety: At 100 Hz sampling across 8 CPUs, you generate 800 samples/second. Each sample with a 64-deep callchain is roughly 600 bytes. That is about 480 KB/second of data — trivial for modern systems. The actual CPU overhead is under 1% because the NMI (non-maskable interrupt) that captures the stack takes only a few microseconds. This is why sampling profilers are safe for production while instrumentation profilers are not.

Project 5: Network Connection Tracker

Difficulty: Medium-Hard
Time: 4-6 hours
Skills: eBPF, networking, state machines

Goal

Build a tool that tracks all TCP connections with latency metrics. This gives you visibility into every connection your service makes: how long the TCP handshake took, how long the connection lasted, and which remote endpoints have the highest latency.

Features

  • Track connection establishment latency (SYN to ESTABLISHED)
  • Track connection duration (ESTABLISHED to CLOSE)
  • Group by remote IP/port for aggregate statistics
  • Show retransmission rates — a key indicator of network health

BPF Program

// Hook tcp_v4_connect -- fires when a process initiates a TCP connection
SEC("kprobe/tcp_v4_connect")
int trace_connect(struct pt_regs *ctx)
{
    struct sock *sk = (struct sock *)PT_REGS_PARM1(ctx);
    u64 ts = bpf_ktime_get_ns();  // Record when the connect() was initiated
    
    // Store the start timestamp keyed by the socket pointer
    // We will look this up when the connection reaches ESTABLISHED state
    bpf_map_update_elem(&connect_start, &sk, &ts, BPF_ANY);
    return 0;
}

// Hook tcp_rcv_state_process -- fires on every TCP state transition
// This is the TCP state machine handler in the kernel
SEC("kprobe/tcp_rcv_state_process")
int trace_state_change(struct pt_regs *ctx)
{
    struct sock *sk = (struct sock *)PT_REGS_PARM1(ctx);
    // BPF_CORE_READ handles kernel struct field access safely across versions
    int state = BPF_CORE_READ(sk, __sk_common.skc_state);
    
    if (state == TCP_ESTABLISHED) {
        // Connection completed -- compute handshake latency
        u64 *start_ts = bpf_map_lookup_elem(&connect_start, &sk);
        if (start_ts) {
            u64 latency = bpf_ktime_get_ns() - *start_ts;
            // latency is in nanoseconds. Typical LAN: 100-500us.
            // Cross-region: 10-100ms. If > 1s, there is a problem.
            emit_event(sk, latency);
            bpf_map_delete_elem(&connect_start, &sk);
        }
    }
    return 0;
}

// Also consider hooking:
// tcp_retransmit_skb -- fires on every retransmission (packet loss indicator)
// tcp_close -- fires when connection ends (measure total connection duration)
// tcp_rcv_established -- fires on every received packet (throughput analysis)
kprobe stability caveat: tcp_v4_connect and tcp_rcv_state_process are internal kernel functions, not part of the stable ABI. They can be renamed, inlined, or refactored between kernel versions. For production tools, prefer tracepoints (tracepoint/tcp/tcp_probe) or fentry programs (kernel 5.5+) which attach to BTF-typed function signatures and are more portable. For a learning project, kprobes are fine because the intent is understanding, not long-term maintenance.

Study Plan Integration

Week 1-2: Foundation

  • Complete Project 1 (Container from Scratch)
  • Understand namespaces and cgroups deeply
  • Milestone: Run /bin/sh inside your container with working PID isolation and a 100MB memory limit

Week 3-4: Tracing

  • Complete Project 2 (Syscall Tracer)
  • Master eBPF basics — BPF maps, ring buffers, tracepoints
  • Milestone: Trace a target process’s file I/O and show per-syscall latency

Week 5-6: Memory

  • Complete Project 3 (Memory Leak Detector)
  • Understand memory allocation internals — malloc arenas, mmap, sbrk
  • Milestone: Detect a synthetic memory leak in a test program and show the leaking call stack

Week 7-8: Production

  • Complete Project 4 or 5 (pick based on your target role)
  • Focus on production debugging skills — profiling, network analysis
  • Milestone: Generate a flame graph for a real application and identify its CPU hot spots

Interview Discussion Points

When discussing these projects in interviews:
  1. Design decisions: Why did you choose this approach? (e.g., “I used ring buffers instead of perf buffers because ring buffers have a single shared buffer across CPUs, reducing memory overhead for my use case where per-CPU isolation was not needed”)
  2. Trade-offs: What are the limitations? (e.g., “My leak detector has false positives for long-lived caches because it cannot distinguish intentional caching from leaks”)
  3. Production readiness: How would you make it production-safe? (e.g., “I would add a BPF map size monitor that emits a metric when the map is above 80% capacity, and a circuit breaker that detaches the probes if CPU overhead exceeds 2%”)
  4. Extensions: How would you add feature X? (e.g., “To add per-container attribution, I would call bpf_get_current_cgroup_id() and use it as a secondary key in the aggregation map”)
  5. Debugging: How did you debug issues while building it? (e.g., “The BPF verifier rejected my first loop because it could not prove termination. I restructured it to use a bounded for with #pragma unroll and verified the generated bytecode with llvm-objdump”)
Important: Do not just copy code — understand every line. Interviewers will ask follow-up questions to verify your understanding. If you cannot explain why bpf_ringbuf_reserve returns NULL (the ring buffer is full because userspace is not consuming events fast enough), the project becomes a liability rather than an asset.

Interview Deep-Dive

Strong Answer:
  • When clone() is called with CLONE_NEWPID, the kernel allocates a new struct pid_namespace and sets it as the child’s active PID namespace. The kernel maintains a hierarchy of PID namespaces — the new namespace is a child of the caller’s namespace. The child process gets PID 1 inside the new namespace (it becomes the init process of that namespace), but also has a PID in every ancestor namespace. The kernel’s struct pid contains an array of struct upid entries, one for each namespace level. When the child calls getpid(), the kernel returns the PID from the innermost namespace. When the parent calls getpid() on the child, it returns the PID from the parent’s namespace.
  • For CLONE_NEWNS, the kernel calls copy_mnt_ns() which duplicates the entire mount table of the parent process. Every struct mount in the parent’s namespace is cloned, creating a new struct mnt_namespace. After this, changes to mounts in the child (like pivot_root or mount) are invisible to the parent and vice versa. This is how a container can have its own /proc mounted without affecting the host’s /proc.
  • The isolation is enforced at the syscall boundary. When a process in the PID namespace calls kill(pid, sig), the kernel resolves pid within the caller’s namespace. PID 1 in the child namespace is unreachable by that PID number from the parent namespace (the parent must use the PID it was assigned in the parent’s namespace). For mount namespaces, open("/etc/passwd") resolves through the caller’s mount table, so the child sees its own filesystem even though the host has a different file at that path.
  • An important subtlety: CLONE_NEWPID does not affect the calling process itself — it affects the child. The caller remains in the original namespace. This is why container runtimes fork twice: the first clone() creates the namespace, and the child then calls execve() to replace itself with the container’s init process.
Follow-up: What happens when PID 1 inside the namespace dies?Follow-up Answer:
  • When PID 1 in a PID namespace dies, the kernel sends SIGKILL to all other processes in that namespace. This is the “init reaping” behavior — PID 1 is special because it is the default parent for orphaned processes. If it exits, there is nobody to adopt orphans, so the kernel cleans up by killing everything. This is why container runtimes need a proper init process (like tini or dumb-init) that handles signals and reaps zombie children, rather than running the application directly as PID 1. If your application ignores SIGTERM (the default), Docker’s docker stop will wait 10 seconds and then send SIGKILL, because the application-as-PID-1 did not handle the signal that PID 1 is expected to handle.
Strong Answer:
  • First, diagnosis: I would check the ring buffer’s drop counter. When bpf_ringbuf_reserve() returns NULL, I would increment a per-CPU counter in a BPF_MAP_TYPE_PERCPU_ARRAY map. Userspace reads this counter periodically to compute the drop rate. If the drop rate is high, the bottleneck is one of three things: (1) the ring buffer is too small, (2) userspace is consuming too slowly, or (3) the per-event data is too large.
  • For 500K syscalls/second, the naive approach (one ring buffer event per syscall with full context) generates roughly 500K * 200 bytes = 100 MB/second of data. This overwhelms the ring buffer’s userspace consumer, which must issue a read() or epoll_wait() syscall to drain events.
  • The architectural fix is to move aggregation into the BPF program itself. Instead of emitting per-event data, I would use a BPF_MAP_TYPE_PERCPU_HASH map keyed by (pid, syscall_nr) with value (count, total_latency_ns, max_latency_ns). The BPF program increments counters in-kernel for every syscall. Userspace reads the aggregated map every 1-5 seconds, getting a compact summary rather than a firehose of raw events. This reduces data transfer from 100 MB/s to kilobytes per read.
  • For cases where per-event detail is needed (e.g., capturing arguments of specific slow syscalls), I would use a two-tier approach: the BPF program aggregates everything in-kernel but also checks if a syscall exceeds a latency threshold (e.g., 10ms). Only threshold-exceeding events go to the ring buffer. This gives you both aggregate statistics and detailed traces of interesting events, without the overhead of logging everything.
  • For the ring buffer itself: size it based on the expected burst rate, not the average rate. If you expect 100 events/second going to the ring buffer but bursts of 10,000, size the buffer to hold at least 2 seconds of burst data. Use BPF_RB_FORCE_WAKEUP on high-priority events to wake the consumer immediately, and BPF_RB_NO_WAKEUP on routine events to let them batch.
Follow-up: How do PERCPU maps avoid lock contention, and what is the trade-off?Follow-up Answer:
  • BPF_MAP_TYPE_PERCPU_HASH allocates a separate copy of each map value for each CPU. When CPU 3 increments a counter, it only touches CPU 3’s copy — no locks, no cache-line bouncing, no contention. This is critical at 500K syscalls/second because even a single atomic increment would cause cache-line bouncing across CPUs (each increment invalidates the cache line on all other CPUs). The trade-off is memory: if you have 64 CPUs and 10,000 map entries of 64 bytes each, the total memory is 64 * 10,000 * 64 = ~40 MB instead of 640 KB for a non-PERCPU map. The other trade-off is read complexity: when userspace reads the map, it gets an array of values (one per CPU) and must sum them. For counter-type values this is straightforward, but for more complex aggregations (histograms, min/max) the merge logic can be subtle.
Strong Answer:
  • This is almost certainly a native memory leak outside the Java heap. The JVM uses malloc() for many internal structures: JIT-compiled code buffers (CodeCache), thread stacks, NIO direct byte buffers (ByteBuffer.allocateDirect() calls malloc() under the hood), JNI native allocations, and class metadata (Metaspace, which uses mmap() internally). None of these show up in Java heap metrics or GC logs.
  • My eBPF leak detector would already catch the malloc() leaks, but the stack traces would show JVM internal frames that are hard to interpret. The fix is multi-layered: First, I would add mmap() tracking to the detector (not just malloc), because the JVM uses mmap(MAP_ANONYMOUS) for large allocations including Metaspace and CodeCache. I would hook mmap and munmap the same way I hook malloc and free. Second, I would correlate the native allocations with JVM internal metrics: jcmd <pid> VM.native_memory summary gives a breakdown by JVM subsystem (CodeCache, Thread, Class, etc.). If Thread memory is growing linearly, it means threads are being created but not destroyed (common with thread-per-request models under load).
  • For NIO direct buffers specifically: these are allocated with malloc() from native code but tracked by Java’s sun.misc.Cleaner mechanism. If the Java heap has enough headroom that GC runs infrequently, the Cleaners do not fire, and the native buffers accumulate. The fix is either calling System.gc() periodically (ugly but effective), using -XX:MaxDirectMemorySize to cap direct buffer usage, or switching to heap-backed buffers.
  • To make the tool more JVM-aware, I would also hook dlopen() to detect loaded JNI libraries (which often have their own allocation patterns) and use the JVM’s -XX:NativeMemoryTracking=summary flag to get the JVM’s own view of native memory, then cross-reference with my eBPF data to identify discrepancies.
Follow-up: How would you distinguish between a genuine leak and legitimate memory growth like caching?Follow-up Answer:
  • A genuine leak has a characteristic signature: allocations accumulate monotonically from the same call stack over time. A cache, by contrast, grows and then plateaus (when the cache is full, old entries are evicted and freed). My tool can distinguish these by tracking allocation rate over time windows. I would modify the userspace component to sample the allocations map every 30 seconds and compute the delta per call stack. If a specific stack shows a constant positive delta (say, 100 new outstanding allocations every 30 seconds) that never decreases, it is likely a leak. If the delta was high initially but has dropped to near-zero, it is a cache that has reached steady state. I would add a --duration flag that runs for multiple measurement windows and flags only stacks with a consistently positive growth rate as likely leaks, filtering out one-time allocations and caches.

Resources

  • Linux kernel source - The ultimate reference. Start with kernel/nsproxy.c for namespaces, kernel/cgroup/ for cgroups
  • libbpf-bootstrap - Minimal BPF project templates. Start with minimal and bootstrap examples
  • bcc examples - Higher-level BPF tooling. Good for prototyping, but libbpf is preferred for production
  • Brendan Gregg’s blog - The definitive resource for performance analysis methodology and BPF tools

Next: Interview Questions