Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Hands-on Projects
Theory is essential, but hands-on practice is what separates good candidates from great ones. These projects will help you build practical skills that interviewers look for. Each project is designed to force you through the same kernel interfaces and trade-offs that production infrastructure tools deal with daily. By the end, you will not just know what namespaces or eBPF are — you will have felt the friction of using them, debugged the edge cases, and built the muscle memory that interviewers can sense immediately.Time Investment: 20-40 hours total
Outcome: Portfolio pieces for interviews + deep understanding
Project 1: Build a Container from Scratch
Difficulty: Medium-HardTime: 4-6 hours
Skills: Namespaces, cgroups, syscalls
Goal
Build a minimal container runtime in C or Go that demonstrates your understanding of Linux isolation primitives. This is arguably the single best project for infrastructure interviews because it touches namespaces, cgroups, filesystem isolation, and syscalls — all in one coherent application. When you can explain “I built my own container runtime,” interviewers pay attention.Requirements
Bonus Features
- Network namespace with veth pair — creates a virtual ethernet cable between host and container, the same mechanism Docker uses for bridge networking
- User namespace for rootless containers — maps container UID 0 to an unprivileged host UID, eliminating the need for real root privileges
- Seccomp filtering — restrict which syscalls the container can make (block
mount,ptrace,kexec_load, etc.) - Capability dropping — start with all capabilities, drop everything except what the workload needs
What You’ll Learn
- How Docker/containerd actually work — your container runtime is a stripped-down version of what
runcdoes - Practical namespace manipulation — you will feel the difference between PID 1 inside vs outside the namespace
- Cgroup setup and limits — and what happens when a process exceeds them (OOM killer behavior)
- Filesystem isolation with pivot_root — and why chroot is insufficient for security
- The startup cost of containers — measuring clone+pivot+mount gives you real numbers for “containers are lightweight”
Project 2: Syscall Tracer with eBPF
Difficulty: HardTime: 6-8 hours
Skills: eBPF, kernel tracing, data structures
Goal
Build a strace-like tool using eBPF that can trace syscalls with minimal overhead. Traditionalstrace uses ptrace(), which stops the traced process on every syscall (2 context switches per syscall). Your eBPF tracer runs inside the kernel and adds less than 1% overhead, making it safe for production use.
Requirements
Add filtering capabilities
- Filter by PID — use a BPF map to store target PIDs, check membership in the BPF program
- Filter by syscall type (file, network, memory) — classify syscall numbers into categories
- Show only slow syscalls (above threshold) — compute duration from enter/exit timestamps, only emit events that exceed the threshold
- Filter by return value — show only errors (
ret < 0) for debugging permission issues
Expected Output
Bonus Features
- Latency histogram per syscall — aggregate durations into log2 buckets directly in BPF (avoid per-event export overhead)
- Argument decoding (file paths, flags) — use
bpf_probe_read_user_str()to read string arguments from userspace memory - Export to JSON format — for integration with monitoring pipelines
- Integration with flame graphs — stack trace capture with
bpf_get_stackid()to show WHERE syscalls originate
Project 3: Memory Leak Detector
Difficulty: HardTime: 6-8 hours
Skills: eBPF, memory management, stack traces
Goal
Build a tool that tracks memory allocations and identifies leaks in running processes. The core idea is elegant: hookmalloc() to record every allocation (address, size, call stack), hook free() to remove the record. Whatever remains after a measurement period is a potential leak.
Approach
Output Format
Project 4: Production CPU Profiler
Difficulty: Very HardTime: 8-10 hours
Skills: Perf events, stack unwinding, visualization
Goal
Build a sampling CPU profiler that generates flame graphs, suitable for production use. Unlike instrumentation-based profilers (which modify code), sampling profilers periodically interrupt the CPU and record what it is doing. The key insight: if you sample 100 times per second and function X appears in 30% of samples, it is using approximately 30% of CPU time. Statistical profiling with zero code changes.Components
- Sampler: Use
perf_event_open()for low-overhead sampling — the kernel does the interrupt and stack capture, you just read the results - Stack Walker: Capture kernel and user stacks — frame pointers or DWARF unwinding
- Aggregator: Collapse and count identical stacks — convert N samples into “stack A appeared M times”
- Visualizer: Generate flame graph SVG — the standard visualization pioneered by Brendan Gregg
Key Implementation
Flame Graph Generation
Project 5: Network Connection Tracker
Difficulty: Medium-HardTime: 4-6 hours
Skills: eBPF, networking, state machines
Goal
Build a tool that tracks all TCP connections with latency metrics. This gives you visibility into every connection your service makes: how long the TCP handshake took, how long the connection lasted, and which remote endpoints have the highest latency.Features
- Track connection establishment latency (SYN to ESTABLISHED)
- Track connection duration (ESTABLISHED to CLOSE)
- Group by remote IP/port for aggregate statistics
- Show retransmission rates — a key indicator of network health
BPF Program
Study Plan Integration
Week 1-2: Foundation
- Complete Project 1 (Container from Scratch)
- Understand namespaces and cgroups deeply
- Milestone: Run
/bin/shinside your container with working PID isolation and a 100MB memory limit
Week 3-4: Tracing
- Complete Project 2 (Syscall Tracer)
- Master eBPF basics — BPF maps, ring buffers, tracepoints
- Milestone: Trace a target process’s file I/O and show per-syscall latency
Week 5-6: Memory
- Complete Project 3 (Memory Leak Detector)
- Understand memory allocation internals — malloc arenas, mmap, sbrk
- Milestone: Detect a synthetic memory leak in a test program and show the leaking call stack
Week 7-8: Production
- Complete Project 4 or 5 (pick based on your target role)
- Focus on production debugging skills — profiling, network analysis
- Milestone: Generate a flame graph for a real application and identify its CPU hot spots
Interview Discussion Points
When discussing these projects in interviews:- Design decisions: Why did you choose this approach? (e.g., “I used ring buffers instead of perf buffers because ring buffers have a single shared buffer across CPUs, reducing memory overhead for my use case where per-CPU isolation was not needed”)
- Trade-offs: What are the limitations? (e.g., “My leak detector has false positives for long-lived caches because it cannot distinguish intentional caching from leaks”)
- Production readiness: How would you make it production-safe? (e.g., “I would add a BPF map size monitor that emits a metric when the map is above 80% capacity, and a circuit breaker that detaches the probes if CPU overhead exceeds 2%”)
- Extensions: How would you add feature X? (e.g., “To add per-container attribution, I would call
bpf_get_current_cgroup_id()and use it as a secondary key in the aggregation map”) - Debugging: How did you debug issues while building it? (e.g., “The BPF verifier rejected my first loop because it could not prove termination. I restructured it to use a bounded
forwith#pragma unrolland verified the generated bytecode withllvm-objdump”)
Interview Deep-Dive
You built a container from scratch. Walk me through exactly what happens in the kernel when you call clone() with CLONE_NEWPID | CLONE_NEWNS. What data structures are created, and how does the kernel maintain isolation?
You built a container from scratch. Walk me through exactly what happens in the kernel when you call clone() with CLONE_NEWPID | CLONE_NEWNS. What data structures are created, and how does the kernel maintain isolation?
- When
clone()is called withCLONE_NEWPID, the kernel allocates a newstruct pid_namespaceand sets it as the child’s active PID namespace. The kernel maintains a hierarchy of PID namespaces — the new namespace is a child of the caller’s namespace. The child process gets PID 1 inside the new namespace (it becomes the init process of that namespace), but also has a PID in every ancestor namespace. The kernel’sstruct pidcontains an array ofstruct upidentries, one for each namespace level. When the child callsgetpid(), the kernel returns the PID from the innermost namespace. When the parent callsgetpid()on the child, it returns the PID from the parent’s namespace. - For
CLONE_NEWNS, the kernel callscopy_mnt_ns()which duplicates the entire mount table of the parent process. Everystruct mountin the parent’s namespace is cloned, creating a newstruct mnt_namespace. After this, changes to mounts in the child (likepivot_rootormount) are invisible to the parent and vice versa. This is how a container can have its own/procmounted without affecting the host’s/proc. - The isolation is enforced at the syscall boundary. When a process in the PID namespace calls
kill(pid, sig), the kernel resolvespidwithin the caller’s namespace. PID 1 in the child namespace is unreachable by that PID number from the parent namespace (the parent must use the PID it was assigned in the parent’s namespace). For mount namespaces,open("/etc/passwd")resolves through the caller’s mount table, so the child sees its own filesystem even though the host has a different file at that path. - An important subtlety:
CLONE_NEWPIDdoes not affect the calling process itself — it affects the child. The caller remains in the original namespace. This is why container runtimes fork twice: the firstclone()creates the namespace, and the child then callsexecve()to replace itself with the container’s init process.
- When PID 1 in a PID namespace dies, the kernel sends SIGKILL to all other processes in that namespace. This is the “init reaping” behavior — PID 1 is special because it is the default parent for orphaned processes. If it exits, there is nobody to adopt orphans, so the kernel cleans up by killing everything. This is why container runtimes need a proper init process (like
tiniordumb-init) that handles signals and reaps zombie children, rather than running the application directly as PID 1. If your application ignores SIGTERM (the default), Docker’sdocker stopwill wait 10 seconds and then send SIGKILL, because the application-as-PID-1 did not handle the signal that PID 1 is expected to handle.
Your eBPF syscall tracer is dropping events in production. The ring buffer is filling up faster than userspace can consume. How do you diagnose this and what architectural changes would you make to handle 500,000 syscalls per second?
Your eBPF syscall tracer is dropping events in production. The ring buffer is filling up faster than userspace can consume. How do you diagnose this and what architectural changes would you make to handle 500,000 syscalls per second?
- First, diagnosis: I would check the ring buffer’s drop counter. When
bpf_ringbuf_reserve()returns NULL, I would increment a per-CPU counter in aBPF_MAP_TYPE_PERCPU_ARRAYmap. Userspace reads this counter periodically to compute the drop rate. If the drop rate is high, the bottleneck is one of three things: (1) the ring buffer is too small, (2) userspace is consuming too slowly, or (3) the per-event data is too large. - For 500K syscalls/second, the naive approach (one ring buffer event per syscall with full context) generates roughly 500K * 200 bytes = 100 MB/second of data. This overwhelms the ring buffer’s userspace consumer, which must issue a
read()orepoll_wait()syscall to drain events. - The architectural fix is to move aggregation into the BPF program itself. Instead of emitting per-event data, I would use a
BPF_MAP_TYPE_PERCPU_HASHmap keyed by(pid, syscall_nr)with value(count, total_latency_ns, max_latency_ns). The BPF program increments counters in-kernel for every syscall. Userspace reads the aggregated map every 1-5 seconds, getting a compact summary rather than a firehose of raw events. This reduces data transfer from 100 MB/s to kilobytes per read. - For cases where per-event detail is needed (e.g., capturing arguments of specific slow syscalls), I would use a two-tier approach: the BPF program aggregates everything in-kernel but also checks if a syscall exceeds a latency threshold (e.g., 10ms). Only threshold-exceeding events go to the ring buffer. This gives you both aggregate statistics and detailed traces of interesting events, without the overhead of logging everything.
- For the ring buffer itself: size it based on the expected burst rate, not the average rate. If you expect 100 events/second going to the ring buffer but bursts of 10,000, size the buffer to hold at least 2 seconds of burst data. Use
BPF_RB_FORCE_WAKEUPon high-priority events to wake the consumer immediately, andBPF_RB_NO_WAKEUPon routine events to let them batch.
BPF_MAP_TYPE_PERCPU_HASHallocates a separate copy of each map value for each CPU. When CPU 3 increments a counter, it only touches CPU 3’s copy — no locks, no cache-line bouncing, no contention. This is critical at 500K syscalls/second because even a single atomic increment would cause cache-line bouncing across CPUs (each increment invalidates the cache line on all other CPUs). The trade-off is memory: if you have 64 CPUs and 10,000 map entries of 64 bytes each, the total memory is 64 * 10,000 * 64 = ~40 MB instead of 640 KB for a non-PERCPU map. The other trade-off is read complexity: when userspace reads the map, it gets an array of values (one per CPU) and must sum them. For counter-type values this is straightforward, but for more complex aggregations (histograms, min/max) the merge logic can be subtle.
Your memory leak detector reports that a Java application is leaking memory, but the Java heap is stable and GC metrics look fine. What is happening and how would you modify your tool to diagnose the real issue?
Your memory leak detector reports that a Java application is leaking memory, but the Java heap is stable and GC metrics look fine. What is happening and how would you modify your tool to diagnose the real issue?
- This is almost certainly a native memory leak outside the Java heap. The JVM uses
malloc()for many internal structures: JIT-compiled code buffers (CodeCache), thread stacks, NIO direct byte buffers (ByteBuffer.allocateDirect()callsmalloc()under the hood), JNI native allocations, and class metadata (Metaspace, which usesmmap()internally). None of these show up in Java heap metrics or GC logs. - My eBPF leak detector would already catch the
malloc()leaks, but the stack traces would show JVM internal frames that are hard to interpret. The fix is multi-layered: First, I would addmmap()tracking to the detector (not justmalloc), because the JVM usesmmap(MAP_ANONYMOUS)for large allocations including Metaspace and CodeCache. I would hookmmapandmunmapthe same way I hookmallocandfree. Second, I would correlate the native allocations with JVM internal metrics:jcmd <pid> VM.native_memory summarygives a breakdown by JVM subsystem (CodeCache, Thread, Class, etc.). IfThreadmemory is growing linearly, it means threads are being created but not destroyed (common with thread-per-request models under load). - For NIO direct buffers specifically: these are allocated with
malloc()from native code but tracked by Java’ssun.misc.Cleanermechanism. If the Java heap has enough headroom that GC runs infrequently, the Cleaners do not fire, and the native buffers accumulate. The fix is either callingSystem.gc()periodically (ugly but effective), using-XX:MaxDirectMemorySizeto cap direct buffer usage, or switching to heap-backed buffers. - To make the tool more JVM-aware, I would also hook
dlopen()to detect loaded JNI libraries (which often have their own allocation patterns) and use the JVM’s-XX:NativeMemoryTracking=summaryflag to get the JVM’s own view of native memory, then cross-reference with my eBPF data to identify discrepancies.
- A genuine leak has a characteristic signature: allocations accumulate monotonically from the same call stack over time. A cache, by contrast, grows and then plateaus (when the cache is full, old entries are evicted and freed). My tool can distinguish these by tracking allocation rate over time windows. I would modify the userspace component to sample the allocations map every 30 seconds and compute the delta per call stack. If a specific stack shows a constant positive delta (say, 100 new outstanding allocations every 30 seconds) that never decreases, it is likely a leak. If the delta was high initially but has dropped to near-zero, it is a cache that has reached steady state. I would add a
--durationflag that runs for multiple measurement windows and flags only stacks with a consistently positive growth rate as likely leaks, filtering out one-time allocations and caches.
Resources
- Linux kernel source - The ultimate reference. Start with
kernel/nsproxy.cfor namespaces,kernel/cgroup/for cgroups - libbpf-bootstrap - Minimal BPF project templates. Start with
minimalandbootstrapexamples - bcc examples - Higher-level BPF tooling. Good for prototyping, but libbpf is preferred for production
- Brendan Gregg’s blog - The definitive resource for performance analysis methodology and BPF tools
Next: Interview Questions