This module contains actual interview questions from infrastructure, observability, and platform engineering roles at top companies. Each question includes detailed solutions and the key insights interviewers are looking for.Think of this module the way a kernel developer thinks about test suites: you don’t just verify the happy path — you test the boundary conditions, the failure modes, and the performance under pressure. These questions work the same way. They test whether you have a working mental model of the kernel, not whether you memorized a man page.
What to Expect: These questions require deep understanding, not memorization Interview Format: Usually 45-60 minute deep dives into 2-3 topics Time to Prepare: 10-12 hours to work through all scenarios
A senior engineer would say: “The best interview answers trace the path from user space down to the kernel and back. If you can explain where in the stack a problem lives and why the kernel behaves that way, you’ve already separated yourself from 90% of candidates.”
Question 1: Implement a Syscall Counter (Datadog-style)
Context: You’re asked to implement a tool that counts syscalls by process in production without significant overhead.
The Question:
“Design and implement a production-safe syscall counter. It should show the top processes by syscall count in real-time. Discuss the trade-offs of different approaches.”
Discussion Points
Interviewer is looking for:
Knowledge of different approaches (strace, perf, eBPF)
Understanding of overhead implications
Production safety considerations
Sampling vs complete counting trade-offs
Key trade-offs to discuss:
strace: Per-syscall ptrace, very high overhead (~100x slowdown)
perf: Sampling-based, lower overhead, may miss syscalls
eBPF kprobes: Slightly higher overhead, more flexible
Solution Approach
Best approach: eBPF tracepointWhy eBPF over strace or perf? It comes down to how the kernel processes these hooks. strace uses ptrace, which forces a context switch on every single syscall — the kernel stops the traced process, notifies the tracer, waits for it to resume, then re-enters the process. At scale, that is a 100x slowdown. eBPF tracepoints, by contrast, run a small verified program inside the kernel at the tracepoint site — no context switches, no copying data to user space on every event.
// syscall_counter.bpf.c#include "vmlinux.h"#include <bpf/bpf_helpers.h>// Why BPF_MAP_TYPE_HASH? We need O(1) lookups by PID, and the PID// space is sparse (not all values 0..32768 are active). A hash map// is the right fit. max_entries caps memory usage -- if we hit 10240// tracked PIDs, new ones simply won't be inserted (graceful degradation).struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 10240); __type(key, u32); // PID -- upper 32 bits of pid_tgid __type(value, u64); // Cumulative syscall count for this PID} syscall_count SEC(".maps");// Separate map for process names so the user-space reader can// display human-friendly output without calling /proc/<pid>/comm// (which itself triggers more syscalls and file I/O).struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 10240); __type(key, u32); // PID __type(value, char[16]); // comm (16 bytes is the kernel TASK_COMM_LEN)} pid_comm SEC(".maps");// Why raw_syscalls/sys_enter? This tracepoint fires for EVERY syscall// regardless of type. It sits at the very top of the syscall entry// path in the kernel (entry_SYSCALL_64 -> do_syscall_64 -> tracepoint).// Using a tracepoint (not a kprobe) gives us a stable ABI -- the// tracepoint format is versioned, so this code survives kernel upgrades.SEC("tracepoint/raw_syscalls/sys_enter")int tracepoint__raw_syscalls__sys_enter(struct trace_event_raw_sys_enter *ctx){ // bpf_get_current_pid_tgid() returns (tgid << 32 | tid). // We shift right by 32 to get the tgid, which is what user space // calls the "PID". The lower 32 bits are the thread ID. u32 pid = bpf_get_current_pid_tgid() >> 32; u64 *count = bpf_map_lookup_elem(&syscall_count, &pid); if (count) { // Hot path: PID already tracked. This is a simple atomic // increment. Note: on a HASH map this is NOT per-CPU, so // concurrent threads in the same tgid could race here. // For a production tool, use PERCPU_HASH (see below). (*count)++; } else { u64 initial = 1; bpf_map_update_elem(&syscall_count, &pid, &initial, BPF_ANY); // Store comm only on first sighting of this PID. // bpf_get_current_comm reads from current->comm in the // kernel task_struct -- it is always 16 bytes or fewer. char comm[16]; bpf_get_current_comm(&comm, sizeof(comm)); bpf_map_update_elem(&pid_comm, &pid, &comm, BPF_ANY); } return 0;}// GPL license is required for tracepoint and kprobe BPF programs.// Without this, the verifier rejects the program at load time.char LICENSE[] SEC("license") = "GPL";
User-space component:
Periodically read the map (every 1-2 seconds), sort by count, display top N processes
Handle PID reuse: cross-reference with /proc/<pid>/stat start time so a recycled PID doesn’t inherit the old count
Optionally reset counters each interval for a “rate” view (syscalls/sec per process)
Debugging tip: If your eBPF program fails to load, run bpftool prog load ./syscall_counter.bpf.o /sys/fs/bpf/test and check dmesg for verifier errors. The verifier output tells you exactly which instruction failed and why — usually a missing bounds check or an uninitialized register.
Production Considerations
What makes this production-safe:
Bounded map size: Won’t consume unlimited memory
No locks in hot path: Per-CPU increments would be ideal
Graceful degradation: If map full, just skip new PIDs
Low overhead: Tracepoint, not ptrace
Improvements for production:
// Use per-CPU hash for counting. Each CPU gets its own copy of the// value, eliminating lock contention entirely. The kernel allocates// NR_CPUS copies of each value behind the scenes. User space reads// all per-CPU values and sums them -- a small cost paid infrequently.struct { __uint(type, BPF_MAP_TYPE_PERCPU_HASH); __uint(max_entries, 10240); __type(key, u32); __type(value, u64);} syscall_count SEC(".maps");
Overhead estimation (these numbers matter in interviews — know them):
~50-100ns per syscall (compare to ptrace: ~10-50us per syscall)
On a busy system (100K syscalls/sec): ~1% CPU overhead
On an extremely busy system (1M syscalls/sec): ~5-10% CPU — at this point, consider sampling
Memory footprint: 10240 entries * (4 + 8) bytes * NR_CPUS — on a 64-core machine, about 7.5 MB
Acceptable for production monitoring, but always set an upper bound on map size
Production gotcha: If your map fills up (all 10240 slots taken), bpf_map_update_elem returns -ENOMEM for new PIDs. Your BPF program silently stops tracking new processes. In production, monitor the map fill level from user space and either increase max_entries or implement an LRU eviction policy with BPF_MAP_TYPE_LRU_PERCPU_HASH.
Context: A service is experiencing intermittent latency spikes. You need to identify the cause without restarting the service.
The Question:
“A production service has p99 latency spikes from 10ms to 500ms every few minutes. How would you debug this? Walk me through your approach.”
Investigation Framework
Systematic approach:
Characterize the problem:
When do spikes occur? (Time correlation)
Which requests are affected? (Endpoint, payload)
Duration of spikes? (Seconds, minutes)
Gather baseline metrics:
CPU utilization (is there contention?)
Memory usage (swapping? GC?)
Disk I/O (latency, throughput)
Network (retransmits, latency)
Narrow down the layer:
Application code?
Runtime (GC pauses)?
Kernel (scheduling, I/O)?
Hardware (disk, network)?
Tools and Commands
Quick triage (a senior engineer runs these in this order — start wide, then narrow):
# 1. Check for scheduling issues -- are we being preempted?# perf sched records scheduler events (context switches, migrations)# and reports which tasks waited longest to get back on a CPU.perf sched latency -p <pid># 2. Check for off-CPU time -- where is the process blocked?# offcputime uses eBPF to capture stack traces at every point the# process is descheduled. Long off-CPU stacks point at I/O, locks,# or page faults. Think of it as the inverse of a CPU profiler.sudo offcputime-bpfcc -p <pid> 5# 3. Check for I/O latency -- is the disk the bottleneck?# biolatency hooks into the block I/O layer (blk_mq_start_request /# blk_account_io_done) and builds a latency histogram. Spikes in# the 10ms+ bucket usually mean disk contention or a slow device.sudo biolatency-bpfcc 5# 4. Check for memory pressure -- are we swapping or faulting?# VmSwap > 0 means pages have been evicted to disk. Even small# amounts of swap can cause tail latency spikes because swap I/O# is synchronous in the page fault path.cat /proc/<pid>/status | grep -E "VmRSS|VmSwap"sar -B 1 5 # Page faults -- look at pgmajfault (disk-backed)# 5. Check for lock contention -- last because it is expensive.# perf lock instruments the kernel's locking primitives (mutexes,# spinlocks, rwlocks) and shows which locks are most contended.sudo perf lock record -p <pid> -- sleep 5sudo perf lock report
Deep analysis with bpftrace:
# Trace slow syscalls -- this is the single most useful one-liner# for latency debugging. It instruments the raw_syscalls tracepoints,# which fire at the very beginning and end of EVERY syscall. We# record a timestamp on entry, compute the delta on exit, and only# print if the syscall took longer than our threshold (10ms here).# The @slow map gives us a histogram of which syscalls are slowest.sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter /pid == $TARGET/ { @start[tid] = nsecs;}tracepoint:raw_syscalls:sys_exit /@start[tid]/ { $lat = (nsecs - @start[tid]) / 1000000; if ($lat > 10) { printf("%s syscall %d took %d ms\n", comm, args->id, $lat); @slow[args->id] = count(); } delete(@start[tid]);}'// Replace $TARGET with the PID. Run for 30-60 seconds during a spike.// If you see futex or epoll_wait dominating @slow, look at lock// contention or slow downstream dependencies. If read/write dominate,// look at disk I/O or network.
Debugging tip: If you cannot reproduce the latency spike interactively, set up a persistent bpftrace script that writes to a ring buffer file. Use bpftrace -o /tmp/slow_syscalls.log and let it run. When the spike happens, you have the evidence. This is the kernel equivalent of always-on distributed tracing.
The Question:
“Explain what happens when a container hits its memory limit. What are the different behaviors, and how would you debug an OOM-killed container?”
Memory Limit Behavior
When container approaches memory limit:Think of the kernel’s memory cgroup controller like a building with multiple fire alarms. Each trip point triggers a different response — from gentle warnings to full evacuation. Understanding which alarm you hit determines how you debug the problem.
Usage < memory.high└── Normal operation -- kernel allocates pages freelyUsage > memory.high (the "throttle" zone)└── Kernel activates direct reclaim on the allocating task└── The process is forced to help reclaim pages before getting new ones└── Reclaim pressure increases (page cache evicted first, then anon pages)└── Application slows down -- this is INTENTIONAL back-pressure└── Unlike memory.max, this does NOT trigger OOM -- it slows you downUsage > memory.max (the "hard wall")└── Allocation fails -- the kernel cannot give you more memory└── OOM killer invoked WITHIN this cgroup (not system-wide)└── Process with highest oom_score_adj in the cgroup is killed└── If all processes die, the cgroup is empty and gets cleaned up
Cgroup v2 memory controls (know these cold for interviews):
memory.current: Current usage in bytes — what the cgroup is actually consuming right now
memory.max: Hard limit (OOM if exceeded) — the absolute ceiling, no negotiation
memory.high: Soft limit (throttling begins) — the kernel starts pushing back but does not kill
memory.low: Best-effort protection — the kernel tries not to reclaim from this cgroup when the system is under pressure, but will if there is no other choice
memory.min: Hard protection — the kernel will never reclaim below this watermark, even under extreme system-wide pressure. Use this for critical workloads that must not be evicted.
Debugging OOM Kills
Immediate diagnostics (run these within minutes of the kill — some evidence is ephemeral):
# Check kernel logs -- the OOM killer always leaves a trace in dmesg.# The kernel's oom_kill_process() function logs the victim PID, its# memory usage breakdown, and every process in the cgroup at kill time.dmesg | grep -i "oom\|killed"# Check container events -- Docker's event stream captures OOM events# from the cgroup notification. This tells you WHEN it happened.docker events --filter 'event=oom'# Get detailed OOM context from journald -- the -A 10 grabs the# full memory dump that the kernel prints, including per-process# RSS breakdowns for every task in the cgroup.journalctl -k | grep -A 10 "invoked oom-killer"
Understanding OOM output (interviewers love when you can read raw kernel output):
Memory cgroup out of memory: Killed process 12345 (myapp)total-pgfault:15000 total-pgmajfault:50anon-rss:1048576kB file-rss:0kB shmem-rss:0kBKey indicators (a senior engineer reads these like a doctor reads blood work):- anon-rss: Heap, stack, anonymous mmap -- this is YOUR code's allocations- file-rss: Mapped files (shared libraries, mmap'd data) -- can be evicted- shmem-rss: Shared memory segments (tmpfs, IPC) -- charged to the cgroup- total-pgmajfault: Major page faults = disk I/O. High values here mean the system was already under pressure before the OOM event.
Debugging tip: If OOM kills happen repeatedly but anon-rss is small, check shmem-rss. A common culprit is /dev/shm usage inside containers (many ML frameworks and databases use shared memory heavily). Also check memory.stat for kernel and slab entries — kernel memory charged to the cgroup can silently consume your limit.
Memory profiling:
# Track allocations over timedocker stats <container># Get detailed memory breakdowncat /sys/fs/cgroup/<path>/memory.stat# Profile with BPFsudo memleak-bpfcc -p <pid>
Prevention Strategies
Proper memory sizing (this is where most teams get it wrong):
Profile application under realistic load — not curl localhost, but actual production traffic replayed against a staging environment
Account for peak usage, not average — the OOM killer does not care about your P50; it cares about your P100
Include headroom for GC (JVM needs ~2x heap for GC overhead), file cache (the kernel will use any free memory in the cgroup for page cache), and kernel slab allocations
Monitor memory.high throttling events — they are the early warning that your limit is too tight
Kubernetes recommendations:
resources: requests: memory: "256Mi" # Used by the scheduler for bin-packing decisions limits: memory: "512Mi" # Maps directly to memory.max in cgroup v2 # Why 2x? Because page cache + GC overhead + kernel slab can # easily double your application's "heap" memory usage. # If your app reports 200Mi heap, you need 400-512Mi as the limit.
Application-level protections:
Set JVM heap below container limit: -Xmx should be ~75% of memory.max to leave room for non-heap memory (thread stacks, native memory, class metadata). A common formula: -Xmx = container_limit * 0.75
Use memory-aware allocators like jemalloc or tcmalloc that return memory to the OS more aggressively than glibc’s default allocator (which holds onto freed pages via brk)
Implement backpressure mechanisms: when memory usage exceeds a threshold, stop accepting new work. This is cheaper than being OOM-killed and restarting.
The Question:
“Explain the journey of a packet from the NIC to the application. Where are the performance bottlenecks, and how would you optimize for high packet rates?”
Packet Journey
┌─────────────────────────────────────────────────────────────────────┐│ PACKET RECEIVE PATH │├─────────────────────────────────────────────────────────────────────┤│ ││ 1. NIC receives packet ││ └─→ DMA to ring buffer in memory ││ └─→ Raise interrupt ││ ││ 2. Interrupt handler (hardirq) ││ └─→ Acknowledge interrupt ││ └─→ Schedule NAPI poll (softirq) ││ └─→ Disable further interrupts for this queue ││ ││ 3. NAPI poll (softirq context) ││ └─→ Poll ring buffer for packets ││ └─→ Allocate sk_buff structures ││ └─→ Process up to budget packets ││ └─→ Re-enable interrupts if done ││ ││ 4. Network stack processing ││ └─→ XDP (if attached) - earliest hook ││ └─→ tc ingress ││ └─→ netfilter/iptables ││ └─→ IP routing ││ └─→ TCP/UDP processing ││ ││ 5. Socket layer ││ └─→ Socket buffer (sk_buff queue) ││ └─→ Wake up waiting application ││ ││ 6. Application ││ └─→ read()/recv() copies to user space ││ └─→ Process data ││ │└─────────────────────────────────────────────────────────────────────┘
Performance Bottlenecks
Common bottlenecks:
Interrupt overhead:
Each interrupt: ~1-2μs
At 1M pps: 100% CPU just handling interrupts
Solution: NAPI, interrupt coalescing
Memory allocation:
sk_buff allocation per packet
Solution: Page pools, recycling
Lock contention:
Socket lock for each packet
Solution: SO_REUSEPORT, RSS
Cache misses:
Packet data not in cache
Solution: Busy polling, NUMA awareness
Context switches:
Waking application per packet
Solution: Batching, busy polling
Optimization Techniques
Hardware level:
# Enable RSS (Receive Side Scaling) -- this tells the NIC to hash# incoming packets (by 5-tuple: src/dst IP, src/dst port, protocol)# and distribute them across multiple RX queues. Each queue gets its# own interrupt, so multiple CPUs process packets in parallel.# Without RSS, one CPU handles ALL packets -- a guaranteed bottleneck.ethtool -L eth0 combined 8# Configure interrupt coalescing -- instead of firing an interrupt# for every single packet, wait up to 50 microseconds OR 64 packets# before interrupting. This trades a tiny bit of latency for a huge# reduction in interrupt overhead at high packet rates.ethtool -C eth0 rx-usecs 50 rx-frames 64# Pin interrupts to CPUs -- ensure each NIC queue's interrupt is# handled by a specific CPU. This keeps packet processing cache-hot# on that CPU. Without pinning, the kernel may migrate interrupts# across CPUs, causing cache misses on the packet data.echo 1 > /proc/irq/XX/smp_affinity
Kernel level:
# Increase socket buffer sizes -- the default rmem_max (212992 bytes)# is far too small for high-throughput applications. At 10Gbps, the# kernel can fill that buffer in ~170 microseconds. If the application# does not read fast enough, packets are dropped at the socket layer.sysctl -w net.core.rmem_max=26214400 # 25 MB# Enable busy polling -- instead of sleeping in epoll_wait and being# woken by a softirq, the application actively polls the NIC driver# for new packets. This eliminates the softirq-to-application wakeup# latency (~5-10us) at the cost of burning CPU cycles while polling.# Only worth it for latency-sensitive apps with dedicated CPU cores.sysctl -w net.core.busy_poll=50 # poll for 50us before sleepingsysctl -w net.core.busy_read=50 # same for synchronous reads# Tune NAPI GRO flush timeout -- GRO (Generic Receive Offload)# coalesces multiple small packets into fewer large ones before# passing them up the stack. This timeout controls how long the# kernel waits for more packets to coalesce.echo 64 > /sys/class/net/eth0/gro_flush_timeout
Application level:
// SO_REUSEPORT lets multiple sockets bind to the same port.// The kernel distributes incoming connections across them using// a hash. Each thread gets its own socket and its own accept queue,// eliminating the thundering herd problem where all threads wake// up for a single incoming connection.int opt = 1;setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &opt, sizeof(opt));// recvmmsg receives multiple messages in a single syscall.// Each syscall has ~1us of overhead (user-to-kernel transition,// register save/restore). Batching amortizes this across BATCH_SIZE// messages. At 1M pps, this alone can save ~50% of CPU time.struct mmsghdr msgs[BATCH_SIZE];int ret = recvmmsg(fd, msgs, BATCH_SIZE, 0, NULL);
Ultimate performance: XDP (eXpress Data Path):
Process packets at the driver level, beforesk_buff allocation — sk_buff allocation is one of the most expensive per-packet operations in the kernel (~100ns each)
Achievable throughput: 10M+ pps on a single core (vs ~1M pps through the normal stack)
Used by Cloudflare (DDoS mitigation), Meta (load balancing with Katran), and Cilium (Kubernetes networking)
Trade-off: you must write your packet processing logic in eBPF, which means no TCP/IP stack, no sockets — you are essentially writing a custom NIC firmware in a safe sandbox
The Question:
“We need sub-millisecond latency for a trading system. How would you configure Linux to minimize jitter?”
Sources of Jitter
Kernel sources:
Timer interrupts (every 1-4ms)
RCU callbacks
Kernel threads (kworker, ksoftirqd)
System call overhead
Hardware sources:
SMI (System Management Interrupt)
Cache pollution
NUMA remote access
Power management (C-states)
Isolation Configuration
Boot parameters (each parameter removes a different source of jitter — know why each one matters):
# /etc/default/grubGRUB_CMDLINE_LINUX=" isolcpus=2,3,4,5 # Remove from CFS scheduler's CPU mask -- # the scheduler will NEVER place a regular # task on these CPUs unless explicitly pinned nohz_full=2,3,4,5 # Disable the periodic timer tick (HZ) on # these CPUs when only one task is running. # Without this, the kernel interrupts every # 1-4ms for scheduling housekeeping. rcu_nocbs=2,3,4,5 # Offload RCU (Read-Copy-Update) callbacks # to housekeeping CPUs. RCU is the kernel's # lock-free synchronization mechanism. Its # callbacks can add 10-50us jitter. irqaffinity=0,1 # Keep ALL hardware IRQs on CPUs 0,1 only. # A NIC interrupt on an isolated CPU would # cause 1-5us of jitter per interrupt. intel_pstate=disable # Disable dynamic frequency scaling entirely. # P-state transitions take 10-100us and # cause unpredictable performance variation. processor.max_cstate=0 # Disable ALL C-states -- CPU never sleeps. # Waking from C6 takes 100-200us. For sub-ms # latency, even C1 (1-2us) is too much. idle=poll # When idle, spin in a tight loop instead of # entering any idle state. Burns power but # guarantees zero wake-up latency. nosoftlockup # Don't check for soft lockups on isolated # CPUs. The watchdog itself causes jitter. nmi_watchdog=0 # Disable the NMI watchdog (perf-based hard # lockup detector). NMIs cannot be masked # and cause ~1us interruptions. audit=0 # Disable the audit subsystem entirely. # Audit hooks in the syscall path add # measurable overhead per system call. skew_tick=1 # Randomize timer ticks across CPUs so they # don't all fire simultaneously and cause # a burst of cache line bouncing."
CPU affinity:
# Pin critical threads to isolated CPUstaskset -c 2,3,4,5 ./trading_app# Move all other processes awayfor pid in $(ps -eo pid --no-headers); do taskset -p 0x3 $pid 2>/dev/null # CPUs 0,1done
IRQ affinity:
# Move all IRQs to housekeeping CPUsfor irq in /proc/irq/*/smp_affinity; do echo 3 > $irq 2>/dev/null # CPUs 0,1done
Verification and Testing
Verify isolation:
# Check no IRQs on isolated CPUscat /proc/interrupts | awk '{print $1, $3, $4}'# Check no kernel threads on isolated CPUsps -eo pid,psr,comm | grep -E "^\s*[0-9]+\s+[2-5]"# Check timer behaviorperf stat -C 2,3,4,5 -e irq_vectors:local_timer_entry sleep 10
Measure latency:
# Use cyclictestcyclictest -m -p 99 -i 100 -h 1000 -D 1m -a 2 -t 1# Interpret results:# Min: 1 μs (good)# Avg: 2 μs (good)# Max: 50 μs (acceptable for many use cases)# Max: 500 μs (investigate!)
The Question:
“Design a system to collect CPU, memory, and I/O metrics from 10,000 containers on each host with minimal overhead.”
Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐│ METRICS COLLECTION ARCHITECTURE │├─────────────────────────────────────────────────────────────────────┤│ ││ Method 1: Poll cgroup files ││ ───────────────────────── ││ For each container: ││ - Read /sys/fs/cgroup/<id>/cpu.stat ││ - Read /sys/fs/cgroup/<id>/memory.current ││ - Read /sys/fs/cgroup/<id>/io.stat ││ ││ Pros: Simple, no kernel changes ││ Cons: File system overhead, 10K reads/second ││ ││ ───────────────────────────────────────────────────────────────── ││ ││ Method 2: eBPF-based collection ││ ───────────────────────────── ││ - Hook scheduler for CPU accounting ││ - Hook memory allocator for memory tracking ││ - Hook block layer for I/O ││ - Aggregate per-cgroup in BPF maps ││ ││ Pros: Lower overhead, real-time data ││ Cons: Complex, needs kernel support ││ ││ ───────────────────────────────────────────────────────────────── ││ ││ Recommended: Hybrid approach ││ - Use cgroup files for infrequent metrics (memory limits) ││ - Use eBPF for high-frequency metrics (CPU, I/O) ││ - Batch reads, use inotify for changes ││ │└─────────────────────────────────────────────────────────────────────┘
Optimized Implementation
Batch cgroup file reading:
// Key optimization: open file descriptors ONCE at startup, then// reuse them on every collection cycle. Each open() in the cgroup// filesystem traverses the VFS path lookup, allocates an fd, and// creates a struct file -- about 2-5us per call. For 10K containers// with 3 files each, that is 30K opens = 60-150ms of pure overhead// per cycle. Keeping FDs open amortizes this to zero.struct container_fds { int cpu_stat_fd; int memory_current_fd; int io_stat_fd;};void collect_metrics(struct container_fds *fds, int count) { for (int i = 0; i < count; i++) { // Use pread with offset 0 instead of lseek + read. // pread is atomic and avoids the overhead of maintaining // the file position. For cgroup pseudo-files, reading from // offset 0 always returns the current value -- the kernel // regenerates the content on each read (these are not real // files on disk; they are backed by seq_file handlers that // call into the cgroup controller code). char buf[4096]; pread(fds[i].cpu_stat_fd, buf, sizeof(buf), 0); parse_cpu_stat(buf, &metrics[i].cpu); pread(fds[i].memory_current_fd, buf, sizeof(buf), 0); metrics[i].memory = atol(buf); }}
Debugging tip: If your metrics collector shows stale values, verify you are reading from offset 0 on each cycle. A common bug is using read() without lseek(fd, 0, SEEK_SET) — after the first read, the file position is at EOF, and subsequent reads return 0 bytes. pread avoids this entirely.
eBPF for CPU tracking:
// Track CPU time per cgroupstruct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 10240); __type(key, u64); // cgroup id __type(value, u64); // total CPU time ns} cgroup_cpu SEC(".maps");SEC("tracepoint/sched/sched_switch")int trace_switch(struct trace_event_raw_sched_switch *ctx) { u64 cgroup_id = bpf_get_current_cgroup_id(); u64 now = bpf_ktime_get_ns(); // Record time for previous task's cgroup u64 *last_time = bpf_map_lookup_elem(&last_switch, &ctx->prev_pid); if (last_time) { u64 delta = now - *last_time; u64 *total = bpf_map_lookup_elem(&cgroup_cpu, &cgroup_id); if (total) { *total += delta; } } // Record switch time for next task bpf_map_update_elem(&last_switch, &ctx->next_pid, &now, BPF_ANY); return 0;}
Situation: A container fails to start with “permission denied” but works as root.
Debugging Steps
# Check seccomp profile -- seccomp filters run BEFORE any other# security check in the syscall path. If seccomp blocks a syscall,# you get EPERM/ENOSYS but no audit log entry from SELinux/AppArmor.# This is the most common "invisible" denial.docker inspect <container> | jq '.[0].HostConfig.SecurityOpt'# Check AppArmor/SELinux -- these are LSM hooks that run AFTER# DAC checks. A process can pass Unix permission checks but still# be denied by mandatory access control policy.docker inspect <container> | jq '.[0].AppArmorProfile'getenforce # SELinux status: Enforcing = blocking, Permissive = logging only# Check capabilities -- the kernel checks capabilities in capable()# which is called from hundreds of places. A missing CAP_SYS_ADMIN# or CAP_NET_RAW is often the culprit for "works as root, fails as# non-root" problems.docker inspect <container> | jq '.[0].HostConfig.CapAdd'docker inspect <container> | jq '.[0].HostConfig.CapDrop'# Trace syscall failures -- strace intercepts via ptrace, so it sees# the error return value from the kernel. Look for EPERM (permission# denied) and EACCES (access denied) -- they come from different# security layers (capabilities vs. DAC vs. LSM).strace -f docker run --rm myimage 2>&1 | grep -i "denied\|EPERM\|EACCES"# Check audit logs -- SELinux and AppArmor write AVC denial records# to the audit subsystem. This is your definitive source of truth# for MAC denials.ausearch -m avc -ts recent
Debugging tip: When you see “permission denied” in a container, check security layers in this order: (1) seccomp (check dmesg | grep seccomp), (2) capabilities (check grep Cap /proc/<pid>/status), (3) DAC/Unix permissions, (4) LSM (audit logs). This matches the kernel’s evaluation order and avoids chasing the wrong layer.
Common Causes
Seccomp blocking syscall:
Solution: Add syscall to profile or use --security-opt seccomp=unconfined
Missing capability:
Solution: --cap-add=SYS_ADMIN (or specific capability)
SELinux/AppArmor denial:
Solution: Check audit logs, update policy
User namespace UID mapping:
Solution: Check /etc/subuid, /etc/subgid
Read-only filesystem:
Solution: --read-only with appropriate tmpfs mounts
Situation: Container shows 2GB used, but application reports only 500MB heap.This is one of the most common debugging scenarios in container platforms, and the answer lies in understanding what the kernel counts as “memory used by this cgroup” versus what an application considers “my memory.”
Memory Accounting Deep Dive
# Get detailed memory breakdown -- memory.stat is the kernel's# authoritative accounting of every byte charged to this cgroup.# The kernel maintains these counters in struct mem_cgroup, updating# them on every page charge/uncharge event.cat /sys/fs/cgroup/<path>/memory.stat# Key fields (these map directly to kernel page types):# anon - Anonymous memory (heap via brk/mmap, stack, mmap(MAP_ANONYMOUS))# file - Page cache (file-backed pages read into memory)# kernel - Kernel memory charged to this cgroup (since cgroup v2)# slab - Slab allocations (dentry cache, inode cache, socket buffers)# Check for file cache -- this is the #1 cause of "missing" memory.# The kernel caches every file read into page cache. Your app reads# a 1GB log file once? That 1GB stays in page cache (charged to your# cgroup) until memory pressure forces reclaim.grep -E "^(anon|file|kernel)" /sys/fs/cgroup/<path>/memory.stat# Inside container -- /proc/meminfo shows the SYSTEM-WIDE view, not# the cgroup view. This is a common source of confusion. The numbers# here do not match memory.stat because /proc/meminfo is not cgroup-aware.cat /proc/meminfo | grep -E "Cached|Buffers|Slab"# Check for memory mapped files -- each line in /proc/<pid>/maps# is a VMA (Virtual Memory Area). File-backed VMAs (those with a# non-zero device and inode) contribute to file RSS.cat /proc/<pid>/maps | grep -v "00000000 00:00" | wc -l
Common causes of the discrepancy (in order of likelihood):
Page cache (accounts for 60-80% of these cases):
Files read by the application are cached in memory by the kernel
Shows in memory.current but NOT in the application’s heap metrics
The good news: these pages are reclaimable — the kernel will evict them under pressure
The bad news: they still count toward memory.max and can trigger OOM if the limit is tight
Memory-mapped files (shared libraries, data files):
Every .so loaded by the application is mmap’d into the process address space
Only the resident pages (those actually touched) consume physical memory
A 200MB library might only have 20MB resident — but those 20MB are charged
Slab memory (kernel allocations on behalf of this cgroup):
A container running a web server with 10K connections could have 50-100MB of kernel slab memory
This is completely invisible to the application but fully charged to the cgroup
Shared memory / tmpfs:
/dev/shm usage, POSIX shared memory, and tmpfs mounts are charged as shmem
Many databases and ML frameworks allocate large shared memory segments
Charged once to the cgroup even if multiple processes access it
A senior engineer would say: “When someone tells me their container is using more memory than expected, I check memory.stat first, not the application metrics. The kernel’s accounting is authoritative. The application only knows about its own heap — it has no visibility into page cache, slab, or shared memory charged to its cgroup.”
# Process investigationps aux --forest # Process tree -- shows parent/child hierarchycat /proc/<pid>/status # Detailed status -- caps, threads, memory, statecat /proc/<pid>/maps # Memory mappings -- every VMA (heap, stack, mmap, libs)ls -la /proc/<pid>/fd/ # Open files -- symlinks to actual files/sockets/pipescat /proc/<pid>/stack # Kernel stack -- where the process is stuck in the kernel# Memory investigation free -h # Memory overview -- total, used, free, availablecat /proc/meminfo # Detailed memory stats -- 40+ fields from the kernelvmstat 1 5 # Virtual memory stats -- si/so (swap in/out) is keyslabtop # Slab allocations -- kernel object caches (dentry, inode)cat /proc/buddyinfo # Fragmentation -- free pages by order (0=4KB to 10=4MB)# CPU investigationtop -H # Per-thread CPU -- find the hot thread, not just the processmpstat -P ALL 1 # Per-CPU stats -- look for one core at 100% (single-threaded bottleneck)perf top # Live profiling -- which functions are consuming CPU right nowpidstat 1 # Per-process stats -- CPU, memory, I/O per process per second# Disk I/Oiostat -xz 1 # Disk stats -- await (avg I/O latency) is the key metriciotop # Per-process I/O -- who is writing? who is reading?cat /proc/<pid>/io # Process I/O -- cumulative bytes read/written by this process# Networkss -tlnp # Listening sockets -- what is bound and on which portss -anp # All connections -- state, queues, backlogcat /proc/net/dev # Interface stats -- drops, errors, overruns per NICnstat # Network statistics -- TCP retransmits, resets, timeouts# Container/cgroupcat /sys/fs/cgroup/<path>/memory.current # Bytes used by this cgroup right nowcat /sys/fs/cgroup/<path>/cpu.stat # CPU usage, throttling events, nr_periodscat /sys/fs/cgroup/<path>/io.stat # Per-device read/write bytes and IOPs# Tracingstrace -p <pid> # Syscall tracing -- HIGH overhead, use for quick diagnosis onlyltrace -p <pid> # Library tracing -- libc calls (malloc, free, etc.)perf trace -p <pid> # Fast syscall tracing -- 10x less overhead than stracebpftrace -e '...' # eBPF tracing -- production-safe, custom analysis
Explain your reasoning as you work through problems. Interviewers want to see your thought process. Say “I am starting at the application layer and working down because…” rather than jumping to a tool.
Start Simple
Begin with the simplest approach, then discuss trade-offs and optimizations. “The naive approach is X, which works but has O(n) overhead. Here is how we improve it…”
Know the Stack
Be ready to go from application to syscall to kernel to hardware. A senior engineer would say: “The read() syscall enters the kernel via entry_SYSCALL_64, dispatches to ksys_read, which calls vfs_read, which calls the filesystem’s read handler, which may block on disk I/O…”
Practice Debugging
Work through real debugging scenarios on actual Linux machines. The muscle memory of knowing which tool to reach for and how to interpret its output is what separates candidates who “know about” Linux from those who “work with” Linux.
A pattern that impresses interviewers: When asked to debug something, state your hypothesis before you run the command. “I suspect this is a scheduling issue because the spikes are periodic, so I will check perf sched latency first.” This shows you are reasoning, not just running commands from a list.
A production service is leaking file descriptors. How do you find the leak and what kernel mechanisms are involved?
What the interviewer is testing: Whether you understand the VFS layer’s file descriptor table, how the kernel tracks open files per process, and your ability to debug a live production issue without restarting the service.Strong answer framework:Start by understanding the kernel’s fd tracking. Every process has a struct files_struct (pointed to by task->files) that contains an fd table — an array of struct file * pointers. Each open() allocates the lowest available fd and creates a struct file backed by a dentry/inode pair in the VFS. If close() is never called, the struct file is never freed, the dentry reference count stays elevated, and the inode stays pinned in memory.Diagnosis steps:
# 1. Confirm the leak exists and measure its rate# /proc/<pid>/fd is a directory with a symlink for each open fd.# Count them over time to establish the leak rate.watch -n 5 'ls /proc/<pid>/fd | wc -l'# 2. Identify WHAT is being leaked (files? sockets? pipes?)# Each symlink points to the actual resource.ls -la /proc/<pid>/fd | awk '{print $NF}' | sort | uniq -c | sort -rn | head -20# If you see thousands of "socket:[xxxxx]" entries, it is socket leaks.# If you see thousands of "/tmp/some-pattern", it is file leaks.# 3. Find the code path that opens without closing -- use bpftrace# to capture stack traces at every open() that is NOT followed by# a close() of the same fd within a timeout window.sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat /pid == TARGET/ { @open_stacks[tid] = ustack;}'
What impresses interviewers: Mention that the per-process fd limit (ulimit -n, backed by RLIMIT_NOFILE) defaults to 1024 on many systems. When you hit it, everyopen(), socket(), accept(), and pipe() returns -EMFILE. This cascading failure is how a file descriptor leak in one component can crash an entire service — the process cannot open any new files, including log files to report the error.Common mistake: Candidates suggest “just restart the process.” In production, restarting a stateful service (database, queue) can cause data loss or require lengthy recovery. The right answer is to identify the leak, apply a mitigation (raise the fd limit temporarily), then deploy a fix.
Explain what happens in the kernel when a process calls fork() followed by exec(). Why is copy-on-write important here?
What the interviewer is testing: Deep understanding of process creation mechanics, virtual memory management, and the COW optimization that makes Unix process creation practical.Strong answer framework:When fork() is called, the kernel’s do_fork() (or kernel_clone() in modern kernels) creates a new task_struct — the kernel’s representation of a process. But it does NOT copy the parent’s physical memory. Instead, it duplicates the page table entries and marks every writable page as read-only in both the parent and the child. This is copy-on-write (COW).When either process later tries to write to a COW page, the CPU raises a page fault (because the PTE is marked read-only). The kernel’s page fault handler (do_wp_page()) detects the COW condition, allocates a new physical page, copies the content, and updates the faulting process’s page table to point to the new page with write permission. The other process still points to the original page.When exec() is called, the kernel’s do_execve() completely replaces the process’s address space. It:
Releases all existing VMAs (which drops the refcounts on all those COW pages)
Parses the ELF binary header to determine the program’s segments
Maps the text segment (code) as read-only + executable
Maps the data segment as read-write
Sets up a new stack, copies argv and envp onto it
Points the instruction pointer to the ELF entry point
Why COW matters: Without COW, fork() would copy all of the parent’s memory. A 2GB process forking would require 2GB of allocation and copying before the child can even call exec(), which immediately discards all of it. COW makes fork() nearly instant (just duplicating page tables, ~100us for a large process) and defers the actual copy to only the pages that are actually written.What impresses interviewers: Mention the MAP_PRIVATE flag on mmap’d files uses the same COW mechanism. Also mention that this is why vfork() exists — it is an even cheaper fork that shares the parent’s address space entirely (no page table copy), but the child MUST call exec() or _exit() immediately. posix_spawn() is the modern alternative that avoids the fork+exec overhead entirely.
You notice a Kubernetes pod is being throttled despite showing low average CPU usage. What is happening and how do you fix it?
What the interviewer is testing: Understanding of CFS bandwidth control, the relationship between CPU limits and cgroup throttling, and why averages lie in systems with bursty workloads.Strong answer framework:The CFS (Completely Fair Scheduler) bandwidth controller works on a period basis, not an average basis. When you set a Kubernetes CPU limit of 500m (half a core), the kernel translates this to cpu.max = "50000 100000" — meaning the cgroup gets 50,000 microseconds of CPU time per 100,000-microsecond period.Here is where averages are deceptive. If your application is bursty — idle for 80ms, then needs 100% of a core for 20ms — it will consume its entire 50ms quota in the first 20ms of burst, then be throttled for the remaining 80ms of that period. The average CPU usage over a minute might show 10%, but the application experiences hard throttling during every burst.How to detect it:
# Check cgroup CPU stats -- nr_throttled and throttled_usec are the keycat /sys/fs/cgroup/cpu.stat# usage_usec 12345678# nr_periods 5000 <- total scheduling periods# nr_throttled 2500 <- periods where throttling occurred (50%!)# throttled_usec 8000000 <- total time spent throttled (8 seconds)
How to fix it:
Increase the CPU limit — the most straightforward fix, but increases cost
Set requests = limits (Guaranteed QoS class) — gives the pod a dedicated CPU core via cpuset pinning, which eliminates the CFS bandwidth controller entirely
Increase only the period — some Kubernetes distributions expose --cpu-cfs-quota-period. A longer period (e.g., 200ms) allows longer bursts before throttling
Remove the limit entirely (set requests only) — controversial, but it means the pod runs in Burstable QoS and is never throttled. The risk is noisy-neighbor effects on shared nodes.
What impresses interviewers: Mention that nr_throttled / nr_periods is the metric to watch, not average CPU usage. A ratio above 5% indicates that the application is being actively harmed by its CPU limit. Also mention that in multi-threaded applications, a limit of 1000m (1 core) does NOT mean “use one core” — it means “use 100ms of CPU time per 100ms period, spread across ANY number of cores.” A 4-thread application with a 1-core limit will burn through its quota in 25ms and be throttled for 75ms.
How does the kernel's OOM killer decide which process to kill, and how can you influence its decision in a container environment?
What the interviewer is testing: Knowledge of the OOM scoring algorithm, cgroup-scoped OOM behavior, and practical experience with tuning OOM behavior in production.Strong answer framework:The OOM killer is invoked when the kernel cannot free enough memory to satisfy an allocation. The core function is out_of_memory() in mm/oom_kill.c. In a cgroup context, the OOM killer is scoped to the cgroup that exceeded its limit — it will not kill processes outside the cgroup.The kernel scores each process using oom_badness(), which calculates:
This raw score is then adjusted by oom_score_adj (a value from -1000 to +1000, set via /proc/<pid>/oom_score_adj):
oom_score_adj = -1000: Never kill this process (OOM-immune)
oom_score_adj = 0: Default scoring
oom_score_adj = +1000: Always kill this process first
The final score is visible in /proc/<pid>/oom_score (range 0-1000). The process with the highest score is killed.In container environments:
Kubernetes sets oom_score_adj based on QoS class: Guaranteed = -997, Burstable = 2-999, BestEffort = 1000
This means BestEffort pods are killed first, then Burstable, and Guaranteed pods are killed last
You can also set memory.oom.group = 1 in cgroup v2 to kill ALL processes in the cgroup together (useful for ensuring a clean restart rather than partial kills that leave the container in a broken state)
What impresses interviewers: Mention that the OOM killer is a last resort. Before it fires, the kernel has already tried: (1) reclaiming page cache, (2) reclaiming slab caches, (3) writing dirty pages to disk, (4) swapping anonymous pages. Only when all of these fail does the OOM killer activate. Also mention memory.oom.group in cgroup v2 and why it matters for multi-process containers like those running an init system or a sidecar pattern.
Walk me through what happens in the kernel between a user calling write() on a file and the data being on disk.
What the interviewer is testing: End-to-end understanding of the VFS layer, page cache, block I/O layer, and the difference between write() returning and data being durable.Strong answer framework:The path from write() to disk traverses four major kernel subsystems:1. VFS layer (ksys_write -> vfs_write): The kernel resolves the fd to a struct file, which points to the filesystem’s file_operations struct. It calls the filesystem’s .write_iter handler (e.g., ext4_file_write_iter).2. Page cache (generic_perform_write): The filesystem writes data into page cache pages. It finds (or allocates) the page corresponding to the file offset, copies the user’s data into the page, and marks the page as dirty. At this point, write() returns to user space. The data is in memory but NOT on disk.3. Writeback (writeback_single_inode): The kernel’s flusher threads (or the sync syscall) periodically walk the list of dirty inodes and submit their dirty pages to the block layer. The default writeback delay is 30 seconds (dirty_writeback_centisecs), but it can be triggered earlier if the percentage of dirty pages exceeds dirty_ratio.4. Block I/O layer (submit_bio -> device driver -> disk): The block layer converts page writes into block I/O requests (struct bio), merges adjacent requests (elevator/IO scheduler), and submits them to the device driver. The driver programs the hardware (DMA for NVMe, SCSI commands for SAS) and the data is written to the physical medium.Critical insight: write() returning success does NOT mean the data is on disk. It means the data is in page cache. If the machine loses power before writeback completes, the data is lost. This is why databases call fsync() — which forces the dirty pages for that file through the entire path to the disk’s persistent storage. fsync() does not return until the drive’s write cache has been flushed.What impresses interviewers: Mention O_DIRECT as the bypass for page cache (used by databases that manage their own caching), O_DSYNC for synchronous data writes (like write + fdatasync on every call), and that NVMe drives with power-loss protection can safely report fsync completion before data hits NAND because the drive’s capacitors can flush the write cache during power loss. Also mention that io_uring changes the game by allowing the kernel to batch and submit these operations asynchronously without per-syscall overhead.