Skip to main content

Real Interview Questions

This module contains actual interview questions from infrastructure, observability, and platform engineering roles at top companies. Each question includes detailed solutions and the key insights interviewers are looking for.
What to Expect: These questions require deep understanding, not memorization
Interview Format: Usually 45-60 minute deep dives into 2-3 topics
Time to Prepare: 10-12 hours to work through all scenarios

Company-Specific Patterns

Different companies emphasize different areas:
CompanyFocus AreasStyle
DatadogeBPF, tracing, metrics collectionHands-on implementation
Grafana LabsObservability stack, performanceSystem design + coding
CloudflareNetwork stack, performanceDeep Linux networking
ChronosphereTime series, observabilityArchitecture + internals
Meta InfraLarge scale systemsDesign + debugging
NetflixPerformance, containersDeep dives + scenarios

Observability Company Questions

Question 1: Implement a Syscall Counter (Datadog-style)

Context: You’re asked to implement a tool that counts syscalls by process in production without significant overhead.
The Question: “Design and implement a production-safe syscall counter. It should show the top processes by syscall count in real-time. Discuss the trade-offs of different approaches.”
Interviewer is looking for:
  1. Knowledge of different approaches (strace, perf, eBPF)
  2. Understanding of overhead implications
  3. Production safety considerations
  4. Sampling vs complete counting trade-offs
Key trade-offs to discuss:
  • strace: Per-syscall ptrace, very high overhead (~100x slowdown)
  • perf: Sampling-based, lower overhead, may miss syscalls
  • eBPF tracepoints: Low overhead (~1-5%), production-safe
  • eBPF kprobes: Slightly higher overhead, more flexible
Best approach: eBPF tracepoint
// syscall_counter.bpf.c
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10240);
    __type(key, u32);  // PID
    __type(value, u64);  // count
} syscall_count SEC(".maps");

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10240);
    __type(key, u32);  // PID
    __type(value, char[16]);  // comm
} pid_comm SEC(".maps");

SEC("tracepoint/raw_syscalls/sys_enter")
int tracepoint__raw_syscalls__sys_enter(struct trace_event_raw_sys_enter *ctx)
{
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    u64 *count = bpf_map_lookup_elem(&syscall_count, &pid);
    
    if (count) {
        (*count)++;
    } else {
        u64 initial = 1;
        bpf_map_update_elem(&syscall_count, &pid, &initial, BPF_ANY);
        
        // Store comm for this PID
        char comm[16];
        bpf_get_current_comm(&comm, sizeof(comm));
        bpf_map_update_elem(&pid_comm, &pid, &comm, BPF_ANY);
    }
    
    return 0;
}

char LICENSE[] SEC("license") = "GPL";
User-space component:
  • Periodically read map, sort by count
  • Display top N processes
  • Handle PID reuse (track start time)
What makes this production-safe:
  1. Bounded map size: Won’t consume unlimited memory
  2. No locks in hot path: Per-CPU increments would be ideal
  3. Graceful degradation: If map full, just skip new PIDs
  4. Low overhead: Tracepoint, not ptrace
Improvements for production:
// Use per-CPU array for counting (no locks)
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_HASH);
    __uint(max_entries, 10240);
    __type(key, u32);
    __type(value, u64);
} syscall_count SEC(".maps");
Overhead estimation:
  • ~50-100ns per syscall
  • On busy system (100K syscalls/sec): ~1% CPU
  • Acceptable for production monitoring

Question 2: Debug High Latency in Production

Context: A service is experiencing intermittent latency spikes. You need to identify the cause without restarting the service.
The Question: “A production service has p99 latency spikes from 10ms to 500ms every few minutes. How would you debug this? Walk me through your approach.”
Systematic approach:
  1. Characterize the problem:
    • When do spikes occur? (Time correlation)
    • Which requests are affected? (Endpoint, payload)
    • Duration of spikes? (Seconds, minutes)
  2. Gather baseline metrics:
    • CPU utilization (is there contention?)
    • Memory usage (swapping? GC?)
    • Disk I/O (latency, throughput)
    • Network (retransmits, latency)
  3. Narrow down the layer:
    • Application code?
    • Runtime (GC pauses)?
    • Kernel (scheduling, I/O)?
    • Hardware (disk, network)?
Quick triage:
# Check for scheduling issues
perf sched latency -p <pid>

# Check for off-CPU time
sudo offcputime-bpfcc -p <pid> 5

# Check for I/O latency
sudo biolatency-bpfcc 5

# Check for memory pressure
cat /proc/<pid>/status | grep -E "VmRSS|VmSwap"
sar -B 1 5  # Page faults

# Check for lock contention
sudo perf lock record -p <pid> -- sleep 5
sudo perf lock report
Deep analysis with bpftrace:
# Trace slow syscalls
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter /pid == $TARGET/ {
    @start[tid] = nsecs;
}
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
    $lat = (nsecs - @start[tid]) / 1000000;
    if ($lat > 10) {
        printf("%s syscall %d took %d ms\n", comm, args->id, $lat);
        @slow[args->id] = count();
    }
    delete(@start[tid]);
}'
Check for GC pauses (if JVM/Go):
# Java
jstat -gc <pid> 1000

# Go - trace runtime events
GODEBUG=gctrace=1 ./myapp
Likely causes of intermittent spikes:
  1. Garbage Collection:
    • Symptom: Regular, predictable spikes
    • Detection: GC logs show long pauses
    • Fix: Tune GC, reduce allocation rate
  2. Disk I/O:
    • Symptom: Correlates with writes/fsync
    • Detection: biolatency shows spikes
    • Fix: Async I/O, better storage
  3. Memory Pressure:
    • Symptom: During memory spikes
    • Detection: sar -B shows page faults
    • Fix: Increase memory, reduce footprint
  4. CPU Throttling (containers):
    • Symptom: Regular, consistent spikes
    • Detection: cat /sys/fs/cgroup/cpu.stat
    • Fix: Increase CPU limits
  5. Network Issues:
    • Symptom: Affects network calls
    • Detection: tcpretrans, ss -ti
    • Fix: Check network path, timeouts

Question 3: Container Memory Behavior

The Question: “Explain what happens when a container hits its memory limit. What are the different behaviors, and how would you debug an OOM-killed container?”
When container approaches memory limit:
Usage < memory.high
└── Normal operation

Usage > memory.high
└── Throttling begins
└── Reclaim pressure increases
└── Application may slow down

Usage > memory.max
└── Allocation fails
└── OOM killer invoked
└── Process in cgroup killed
Cgroup v2 memory controls:
  • memory.current: Current usage
  • memory.max: Hard limit (OOM if exceeded)
  • memory.high: Soft limit (throttling)
  • memory.low: Best-effort protection
  • memory.min: Hard protection
Immediate diagnostics:
# Check kernel logs
dmesg | grep -i "oom\|killed"

# Check container events
docker events --filter 'event=oom'

# Get OOM details from journald
journalctl -k | grep -A 10 "invoked oom-killer"
Understanding OOM output:
Memory cgroup out of memory: Killed process 12345 (myapp)
total-pgfault:15000 total-pgmajfault:50
anon-rss:1048576kB file-rss:0kB shmem-rss:0kB

Key indicators:
- anon-rss: Heap, stack, anonymous mmap
- file-rss: Mapped files (can be evicted)
- shmem-rss: Shared memory
Memory profiling:
# Track allocations over time
docker stats <container>

# Get detailed memory breakdown
cat /sys/fs/cgroup/<path>/memory.stat

# Profile with BPF
sudo memleak-bpfcc -p <pid>
Proper memory sizing:
  1. Profile application under realistic load
  2. Account for peak usage, not average
  3. Include headroom for GC, file cache
Kubernetes recommendations:
resources:
  requests:
    memory: "256Mi"  # For scheduling
  limits:
    memory: "512Mi"  # Hard limit (2x requests typical)
Application-level protections:
  • Set JVM heap < container limit (-Xmx)
  • Use memory-aware allocators
  • Implement backpressure mechanisms

Infrastructure Company Questions

Question 4: Network Stack Performance (Cloudflare-style)

The Question: “Explain the journey of a packet from the NIC to the application. Where are the performance bottlenecks, and how would you optimize for high packet rates?”
┌─────────────────────────────────────────────────────────────────────┐
│                     PACKET RECEIVE PATH                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  1. NIC receives packet                                              │
│     └─→ DMA to ring buffer in memory                                │
│     └─→ Raise interrupt                                             │
│                                                                      │
│  2. Interrupt handler (hardirq)                                      │
│     └─→ Acknowledge interrupt                                       │
│     └─→ Schedule NAPI poll (softirq)                                │
│     └─→ Disable further interrupts for this queue                  │
│                                                                      │
│  3. NAPI poll (softirq context)                                     │
│     └─→ Poll ring buffer for packets                                │
│     └─→ Allocate sk_buff structures                                 │
│     └─→ Process up to budget packets                                │
│     └─→ Re-enable interrupts if done                                │
│                                                                      │
│  4. Network stack processing                                         │
│     └─→ XDP (if attached) - earliest hook                          │
│     └─→ tc ingress                                                   │
│     └─→ netfilter/iptables                                          │
│     └─→ IP routing                                                   │
│     └─→ TCP/UDP processing                                          │
│                                                                      │
│  5. Socket layer                                                     │
│     └─→ Socket buffer (sk_buff queue)                               │
│     └─→ Wake up waiting application                                 │
│                                                                      │
│  6. Application                                                      │
│     └─→ read()/recv() copies to user space                         │
│     └─→ Process data                                                │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
Common bottlenecks:
  1. Interrupt overhead:
    • Each interrupt: ~1-2μs
    • At 1M pps: 100% CPU just handling interrupts
    • Solution: NAPI, interrupt coalescing
  2. Memory allocation:
    • sk_buff allocation per packet
    • Solution: Page pools, recycling
  3. Lock contention:
    • Socket lock for each packet
    • Solution: SO_REUSEPORT, RSS
  4. Cache misses:
    • Packet data not in cache
    • Solution: Busy polling, NUMA awareness
  5. Context switches:
    • Waking application per packet
    • Solution: Batching, busy polling
Hardware level:
# Enable RSS (Receive Side Scaling)
ethtool -L eth0 combined 8

# Configure interrupt coalescing
ethtool -C eth0 rx-usecs 50 rx-frames 64

# Pin interrupts to CPUs
echo 1 > /proc/irq/XX/smp_affinity
Kernel level:
# Increase socket buffer sizes
sysctl -w net.core.rmem_max=26214400

# Enable busy polling
sysctl -w net.core.busy_poll=50
sysctl -w net.core.busy_read=50

# Tune NAPI
echo 64 > /sys/class/net/eth0/gro_flush_timeout
Application level:
// Use SO_REUSEPORT for multi-thread scaling
int opt = 1;
setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &opt, sizeof(opt));

// Use recvmmsg for batch receiving
struct mmsghdr msgs[BATCH_SIZE];
int ret = recvmmsg(fd, msgs, BATCH_SIZE, 0, NULL);
Ultimate performance: XDP:
  • Process packets before sk_buff allocation
  • 10M+ pps on single core
  • Used by Cloudflare, Facebook

Question 5: CPU Isolation for Low Latency

The Question: “We need sub-millisecond latency for a trading system. How would you configure Linux to minimize jitter?”
Kernel sources:
  • Timer interrupts (every 1-4ms)
  • RCU callbacks
  • Kernel threads (kworker, ksoftirqd)
  • System call overhead
Hardware sources:
  • SMI (System Management Interrupt)
  • Cache pollution
  • NUMA remote access
  • Power management (C-states)
Boot parameters:
# /etc/default/grub
GRUB_CMDLINE_LINUX="
    isolcpus=2,3,4,5              # Remove from scheduler
    nohz_full=2,3,4,5             # No timer ticks
    rcu_nocbs=2,3,4,5             # Offload RCU callbacks
    irqaffinity=0,1               # Keep IRQs off isolated CPUs
    intel_pstate=disable          # Disable dynamic frequency
    processor.max_cstate=0        # Disable C-states
    idle=poll                     # Poll instead of sleep
    nosoftlockup                  # Don't check for lockups
    nmi_watchdog=0                # Disable NMI watchdog
    audit=0                       # Disable audit
    skew_tick=1                   # Randomize timer ticks
"
CPU affinity:
# Pin critical threads to isolated CPUs
taskset -c 2,3,4,5 ./trading_app

# Move all other processes away
for pid in $(ps -eo pid --no-headers); do
    taskset -p 0x3 $pid 2>/dev/null  # CPUs 0,1
done
IRQ affinity:
# Move all IRQs to housekeeping CPUs
for irq in /proc/irq/*/smp_affinity; do
    echo 3 > $irq 2>/dev/null  # CPUs 0,1
done
Verify isolation:
# Check no IRQs on isolated CPUs
cat /proc/interrupts | awk '{print $1, $3, $4}'

# Check no kernel threads on isolated CPUs
ps -eo pid,psr,comm | grep -E "^\s*[0-9]+\s+[2-5]"

# Check timer behavior
perf stat -C 2,3,4,5 -e irq_vectors:local_timer_entry sleep 10
Measure latency:
# Use cyclictest
cyclictest -m -p 99 -i 100 -h 1000 -D 1m -a 2 -t 1

# Interpret results:
# Min: 1 μs    (good)
# Avg: 2 μs    (good)
# Max: 50 μs   (acceptable for many use cases)
# Max: 500 μs  (investigate!)

System Design with Kernel Awareness

Question 6: Design a Container Metrics Collector

The Question: “Design a system to collect CPU, memory, and I/O metrics from 10,000 containers on each host with minimal overhead.”
┌─────────────────────────────────────────────────────────────────────┐
│                    METRICS COLLECTION ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   Method 1: Poll cgroup files                                        │
│   ─────────────────────────                                         │
│   For each container:                                                │
│     - Read /sys/fs/cgroup/<id>/cpu.stat                            │
│     - Read /sys/fs/cgroup/<id>/memory.current                      │
│     - Read /sys/fs/cgroup/<id>/io.stat                             │
│                                                                      │
│   Pros: Simple, no kernel changes                                   │
│   Cons: File system overhead, 10K reads/second                      │
│                                                                      │
│   ─────────────────────────────────────────────────────────────────  │
│                                                                      │
│   Method 2: eBPF-based collection                                    │
│   ─────────────────────────────                                     │
│   - Hook scheduler for CPU accounting                               │
│   - Hook memory allocator for memory tracking                       │
│   - Hook block layer for I/O                                        │
│   - Aggregate per-cgroup in BPF maps                                │
│                                                                      │
│   Pros: Lower overhead, real-time data                              │
│   Cons: Complex, needs kernel support                               │
│                                                                      │
│   ─────────────────────────────────────────────────────────────────  │
│                                                                      │
│   Recommended: Hybrid approach                                       │
│   - Use cgroup files for infrequent metrics (memory limits)         │
│   - Use eBPF for high-frequency metrics (CPU, I/O)                  │
│   - Batch reads, use inotify for changes                            │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
Batch cgroup file reading:
// Open file descriptors once, reuse
struct container_fds {
    int cpu_stat_fd;
    int memory_current_fd;
    int io_stat_fd;
};

void collect_metrics(struct container_fds *fds, int count) {
    for (int i = 0; i < count; i++) {
        // Use pread to avoid seeking
        char buf[4096];
        pread(fds[i].cpu_stat_fd, buf, sizeof(buf), 0);
        parse_cpu_stat(buf, &metrics[i].cpu);
        
        pread(fds[i].memory_current_fd, buf, sizeof(buf), 0);
        metrics[i].memory = atol(buf);
    }
}
eBPF for CPU tracking:
// Track CPU time per cgroup
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10240);
    __type(key, u64);  // cgroup id
    __type(value, u64);  // total CPU time ns
} cgroup_cpu SEC(".maps");

SEC("tracepoint/sched/sched_switch")
int trace_switch(struct trace_event_raw_sched_switch *ctx) {
    u64 cgroup_id = bpf_get_current_cgroup_id();
    u64 now = bpf_ktime_get_ns();
    
    // Record time for previous task's cgroup
    u64 *last_time = bpf_map_lookup_elem(&last_switch, &ctx->prev_pid);
    if (last_time) {
        u64 delta = now - *last_time;
        u64 *total = bpf_map_lookup_elem(&cgroup_cpu, &cgroup_id);
        if (total) {
            *total += delta;
        }
    }
    
    // Record switch time for next task
    bpf_map_update_elem(&last_switch, &ctx->next_pid, &now, BPF_ANY);
    return 0;
}
Polling approach (10K containers, 1-second interval):
  • 30K file reads per second
  • ~100μs per read = 3 seconds of CPU
  • Too much overhead!
Optimized polling:
  • Keep FDs open: eliminate open/close
  • Batch reads with io_uring
  • Stagger collection across time
  • Result: ~100ms of CPU per second
eBPF approach:
  • Constant overhead regardless of container count
  • ~1-2% CPU for tracing hooks
  • Scales to any number of containers
Hybrid approach:
  • eBPF for high-frequency (CPU, I/O): ~1% overhead
  • Polling for low-frequency (limits, configs): ~0.1% overhead
  • Total: ~1.1% CPU overhead for 10K containers

Debugging Scenarios

Scenario 1: Container Not Starting

Situation: A container fails to start with “permission denied” but works as root.
# Check seccomp profile
docker inspect <container> | jq '.[0].HostConfig.SecurityOpt'

# Check AppArmor/SELinux
docker inspect <container> | jq '.[0].AppArmorProfile'
getenforce  # SELinux status

# Check capabilities
docker inspect <container> | jq '.[0].HostConfig.CapAdd'
docker inspect <container> | jq '.[0].HostConfig.CapDrop'

# Trace syscall failures
strace -f docker run --rm myimage 2>&1 | grep -i denied

# Check audit logs
ausearch -m avc -ts recent
  1. Seccomp blocking syscall:
    • Solution: Add syscall to profile or use --security-opt seccomp=unconfined
  2. Missing capability:
    • Solution: --cap-add=SYS_ADMIN (or specific capability)
  3. SELinux/AppArmor denial:
    • Solution: Check audit logs, update policy
  4. User namespace UID mapping:
    • Solution: Check /etc/subuid, /etc/subgid
  5. Read-only filesystem:
    • Solution: --read-only with appropriate tmpfs mounts

Scenario 2: High Memory Usage Mystery

Situation: Container shows 2GB used, but application reports only 500MB heap.
# Get detailed memory breakdown
cat /sys/fs/cgroup/<path>/memory.stat

# Key fields:
# anon - Anonymous memory (heap, stack)
# file - Page cache (file-backed)
# kernel - Kernel memory charged to cgroup
# slab - Slab allocations

# Check for file cache
# This is often the "missing" memory!
grep -E "^(anon|file|kernel)" /sys/fs/cgroup/<path>/memory.stat

# Inside container:
cat /proc/meminfo | grep -E "Cached|Buffers|Slab"

# Check for memory mapped files
cat /proc/<pid>/maps | grep -v "00000000 00:00" | wc -l
Common causes of discrepancy:
  1. Page cache: Files read by application cached in memory
    • Shows in cgroup, not in application heap
    • Will be reclaimed under pressure
  2. Memory-mapped files: Libraries, data files
    • mmap’d but not all pages resident
  3. Slab memory: Kernel allocations for this cgroup
    • Network buffers, file system metadata
  4. Shared memory: Multiple processes sharing
    • Charged once but used by many

Quick Reference: Commands for Interviews

# Process investigation
ps aux --forest                    # Process tree
cat /proc/<pid>/status             # Detailed status
cat /proc/<pid>/maps               # Memory mappings
ls -la /proc/<pid>/fd/             # Open files
cat /proc/<pid>/stack              # Kernel stack

# Memory investigation  
free -h                            # Memory overview
cat /proc/meminfo                  # Detailed memory stats
vmstat 1 5                         # Virtual memory stats
slabtop                            # Slab allocations
cat /proc/buddyinfo                # Fragmentation

# CPU investigation
top -H                             # Per-thread CPU
mpstat -P ALL 1                    # Per-CPU stats
perf top                           # Live profiling
pidstat 1                          # Per-process stats

# Disk I/O
iostat -xz 1                       # Disk stats
iotop                              # Per-process I/O
cat /proc/<pid>/io                 # Process I/O

# Network
ss -tlnp                           # Listening sockets
ss -anp                            # All connections
cat /proc/net/dev                  # Interface stats
nstat                              # Network statistics

# Container/cgroup
cat /sys/fs/cgroup/<path>/memory.current
cat /sys/fs/cgroup/<path>/cpu.stat
cat /sys/fs/cgroup/<path>/io.stat

# Tracing
strace -p <pid>                    # Syscall tracing
ltrace -p <pid>                    # Library tracing
perf trace -p <pid>                # Fast syscall tracing
bpftrace -e '...'                  # eBPF tracing

Key Interview Tips

Think Out Loud

Explain your reasoning as you work through problems. Interviewers want to see your thought process.

Start Simple

Begin with the simplest approach, then discuss trade-offs and optimizations.

Know the Stack

Be ready to go from application to syscall to kernel to hardware.

Practice Debugging

Work through real debugging scenarios. This experience shows in interviews.

Next: Hands-on Projects →