Real Interview Questions
This module contains actual interview questions from infrastructure, observability, and platform engineering roles at top companies. Each question includes detailed solutions and the key insights interviewers are looking for.What to Expect: These questions require deep understanding, not memorization
Interview Format: Usually 45-60 minute deep dives into 2-3 topics
Time to Prepare: 10-12 hours to work through all scenarios
Interview Format: Usually 45-60 minute deep dives into 2-3 topics
Time to Prepare: 10-12 hours to work through all scenarios
Company-Specific Patterns
Different companies emphasize different areas:| Company | Focus Areas | Style |
|---|---|---|
| Datadog | eBPF, tracing, metrics collection | Hands-on implementation |
| Grafana Labs | Observability stack, performance | System design + coding |
| Cloudflare | Network stack, performance | Deep Linux networking |
| Chronosphere | Time series, observability | Architecture + internals |
| Meta Infra | Large scale systems | Design + debugging |
| Netflix | Performance, containers | Deep dives + scenarios |
Observability Company Questions
Question 1: Implement a Syscall Counter (Datadog-style)
Context: You’re asked to implement a tool that counts syscalls by process in production without significant overhead.
Discussion Points
Discussion Points
Interviewer is looking for:
- Knowledge of different approaches (strace, perf, eBPF)
- Understanding of overhead implications
- Production safety considerations
- Sampling vs complete counting trade-offs
- strace: Per-syscall ptrace, very high overhead (~100x slowdown)
- perf: Sampling-based, lower overhead, may miss syscalls
- eBPF tracepoints: Low overhead (~1-5%), production-safe
- eBPF kprobes: Slightly higher overhead, more flexible
Solution Approach
Solution Approach
Best approach: eBPF tracepointUser-space component:
- Periodically read map, sort by count
- Display top N processes
- Handle PID reuse (track start time)
Production Considerations
Production Considerations
What makes this production-safe:Overhead estimation:
- Bounded map size: Won’t consume unlimited memory
- No locks in hot path: Per-CPU increments would be ideal
- Graceful degradation: If map full, just skip new PIDs
- Low overhead: Tracepoint, not ptrace
- ~50-100ns per syscall
- On busy system (100K syscalls/sec): ~1% CPU
- Acceptable for production monitoring
Question 2: Debug High Latency in Production
Context: A service is experiencing intermittent latency spikes. You need to identify the cause without restarting the service.
Investigation Framework
Investigation Framework
Systematic approach:
- Characterize the problem:
- When do spikes occur? (Time correlation)
- Which requests are affected? (Endpoint, payload)
- Duration of spikes? (Seconds, minutes)
- Gather baseline metrics:
- CPU utilization (is there contention?)
- Memory usage (swapping? GC?)
- Disk I/O (latency, throughput)
- Network (retransmits, latency)
- Narrow down the layer:
- Application code?
- Runtime (GC pauses)?
- Kernel (scheduling, I/O)?
- Hardware (disk, network)?
Tools and Commands
Tools and Commands
Quick triage:Deep analysis with bpftrace:Check for GC pauses (if JVM/Go):
Common Causes and Solutions
Common Causes and Solutions
Likely causes of intermittent spikes:
- Garbage Collection:
- Symptom: Regular, predictable spikes
- Detection: GC logs show long pauses
- Fix: Tune GC, reduce allocation rate
- Disk I/O:
- Symptom: Correlates with writes/fsync
- Detection:
biolatencyshows spikes - Fix: Async I/O, better storage
- Memory Pressure:
- Symptom: During memory spikes
- Detection:
sar -Bshows page faults - Fix: Increase memory, reduce footprint
- CPU Throttling (containers):
- Symptom: Regular, consistent spikes
- Detection:
cat /sys/fs/cgroup/cpu.stat - Fix: Increase CPU limits
- Network Issues:
- Symptom: Affects network calls
- Detection:
tcpretrans,ss -ti - Fix: Check network path, timeouts
Question 3: Container Memory Behavior
The Question: “Explain what happens when a container hits its memory limit. What are the different behaviors, and how would you debug an OOM-killed container?”Memory Limit Behavior
Memory Limit Behavior
When container approaches memory limit:Cgroup v2 memory controls:
memory.current: Current usagememory.max: Hard limit (OOM if exceeded)memory.high: Soft limit (throttling)memory.low: Best-effort protectionmemory.min: Hard protection
Debugging OOM Kills
Debugging OOM Kills
Immediate diagnostics:Understanding OOM output:Memory profiling:
Prevention Strategies
Prevention Strategies
Proper memory sizing:Application-level protections:
- Profile application under realistic load
- Account for peak usage, not average
- Include headroom for GC, file cache
- Set JVM heap < container limit (-Xmx)
- Use memory-aware allocators
- Implement backpressure mechanisms
Infrastructure Company Questions
Question 4: Network Stack Performance (Cloudflare-style)
The Question: “Explain the journey of a packet from the NIC to the application. Where are the performance bottlenecks, and how would you optimize for high packet rates?”Packet Journey
Packet Journey
Performance Bottlenecks
Performance Bottlenecks
Common bottlenecks:
- Interrupt overhead:
- Each interrupt: ~1-2μs
- At 1M pps: 100% CPU just handling interrupts
- Solution: NAPI, interrupt coalescing
- Memory allocation:
- sk_buff allocation per packet
- Solution: Page pools, recycling
- Lock contention:
- Socket lock for each packet
- Solution: SO_REUSEPORT, RSS
- Cache misses:
- Packet data not in cache
- Solution: Busy polling, NUMA awareness
- Context switches:
- Waking application per packet
- Solution: Batching, busy polling
Optimization Techniques
Optimization Techniques
Hardware level:Kernel level:Application level:Ultimate performance: XDP:
- Process packets before sk_buff allocation
- 10M+ pps on single core
- Used by Cloudflare, Facebook
Question 5: CPU Isolation for Low Latency
The Question: “We need sub-millisecond latency for a trading system. How would you configure Linux to minimize jitter?”Sources of Jitter
Sources of Jitter
Kernel sources:
- Timer interrupts (every 1-4ms)
- RCU callbacks
- Kernel threads (kworker, ksoftirqd)
- System call overhead
- SMI (System Management Interrupt)
- Cache pollution
- NUMA remote access
- Power management (C-states)
Isolation Configuration
Isolation Configuration
Boot parameters:CPU affinity:IRQ affinity:
Verification and Testing
Verification and Testing
Verify isolation:Measure latency:
System Design with Kernel Awareness
Question 6: Design a Container Metrics Collector
The Question: “Design a system to collect CPU, memory, and I/O metrics from 10,000 containers on each host with minimal overhead.”Architecture Overview
Architecture Overview
Optimized Implementation
Optimized Implementation
Batch cgroup file reading:eBPF for CPU tracking:
Overhead Analysis
Overhead Analysis
Polling approach (10K containers, 1-second interval):
- 30K file reads per second
- ~100μs per read = 3 seconds of CPU
- Too much overhead!
- Keep FDs open: eliminate open/close
- Batch reads with io_uring
- Stagger collection across time
- Result: ~100ms of CPU per second
- Constant overhead regardless of container count
- ~1-2% CPU for tracing hooks
- Scales to any number of containers
- eBPF for high-frequency (CPU, I/O): ~1% overhead
- Polling for low-frequency (limits, configs): ~0.1% overhead
- Total: ~1.1% CPU overhead for 10K containers
Debugging Scenarios
Scenario 1: Container Not Starting
Situation: A container fails to start with “permission denied” but works as root.Debugging Steps
Debugging Steps
Common Causes
Common Causes
- Seccomp blocking syscall:
- Solution: Add syscall to profile or use
--security-opt seccomp=unconfined
- Solution: Add syscall to profile or use
- Missing capability:
- Solution:
--cap-add=SYS_ADMIN(or specific capability)
- Solution:
- SELinux/AppArmor denial:
- Solution: Check audit logs, update policy
- User namespace UID mapping:
- Solution: Check /etc/subuid, /etc/subgid
- Read-only filesystem:
- Solution:
--read-onlywith appropriate tmpfs mounts
- Solution:
Scenario 2: High Memory Usage Mystery
Situation: Container shows 2GB used, but application reports only 500MB heap.Memory Accounting Deep Dive
Memory Accounting Deep Dive
- Page cache: Files read by application cached in memory
- Shows in cgroup, not in application heap
- Will be reclaimed under pressure
- Memory-mapped files: Libraries, data files
- mmap’d but not all pages resident
- Slab memory: Kernel allocations for this cgroup
- Network buffers, file system metadata
- Shared memory: Multiple processes sharing
- Charged once but used by many
Quick Reference: Commands for Interviews
Key Interview Tips
Think Out Loud
Explain your reasoning as you work through problems. Interviewers want to see your thought process.
Start Simple
Begin with the simplest approach, then discuss trade-offs and optimizations.
Know the Stack
Be ready to go from application to syscall to kernel to hardware.
Practice Debugging
Work through real debugging scenarios. This experience shows in interviews.
Next: Hands-on Projects →