When you view a trace in Jaeger or Honeycomb, the backend has to perform a Distributed Join to reconstruct the tree:
Gathering: Collect all spans sharing the same trace-id.
Topological Sort: Use the parent-id to arrange spans into a Directed Acyclic Graph (DAG).
Clock Correction: Since different nodes have different clocks, the backend must adjust span start times so that a child span never starts before its parent (Causality preserved).
Standard tracing assumes a parent-child relationship (synchronous RPC). But for asynchronous messages (Kafka/SQS), we use Links.
The Problem: If a consumer processes 100 messages in one batch, who is the “parent”?
The Solution: The trace has multiple “Links” to the 100 originating traces, allowing you to see the fan-out without breaking the original trace’s structure.
Feature
Child-Of Relationship
Link Relationship
Model
Synchronous (RPC)
Asynchronous (Messages)
Dependency
Direct parent
Casual link
Causality
Strong (Blocked)
Weak (Hand-off)
Staff Tip: When implementing tracing, never use a “global” trace context variable in your code. Always pass the Context object explicitly (in Go) or use ThreadLocal storage (in Java) to avoid trace leakage between concurrent requests.
The hardest part of distributed tracing is ensuring the trace context flows through every boundary: HTTP, gRPC, message queues, thread pools, and async callbacks.
The trickiest case - async messages break the parent-child relationship.
Copy
# Producer: Inject context into message headersdef send_to_kafka(topic: str, message: dict): tracer = trace.get_tracer(__name__) with tracer.start_as_current_span("kafka_produce") as span: headers = [] carrier = {} inject(carrier) # Convert to Kafka header format for key, value in carrier.items(): headers.append((key, value.encode('utf-8'))) producer.send( topic, value=json.dumps(message).encode(), headers=headers )# Consumer: Extract context, but use LINKS not parentdef consume_from_kafka(): tracer = trace.get_tracer(__name__) for message in consumer: # Extract the trace context from producer carrier = {k: v.decode() for k, v in message.headers} producer_context = extract(carrier) # Create LINK to producer span (not parent-child!) producer_span = trace.get_current_span(producer_context) links = [trace.Link(producer_span.get_span_context())] # Start new span with link to producer with tracer.start_span( "kafka_consume", links=links # Links allow many-to-many relationships ) as span: process_message(message)
Why Links instead of Parent?
A consumer might process messages from 100 different producers in one batch
Using parent-child would create 100 parallel traces
Links allow you to see “this processing was triggered by these 100 messages”
Context can be lost when work is submitted to a thread pool.
Copy
# BAD: Context is lost!def bad_async_call(): with tracer.start_as_current_span("parent"): # Context is NOT automatically copied to the thread pool worker executor.submit(do_work) # do_work sees NO trace context# GOOD: Explicitly propagate contextfrom opentelemetry.context import attach, detach, get_currentdef good_async_call(): with tracer.start_as_current_span("parent"): # Capture current context ctx = get_current() def wrapped_work(): # Attach captured context in worker thread token = attach(ctx) try: do_work() # Now sees the trace context! finally: detach(token) executor.submit(wrapped_work)# Best: Use instrumented executorfrom opentelemetry.instrumentation.threading import ThreadPoolExecutorInstrumentorThreadPoolExecutorInstrumentor().instrument()# Now all executor.submit() calls auto-propagate context
┌─────────────────────────────────────────────────────────────────────────────┐│ CONTEXT PROPAGATION PITFALLS │├─────────────────────────────────────────────────────────────────────────────┤│ ││ 1. LOST CONTEXT IN ASYNC ││ Problem: Context not copied to worker threads ││ Fix: Use instrumented executors or manual propagation ││ ││ 2. CONTEXT LEAKAGE ││ Problem: Global context variable shared between requests ││ Fix: Use ThreadLocal/ContextVar, never global ││ ││ 3. MISSING HEADERS IN RETRIES ││ Problem: Retry library doesn't copy trace headers ││ Fix: Ensure retry wrapper copies headers from original request ││ ││ 4. WRONG RELATIONSHIP FOR QUEUES ││ Problem: Using parent-child for async messages ││ Fix: Use Links for message queues, parent-child for sync calls ││ ││ 5. SAMPLING DECISIONS INCONSISTENT ││ Problem: Parent is sampled but child isn't (or vice versa) ││ Fix: Propagate sampling decision in tracestate ││ │└─────────────────────────────────────────────────────────────────────────────┘
Staff Tip: When designing an observability stack, ensure your Trace Context propagates through asynchronous boundaries like message queues (Kafka headers) and background jobs (sidekiq/celery), otherwise your traces will have “gaps” that make debugging impossible.
┌─────────────────────────────────────────────────────────────────────────────┐│ METRICS BEST PRACTICES │├─────────────────────────────────────────────────────────────────────────────┤│ ││ USE (Brendan Gregg's method): ││ ──────────────────────────── ││ Utilization: % of resource used ││ Saturation: Queue depth, backpressure ││ Errors: Error count/rate ││ ││ RED (for services): ││ ───────────────── ││ Rate: Requests per second ││ Errors: Error rate ││ Duration: Latency distribution (p50, p95, p99) ││ ││ THE FOUR GOLDEN SIGNALS (Google SRE): ││ ───────────────────────────────────── ││ 1. Latency: Time to serve request ││ 2. Traffic: Requests per second ││ 3. Errors: Rate of failed requests ││ 4. Saturation: How full is your system? ││ ││ ═══════════════════════════════════════════════════════════════════════ ││ ││ ALERTING BEST PRACTICES: ││ ──────────────────────── ││ ││ ┌────────────────────────────────────────────────────────────────────┐ ││ │ GOOD ALERTS │ BAD ALERTS │ ││ ├────────────────────────────────────────────────────────────────────┤ ││ │ Error rate > 1% for 5 min │ CPU > 80% │ ││ │ p99 latency > 500ms │ Single node down │ ││ │ Error budget < 10% │ Disk usage > 70% │ ││ │ Zero successful payments │ Memory usage high │ ││ └────────────────────────────────────────────────────────────────────┘ ││ ││ ALERT ON SYMPTOMS, NOT CAUSES ││ "Users are affected" not "CPU is high" ││ │└─────────────────────────────────────────────────────────────────────────────┘
┌────────────────────┬────────────────────────────────────────────────────────┐│ Failure Type │ How to Inject │├────────────────────┼────────────────────────────────────────────────────────┤│ Process crash │ kill -9, container stop ││ Instance failure │ Terminate EC2, pod delete ││ Zone failure │ Block traffic to zone, DNS manipulation ││ Region failure │ Failover entire region ││ Network latency │ tc netem, iptables delay ││ Network partition │ iptables DROP, security groups ││ Packet loss │ tc netem loss ││ CPU stress │ stress-ng, burn CPU ││ Memory pressure │ stress-ng --vm ││ Disk full │ dd if=/dev/zero of=/tmp/fill ││ Disk slow │ dm-delay, slow filesystem ││ Clock skew │ date --set, chronyd manipulation ││ DNS failure │ Block port 53, bad DNS response ││ Certificate expiry │ Use expired cert, revoke cert │└────────────────────┴────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐│ ERROR BUDGETS IN PRACTICE │├─────────────────────────────────────────────────────────────────────────────┤│ ││ ERROR BUDGET POLICY: ││ ──────────────────── ││ ││ If Error Budget > 50%: ││ ✓ Deploy normally ││ ✓ Run experiments ││ ✓ Add new features ││ ││ If Error Budget 10-50%: ││ ⚠ Slower deployments ││ ⚠ Extra review for risky changes ││ ⚠ Focus on reliability improvements ││ ││ If Error Budget < 10%: ││ 🛑 Feature freeze ││ 🛑 Only reliability fixes deployed ││ 🛑 Postmortem required for any new issues ││ ││ If Error Budget Exhausted: ││ 🚨 All hands on reliability ││ 🚨 Executive escalation ││ 🚨 No deploys until budget restored ││ ││ ═══════════════════════════════════════════════════════════════════════ ││ ││ BALANCING ACT: ││ ────────────── ││ ││ Too much budget remaining = not innovating fast enough ││ Too little budget = reliability suffering ││ ││ Target: Consume ~100% of budget by end of period ││ │└─────────────────────────────────────────────────────────────────────────────┘
A system is Statically Stable if it continues to operate in its “steady state” without needing to make changes during a dependency failure.
The Problem: Reactive autoscaling. If AZ-1 fails, AZ-2 and AZ-3 try to scale up. But if the control plane (Kubernetes/EC2 API) is also failing, they can’t scale, and the whole system crashes.
The Solution: Over-provisioning. Run AZ-1, AZ-2, and AZ-3 at 50% capacity each. If one AZ fails, the remaining two are already at 100% capacity and can handle the full load immediately without calling any external APIs.
Key Principle: Avoid “Control Plane” dependencies in the “Data Plane” recovery path.
Instead of one giant “monolith” cluster, you split your infrastructure into many independent Cells.
Definition: A Cell is a complete, self-contained instance of the service (App + DB + Cache).
Blast Radius: If Cell A has a “poison pill” request or a hardware failure, only the 5% of users in Cell A are affected. The other 95% of users in other cells are completely isolated.
Scaling: To double capacity, you don’t scale the cells; you just add more cells.
Used By: AWS (Lambda, DynamoDB), Salesforce, and Facebook.
In a modern distributed environment, you don’t manage individual servers; you manage a Cluster. A Scheduler (like Google’s Borg or Kubernetes’ kube-scheduler) is responsible for deciding where your code runs.
In a cluster, users need multiple resources (e.g., User A needs high CPU, User B needs high RAM). How do you allocate resources fairly? DRF is the standard algorithm.
Definition: DRF calculates the “dominant share” for each user (the resource they need the most of relative to the cluster’s total capacity) and tries to equalize these shares.
Not all jobs are equal. Borg introduced two main categories:
Prod (Production): Low latency, high availability (e.g., Search, Gmail).
Non-Prod (Batch): High throughput, latency-insensitive (e.g., Log processing, ML training).
Preemption: If a high-priority “Prod” job needs resources, the scheduler will kill (evict) lower-priority “Non-Prod” jobs to make room. This allows Google to run clusters at 90%+ utilization.
Copy
┌─────────────────────────────────────────────────────────────────────────────┐│ SCHEDULER DECISION FLOW │├─────────────────────────────────────────────────────────────────────────────┤│ ││ Task arrives: {CPU: 2, RAM: 4GB, Priority: PROD} ││ ││ 1. FILTERING (Predicates): ││ - Does node have enough free RAM? ││ - Does node have required OS/Hardware? ││ - Result: Set of "Feasible Nodes" ││ ││ 2. SCORING (Priorities): ││ - Which node minimizes fragmentation? (Bin packing) ││ - Which node has the data locally? (Locality) ││ - Result: "Best Node" picked ││ ││ 3. PREEMPTION (If no feasible nodes): ││ - Can I kill priority 0 jobs to fit priority 10? ││ │└─────────────────────────────────────────────────────────────────────────────┘
Staff Tip: When designing a system, differentiate between Hard Constraints (the app cannot run without this) and Soft Constraints/Affinities (the app should run here for better performance). Over-constraining your scheduler leads to “unschedulable” tasks and wasted resources.
Q: How would you debug a latency spike across 100 services?
Systematic approach:
Check dashboards first
Which services show elevated latency?
When did it start? Correlate with deployments
Scope: All users or specific segment?
Use distributed tracing
Find slow traces
Identify which span is slowest
Look for patterns (specific service, DB, external API)
Drill down
Check that service’s metrics (CPU, memory, connections)
Check dependencies (DB latency, cache hit rate)
Check for new error types in logs
Common culprits
Database slow queries
Cache miss spike
Connection pool exhaustion
Garbage collection
Noisy neighbor (shared resources)
External API degradation
Mitigation while investigating
Scale up if resource-bound
Enable circuit breaker
Failover to backup
Q: Design a chaos engineering program for your team
Copy
PHASE 1: FOUNDATION (Month 1-2)─────────────────────────────• Define steady state metrics• Set up experiment tracking• Train team on principles• Start with game days (manual)PHASE 2: BASIC EXPERIMENTS (Month 3-4)──────────────────────────────────────• Instance failures• Network latency injection• Run in staging first• Graduate to production with small blast radiusPHASE 3: ADVANCED (Month 5-6)─────────────────────────────• Zone failures• Database failover• Certificate expiry• Clock skewPHASE 4: AUTOMATION (Month 7+)──────────────────────────────• Continuous chaos in production• Automatic rollback on failure• Integrate with CI/CD• Coverage reportingKEY PRINCIPLES:• Always have rollback plan• Start small, expand gradually• Document everything• Celebrate finding issues!
Q: How do you handle being paged 5 times a night?
Short-term mitigation:
Analyze page patterns
Suppress non-actionable alerts
Add secondary on-call
Medium-term fixes:
Improve runbooks for faster resolution
Automate common remediations
Add better monitoring to prevent issues
Long-term solutions:
Fix underlying reliability issues
Add circuit breakers, retries
Improve capacity planning
Push back on missing error budget
Process changes:
Track alert metrics (pages per week)
Review every page in team meeting
Goal: < 2 pages per on-call shift
Escalate if consistently exceeded
Q: How do you prepare for a 10x traffic spike?
Copy
PREPARATION TIMELINE:T-4 weeks:• Load test current capacity• Identify bottlenecks• Order additional capacityT-2 weeks:• Pre-scale databases (can't autoscale fast)• Increase cache capacity• Add read replicas• Pre-warm cachesT-1 week:• Scale compute to 3x normal• Verify autoscaling works• Test circuit breakers• Prepare runbooksT-1 day:• Final scale to target (10x)• War room ready• All hands availableDuring event:• Monitor dashboards• Quick decisions on feature flags• Prepared to shed load (graceful degradation)After:• Scale down gradually (30% per hour)• Postmortem any issues• Update capacity model