Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Senior Level: Observability is how you debug production issues. Interviewers expect senior engineers to design systems that can be understood and debugged at scale.

The Three Pillars of Observability

┌─────────────────────────────────────────────────────────────────┐
│              The Three Pillars                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  METRICS                    LOGS                  TRACES        │
│  ───────                    ────                  ──────        │
│  "What happened?"           "Why?"                "Where?"      │
│                                                                 │
│  Aggregated data            Individual events     Request flow  │
│  Time-series               Text/structured       Distributed    │
│  Cheap to store            Expensive             Moderate       │
│  Alerting                  Debugging             Root cause     │
│                                                                 │
│  Examples:                  Examples:             Examples:      │
│  • Request rate             • Error messages      • Request path │
│  • Error rate               • Stack traces        • Latency/span │
│  • Latency (p50,p99)        • User actions        • Dependencies │
│  • CPU/Memory               • Audit trail         • Bottlenecks  │
│                                                                 │
│  Tools:                     Tools:                Tools:        │
│  • Prometheus               • ELK Stack           • Jaeger       │
│  • Datadog                  • Splunk              • Zipkin       │
│  • CloudWatch               • CloudWatch Logs     • AWS X-Ray    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
Observability Stack

Metrics

Key Metrics to Track (RED Method)

┌─────────────────────────────────────────────────────────────────┐
│              RED Method (Request-focused)                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  R - Rate      : Requests per second                           │
│  E - Errors    : Error rate (% of failed requests)             │
│  D - Duration  : Request latency (p50, p95, p99)               │
│                                                                 │
│  For every service, track:                                     │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Service: payment-service                               │   │
│  │  ─────────────────────────────────────────              │   │
│  │  Rate:      150 req/s                                   │   │
│  │  Errors:    0.1% (5xx), 2% (4xx)                        │   │
│  │  Duration:  p50=45ms, p95=120ms, p99=350ms              │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

USE Method (Resource-focused)

┌─────────────────────────────────────────────────────────────────┐
│              USE Method (Infrastructure)                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  U - Utilization : % time resource is busy                     │
│  S - Saturation  : Work resource can't service (queue length)  │
│  E - Errors      : Count of error events                       │
│                                                                 │
│  Apply to each resource:                                       │
│                                                                 │
│  CPU:                                                          │
│  • Utilization: 75%                                            │
│  • Saturation: Load average / CPU count                        │
│  • Errors: CPU errors (rare)                                   │
│                                                                 │
│  Memory:                                                        │
│  • Utilization: Used / Total                                   │
│  • Saturation: Swap usage, OOM events                          │
│  • Errors: Allocation failures                                 │
│                                                                 │
│  Network:                                                       │
│  • Utilization: Bandwidth used                                 │
│  • Saturation: Dropped packets, retransmits                    │
│  • Errors: Interface errors                                    │
│                                                                 │
│  Disk:                                                          │
│  • Utilization: I/O time %                                     │
│  • Saturation: I/O queue length                                │
│  • Errors: Bad sectors, I/O errors                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Metric Types

from prometheus_client import Counter, Gauge, Histogram, Summary

# COUNTER: Only goes up (request count, errors)
requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# GAUGE: Can go up and down (current connections, queue size)
active_connections = Gauge(
    'active_connections',
    'Number of active connections'
)

# HISTOGRAM: Measures distribution (latency buckets)
request_latency = Histogram(
    'http_request_duration_seconds',
    'Request latency in seconds',
    ['endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# SUMMARY: Like histogram but calculates quantiles client-side
request_latency_summary = Summary(
    'http_request_duration_summary',
    'Request latency summary',
    ['endpoint']
)

# Usage in request handler
@app.middleware("http")
async def metrics_middleware(request, call_next):
    start = time.time()
    
    active_connections.inc()
    try:
        response = await call_next(request)
        return response
    finally:
        duration = time.time() - start
        active_connections.dec()
        
        requests_total.labels(
            method=request.method,
            endpoint=request.url.path,
            status=response.status_code
        ).inc()
        
        request_latency.labels(
            endpoint=request.url.path
        ).observe(duration)

Distributed Tracing

How Tracing Works

┌─────────────────────────────────────────────────────────────────┐
│              Distributed Trace                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Trace ID: abc123 (unique per request)                         │
│                                                                 │
│  API Gateway                                                    │
│  ├─ Span 1: gateway (trace_id=abc123, span_id=001)             │
│  │  │ start: 0ms, duration: 150ms                              │
│  │  │                                                          │
│  │  ├─ Span 2: auth-service (parent=001, span_id=002)         │
│  │  │  │ start: 5ms, duration: 20ms                           │
│  │  │  └─ tags: {user_id: "usr_123"}                          │
│  │  │                                                          │
│  │  ├─ Span 3: order-service (parent=001, span_id=003)        │
│  │  │  │ start: 30ms, duration: 100ms                         │
│  │  │  │                                                       │
│  │  │  ├─ Span 4: database (parent=003, span_id=004)          │
│  │  │  │  │ start: 35ms, duration: 45ms                       │
│  │  │  │  └─ tags: {query: "SELECT..."}                       │
│  │  │  │                                                       │
│  │  │  └─ Span 5: payment-service (parent=003, span_id=005)   │
│  │  │     │ start: 85ms, duration: 40ms                       │
│  │  │     └─ tags: {amount: 150.00}                           │
│  │  │                                                          │
│  │  └─ Span 6: notification (parent=001, span_id=006)         │
│  │     │ start: 135ms, duration: 10ms                         │
│  │     └─ tags: {type: "email"}                               │
│  │                                                              │
│  └─ Total: 150ms                                               │
│                                                                 │
│  Visualized:                                                   │
│  ┌──────────────────────────────────────────────────────┐     │
│  │ gateway                                              │     │
│  │  ├─auth──┤                                           │     │
│  │      ├──────────────order-service────────────────┤   │     │
│  │      │   ├────database────┤                      │   │     │
│  │      │                    ├────payment────┤      │   │     │
│  │                                              ├─notif─┤     │
│  └──────────────────────────────────────────────────────┘     │
│  0ms      25ms      50ms      75ms      100ms     125ms 150ms │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Implementing Tracing

from opentelemetry import trace
from opentelemetry.trace.propagation import set_span_in_context
from opentelemetry.propagate import inject, extract

tracer = trace.get_tracer(__name__)

# Propagate context through HTTP headers
async def call_service(service_url: str, payload: dict, headers: dict):
    # Get current context and inject trace headers
    inject(headers)
    
    async with httpx.AsyncClient() as client:
        return await client.post(service_url, json=payload, headers=headers)

# Start a new span
@app.post("/orders")
async def create_order(request: Request, order: OrderCreate):
    # Extract context from incoming request
    context = extract(dict(request.headers))
    
    with tracer.start_as_current_span("create_order", context=context) as span:
        # Add attributes (searchable in UI)
        span.set_attribute("user_id", order.user_id)
        span.set_attribute("order_total", order.total)
        
        # Child span for database
        with tracer.start_as_current_span("db.insert_order") as db_span:
            db_span.set_attribute("db.system", "postgresql")
            db_span.set_attribute("db.statement", "INSERT INTO orders...")
            order_id = await db.insert_order(order)
        
        # Child span for external service
        with tracer.start_as_current_span("payment.charge") as pay_span:
            pay_span.set_attribute("payment.amount", order.total)
            try:
                await call_service(
                    f"http://payment-service/charge",
                    {"amount": order.total},
                    {}
                )
            except Exception as e:
                pay_span.record_exception(e)
                pay_span.set_status(trace.Status(trace.StatusCode.ERROR))
                raise
        
        return {"order_id": order_id}

Structured Logging

Log Format Best Practices

import structlog
import logging
from datetime import datetime

# Configure structured logging
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer()
    ]
)

logger = structlog.get_logger()

# Good: Structured, searchable, includes context
logger.info(
    "order_created",
    order_id="ord_123",
    user_id="usr_456",
    total=150.00,
    items_count=3,
    trace_id="abc123",
    duration_ms=45
)
# Output: {"event": "order_created", "order_id": "ord_123", 
#          "user_id": "usr_456", "total": 150.00, ...}

# Bad: Unstructured, hard to search
logger.info(f"Order ord_123 created for user usr_456 with total $150.00")

Log Levels

┌─────────────────────────────────────────────────────────────────┐
│                   Log Level Guidelines                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  DEBUG   │ Detailed diagnostic info                            │
│          │ Use: Development, troubleshooting                   │
│          │ Example: "Cache key: user:123, value: {...}"        │
│          │                                                      │
│  INFO    │ Normal operations                                   │
│          │ Use: Track key events                               │
│          │ Example: "Order created", "User logged in"          │
│          │                                                      │
│  WARNING │ Potential issues (but still working)                │
│          │ Use: Things that should be investigated             │
│          │ Example: "Rate limit approaching", "Retry succeeded"│
│          │                                                      │
│  ERROR   │ Operation failed (but service continues)            │
│          │ Use: Failed requests, exceptions                    │
│          │ Example: "Payment failed", "DB connection timeout"  │
│          │                                                      │
│  FATAL   │ Service is unusable                                 │
│          │ Use: Critical failures requiring immediate action   │
│          │ Example: "Database unreachable", "Config missing"   │
│                                                                 │
│  PRODUCTION LOG LEVELS:                                        │
│  • Default: INFO and above                                     │
│  • Per-service override for debugging                          │
│  • Never DEBUG in production (too verbose, costly)             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Correlation IDs

from contextvars import ContextVar
import uuid

# Context variable for request correlation
correlation_id_ctx: ContextVar[str] = ContextVar('correlation_id')

class CorrelationMiddleware:
    """
    Ensures every request has a correlation ID
    """
    
    async def __call__(self, request, call_next):
        # Get from header or generate new
        correlation_id = request.headers.get(
            'X-Correlation-ID',
            str(uuid.uuid4())
        )
        
        # Set in context for logging
        correlation_id_ctx.set(correlation_id)
        
        # Process request
        response = await call_next(request)
        
        # Add to response headers
        response.headers['X-Correlation-ID'] = correlation_id
        return response

# Logger automatically includes correlation ID
class CorrelationLogger:
    def _log(self, level, message, **kwargs):
        kwargs['correlation_id'] = correlation_id_ctx.get(None)
        structlog.get_logger().log(level, message, **kwargs)
    
    def info(self, message, **kwargs):
        self._log('info', message, **kwargs)
    
    def error(self, message, **kwargs):
        self._log('error', message, **kwargs)

Alerting

Alert Design Principles

┌─────────────────────────────────────────────────────────────────┐
│                   Alerting Best Practices                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  GOOD ALERT:                                                    │
│  • Actionable (someone needs to do something)                  │
│  • Urgent (needs attention within SLA)                         │
│  • Clear (describes what's wrong and impact)                   │
│  • Rare (not noisy, doesn't cause alert fatigue)               │
│                                                                 │
│  BAD ALERT:                                                     │
│  • "CPU at 80%" (so what? is anything broken?)                 │
│  • Fires frequently (causes alert fatigue)                     │
│  • No clear action (what should I do?)                         │
│                                                                 │
│  ALERT ON SYMPTOMS, NOT CAUSES:                                │
│                                                                 │
│  BAD:  "Database CPU > 90%"                                 │
│  GOOD: "Order creation latency > 2s for 5 minutes"         │
│                                                                 │
│  Why? The symptom (slow orders) is what matters to users.      │
│  High CPU that doesn't affect users isn't urgent.              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Production-Ready Observability Implementation

Complete observability stack with metrics, logging, and tracing:
import time
import logging
import functools
from typing import Optional, Dict, Any, Callable
from dataclasses import dataclass, field
from datetime import datetime
from contextlib import contextmanager
from contextvars import ContextVar
import json
import uuid
import asyncio

# ============== Metrics Collection ==============
from prometheus_client import Counter, Histogram, Gauge, Info, REGISTRY
from prometheus_client.exposition import generate_latest

class MetricsCollector:
    """Production-ready metrics collection"""
    
    def __init__(self, service_name: str):
        self.service_name = service_name
        
        # RED Method Metrics
        self.request_count = Counter(
            'http_requests_total',
            'Total HTTP requests',
            ['service', 'method', 'endpoint', 'status']
        )
        
        self.request_latency = Histogram(
            'http_request_duration_seconds',
            'HTTP request latency',
            ['service', 'method', 'endpoint'],
            buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
        )
        
        self.request_in_progress = Gauge(
            'http_requests_in_progress',
            'HTTP requests currently being processed',
            ['service', 'endpoint']
        )
        
        # USE Method Metrics
        self.connection_pool_size = Gauge(
            'connection_pool_size',
            'Current connection pool size',
            ['service', 'pool_name']
        )
        
        self.queue_size = Gauge(
            'queue_size',
            'Current queue size',
            ['service', 'queue_name']
        )
        
        # Business Metrics
        self.orders_total = Counter(
            'orders_total',
            'Total orders processed',
            ['service', 'status', 'payment_method']
        )
        
        self.revenue_total = Counter(
            'revenue_total_cents',
            'Total revenue in cents',
            ['service', 'currency']
        )
        
        # Service Info
        self.service_info = Info(
            'service_info',
            'Service information'
        )
        self.service_info.info({
            'service': service_name,
            'version': '1.0.0'
        })
    
    def track_request(self, method: str, endpoint: str):
        """Decorator to track request metrics"""
        def decorator(func: Callable):
            @functools.wraps(func)
            async def async_wrapper(*args, **kwargs):
                self.request_in_progress.labels(
                    service=self.service_name,
                    endpoint=endpoint
                ).inc()
                
                start_time = time.time()
                status = "success"
                
                try:
                    result = await func(*args, **kwargs)
                    return result
                except Exception as e:
                    status = "error"
                    raise
                finally:
                    duration = time.time() - start_time
                    
                    self.request_count.labels(
                        service=self.service_name,
                        method=method,
                        endpoint=endpoint,
                        status=status
                    ).inc()
                    
                    self.request_latency.labels(
                        service=self.service_name,
                        method=method,
                        endpoint=endpoint
                    ).observe(duration)
                    
                    self.request_in_progress.labels(
                        service=self.service_name,
                        endpoint=endpoint
                    ).dec()
            
            return async_wrapper
        return decorator
    
    def export(self) -> bytes:
        """Export metrics in Prometheus format"""
        return generate_latest(REGISTRY)

# ============== Distributed Tracing ==============
trace_context: ContextVar[Dict[str, Any]] = ContextVar('trace_context', default={})

@dataclass
class Span:
    trace_id: str
    span_id: str
    parent_span_id: Optional[str]
    operation_name: str
    service_name: str
    start_time: float
    end_time: Optional[float] = None
    tags: Dict[str, Any] = field(default_factory=dict)
    logs: list = field(default_factory=list)
    status: str = "OK"
    
    def set_tag(self, key: str, value: Any) -> 'Span':
        self.tags[key] = value
        return self
    
    def log(self, message: str, **kwargs) -> 'Span':
        self.logs.append({
            "timestamp": time.time(),
            "message": message,
            **kwargs
        })
        return self
    
    def set_error(self, error: Exception) -> 'Span':
        self.status = "ERROR"
        self.tags["error"] = True
        self.tags["error.message"] = str(error)
        self.tags["error.type"] = type(error).__name__
        return self
    
    def finish(self) -> None:
        self.end_time = time.time()
    
    def duration_ms(self) -> float:
        if self.end_time is None:
            return 0
        return (self.end_time - self.start_time) * 1000
    
    def to_dict(self) -> Dict[str, Any]:
        return {
            "traceId": self.trace_id,
            "spanId": self.span_id,
            "parentSpanId": self.parent_span_id,
            "operationName": self.operation_name,
            "serviceName": self.service_name,
            "startTime": int(self.start_time * 1_000_000),
            "duration": int(self.duration_ms() * 1000),
            "tags": self.tags,
            "logs": self.logs,
            "status": self.status
        }

class Tracer:
    """Distributed tracing implementation"""
    
    def __init__(self, service_name: str, exporter: Optional['SpanExporter'] = None):
        self.service_name = service_name
        self.exporter = exporter or ConsoleSpanExporter()
    
    @contextmanager
    def start_span(self, operation_name: str, **tags):
        """Start a new span as context manager"""
        ctx = trace_context.get()
        
        trace_id = ctx.get('trace_id') or self._generate_id()
        parent_span_id = ctx.get('span_id')
        span_id = self._generate_id()
        
        span = Span(
            trace_id=trace_id,
            span_id=span_id,
            parent_span_id=parent_span_id,
            operation_name=operation_name,
            service_name=self.service_name,
            start_time=time.time(),
            tags=tags
        )
        
        # Set new context
        new_ctx = {
            'trace_id': trace_id,
            'span_id': span_id,
            'parent_span_id': parent_span_id
        }
        token = trace_context.set(new_ctx)
        
        try:
            yield span
        except Exception as e:
            span.set_error(e)
            raise
        finally:
            span.finish()
            self.exporter.export(span)
            trace_context.reset(token)
    
    def inject_headers(self) -> Dict[str, str]:
        """Inject trace context into HTTP headers"""
        ctx = trace_context.get()
        return {
            'X-Trace-ID': ctx.get('trace_id', ''),
            'X-Span-ID': ctx.get('span_id', ''),
            'X-Parent-Span-ID': ctx.get('parent_span_id', '')
        }
    
    def extract_headers(self, headers: Dict[str, str]) -> None:
        """Extract trace context from HTTP headers"""
        ctx = {
            'trace_id': headers.get('X-Trace-ID') or self._generate_id(),
            'span_id': headers.get('X-Span-ID'),
            'parent_span_id': headers.get('X-Parent-Span-ID')
        }
        trace_context.set(ctx)
    
    def _generate_id(self) -> str:
        return uuid.uuid4().hex[:16]

class ConsoleSpanExporter:
    def export(self, span: Span) -> None:
        print(json.dumps(span.to_dict(), indent=2))

class JaegerExporter:
    """Export spans to Jaeger"""
    
    def __init__(self, agent_host: str = "localhost", agent_port: int = 6831):
        self.agent_host = agent_host
        self.agent_port = agent_port
        self.batch: list = []
        self.batch_size = 100
    
    def export(self, span: Span) -> None:
        self.batch.append(span.to_dict())
        
        if len(self.batch) >= self.batch_size:
            self._flush()
    
    def _flush(self) -> None:
        if not self.batch:
            return
        
        # Send to Jaeger agent via UDP
        import socket
        sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        
        payload = json.dumps({"spans": self.batch}).encode()
        sock.sendto(payload, (self.agent_host, self.agent_port))
        
        self.batch = []

# ============== Structured Logging ==============
correlation_id_ctx: ContextVar[str] = ContextVar('correlation_id', default='')

class StructuredLogger:
    """Production structured logging with correlation"""
    
    def __init__(self, service_name: str, log_level: str = "INFO"):
        self.service_name = service_name
        self.logger = logging.getLogger(service_name)
        self.logger.setLevel(getattr(logging, log_level))
        
        # JSON formatter
        handler = logging.StreamHandler()
        handler.setFormatter(JsonFormatter())
        self.logger.addHandler(handler)
    
    def _enrich(self, extra: Dict[str, Any]) -> Dict[str, Any]:
        """Add standard fields to log entry"""
        ctx = trace_context.get()
        return {
            "service": self.service_name,
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "trace_id": ctx.get('trace_id', ''),
            "span_id": ctx.get('span_id', ''),
            "correlation_id": correlation_id_ctx.get(''),
            **extra
        }
    
    def info(self, message: str, **kwargs) -> None:
        self.logger.info(message, extra=self._enrich(kwargs))
    
    def error(self, message: str, **kwargs) -> None:
        self.logger.error(message, extra=self._enrich(kwargs))
    
    def warning(self, message: str, **kwargs) -> None:
        self.logger.warning(message, extra=self._enrich(kwargs))
    
    def debug(self, message: str, **kwargs) -> None:
        self.logger.debug(message, extra=self._enrich(kwargs))

class JsonFormatter(logging.Formatter):
    """Format logs as JSON"""
    
    def format(self, record: logging.LogRecord) -> str:
        log_entry = {
            "level": record.levelname,
            "message": record.getMessage(),
            "logger": record.name,
        }
        
        # Add extra fields
        if hasattr(record, '__dict__'):
            for key, value in record.__dict__.items():
                if key not in ['name', 'msg', 'args', 'levelname', 'levelno',
                              'pathname', 'filename', 'module', 'lineno',
                              'funcName', 'created', 'msecs', 'relativeCreated',
                              'thread', 'threadName', 'processName', 'process',
                              'getMessage', 'exc_info', 'exc_text', 'stack_info']:
                    log_entry[key] = value
        
        return json.dumps(log_entry)

# ============== Full Observability Middleware ==============
class ObservabilityMiddleware:
    """Combined metrics, tracing, and logging middleware"""
    
    def __init__(
        self, 
        app, 
        service_name: str,
        metrics: MetricsCollector,
        tracer: Tracer,
        logger: StructuredLogger
    ):
        self.app = app
        self.service_name = service_name
        self.metrics = metrics
        self.tracer = tracer
        self.logger = logger
    
    async def __call__(self, request, call_next):
        # Extract or generate correlation ID
        correlation_id = request.headers.get(
            'X-Correlation-ID', 
            str(uuid.uuid4())
        )
        correlation_id_ctx.set(correlation_id)
        
        # Extract trace context
        self.tracer.extract_headers(dict(request.headers))
        
        endpoint = request.url.path
        method = request.method
        
        with self.tracer.start_span(
            f"{method} {endpoint}",
            http_method=method,
            http_url=str(request.url)
        ) as span:
            
            self.logger.info(
                "request_started",
                method=method,
                path=endpoint,
                user_agent=request.headers.get('user-agent')
            )
            
            start_time = time.time()
            
            try:
                response = await call_next(request)
                
                span.set_tag("http.status_code", response.status_code)
                
                self.logger.info(
                    "request_completed",
                    method=method,
                    path=endpoint,
                    status_code=response.status_code,
                    duration_ms=round((time.time() - start_time) * 1000, 2)
                )
                
                # Add trace headers to response
                response.headers['X-Trace-ID'] = span.trace_id
                response.headers['X-Correlation-ID'] = correlation_id
                
                return response
                
            except Exception as e:
                span.set_error(e)
                
                self.logger.error(
                    "request_failed",
                    method=method,
                    path=endpoint,
                    error=str(e),
                    error_type=type(e).__name__,
                    duration_ms=round((time.time() - start_time) * 1000, 2)
                )
                
                raise

Alert Template

# Example PagerDuty/Opsgenie Alert
alert: OrderServiceHighLatency
expr: histogram_quantile(0.99, rate(order_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
  severity: critical
  service: order-service
  team: payments
annotations:
  summary: "Order service p99 latency is {{ $value }}s (threshold: 2s)"
  description: |
    Impact: Users are experiencing slow checkout.
    
    Dashboard: https://grafana.example.com/d/orders
    Runbook: https://wiki.example.com/runbooks/order-latency
    
    Possible causes:
    - Database connection pool exhausted
    - Payment service slow
    - Increased traffic
    
    Immediate actions:
    1. Check dashboard for traffic spike
    2. Check payment-service health
    3. Check database metrics
  runbook_url: https://wiki.example.com/runbooks/order-latency

Alert Severity Levels

┌─────────────────────────────────────────────────────────────────┐
│                   Severity Levels                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SEV-1 (Critical) - Page immediately, all hands                │
│  ─────────────────────────────────────────────                  │
│  • Service is down                                             │
│  • Data loss occurring                                         │
│  • Security breach                                             │
│  Response: 15 minutes                                          │
│                                                                 │
│  SEV-2 (High) - Page on-call                                   │
│  ───────────────────────────                                    │
│  • Major feature degraded                                      │
│  • Significant latency increase                                │
│  • Error rate > 5%                                             │
│  Response: 30 minutes                                          │
│                                                                 │
│  SEV-3 (Medium) - Ticket, fix during business hours            │
│  ───────────────────────────────────────────────                │
│  • Non-critical feature broken                                 │
│  • Performance degradation (not severe)                        │
│  Response: 4 hours                                             │
│                                                                 │
│  SEV-4 (Low) - Ticket, fix when convenient                     │
│  ─────────────────────────────────────────                      │
│  • Minor issues                                                │
│  • Cosmetic problems                                           │
│  Response: 1 week                                              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Dashboards

Dashboard Design

┌─────────────────────────────────────────────────────────────────┐
│                   Dashboard Layout                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  TOP: High-level health (GREEN/YELLOW/RED)                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  [OK] API Gateway  [OK] Orders  [WARN] Payments  [OK] Database    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  GOLDEN SIGNALS (RED method):                                  │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐  │
│  │ Request Rate    │ │ Error Rate      │ │ Latency p99     │  │
│  │     ^           │ │     v           │ │     ~           │  │
│  │  1,234 req/s    │ │    0.05%        │ │    145ms        │  │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘  │
│                                                                 │
│  RESOURCE UTILIZATION (USE method):                            │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐  │
│  │ CPU             │ │ Memory          │ │ Disk I/O        │  │
│  │     ▂▃▅▆▇       │ │     ▃▃▃▃▃       │ │     ▁▂▁▂▁       │  │
│  │    65%          │ │    72%          │ │    15%          │  │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘  │
│                                                                 │
│  DEPENDENCIES:                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Payment API: 50ms │ Database: 10ms │ Redis: 1ms        │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  BOTTOM: Detailed graphs, logs, traces                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

SLIs, SLOs, and SLAs

┌─────────────────────────────────────────────────────────────────┐
│              SLI / SLO / SLA                                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SLI (Service Level Indicator)                                 │
│  ─────────────────────────────                                  │
│  The metric you measure                                        │
│  Example: "99th percentile latency of API requests"            │
│                                                                 │
│  SLO (Service Level Objective)                                 │
│  ─────────────────────────────                                  │
│  The target you aim for (internal)                             │
│  Example: "99th percentile latency < 200ms"                    │
│                                                                 │
│  SLA (Service Level Agreement)                                 │
│  ─────────────────────────────                                  │
│  The contract with customers (external)                        │
│  Example: "99.9% availability or refund"                       │
│  Note: SLA should be looser than SLO (buffer)                  │
│                                                                 │
│  Example SLO Document:                                         │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Service: Order API                                     │   │
│  │  ────────────────────                                    │   │
│  │  Availability: 99.95% of requests successful            │   │
│  │  Latency: 99% of requests < 500ms                       │   │
│  │  Throughput: Handle 10,000 req/s                        │   │
│  │                                                         │   │
│  │  Error Budget: 0.05% = 21.6 minutes/month               │   │
│  │  Current Budget Remaining: 15.3 minutes                 │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Senior Interview Questions

Systematic approach:
  1. Check dashboards: RED metrics, identify when it started
  2. Correlate: Deployments? Traffic spike? Dependency issue?
  3. Trace analysis: Find slow spans in traces
  4. Log analysis: Search for errors around that time
  5. Narrow down: Which endpoint? Which users?
Common causes:
  • Database slow queries (check slow query log)
  • GC pauses (check GC metrics)
  • Connection pool exhaustion
  • External dependency slowdown
  • Lock contention
Checklist:
  1. Instrument code: Add metrics (RED method)
  2. Add tracing: Propagate trace context
  3. Structured logging: With correlation IDs
  4. Create dashboard: Health, golden signals, resources
  5. Set up alerts: On symptoms, not causes
  6. Document SLOs: Define success criteria
  7. Create runbooks: What to do when alerts fire
Strategies:
  1. Alert on symptoms: User impact, not causes
  2. Use thresholds wisely: 5 minutes > 80% vs instant spike
  3. Group related alerts: One page per incident, not 10
  4. Regular review: Delete unused, tune noisy alerts
  5. Escalation policy: Low-priority → ticket, high → page
  6. On-call feedback: Track alert quality metrics
Key metric: If on-call is paged but no action needed, fix the alert!
Architecture:
  1. Collection: Agent on each host (Prometheus, StatsD)
  2. Aggregation: Pre-aggregate at edge (reduce cardinality)
  3. Storage: Time-series DB (InfluxDB, M3DB, Thanos)
  4. Query: Federation for cross-cluster queries
  5. Visualization: Grafana dashboards
Scale challenges:
  • High cardinality labels (user_id) → Aggregate
  • Long retention → Downsampling
  • Many metrics → Drop unused

Interview Deep-Dive Questions

What the interviewer is really testing: Whether you can use all three observability pillars in a coordinated investigation, not just name them. The interviewer wants to see a systematic debugging methodology, not random guessing.Strong Answer:
  • Start with metrics to scope the problem. Check the RED dashboard for the checkout service: is the spike affecting all endpoints or just one? Is it all users or a specific segment? Is the error rate also elevated, or is it purely a latency issue? Correlate with infrastructure metrics (USE method): is CPU saturated? Is memory under pressure (swap usage, GC pauses)? Is there a spike in database connection pool utilization or queue depth?
  • Metrics narrow the blast radius. Let’s say you find: p99 latency is up for the /checkout/complete endpoint only, error rate is unchanged, and the database connection pool is at 95% utilization. That tells you the database is likely the bottleneck.
  • Now switch to traces. Pull up a few traces from the last 10 minutes where the checkout took longer than 2 seconds. Distributed tracing shows the span breakdown: the checkout-service span is 3 seconds total, of which payment-service took 50ms and inventory-service took 30ms — both normal. But the db.query span inside checkout-service took 2.8 seconds. Open the span attributes: the query is SELECT * FROM orders WHERE user_id = ? AND status = 'pending'.
  • Now switch to logs. Search for log entries correlated with the trace ID from one of those slow traces. You find a log line: “Slow query detected: 2800ms, table=orders, missing index on (user_id, status).” Check the deployment log: a migration ran at 1:45 AM that added a new column to the orders table, and the migration accidentally dropped an index.
  • Root cause identified: the deployment at 1:45 AM dropped an index, causing a full table scan on a query that previously used the index. Fix: recreate the index. Immediate mitigation: either roll back the migration or run CREATE INDEX CONCURRENTLY on the affected columns.
  • The key methodology: metrics tell you WHAT is wrong and WHERE, traces tell you which component in the request path is slow, and logs tell you WHY that component is slow. Using them in this order (metrics first, then traces, then logs) is the most efficient debugging path.
  • Example: Stripe’s internal debugging workflow follows exactly this pattern. They call it “start wide, go deep.” Their dashboards show service-level RED metrics with drill-down into per-endpoint metrics, which link directly to example traces for slow requests, which link to correlated logs. An engineer can go from “something is slow” to “this specific query is missing an index” in under 5 minutes.
Follow-up: The on-call engineer finds the root cause but notices that the alert fired 15 minutes after the deployment. How would you improve the alerting to catch this faster?The alert is based on p99 latency exceeding a threshold, but it has a smoothing window (probably 5-10 minutes of sustained breach before firing). The 15-minute delay is the smoothing window plus the time for the latency spike to propagate. Improvements: (1) Add a deployment-correlated alert that automatically watches key metrics for 10 minutes after any deployment and fires immediately if they degrade beyond a baseline. (2) Use anomaly detection instead of static thresholds — a sudden jump from 200ms to 3 seconds should trigger faster than waiting for a sustained breach. (3) Integrate with the CI/CD pipeline: if the deployment system knows it ran a database migration, it should automatically verify that query performance metrics are stable before marking the deployment as successful (canary analysis).
What the interviewer is really testing: Whether you understand the economics of observability (what drives cost) and can make intelligent trade-offs between visibility and budget.Strong Answer:
  • Observability costs are driven by three factors: data volume (how many metrics, logs, and traces you ingest), data retention (how long you store it), and data cardinality (how many unique time series your metrics create). Cutting cost means reducing one or more of these without losing the ability to debug production issues.
  • Metrics cost reduction: (1) Audit metric cardinality. A single metric with a high-cardinality label (like user_id or request_path with thousands of unique values) can create millions of time series. Replace high-cardinality labels with bucketed versions (e.g., replace request_path=/users/12345 with request_path_group=/users/:id). (2) Drop unused metrics. Query your metrics backend to find metrics that have not been queried in 90 days. If nobody is looking at them, stop collecting them. (3) Reduce collection frequency for low-priority services. Internal batch jobs do not need 10-second resolution metrics — 60-second is fine.
  • Logs cost reduction: (1) Reduce log volume at the source. Debug-level logging in production is almost never needed — set production log level to INFO or WARN. A single service logging at DEBUG can produce 10x the volume of INFO-level logging. (2) Use sampling for high-volume logs. If your API gateway logs every request, sample 10% of successful requests but keep 100% of errors. (3) Implement log-to-metrics pipelines: instead of storing every “request completed” log line, extract the latency and status code into a metric at the edge, then drop the log line. Metrics are orders of magnitude cheaper to store than logs.
  • Traces cost reduction: (1) Use tail-based sampling: keep 100% of error and slow traces, sample 1% of successful traces. This can reduce trace volume by 90% while retaining every trace you would actually need for debugging. (2) Reduce span depth — trace at service boundaries, not at every function call within a service.
  • Retention tiers: keep high-resolution data for 7 days (recent incidents), downsample to 1-minute resolution for 30 days, 1-hour resolution for 1 year. Most vendors charge significantly less for lower-resolution or archived data.
  • The 50% cut is achievable by combining: metric cardinality audit (20% savings), log level and sampling changes (40% savings on log costs, which are typically the largest line item), trace sampling (30% savings on trace costs), and retention tiering (15% savings across the board).
  • Example: Uber reduced their observability costs by 40% by building an internal “metric governance” system that automatically detected and alerted on high-cardinality metrics before they exploded the time series count. They also moved to a tiered storage model where data older than 72 hours was automatically downsampled and moved to cheaper storage.
Follow-up: After cutting costs, an incident occurs and the on-call engineer cannot find the relevant logs because they were sampled out. How do you handle this, and how do you prevent it in the future?This is the inherent trade-off of sampling — you will occasionally lose data you needed. Mitigations: (1) Never sample error logs or logs associated with errored traces (tail-based sampling should guarantee this). (2) Implement “debug mode” per service: an API that temporarily sets a service to 100% log collection for 30 minutes when an engineer is actively debugging. This costs more during the debug window but is capped in duration. (3) Keep raw logs in cheap storage (S3) for 30 days before discarding. They are not indexed and not searchable in your observability tool, but an engineer can retrieve and search them manually if needed. The cost of S3 storage is a fraction of indexed log storage.
What the interviewer is really testing: Whether you understand the hierarchy of service level concepts and can translate business requirements into measurable technical targets.Strong Answer:
  • SLI (Service Level Indicator) is the metric itself — a quantitative measurement of one aspect of service quality. SLO (Service Level Objective) is the target value for that SLI — what “good enough” looks like. SLA (Service Level Agreement) is a contractual commitment with consequences (usually financial) if the SLO is breached. The relationship: SLIs measure, SLOs set goals, SLAs have teeth.
  • For a payment processing service, I would define these SLIs and SLOs:
  • Availability SLI: the proportion of successful payment API requests (non-5xx responses) out of total requests, measured over a rolling 30-day window. SLO: 99.95% (allows 21.6 minutes of downtime per month). Why not 99.99%? Because the payment service depends on external providers (Stripe, banks) that themselves have SLAs around 99.95%, and your SLO cannot meaningfully exceed your dependencies’ reliability.
  • Latency SLI: p50 and p99 of payment processing time (from API request received to response sent). SLO: p50 under 300ms, p99 under 2 seconds. The p99 is generous because some payments require 3D Secure verification or bank redirects that add legitimate latency.
  • Correctness SLI: the proportion of payments where the amount charged matches the amount requested, measured by daily reconciliation. SLO: 99.999% (1 in 100,000 transactions may have a discrepancy, immediately flagged for investigation). Correctness has a much tighter SLO than availability because a wrong charge erodes trust far more than a brief outage.
  • The SLA (external contract with merchants): “Payment API will be available 99.9% of the time per calendar month. If availability falls below 99.9%, affected merchants receive a 10% credit on their monthly processing fees. If below 99.5%, a 25% credit.” Notice the SLA is looser than the SLO — the SLO is 99.95% but the SLA promises 99.9%. This error budget between the SLO and SLA is your buffer. If you are burning through the SLO error budget, you freeze deployments and focus on reliability before the SLA is breached.
  • Error budget: at 99.95% SLO over 30 days, you have approximately 21.6 minutes of allowed downtime. Track this as a “remaining error budget” metric. When the budget is below 50%, prioritize reliability work. When it is below 25%, freeze non-critical deployments.
  • Example: Google’s internal SRE practices (documented in their SRE book) emphasize that SLOs should be set based on user expectations, not engineering ambition. Their payments team sets SLOs slightly below what they can actually achieve, so that the error budget is a meaningful lever for prioritizing reliability vs. feature work.
Follow-up: The payment service has been running at 99.99% availability for six months, well above the 99.95% SLO. An engineer argues this means you can afford to take more risks with deployments and move faster. Do you agree?Partially. An overshoot of the SLO does mean you have error budget to spend, and spending it on velocity is legitimate. But there is a nuance: if you are consistently at 99.99% with the SLO at 99.95%, it might mean the SLO is too loose, not that you should deploy more recklessly. Check with your users and stakeholders — have they started depending on 99.99%? If so, you have an implicit SLO that is higher than your explicit one, and degrading to 99.95% will feel like a regression even though it is “within SLO.” The correct response is to either raise the SLO to match actual user expectations or explicitly communicate that reliability may vary within the SLO range. Then yes, use the error budget to deploy faster, run more experiments, and accept controlled risk.

Interview Questions

Strong Answer:
  • A Histogram buckets observations into configurable ranges (e.g., 0-10ms, 10-50ms, 50-100ms) and stores a cumulative count per bucket, plus a total sum and count. Quantiles (p50, p95, p99) are computed at query time using histogram_quantile(). A Summary calculates quantiles client-side (inside the application process) and exposes them directly as pre-computed values.
  • The critical difference is aggregation. Histograms can be aggregated across multiple instances — if you have 20 pods, you can compute a meaningful p99 across all 20 by summing the bucket counts. Summaries cannot be meaningfully aggregated because you cannot merge pre-computed quantiles from separate processes into a correct global quantile. Averaging p99 values from 20 pods does not give you the true p99.
  • In practice, Histograms are almost always the right choice for production services because you always want to aggregate across instances. The only case where Summary wins is if you need exact quantiles from a single process with minimal query-time computation (e.g., a batch job running on a single node where the quantile must be precise, not interpolated from buckets).
  • The trade-off with Histograms is choosing good bucket boundaries. If your buckets are too coarse (e.g., 100ms, 500ms, 1s), your p99 estimate between 100ms and 500ms is imprecise. If they are too fine, you generate more time series (each bucket is a separate time series), increasing storage and scrape cost. A good default for HTTP latency is [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10].
  • Example: At a company running 50 replicas of an API service, switching from Summary to Histogram for request latency metrics enabled accurate global p99 calculations for the first time. The previous Summary-based dashboards were showing “average of p99 across pods,” which masked a single hot pod with a true p99 of 2 seconds behind the averaged value of 400ms.
Red flag answer: “They are basically the same thing, just different ways to measure latency.” This misses the aggregation problem entirely, which is the core reason Histograms dominate in production.Follow-ups:
  1. You set up a Histogram with default buckets but your service has a bimodal latency distribution — 90% of requests complete in 5ms and 10% take 800ms due to cache misses. How would you adjust the bucket boundaries, and what would the default buckets get wrong?
  2. Your Prometheus storage is growing rapidly and you discover that a single Histogram metric with 5 labels is generating 500K time series. Walk me through how you would diagnose and fix this cardinality explosion.
Strong Answer:
  • A correlation ID is a unique identifier generated at the entry point of a request (usually the API gateway or first service to receive the request) and propagated through every downstream service call via HTTP headers (commonly X-Correlation-ID). Its purpose is to tie together all logs, events, and side effects that belong to the same user-initiated action, even across asynchronous boundaries like message queues.
  • A trace ID (from distributed tracing) serves a similar linking purpose but lives within the tracing system specifically. The trace ID connects spans in a distributed trace. A correlation ID is broader — it appears in logs, metrics tags, audit trails, error reports, and even database records. Think of the trace ID as the tracing system’s view of a request, and the correlation ID as the business-level identity of the entire operation.
  • In practice, many teams use the trace ID as the correlation ID (they are the same value). This works well if all your systems participate in tracing. But some systems do not produce traces — message queues, batch processors, third-party webhook handlers. The correlation ID survives in places where tracing does not. For example, if a request triggers an async job via Kafka, the trace ends at the Kafka producer, but the correlation ID is embedded in the Kafka message and picked up by the consumer, allowing you to search logs across the sync and async boundary.
  • Implementation detail: use ContextVar (Python) or AsyncLocalStorage (Node.js) to propagate the correlation ID within a service without passing it explicitly through every function. The middleware sets it on request entry, and the structured logger automatically includes it in every log line.
  • Example: During a production incident at an e-commerce company, a customer reported being double-charged. The support team searched logs by the correlation ID from the customer’s API response header and found the entire lifecycle: the initial checkout request, the payment service call, a timeout, the retry (with the same correlation ID), and the payment service processing both the original and the retry because the idempotency key was not checked. Without the correlation ID linking these events across 4 services and a retry boundary, finding this would have taken hours instead of minutes.
Red flag answer: “It is just a request ID that you log.” This misses the cross-service propagation, the difference from trace IDs, and the practical importance for debugging.Follow-ups:
  1. A request enters your system, triggers 3 synchronous service calls, then publishes to Kafka, where 2 consumers process the message asynchronously. How do you ensure the correlation ID flows through the entire chain, including the async leg?
  2. Your team is debating whether to use the OpenTelemetry trace ID as the correlation ID or maintain a separate one. What are the arguments for and against each approach?
Strong Answer:
  • These are cause-based alerts, not symptom-based alerts. High CPU or high disk usage is not inherently a problem — it is a problem only if it impacts users. A server running at 85% CPU while serving all requests within SLO is healthy. Alerting on it creates noise and trains on-call engineers to ignore pages, which is the definition of alert fatigue.
  • The principle is: alert on symptoms (what users experience), investigate causes (what the infrastructure is doing). Symptoms are things like “error rate exceeded 1% for 5 minutes,” “p99 latency exceeded 2 seconds,” or “order completion rate dropped below baseline.” These tell you that users are suffering and action is needed. The CPU and disk metrics are useful for diagnosis after the symptom alert fires, not as alert triggers themselves.
  • There is one exception: predictive alerts for resource exhaustion. “Disk is 90% full” is a bad alert. “At the current growth rate, disk will be 100% full in 4 hours” is a good alert because it predicts an imminent user-facing failure (the service will crash when the disk fills up). Similarly, “connection pool is at 95% capacity” is worth alerting on because at 100% the service stops accepting new requests.
  • A well-designed alert has four properties: it is actionable (someone needs to do something), it is urgent (it cannot wait until Monday), it is clear (the alert message says what is wrong and links to a runbook), and it is rare (it does not fire frequently enough to be ignored).
  • Example: A team at a fintech company had 47 active alert rules. On-call engineers were paged 15-20 times per week, mostly for “CPU > 80%” on batch processing servers that regularly spiked to 90% during scheduled jobs — completely normal behavior. After a quarterly alert review, they reduced to 12 alerts, all symptom-based (error rates, latency SLO breaches, queue depth growing for more than 10 minutes). On-call pages dropped to 2-3 per week, and the Mean Time to Acknowledge improved from 12 minutes to 3 minutes because engineers trusted that every page was real.
Red flag answer: “CPU and disk alerts are fine because you need to know when resources are running out.” This shows a lack of understanding of alert fatigue and the symptom-vs-cause distinction.Follow-ups:
  1. Your organization has 300 alert rules across all services. How would you audit them and decide which to keep, modify, or delete?
  2. A junior engineer pushes a new alert: “If any single request takes longer than 5 seconds, page on-call.” Why is this problematic, and how would you redesign it?
Strong Answer:
  • An error budget is the inverse of your SLO, expressed as an allowed amount of unreliability. If your SLO is 99.95% availability over 30 days, you are allowed 0.05% failure, which translates to about 21.6 minutes of downtime or about 0.05% of requests returning errors. The error budget is that 21.6 minutes — a quantified “budget” that the team can “spend” on risky activities like deployments, migrations, or experiments.
  • The organizational impact is transformative. Without error budgets, product teams and reliability teams are in constant tension. Product wants to ship features fast (which introduces risk), reliability wants to freeze everything (which preserves uptime). Error budgets resolve this by making the trade-off explicit and data-driven: if the error budget has plenty remaining, the product team ships aggressively. If the budget is nearly exhausted, the team prioritizes reliability work. Neither side “wins” — the data decides.
  • In practice, error budget policies define what happens at different thresholds. Above 50% budget remaining: normal development velocity, deploy at will. Between 25-50%: increase testing rigor, require canary deployments, postpone risky migrations. Below 25%: feature freeze, all engineering effort goes to reliability improvements. Budget exhausted (0%): no deployments except reliability fixes, postmortem required.
  • The error budget also prevents over-investing in reliability. If you consistently have 90% of your budget remaining at the end of each month, you are either too conservative (deploy more aggressively) or your SLO is too loose (tighten it). The budget should be meaningfully spent — ideally 30-60% consumed in a normal month.
  • Example: Google’s SRE book describes how the Ads team uses error budgets. When their search ads serving system was too reliable (consistently 99.999% against a 99.99% SLO), the SRE team actually encouraged the product team to take more risks with deployments because the unspent error budget represented wasted velocity. This counterintuitive dynamic — reliability engineers encouraging risk — only works because the error budget framework makes the trade-off explicit.
Red flag answer: “Error budget is just the amount of downtime you are allowed.” This is technically correct but misses the organizational mechanism entirely, which is the whole point.Follow-ups:
  1. The product team has a major launch next week, but the error budget is at 10% remaining. The VP of Product insists on shipping anyway. How do you handle this?
  2. Your error budget calculations show 99.97% availability, but customers are complaining about frequent errors. What could explain the discrepancy between the budget being healthy and user experience being poor?
Strong Answer:
  • At 500GB per day, the architecture must address three challenges: ingestion throughput (can you capture all 500GB without dropping logs?), storage economics (indexed log storage is expensive at this scale), and query performance (can engineers search these logs quickly during an incident?).
  • Ingestion layer: applications write logs to local files or stdout (never directly to a remote service — that creates a coupling that can crash the app if the logging service is down). A local agent (Fluentd, Fluent Bit, or Vector) tails the logs, parses and enriches them (adds host, service name, environment), and forwards them to a central aggregation layer. Use a buffer (Kafka) between the agent and the indexer to absorb traffic spikes. Kafka also provides durability — if the indexer goes down, logs queue up in Kafka rather than being dropped.
  • Processing and routing: not all logs deserve the same treatment. Route logs based on level and source: ERROR and WARN logs go to the fast-query indexed store (Elasticsearch, Loki, or Splunk). INFO logs for critical services go to the indexed store. INFO logs for non-critical services and all DEBUG logs go directly to cold storage (S3 with Parquet format). This tiering typically reduces your indexed volume by 60-70%.
  • Indexed storage: Elasticsearch is the most common choice. At 500GB per day (with 60-70% going to cold storage, so ~150-175GB indexed daily), you need a cluster sized for both ingest throughput and query performance. Use time-based indices (one per day), apply an ILM (Index Lifecycle Management) policy to roll indices to warm storage after 7 days (fewer replicas, cheaper nodes) and delete after 30 days. For cheaper alternative, Grafana Loki stores log lines without full-text indexing — it indexes only labels (service, level) and uses brute-force grep at query time. Much cheaper, slower for arbitrary text search, excellent for label-based filtering.
  • Cold storage: the 60-70% of logs that go to S3 can be queried on-demand using Athena or Presto when needed, typically during deep-dive investigations. The query takes minutes instead of seconds, but the storage cost is cents per GB vs. dollars per GB for indexed storage.
  • Reliability: the pipeline must not lose logs. Kafka’s durability guarantees cover the ingestion path. The agent should have a disk-backed buffer for when Kafka is unreachable. End-to-end, monitor the pipeline itself: track the lag between log generation and indexing, alert if it exceeds 5 minutes.
  • Example: Netflix processes over 1 PB of logs per day. They use a tiered architecture: a subset of logs is indexed in Elasticsearch for real-time search (their “Atlas” system), while the full log stream goes to S3 via their custom pipeline (“Keystone”). Engineers query S3 via Presto for historical investigations. This hybrid gives them sub-second search for recent logs and cost-effective access to everything.
Red flag answer: “Just send everything to Elasticsearch” or “Just use CloudWatch Logs.” Both show a lack of understanding of cost at scale — 500GB/day in Elasticsearch with default settings would cost tens of thousands per month and require significant cluster management.Follow-ups:
  1. Your Elasticsearch cluster is healthy but engineers complain that log searches during incidents are slow (taking 30+ seconds). What are the most likely causes and how do you optimize query performance?
  2. A compliance requirement mandates that logs must be retained for 7 years. How does this change your architecture, and what is the cost implication?
Strong Answer:
  • Head-based sampling makes the keep-or-drop decision at the start of the trace (when the first span is created). For example, “sample 10% of traces” means 90% of traces are never recorded at all. The decision is made before you know anything about whether the trace will be interesting. Tail-based sampling defers the decision until the trace is complete, so it can evaluate the entire trace before deciding whether to keep it.
  • Tail-based sampling is superior because it can apply intelligent criteria: keep all traces where any span has an error, keep all traces where the total duration exceeds a threshold (e.g., > 2 seconds), keep all traces for a specific user ID or transaction type, and randomly sample the rest. This means you retain 100% of the traces that matter for debugging (errors and slow requests) while dropping the vast majority of boring, successful traces.
  • With head-based sampling at 10%, you have a 90% chance of losing the trace for any given production error. During an incident, you might find zero traces for the failing request pattern, making tracing useless exactly when you need it most. With tail-based sampling, every error trace is preserved regardless of the overall sampling rate.
  • The trade-off: tail-based sampling is operationally more complex. It requires a collector that temporarily buffers all spans, waits for the trace to complete (or a timeout), evaluates the sampling rules, then either forwards or drops the trace. This buffer consumes memory and adds latency to trace availability (you cannot see the trace until the sampling decision is made, typically 30-60 seconds after trace completion). OpenTelemetry Collector supports tail-based sampling natively.
  • Example: A payments company switched from 5% head-based sampling to tail-based sampling that kept 100% of error traces, 100% of traces slower than 1 second, and 1% of everything else. Their total trace volume dropped by 70% (cost savings) while their ability to debug payment failures went from “we might have a trace” to “we always have the trace.” The first week after switching, they diagnosed a race condition in their idempotency logic that only manifested 0.01% of the time — a trace that head-based sampling would have almost certainly missed.
Red flag answer: “Sampling means you just keep a percentage of traces. I would keep 100% to be safe.” This ignores the cost reality — 100% retention at scale can cost hundreds of thousands per month — and misses the entire concept of intelligent sampling.Follow-ups:
  1. Your tail-based sampling collector is buffering spans for 60 seconds before making a decision, but some traces in your system take 5 minutes to complete (long-running background jobs). How do you handle this?
  2. An engineer argues that with tail-based sampling, you can reduce your overall sampling to 0.1% of successful traces because you keep all errors. What is the risk of being this aggressive?
Strong Answer:
  • Monitoring answers pre-defined questions: “Is the CPU high? Is the error rate above threshold? Is the service up?” You set up checks for known failure modes. Observability is the ability to ask arbitrary questions about your system’s internal state using external outputs — questions you did not anticipate when you instrumented the system. The distinction: monitoring tells you THAT something is wrong, observability lets you figure out WHY.
  • The telltale sign of monitoring-without-observability is the “dashboard gap”: an alert fires, the on-call engineer opens a dashboard, sees something is red, but cannot drill down to root cause without SSHing into servers, running ad-hoc queries, or asking other engineers “have you seen this before?” Every incident becomes a manual investigation because the system does not expose enough context to reason about novel failures.
  • To bridge the gap, three concrete changes: (1) Add structured logging with high-cardinality fields. Replace log.info("Request failed") with log.info("request_failed", user_id="usr_123", endpoint="/checkout", error_code="PAYMENT_TIMEOUT", trace_id="abc789", duration_ms=4500). High-cardinality fields (user_id, order_id, trace_id) are what let you pivot and slice data in new ways during an incident. (2) Add distributed tracing. Tracing gives you the request-level view that metrics and logs individually cannot: which service was slow, what the call graph looked like, where the bottleneck was. (3) Correlate all three pillars. Every log line should include the trace_id. Dashboards should link to example traces. Trace views should link to correlated logs. This correlation is what turns three separate data sources into a unified investigative tool.
  • The cultural shift is equally important. Monitoring culture says “add an alert for this failure mode.” Observability culture says “instrument the system so that any failure mode — including ones we have not seen yet — can be diagnosed from the telemetry.” The instrumentation mindset changes from reactive (add a check when something breaks) to proactive (emit rich context from the start).
  • Example: Charity Majors (CEO of Honeycomb) describes the distinction as “monitoring is for known-unknowns, observability is for unknown-unknowns.” A team at Honeycomb went from 4-hour Mean Time to Resolution to 15 minutes after adopting high-cardinality structured events with tracing, because engineers could query production telemetry the same way they would query a database: “Show me all requests from user X in the last hour, grouped by endpoint, sorted by duration.”
Red flag answer: “Monitoring and observability are the same thing. Observability is just a buzzword.” This misses a real and important distinction. Or: “Just add more dashboards.” More dashboards for known metrics is still monitoring, not observability.Follow-ups:
  1. You are trying to convince leadership to invest in observability tooling, but they say “we already have Grafana dashboards and PagerDuty alerts, why do we need more?” How do you make the business case?
  2. What is “high cardinality” in the context of observability, and why is it both the most powerful feature and the biggest cost driver?
Strong Answer:
  • This is the “missing time” problem in distributed tracing, and it is one of the most common and frustrating debugging scenarios. The 2.2 seconds are hiding in uninstrumented gaps — parts of the request lifecycle that do not have spans.
  • The most common causes: (1) Queue wait time. The request was placed in a thread pool queue or connection pool queue and waited 2.2 seconds for an available worker or connection. This time is between the parent span starting and the child span starting, but no span covers the gap. Fix: add instrumentation that records time spent waiting for resources (thread pool queue time, connection pool checkout time). (2) Middleware or framework overhead. The web framework does work before and after your handler code — request parsing, authentication middleware, response serialization, compression. If you only instrumented the handler, the framework overhead is invisible. Fix: add spans or metrics at the middleware layer. (3) Network latency between services. If Service A calls Service B, the parent span in A ends when it sends the request, and the child span in B starts when B receives it. The network transit time (plus any load balancer processing) is in neither span. Fix: use client-side spans that cover the full HTTP call, not just the server-side processing. (4) Garbage collection pauses. A GC pause can add seconds of latency but does not show up in application-level spans. Fix: emit GC pause metrics and correlate them with trace timestamps. (5) DNS resolution or TLS handshake. The first request to a service may include DNS lookup (50-200ms) and TLS negotiation (50-150ms). Subsequent requests reuse the connection. Fix: instrument the HTTP client at the transport layer.
  • The debugging approach: look at the waterfall view of the trace. Identify the gaps between consecutive spans. The largest gap is where the time went. Check whether any span’s start time is significantly later than its parent’s start time — that delta is queue or network time. Check whether any span’s end time is much later than the last child span’s end time — that could be response serialization or middleware.
  • Example: A team debugging a slow API response found 1.5 seconds of “missing time” in their trace. The trace showed the handler span started 1.5 seconds after the request hit the server. Investigation revealed the service was using a synchronous thread pool with 10 threads, and under load, requests were queuing for the pool. Adding a span around thread pool checkout time revealed the queue wait. The fix was switching to async request handling, which eliminated the thread pool bottleneck entirely.
Red flag answer: “The trace is probably broken or the spans were not recorded correctly.” While instrumentation bugs are possible, jumping to “the tool is wrong” before investigating the common causes of missing time shows a lack of debugging experience.Follow-ups:
  1. You add instrumentation and discover the 2.2 seconds is connection pool wait time for the database. The pool has 20 connections and 200 concurrent requests. What are your options, and which do you prefer?
  2. How would you design your tracing instrumentation from the start to minimize these “missing time” gaps, without over-instrumenting and creating too many spans?