Skip to main content
Senior Level: Observability is how you debug production issues. Interviewers expect senior engineers to design systems that can be understood and debugged at scale.

The Three Pillars of Observability

┌─────────────────────────────────────────────────────────────────┐
│              The Three Pillars                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  METRICS                    LOGS                  TRACES        │
│  ───────                    ────                  ──────        │
│  "What happened?"           "Why?"                "Where?"      │
│                                                                 │
│  Aggregated data            Individual events     Request flow  │
│  Time-series               Text/structured       Distributed    │
│  Cheap to store            Expensive             Moderate       │
│  Alerting                  Debugging             Root cause     │
│                                                                 │
│  Examples:                  Examples:             Examples:      │
│  • Request rate             • Error messages      • Request path │
│  • Error rate               • Stack traces        • Latency/span │
│  • Latency (p50,p99)        • User actions        • Dependencies │
│  • CPU/Memory               • Audit trail         • Bottlenecks  │
│                                                                 │
│  Tools:                     Tools:                Tools:        │
│  • Prometheus               • ELK Stack           • Jaeger       │
│  • Datadog                  • Splunk              • Zipkin       │
│  • CloudWatch               • CloudWatch Logs     • AWS X-Ray    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
Observability Stack

Metrics

Key Metrics to Track (RED Method)

┌─────────────────────────────────────────────────────────────────┐
│              RED Method (Request-focused)                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  R - Rate      : Requests per second                           │
│  E - Errors    : Error rate (% of failed requests)             │
│  D - Duration  : Request latency (p50, p95, p99)               │
│                                                                 │
│  For every service, track:                                     │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Service: payment-service                               │   │
│  │  ─────────────────────────────────────────              │   │
│  │  Rate:      150 req/s                                   │   │
│  │  Errors:    0.1% (5xx), 2% (4xx)                        │   │
│  │  Duration:  p50=45ms, p95=120ms, p99=350ms              │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

USE Method (Resource-focused)

┌─────────────────────────────────────────────────────────────────┐
│              USE Method (Infrastructure)                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  U - Utilization : % time resource is busy                     │
│  S - Saturation  : Work resource can't service (queue length)  │
│  E - Errors      : Count of error events                       │
│                                                                 │
│  Apply to each resource:                                       │
│                                                                 │
│  CPU:                                                          │
│  • Utilization: 75%                                            │
│  • Saturation: Load average / CPU count                        │
│  • Errors: CPU errors (rare)                                   │
│                                                                 │
│  Memory:                                                        │
│  • Utilization: Used / Total                                   │
│  • Saturation: Swap usage, OOM events                          │
│  • Errors: Allocation failures                                 │
│                                                                 │
│  Network:                                                       │
│  • Utilization: Bandwidth used                                 │
│  • Saturation: Dropped packets, retransmits                    │
│  • Errors: Interface errors                                    │
│                                                                 │
│  Disk:                                                          │
│  • Utilization: I/O time %                                     │
│  • Saturation: I/O queue length                                │
│  • Errors: Bad sectors, I/O errors                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Metric Types

from prometheus_client import Counter, Gauge, Histogram, Summary

# COUNTER: Only goes up (request count, errors)
requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# GAUGE: Can go up and down (current connections, queue size)
active_connections = Gauge(
    'active_connections',
    'Number of active connections'
)

# HISTOGRAM: Measures distribution (latency buckets)
request_latency = Histogram(
    'http_request_duration_seconds',
    'Request latency in seconds',
    ['endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# SUMMARY: Like histogram but calculates quantiles client-side
request_latency_summary = Summary(
    'http_request_duration_summary',
    'Request latency summary',
    ['endpoint']
)

# Usage in request handler
@app.middleware("http")
async def metrics_middleware(request, call_next):
    start = time.time()
    
    active_connections.inc()
    try:
        response = await call_next(request)
        return response
    finally:
        duration = time.time() - start
        active_connections.dec()
        
        requests_total.labels(
            method=request.method,
            endpoint=request.url.path,
            status=response.status_code
        ).inc()
        
        request_latency.labels(
            endpoint=request.url.path
        ).observe(duration)

Distributed Tracing

How Tracing Works

┌─────────────────────────────────────────────────────────────────┐
│              Distributed Trace                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Trace ID: abc123 (unique per request)                         │
│                                                                 │
│  API Gateway                                                    │
│  ├─ Span 1: gateway (trace_id=abc123, span_id=001)             │
│  │  │ start: 0ms, duration: 150ms                              │
│  │  │                                                          │
│  │  ├─ Span 2: auth-service (parent=001, span_id=002)         │
│  │  │  │ start: 5ms, duration: 20ms                           │
│  │  │  └─ tags: {user_id: "usr_123"}                          │
│  │  │                                                          │
│  │  ├─ Span 3: order-service (parent=001, span_id=003)        │
│  │  │  │ start: 30ms, duration: 100ms                         │
│  │  │  │                                                       │
│  │  │  ├─ Span 4: database (parent=003, span_id=004)          │
│  │  │  │  │ start: 35ms, duration: 45ms                       │
│  │  │  │  └─ tags: {query: "SELECT..."}                       │
│  │  │  │                                                       │
│  │  │  └─ Span 5: payment-service (parent=003, span_id=005)   │
│  │  │     │ start: 85ms, duration: 40ms                       │
│  │  │     └─ tags: {amount: 150.00}                           │
│  │  │                                                          │
│  │  └─ Span 6: notification (parent=001, span_id=006)         │
│  │     │ start: 135ms, duration: 10ms                         │
│  │     └─ tags: {type: "email"}                               │
│  │                                                              │
│  └─ Total: 150ms                                               │
│                                                                 │
│  Visualized:                                                   │
│  ┌──────────────────────────────────────────────────────┐     │
│  │ gateway                                              │     │
│  │  ├─auth──┤                                           │     │
│  │      ├──────────────order-service────────────────┤   │     │
│  │      │   ├────database────┤                      │   │     │
│  │      │                    ├────payment────┤      │   │     │
│  │                                              ├─notif─┤     │
│  └──────────────────────────────────────────────────────┘     │
│  0ms      25ms      50ms      75ms      100ms     125ms 150ms │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Implementing Tracing

from opentelemetry import trace
from opentelemetry.trace.propagation import set_span_in_context
from opentelemetry.propagate import inject, extract

tracer = trace.get_tracer(__name__)

# Propagate context through HTTP headers
async def call_service(service_url: str, payload: dict, headers: dict):
    # Get current context and inject trace headers
    inject(headers)
    
    async with httpx.AsyncClient() as client:
        return await client.post(service_url, json=payload, headers=headers)

# Start a new span
@app.post("/orders")
async def create_order(request: Request, order: OrderCreate):
    # Extract context from incoming request
    context = extract(dict(request.headers))
    
    with tracer.start_as_current_span("create_order", context=context) as span:
        # Add attributes (searchable in UI)
        span.set_attribute("user_id", order.user_id)
        span.set_attribute("order_total", order.total)
        
        # Child span for database
        with tracer.start_as_current_span("db.insert_order") as db_span:
            db_span.set_attribute("db.system", "postgresql")
            db_span.set_attribute("db.statement", "INSERT INTO orders...")
            order_id = await db.insert_order(order)
        
        # Child span for external service
        with tracer.start_as_current_span("payment.charge") as pay_span:
            pay_span.set_attribute("payment.amount", order.total)
            try:
                await call_service(
                    f"http://payment-service/charge",
                    {"amount": order.total},
                    {}
                )
            except Exception as e:
                pay_span.record_exception(e)
                pay_span.set_status(trace.Status(trace.StatusCode.ERROR))
                raise
        
        return {"order_id": order_id}

Structured Logging

Log Format Best Practices

import structlog
import logging
from datetime import datetime

# Configure structured logging
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer()
    ]
)

logger = structlog.get_logger()

# Good: Structured, searchable, includes context
logger.info(
    "order_created",
    order_id="ord_123",
    user_id="usr_456",
    total=150.00,
    items_count=3,
    trace_id="abc123",
    duration_ms=45
)
# Output: {"event": "order_created", "order_id": "ord_123", 
#          "user_id": "usr_456", "total": 150.00, ...}

# Bad: Unstructured, hard to search
logger.info(f"Order ord_123 created for user usr_456 with total $150.00")

Log Levels

┌─────────────────────────────────────────────────────────────────┐
│                   Log Level Guidelines                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  DEBUG   │ Detailed diagnostic info                            │
│          │ Use: Development, troubleshooting                   │
│          │ Example: "Cache key: user:123, value: {...}"        │
│          │                                                      │
│  INFO    │ Normal operations                                   │
│          │ Use: Track key events                               │
│          │ Example: "Order created", "User logged in"          │
│          │                                                      │
│  WARNING │ Potential issues (but still working)                │
│          │ Use: Things that should be investigated             │
│          │ Example: "Rate limit approaching", "Retry succeeded"│
│          │                                                      │
│  ERROR   │ Operation failed (but service continues)            │
│          │ Use: Failed requests, exceptions                    │
│          │ Example: "Payment failed", "DB connection timeout"  │
│          │                                                      │
│  FATAL   │ Service is unusable                                 │
│          │ Use: Critical failures requiring immediate action   │
│          │ Example: "Database unreachable", "Config missing"   │
│                                                                 │
│  PRODUCTION LOG LEVELS:                                        │
│  • Default: INFO and above                                     │
│  • Per-service override for debugging                          │
│  • Never DEBUG in production (too verbose, costly)             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Correlation IDs

from contextvars import ContextVar
import uuid

# Context variable for request correlation
correlation_id_ctx: ContextVar[str] = ContextVar('correlation_id')

class CorrelationMiddleware:
    """
    Ensures every request has a correlation ID
    """
    
    async def __call__(self, request, call_next):
        # Get from header or generate new
        correlation_id = request.headers.get(
            'X-Correlation-ID',
            str(uuid.uuid4())
        )
        
        # Set in context for logging
        correlation_id_ctx.set(correlation_id)
        
        # Process request
        response = await call_next(request)
        
        # Add to response headers
        response.headers['X-Correlation-ID'] = correlation_id
        return response

# Logger automatically includes correlation ID
class CorrelationLogger:
    def _log(self, level, message, **kwargs):
        kwargs['correlation_id'] = correlation_id_ctx.get(None)
        structlog.get_logger().log(level, message, **kwargs)
    
    def info(self, message, **kwargs):
        self._log('info', message, **kwargs)
    
    def error(self, message, **kwargs):
        self._log('error', message, **kwargs)

Alerting

Alert Design Principles

┌─────────────────────────────────────────────────────────────────┐
│                   Alerting Best Practices                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  GOOD ALERT:                                                    │
│  • Actionable (someone needs to do something)                  │
│  • Urgent (needs attention within SLA)                         │
│  • Clear (describes what's wrong and impact)                   │
│  • Rare (not noisy, doesn't cause alert fatigue)               │
│                                                                 │
│  BAD ALERT:                                                     │
│  • "CPU at 80%" (so what? is anything broken?)                 │
│  • Fires frequently (causes alert fatigue)                     │
│  • No clear action (what should I do?)                         │
│                                                                 │
│  ALERT ON SYMPTOMS, NOT CAUSES:                                │
│                                                                 │
│  BAD:  "Database CPU > 90%"                                 │
│  GOOD: "Order creation latency > 2s for 5 minutes"         │
│                                                                 │
│  Why? The symptom (slow orders) is what matters to users.      │
│  High CPU that doesn't affect users isn't urgent.              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Production-Ready Observability Implementation

Complete observability stack with metrics, logging, and tracing:
import time
import logging
import functools
from typing import Optional, Dict, Any, Callable
from dataclasses import dataclass, field
from datetime import datetime
from contextlib import contextmanager
from contextvars import ContextVar
import json
import uuid
import asyncio

# ============== Metrics Collection ==============
from prometheus_client import Counter, Histogram, Gauge, Info, REGISTRY
from prometheus_client.exposition import generate_latest

class MetricsCollector:
    """Production-ready metrics collection"""
    
    def __init__(self, service_name: str):
        self.service_name = service_name
        
        # RED Method Metrics
        self.request_count = Counter(
            'http_requests_total',
            'Total HTTP requests',
            ['service', 'method', 'endpoint', 'status']
        )
        
        self.request_latency = Histogram(
            'http_request_duration_seconds',
            'HTTP request latency',
            ['service', 'method', 'endpoint'],
            buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
        )
        
        self.request_in_progress = Gauge(
            'http_requests_in_progress',
            'HTTP requests currently being processed',
            ['service', 'endpoint']
        )
        
        # USE Method Metrics
        self.connection_pool_size = Gauge(
            'connection_pool_size',
            'Current connection pool size',
            ['service', 'pool_name']
        )
        
        self.queue_size = Gauge(
            'queue_size',
            'Current queue size',
            ['service', 'queue_name']
        )
        
        # Business Metrics
        self.orders_total = Counter(
            'orders_total',
            'Total orders processed',
            ['service', 'status', 'payment_method']
        )
        
        self.revenue_total = Counter(
            'revenue_total_cents',
            'Total revenue in cents',
            ['service', 'currency']
        )
        
        # Service Info
        self.service_info = Info(
            'service_info',
            'Service information'
        )
        self.service_info.info({
            'service': service_name,
            'version': '1.0.0'
        })
    
    def track_request(self, method: str, endpoint: str):
        """Decorator to track request metrics"""
        def decorator(func: Callable):
            @functools.wraps(func)
            async def async_wrapper(*args, **kwargs):
                self.request_in_progress.labels(
                    service=self.service_name,
                    endpoint=endpoint
                ).inc()
                
                start_time = time.time()
                status = "success"
                
                try:
                    result = await func(*args, **kwargs)
                    return result
                except Exception as e:
                    status = "error"
                    raise
                finally:
                    duration = time.time() - start_time
                    
                    self.request_count.labels(
                        service=self.service_name,
                        method=method,
                        endpoint=endpoint,
                        status=status
                    ).inc()
                    
                    self.request_latency.labels(
                        service=self.service_name,
                        method=method,
                        endpoint=endpoint
                    ).observe(duration)
                    
                    self.request_in_progress.labels(
                        service=self.service_name,
                        endpoint=endpoint
                    ).dec()
            
            return async_wrapper
        return decorator
    
    def export(self) -> bytes:
        """Export metrics in Prometheus format"""
        return generate_latest(REGISTRY)

# ============== Distributed Tracing ==============
trace_context: ContextVar[Dict[str, Any]] = ContextVar('trace_context', default={})

@dataclass
class Span:
    trace_id: str
    span_id: str
    parent_span_id: Optional[str]
    operation_name: str
    service_name: str
    start_time: float
    end_time: Optional[float] = None
    tags: Dict[str, Any] = field(default_factory=dict)
    logs: list = field(default_factory=list)
    status: str = "OK"
    
    def set_tag(self, key: str, value: Any) -> 'Span':
        self.tags[key] = value
        return self
    
    def log(self, message: str, **kwargs) -> 'Span':
        self.logs.append({
            "timestamp": time.time(),
            "message": message,
            **kwargs
        })
        return self
    
    def set_error(self, error: Exception) -> 'Span':
        self.status = "ERROR"
        self.tags["error"] = True
        self.tags["error.message"] = str(error)
        self.tags["error.type"] = type(error).__name__
        return self
    
    def finish(self) -> None:
        self.end_time = time.time()
    
    def duration_ms(self) -> float:
        if self.end_time is None:
            return 0
        return (self.end_time - self.start_time) * 1000
    
    def to_dict(self) -> Dict[str, Any]:
        return {
            "traceId": self.trace_id,
            "spanId": self.span_id,
            "parentSpanId": self.parent_span_id,
            "operationName": self.operation_name,
            "serviceName": self.service_name,
            "startTime": int(self.start_time * 1_000_000),
            "duration": int(self.duration_ms() * 1000),
            "tags": self.tags,
            "logs": self.logs,
            "status": self.status
        }

class Tracer:
    """Distributed tracing implementation"""
    
    def __init__(self, service_name: str, exporter: Optional['SpanExporter'] = None):
        self.service_name = service_name
        self.exporter = exporter or ConsoleSpanExporter()
    
    @contextmanager
    def start_span(self, operation_name: str, **tags):
        """Start a new span as context manager"""
        ctx = trace_context.get()
        
        trace_id = ctx.get('trace_id') or self._generate_id()
        parent_span_id = ctx.get('span_id')
        span_id = self._generate_id()
        
        span = Span(
            trace_id=trace_id,
            span_id=span_id,
            parent_span_id=parent_span_id,
            operation_name=operation_name,
            service_name=self.service_name,
            start_time=time.time(),
            tags=tags
        )
        
        # Set new context
        new_ctx = {
            'trace_id': trace_id,
            'span_id': span_id,
            'parent_span_id': parent_span_id
        }
        token = trace_context.set(new_ctx)
        
        try:
            yield span
        except Exception as e:
            span.set_error(e)
            raise
        finally:
            span.finish()
            self.exporter.export(span)
            trace_context.reset(token)
    
    def inject_headers(self) -> Dict[str, str]:
        """Inject trace context into HTTP headers"""
        ctx = trace_context.get()
        return {
            'X-Trace-ID': ctx.get('trace_id', ''),
            'X-Span-ID': ctx.get('span_id', ''),
            'X-Parent-Span-ID': ctx.get('parent_span_id', '')
        }
    
    def extract_headers(self, headers: Dict[str, str]) -> None:
        """Extract trace context from HTTP headers"""
        ctx = {
            'trace_id': headers.get('X-Trace-ID') or self._generate_id(),
            'span_id': headers.get('X-Span-ID'),
            'parent_span_id': headers.get('X-Parent-Span-ID')
        }
        trace_context.set(ctx)
    
    def _generate_id(self) -> str:
        return uuid.uuid4().hex[:16]

class ConsoleSpanExporter:
    def export(self, span: Span) -> None:
        print(json.dumps(span.to_dict(), indent=2))

class JaegerExporter:
    """Export spans to Jaeger"""
    
    def __init__(self, agent_host: str = "localhost", agent_port: int = 6831):
        self.agent_host = agent_host
        self.agent_port = agent_port
        self.batch: list = []
        self.batch_size = 100
    
    def export(self, span: Span) -> None:
        self.batch.append(span.to_dict())
        
        if len(self.batch) >= self.batch_size:
            self._flush()
    
    def _flush(self) -> None:
        if not self.batch:
            return
        
        # Send to Jaeger agent via UDP
        import socket
        sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        
        payload = json.dumps({"spans": self.batch}).encode()
        sock.sendto(payload, (self.agent_host, self.agent_port))
        
        self.batch = []

# ============== Structured Logging ==============
correlation_id_ctx: ContextVar[str] = ContextVar('correlation_id', default='')

class StructuredLogger:
    """Production structured logging with correlation"""
    
    def __init__(self, service_name: str, log_level: str = "INFO"):
        self.service_name = service_name
        self.logger = logging.getLogger(service_name)
        self.logger.setLevel(getattr(logging, log_level))
        
        # JSON formatter
        handler = logging.StreamHandler()
        handler.setFormatter(JsonFormatter())
        self.logger.addHandler(handler)
    
    def _enrich(self, extra: Dict[str, Any]) -> Dict[str, Any]:
        """Add standard fields to log entry"""
        ctx = trace_context.get()
        return {
            "service": self.service_name,
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "trace_id": ctx.get('trace_id', ''),
            "span_id": ctx.get('span_id', ''),
            "correlation_id": correlation_id_ctx.get(''),
            **extra
        }
    
    def info(self, message: str, **kwargs) -> None:
        self.logger.info(message, extra=self._enrich(kwargs))
    
    def error(self, message: str, **kwargs) -> None:
        self.logger.error(message, extra=self._enrich(kwargs))
    
    def warning(self, message: str, **kwargs) -> None:
        self.logger.warning(message, extra=self._enrich(kwargs))
    
    def debug(self, message: str, **kwargs) -> None:
        self.logger.debug(message, extra=self._enrich(kwargs))

class JsonFormatter(logging.Formatter):
    """Format logs as JSON"""
    
    def format(self, record: logging.LogRecord) -> str:
        log_entry = {
            "level": record.levelname,
            "message": record.getMessage(),
            "logger": record.name,
        }
        
        # Add extra fields
        if hasattr(record, '__dict__'):
            for key, value in record.__dict__.items():
                if key not in ['name', 'msg', 'args', 'levelname', 'levelno',
                              'pathname', 'filename', 'module', 'lineno',
                              'funcName', 'created', 'msecs', 'relativeCreated',
                              'thread', 'threadName', 'processName', 'process',
                              'getMessage', 'exc_info', 'exc_text', 'stack_info']:
                    log_entry[key] = value
        
        return json.dumps(log_entry)

# ============== Full Observability Middleware ==============
class ObservabilityMiddleware:
    """Combined metrics, tracing, and logging middleware"""
    
    def __init__(
        self, 
        app, 
        service_name: str,
        metrics: MetricsCollector,
        tracer: Tracer,
        logger: StructuredLogger
    ):
        self.app = app
        self.service_name = service_name
        self.metrics = metrics
        self.tracer = tracer
        self.logger = logger
    
    async def __call__(self, request, call_next):
        # Extract or generate correlation ID
        correlation_id = request.headers.get(
            'X-Correlation-ID', 
            str(uuid.uuid4())
        )
        correlation_id_ctx.set(correlation_id)
        
        # Extract trace context
        self.tracer.extract_headers(dict(request.headers))
        
        endpoint = request.url.path
        method = request.method
        
        with self.tracer.start_span(
            f"{method} {endpoint}",
            http_method=method,
            http_url=str(request.url)
        ) as span:
            
            self.logger.info(
                "request_started",
                method=method,
                path=endpoint,
                user_agent=request.headers.get('user-agent')
            )
            
            start_time = time.time()
            
            try:
                response = await call_next(request)
                
                span.set_tag("http.status_code", response.status_code)
                
                self.logger.info(
                    "request_completed",
                    method=method,
                    path=endpoint,
                    status_code=response.status_code,
                    duration_ms=round((time.time() - start_time) * 1000, 2)
                )
                
                # Add trace headers to response
                response.headers['X-Trace-ID'] = span.trace_id
                response.headers['X-Correlation-ID'] = correlation_id
                
                return response
                
            except Exception as e:
                span.set_error(e)
                
                self.logger.error(
                    "request_failed",
                    method=method,
                    path=endpoint,
                    error=str(e),
                    error_type=type(e).__name__,
                    duration_ms=round((time.time() - start_time) * 1000, 2)
                )
                
                raise

Alert Template

# Example PagerDuty/Opsgenie Alert
alert: OrderServiceHighLatency
expr: histogram_quantile(0.99, rate(order_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
  severity: critical
  service: order-service
  team: payments
annotations:
  summary: "Order service p99 latency is {{ $value }}s (threshold: 2s)"
  description: |
    Impact: Users are experiencing slow checkout.
    
    Dashboard: https://grafana.example.com/d/orders
    Runbook: https://wiki.example.com/runbooks/order-latency
    
    Possible causes:
    - Database connection pool exhausted
    - Payment service slow
    - Increased traffic
    
    Immediate actions:
    1. Check dashboard for traffic spike
    2. Check payment-service health
    3. Check database metrics
  runbook_url: https://wiki.example.com/runbooks/order-latency

Alert Severity Levels

┌─────────────────────────────────────────────────────────────────┐
│                   Severity Levels                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SEV-1 (Critical) - Page immediately, all hands                │
│  ─────────────────────────────────────────────                  │
│  • Service is down                                             │
│  • Data loss occurring                                         │
│  • Security breach                                             │
│  Response: 15 minutes                                          │
│                                                                 │
│  SEV-2 (High) - Page on-call                                   │
│  ───────────────────────────                                    │
│  • Major feature degraded                                      │
│  • Significant latency increase                                │
│  • Error rate > 5%                                             │
│  Response: 30 minutes                                          │
│                                                                 │
│  SEV-3 (Medium) - Ticket, fix during business hours            │
│  ───────────────────────────────────────────────                │
│  • Non-critical feature broken                                 │
│  • Performance degradation (not severe)                        │
│  Response: 4 hours                                             │
│                                                                 │
│  SEV-4 (Low) - Ticket, fix when convenient                     │
│  ─────────────────────────────────────────                      │
│  • Minor issues                                                │
│  • Cosmetic problems                                           │
│  Response: 1 week                                              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Dashboards

Dashboard Design

┌─────────────────────────────────────────────────────────────────┐
│                   Dashboard Layout                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  TOP: High-level health (GREEN/YELLOW/RED)                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  [OK] API Gateway  [OK] Orders  [WARN] Payments  [OK] Database    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  GOLDEN SIGNALS (RED method):                                  │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐  │
│  │ Request Rate    │ │ Error Rate      │ │ Latency p99     │  │
│  │     ^           │ │     v           │ │     ~           │  │
│  │  1,234 req/s    │ │    0.05%        │ │    145ms        │  │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘  │
│                                                                 │
│  RESOURCE UTILIZATION (USE method):                            │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐  │
│  │ CPU             │ │ Memory          │ │ Disk I/O        │  │
│  │     ▂▃▅▆▇       │ │     ▃▃▃▃▃       │ │     ▁▂▁▂▁       │  │
│  │    65%          │ │    72%          │ │    15%          │  │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘  │
│                                                                 │
│  DEPENDENCIES:                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Payment API: 50ms │ Database: 10ms │ Redis: 1ms        │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  BOTTOM: Detailed graphs, logs, traces                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

SLIs, SLOs, and SLAs

┌─────────────────────────────────────────────────────────────────┐
│              SLI / SLO / SLA                                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SLI (Service Level Indicator)                                 │
│  ─────────────────────────────                                  │
│  The metric you measure                                        │
│  Example: "99th percentile latency of API requests"            │
│                                                                 │
│  SLO (Service Level Objective)                                 │
│  ─────────────────────────────                                  │
│  The target you aim for (internal)                             │
│  Example: "99th percentile latency < 200ms"                    │
│                                                                 │
│  SLA (Service Level Agreement)                                 │
│  ─────────────────────────────                                  │
│  The contract with customers (external)                        │
│  Example: "99.9% availability or refund"                       │
│  Note: SLA should be looser than SLO (buffer)                  │
│                                                                 │
│  Example SLO Document:                                         │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Service: Order API                                     │   │
│  │  ────────────────────                                    │   │
│  │  Availability: 99.95% of requests successful            │   │
│  │  Latency: 99% of requests < 500ms                       │   │
│  │  Throughput: Handle 10,000 req/s                        │   │
│  │                                                         │   │
│  │  Error Budget: 0.05% = 21.6 minutes/month               │   │
│  │  Current Budget Remaining: 15.3 minutes                 │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Senior Interview Questions

Systematic approach:
  1. Check dashboards: RED metrics, identify when it started
  2. Correlate: Deployments? Traffic spike? Dependency issue?
  3. Trace analysis: Find slow spans in traces
  4. Log analysis: Search for errors around that time
  5. Narrow down: Which endpoint? Which users?
Common causes:
  • Database slow queries (check slow query log)
  • GC pauses (check GC metrics)
  • Connection pool exhaustion
  • External dependency slowdown
  • Lock contention
Checklist:
  1. Instrument code: Add metrics (RED method)
  2. Add tracing: Propagate trace context
  3. Structured logging: With correlation IDs
  4. Create dashboard: Health, golden signals, resources
  5. Set up alerts: On symptoms, not causes
  6. Document SLOs: Define success criteria
  7. Create runbooks: What to do when alerts fire
Strategies:
  1. Alert on symptoms: User impact, not causes
  2. Use thresholds wisely: 5 minutes > 80% vs instant spike
  3. Group related alerts: One page per incident, not 10
  4. Regular review: Delete unused, tune noisy alerts
  5. Escalation policy: Low-priority → ticket, high → page
  6. On-call feedback: Track alert quality metrics
Key metric: If on-call is paged but no action needed, fix the alert!
Architecture:
  1. Collection: Agent on each host (Prometheus, StatsD)
  2. Aggregation: Pre-aggregate at edge (reduce cardinality)
  3. Storage: Time-series DB (InfluxDB, M3DB, Thanos)
  4. Query: Federation for cross-cluster queries
  5. Visualization: Grafana dashboards
Scale challenges:
  • High cardinality labels (user_id) → Aggregate
  • Long retention → Downsampling
  • Many metrics → Drop unused