Observability & Monitoring

Senior Level: Observability is how you debug production issues. Interviewers expect senior engineers to design systems that can be understood and debugged at scale.

The Three Pillars of Observability

┌─────────────────────────────────────────────────────────────────┐
│              The Three Pillars                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  METRICS                    LOGS                  TRACES        │
│  ───────                    ────                  ──────        │
│  "What happened?"           "Why?"                "Where?"      │
│                                                                 │
│  Aggregated data            Individual events     Request flow  │
│  Time-series               Text/structured       Distributed    │
│  Cheap to store            Expensive             Moderate       │
│  Alerting                  Debugging             Root cause     │
│                                                                 │
│  Examples:                  Examples:             Examples:      │
│  • Request rate             • Error messages      • Request path │
│  • Error rate               • Stack traces        • Latency/span │
│  • Latency (p50,p99)        • User actions        • Dependencies │
│  • CPU/Memory               • Audit trail         • Bottlenecks  │
│                                                                 │
│  Tools:                     Tools:                Tools:        │
│  • Prometheus               • ELK Stack           • Jaeger       │
│  • Datadog                  • Splunk              • Zipkin       │
│  • CloudWatch               • CloudWatch Logs     • AWS X-Ray    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Metrics

Key Metrics to Track (RED Method)

┌─────────────────────────────────────────────────────────────────┐
│              RED Method (Request-focused)                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  R - Rate      : Requests per second                           │
│  E - Errors    : Error rate (% of failed requests)             │
│  D - Duration  : Request latency (p50, p95, p99)               │
│                                                                 │
│  For every service, track:                                     │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Service: payment-service                               │   │
│  │  ─────────────────────────────────────────              │   │
│  │  Rate:      150 req/s                                   │   │
│  │  Errors:    0.1% (5xx), 2% (4xx)                        │   │
│  │  Duration:  p50=45ms, p95=120ms, p99=350ms              │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

USE Method (Resource-focused)

┌─────────────────────────────────────────────────────────────────┐
│              USE Method (Infrastructure)                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  U - Utilization : % time resource is busy                     │
│  S - Saturation  : Work resource can't service (queue length)  │
│  E - Errors      : Count of error events                       │
│                                                                 │
│  Apply to each resource:                                       │
│                                                                 │
│  CPU:                                                          │
│  • Utilization: 75%                                            │
│  • Saturation: Load average / CPU count                        │
│  • Errors: CPU errors (rare)                                   │
│                                                                 │
│  Memory:                                                        │
│  • Utilization: Used / Total                                   │
│  • Saturation: Swap usage, OOM events                          │
│  • Errors: Allocation failures                                 │
│                                                                 │
│  Network:                                                       │
│  • Utilization: Bandwidth used                                 │
│  • Saturation: Dropped packets, retransmits                    │
│  • Errors: Interface errors                                    │
│                                                                 │
│  Disk:                                                          │
│  • Utilization: I/O time %                                     │
│  • Saturation: I/O queue length                                │
│  • Errors: Bad sectors, I/O errors                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Metric Types

from prometheus_client import Counter, Gauge, Histogram, Summary

# COUNTER: Only goes up (request count, errors)
requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# GAUGE: Can go up and down (current connections, queue size)
active_connections = Gauge(
    'active_connections',
    'Number of active connections'
)

# HISTOGRAM: Measures distribution (latency buckets)
request_latency = Histogram(
    'http_request_duration_seconds',
    'Request latency in seconds',
    ['endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# SUMMARY: Like histogram but calculates quantiles client-side
request_latency_summary = Summary(
    'http_request_duration_summary',
    'Request latency summary',
    ['endpoint']
)

# Usage in request handler
@app.middleware("http")
async def metrics_middleware(request, call_next):
    start = time.time()
    
    active_connections.inc()
    try:
        response = await call_next(request)
        return response
    finally:
        duration = time.time() - start
        active_connections.dec()
        
        requests_total.labels(
            method=request.method,
            endpoint=request.url.path,
            status=response.status_code
        ).inc()
        
        request_latency.labels(
            endpoint=request.url.path
        ).observe(duration)

Distributed Tracing

How Tracing Works

┌─────────────────────────────────────────────────────────────────┐
│              Distributed Trace                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Trace ID: abc123 (unique per request)                         │
│                                                                 │
│  API Gateway                                                    │
│  ├─ Span 1: gateway (trace_id=abc123, span_id=001)             │
│  │  │ start: 0ms, duration: 150ms                              │
│  │  │                                                          │
│  │  ├─ Span 2: auth-service (parent=001, span_id=002)         │
│  │  │  │ start: 5ms, duration: 20ms                           │
│  │  │  └─ tags: {user_id: "usr_123"}                          │
│  │  │                                                          │
│  │  ├─ Span 3: order-service (parent=001, span_id=003)        │
│  │  │  │ start: 30ms, duration: 100ms                         │
│  │  │  │                                                       │
│  │  │  ├─ Span 4: database (parent=003, span_id=004)          │
│  │  │  │  │ start: 35ms, duration: 45ms                       │
│  │  │  │  └─ tags: {query: "SELECT..."}                       │
│  │  │  │                                                       │
│  │  │  └─ Span 5: payment-service (parent=003, span_id=005)   │
│  │  │     │ start: 85ms, duration: 40ms                       │
│  │  │     └─ tags: {amount: 150.00}                           │
│  │  │                                                          │
│  │  └─ Span 6: notification (parent=001, span_id=006)         │
│  │     │ start: 135ms, duration: 10ms                         │
│  │     └─ tags: {type: "email"}                               │
│  │                                                              │
│  └─ Total: 150ms                                               │
│                                                                 │
│  Visualized:                                                   │
│  ┌──────────────────────────────────────────────────────┐     │
│  │ gateway                                              │     │
│  │  ├─auth──┤                                           │     │
│  │      ├──────────────order-service────────────────┤   │     │
│  │      │   ├────database────┤                      │   │     │
│  │      │                    ├────payment────┤      │   │     │
│  │                                              ├─notif─┤     │
│  └──────────────────────────────────────────────────────┘     │
│  0ms      25ms      50ms      75ms      100ms     125ms 150ms │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Implementing Tracing

from opentelemetry import trace
from opentelemetry.trace.propagation import set_span_in_context
from opentelemetry.propagate import inject, extract

tracer = trace.get_tracer(__name__)

# Propagate context through HTTP headers
async def call_service(service_url: str, payload: dict, headers: dict):
    # Get current context and inject trace headers
    inject(headers)
    
    async with httpx.AsyncClient() as client:
        return await client.post(service_url, json=payload, headers=headers)

# Start a new span
@app.post("/orders")
async def create_order(request: Request, order: OrderCreate):
    # Extract context from incoming request
    context = extract(dict(request.headers))
    
    with tracer.start_as_current_span("create_order", context=context) as span:
        # Add attributes (searchable in UI)
        span.set_attribute("user_id", order.user_id)
        span.set_attribute("order_total", order.total)
        
        # Child span for database
        with tracer.start_as_current_span("db.insert_order") as db_span:
            db_span.set_attribute("db.system", "postgresql")
            db_span.set_attribute("db.statement", "INSERT INTO orders...")
            order_id = await db.insert_order(order)
        
        # Child span for external service
        with tracer.start_as_current_span("payment.charge") as pay_span:
            pay_span.set_attribute("payment.amount", order.total)
            try:
                await call_service(
                    f"http://payment-service/charge",
                    {"amount": order.total},
                    {}
                )
            except Exception as e:
                pay_span.record_exception(e)
                pay_span.set_status(trace.Status(trace.StatusCode.ERROR))
                raise
        
        return {"order_id": order_id}

Structured Logging

Log Format Best Practices

import structlog
import logging
from datetime import datetime

# Configure structured logging
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer()
    ]
)

logger = structlog.get_logger()

# Good: Structured, searchable, includes context
logger.info(
    "order_created",
    order_id="ord_123",
    user_id="usr_456",
    total=150.00,
    items_count=3,
    trace_id="abc123",
    duration_ms=45
)
# Output: {"event": "order_created", "order_id": "ord_123", 
#          "user_id": "usr_456", "total": 150.00, ...}

# Bad: Unstructured, hard to search
logger.info(f"Order ord_123 created for user usr_456 with total $150.00")

Log Levels

┌─────────────────────────────────────────────────────────────────┐
│                   Log Level Guidelines                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  DEBUG   │ Detailed diagnostic info                            │
│          │ Use: Development, troubleshooting                   │
│          │ Example: "Cache key: user:123, value: {...}"        │
│          │                                                      │
│  INFO    │ Normal operations                                   │
│          │ Use: Track key events                               │
│          │ Example: "Order created", "User logged in"          │
│          │                                                      │
│  WARNING │ Potential issues (but still working)                │
│          │ Use: Things that should be investigated             │
│          │ Example: "Rate limit approaching", "Retry succeeded"│
│          │                                                      │
│  ERROR   │ Operation failed (but service continues)            │
│          │ Use: Failed requests, exceptions                    │
│          │ Example: "Payment failed", "DB connection timeout"  │
│          │                                                      │
│  FATAL   │ Service is unusable                                 │
│          │ Use: Critical failures requiring immediate action   │
│          │ Example: "Database unreachable", "Config missing"   │
│                                                                 │
│  PRODUCTION LOG LEVELS:                                        │
│  • Default: INFO and above                                     │
│  • Per-service override for debugging                          │
│  • Never DEBUG in production (too verbose, costly)             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Correlation IDs

from contextvars import ContextVar
import uuid

# Context variable for request correlation
correlation_id_ctx: ContextVar[str] = ContextVar('correlation_id')

class CorrelationMiddleware:
    """
    Ensures every request has a correlation ID
    """
    
    async def __call__(self, request, call_next):
        # Get from header or generate new
        correlation_id = request.headers.get(
            'X-Correlation-ID',
            str(uuid.uuid4())
        )
        
        # Set in context for logging
        correlation_id_ctx.set(correlation_id)
        
        # Process request
        response = await call_next(request)
        
        # Add to response headers
        response.headers['X-Correlation-ID'] = correlation_id
        return response

# Logger automatically includes correlation ID
class CorrelationLogger:
    def _log(self, level, message, **kwargs):
        kwargs['correlation_id'] = correlation_id_ctx.get(None)
        structlog.get_logger().log(level, message, **kwargs)
    
    def info(self, message, **kwargs):
        self._log('info', message, **kwargs)
    
    def error(self, message, **kwargs):
        self._log('error', message, **kwargs)

Alerting

Alert Design Principles

┌─────────────────────────────────────────────────────────────────┐
│                   Alerting Best Practices                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  GOOD ALERT:                                                    │
│  • Actionable (someone needs to do something)                  │
│  • Urgent (needs attention within SLA)                         │
│  • Clear (describes what's wrong and impact)                   │
│  • Rare (not noisy, doesn't cause alert fatigue)               │
│                                                                 │
│  BAD ALERT:                                                     │
│  • "CPU at 80%" (so what? is anything broken?)                 │
│  • Fires frequently (causes alert fatigue)                     │
│  • No clear action (what should I do?)                         │
│                                                                 │
│  ALERT ON SYMPTOMS, NOT CAUSES:                                │
│                                                                 │
│  BAD:  "Database CPU > 90%"                                 │
│  GOOD: "Order creation latency > 2s for 5 minutes"         │
│                                                                 │
│  Why? The symptom (slow orders) is what matters to users.      │
│  High CPU that doesn't affect users isn't urgent.              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Production-Ready Observability Implementation

Complete observability stack with metrics, logging, and tracing:

Python
JavaScript

import time
import logging
import functools
from typing import Optional, Dict, Any, Callable
from dataclasses import dataclass, field
from datetime import datetime
from contextlib import contextmanager
from contextvars import ContextVar
import json
import uuid
import asyncio

# ============== Metrics Collection ==============
from prometheus_client import Counter, Histogram, Gauge, Info, REGISTRY
from prometheus_client.exposition import generate_latest

class MetricsCollector:
    """Production-ready metrics collection"""
    
    def __init__(self, service_name: str):
        self.service_name = service_name
        
        # RED Method Metrics
        self.request_count = Counter(
            'http_requests_total',
            'Total HTTP requests',
            ['service', 'method', 'endpoint', 'status']
        )
        
        self.request_latency = Histogram(
            'http_request_duration_seconds',
            'HTTP request latency',
            ['service', 'method', 'endpoint'],
            buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
        )
        
        self.request_in_progress = Gauge(
            'http_requests_in_progress',
            'HTTP requests currently being processed',
            ['service', 'endpoint']
        )
        
        # USE Method Metrics
        self.connection_pool_size = Gauge(
            'connection_pool_size',
            'Current connection pool size',
            ['service', 'pool_name']
        )
        
        self.queue_size = Gauge(
            'queue_size',
            'Current queue size',
            ['service', 'queue_name']
        )
        
        # Business Metrics
        self.orders_total = Counter(
            'orders_total',
            'Total orders processed',
            ['service', 'status', 'payment_method']
        )
        
        self.revenue_total = Counter(
            'revenue_total_cents',
            'Total revenue in cents',
            ['service', 'currency']
        )
        
        # Service Info
        self.service_info = Info(
            'service_info',
            'Service information'
        )
        self.service_info.info({
            'service': service_name,
            'version': '1.0.0'
        })
    
    def track_request(self, method: str, endpoint: str):
        """Decorator to track request metrics"""
        def decorator(func: Callable):
            @functools.wraps(func)
            async def async_wrapper(*args, **kwargs):
                self.request_in_progress.labels(
                    service=self.service_name,
                    endpoint=endpoint
                ).inc()
                
                start_time = time.time()
                status = "success"
                
                try:
                    result = await func(*args, **kwargs)
                    return result
                except Exception as e:
                    status = "error"
                    raise
                finally:
                    duration = time.time() - start_time
                    
                    self.request_count.labels(
                        service=self.service_name,
                        method=method,
                        endpoint=endpoint,
                        status=status
                    ).inc()
                    
                    self.request_latency.labels(
                        service=self.service_name,
                        method=method,
                        endpoint=endpoint
                    ).observe(duration)
                    
                    self.request_in_progress.labels(
                        service=self.service_name,
                        endpoint=endpoint
                    ).dec()
            
            return async_wrapper
        return decorator
    
    def export(self) -> bytes:
        """Export metrics in Prometheus format"""
        return generate_latest(REGISTRY)

# ============== Distributed Tracing ==============
trace_context: ContextVar[Dict[str, Any]] = ContextVar('trace_context', default={})

@dataclass
class Span:
    trace_id: str
    span_id: str
    parent_span_id: Optional[str]
    operation_name: str
    service_name: str
    start_time: float
    end_time: Optional[float] = None
    tags: Dict[str, Any] = field(default_factory=dict)
    logs: list = field(default_factory=list)
    status: str = "OK"
    
    def set_tag(self, key: str, value: Any) -> 'Span':
        self.tags[key] = value
        return self
    
    def log(self, message: str, **kwargs) -> 'Span':
        self.logs.append({
            "timestamp": time.time(),
            "message": message,
            **kwargs
        })
        return self
    
    def set_error(self, error: Exception) -> 'Span':
        self.status = "ERROR"
        self.tags["error"] = True
        self.tags["error.message"] = str(error)
        self.tags["error.type"] = type(error).__name__
        return self
    
    def finish(self) -> None:
        self.end_time = time.time()
    
    def duration_ms(self) -> float:
        if self.end_time is None:
            return 0
        return (self.end_time - self.start_time) * 1000
    
    def to_dict(self) -> Dict[str, Any]:
        return {
            "traceId": self.trace_id,
            "spanId": self.span_id,
            "parentSpanId": self.parent_span_id,
            "operationName": self.operation_name,
            "serviceName": self.service_name,
            "startTime": int(self.start_time * 1_000_000),
            "duration": int(self.duration_ms() * 1000),
            "tags": self.tags,
            "logs": self.logs,
            "status": self.status
        }

class Tracer:
    """Distributed tracing implementation"""
    
    def __init__(self, service_name: str, exporter: Optional['SpanExporter'] = None):
        self.service_name = service_name
        self.exporter = exporter or ConsoleSpanExporter()
    
    @contextmanager
    def start_span(self, operation_name: str, **tags):
        """Start a new span as context manager"""
        ctx = trace_context.get()
        
        trace_id = ctx.get('trace_id') or self._generate_id()
        parent_span_id = ctx.get('span_id')
        span_id = self._generate_id()
        
        span = Span(
            trace_id=trace_id,
            span_id=span_id,
            parent_span_id=parent_span_id,
            operation_name=operation_name,
            service_name=self.service_name,
            start_time=time.time(),
            tags=tags
        )
        
        # Set new context
        new_ctx = {
            'trace_id': trace_id,
            'span_id': span_id,
            'parent_span_id': parent_span_id
        }
        token = trace_context.set(new_ctx)
        
        try:
            yield span
        except Exception as e:
            span.set_error(e)
            raise
        finally:
            span.finish()
            self.exporter.export(span)
            trace_context.reset(token)
    
    def inject_headers(self) -> Dict[str, str]:
        """Inject trace context into HTTP headers"""
        ctx = trace_context.get()
        return {
            'X-Trace-ID': ctx.get('trace_id', ''),
            'X-Span-ID': ctx.get('span_id', ''),
            'X-Parent-Span-ID': ctx.get('parent_span_id', '')
        }
    
    def extract_headers(self, headers: Dict[str, str]) -> None:
        """Extract trace context from HTTP headers"""
        ctx = {
            'trace_id': headers.get('X-Trace-ID') or self._generate_id(),
            'span_id': headers.get('X-Span-ID'),
            'parent_span_id': headers.get('X-Parent-Span-ID')
        }
        trace_context.set(ctx)
    
    def _generate_id(self) -> str:
        return uuid.uuid4().hex[:16]

class ConsoleSpanExporter:
    def export(self, span: Span) -> None:
        print(json.dumps(span.to_dict(), indent=2))

class JaegerExporter:
    """Export spans to Jaeger"""
    
    def __init__(self, agent_host: str = "localhost", agent_port: int = 6831):
        self.agent_host = agent_host
        self.agent_port = agent_port
        self.batch: list = []
        self.batch_size = 100
    
    def export(self, span: Span) -> None:
        self.batch.append(span.to_dict())
        
        if len(self.batch) >= self.batch_size:
            self._flush()
    
    def _flush(self) -> None:
        if not self.batch:
            return
        
        # Send to Jaeger agent via UDP
        import socket
        sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        
        payload = json.dumps({"spans": self.batch}).encode()
        sock.sendto(payload, (self.agent_host, self.agent_port))
        
        self.batch = []

# ============== Structured Logging ==============
correlation_id_ctx: ContextVar[str] = ContextVar('correlation_id', default='')

class StructuredLogger:
    """Production structured logging with correlation"""
    
    def __init__(self, service_name: str, log_level: str = "INFO"):
        self.service_name = service_name
        self.logger = logging.getLogger(service_name)
        self.logger.setLevel(getattr(logging, log_level))
        
        # JSON formatter
        handler = logging.StreamHandler()
        handler.setFormatter(JsonFormatter())
        self.logger.addHandler(handler)
    
    def _enrich(self, extra: Dict[str, Any]) -> Dict[str, Any]:
        """Add standard fields to log entry"""
        ctx = trace_context.get()
        return {
            "service": self.service_name,
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "trace_id": ctx.get('trace_id', ''),
            "span_id": ctx.get('span_id', ''),
            "correlation_id": correlation_id_ctx.get(''),
            **extra
        }
    
    def info(self, message: str, **kwargs) -> None:
        self.logger.info(message, extra=self._enrich(kwargs))
    
    def error(self, message: str, **kwargs) -> None:
        self.logger.error(message, extra=self._enrich(kwargs))
    
    def warning(self, message: str, **kwargs) -> None:
        self.logger.warning(message, extra=self._enrich(kwargs))
    
    def debug(self, message: str, **kwargs) -> None:
        self.logger.debug(message, extra=self._enrich(kwargs))

class JsonFormatter(logging.Formatter):
    """Format logs as JSON"""
    
    def format(self, record: logging.LogRecord) -> str:
        log_entry = {
            "level": record.levelname,
            "message": record.getMessage(),
            "logger": record.name,
        }
        
        # Add extra fields
        if hasattr(record, '__dict__'):
            for key, value in record.__dict__.items():
                if key not in ['name', 'msg', 'args', 'levelname', 'levelno',
                              'pathname', 'filename', 'module', 'lineno',
                              'funcName', 'created', 'msecs', 'relativeCreated',
                              'thread', 'threadName', 'processName', 'process',
                              'getMessage', 'exc_info', 'exc_text', 'stack_info']:
                    log_entry[key] = value
        
        return json.dumps(log_entry)

# ============== Full Observability Middleware ==============
class ObservabilityMiddleware:
    """Combined metrics, tracing, and logging middleware"""
    
    def __init__(
        self, 
        app, 
        service_name: str,
        metrics: MetricsCollector,
        tracer: Tracer,
        logger: StructuredLogger
    ):
        self.app = app
        self.service_name = service_name
        self.metrics = metrics
        self.tracer = tracer
        self.logger = logger
    
    async def __call__(self, request, call_next):
        # Extract or generate correlation ID
        correlation_id = request.headers.get(
            'X-Correlation-ID', 
            str(uuid.uuid4())
        )
        correlation_id_ctx.set(correlation_id)
        
        # Extract trace context
        self.tracer.extract_headers(dict(request.headers))
        
        endpoint = request.url.path
        method = request.method
        
        with self.tracer.start_span(
            f"{method} {endpoint}",
            http_method=method,
            http_url=str(request.url)
        ) as span:
            
            self.logger.info(
                "request_started",
                method=method,
                path=endpoint,
                user_agent=request.headers.get('user-agent')
            )
            
            start_time = time.time()
            
            try:
                response = await call_next(request)
                
                span.set_tag("http.status_code", response.status_code)
                
                self.logger.info(
                    "request_completed",
                    method=method,
                    path=endpoint,
                    status_code=response.status_code,
                    duration_ms=round((time.time() - start_time) * 1000, 2)
                )
                
                # Add trace headers to response
                response.headers['X-Trace-ID'] = span.trace_id
                response.headers['X-Correlation-ID'] = correlation_id
                
                return response
                
            except Exception as e:
                span.set_error(e)
                
                self.logger.error(
                    "request_failed",
                    method=method,
                    path=endpoint,
                    error=str(e),
                    error_type=type(e).__name__,
                    duration_ms=round((time.time() - start_time) * 1000, 2)
                )
                
                raise

const { v4: uuidv4 } = require('uuid');
const { AsyncLocalStorage } = require('async_hooks');

// ============== Context Storage ==============
const traceContext = new AsyncLocalStorage();
const correlationContext = new AsyncLocalStorage();

// ============== Metrics Collection ==============
const promClient = require('prom-client');

class MetricsCollector {
  constructor(serviceName) {
    this.serviceName = serviceName;
    
    // Enable default metrics
    promClient.collectDefaultMetrics({ prefix: `${serviceName}_` });

    // RED Method Metrics
    this.requestCount = new promClient.Counter({
      name: 'http_requests_total',
      help: 'Total HTTP requests',
      labelNames: ['service', 'method', 'endpoint', 'status']
    });

    this.requestLatency = new promClient.Histogram({
      name: 'http_request_duration_seconds',
      help: 'HTTP request latency',
      labelNames: ['service', 'method', 'endpoint'],
      buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
    });

    this.requestsInProgress = new promClient.Gauge({
      name: 'http_requests_in_progress',
      help: 'HTTP requests currently being processed',
      labelNames: ['service', 'endpoint']
    });

    // Business Metrics
    this.ordersTotal = new promClient.Counter({
      name: 'orders_total',
      help: 'Total orders processed',
      labelNames: ['service', 'status', 'payment_method']
    });

    this.revenueTotal = new promClient.Counter({
      name: 'revenue_total_cents',
      help: 'Total revenue in cents',
      labelNames: ['service', 'currency']
    });
  }

  trackRequest(method, endpoint) {
    return (handler) => {
      return async (req, res, next) => {
        const labels = { 
          service: this.serviceName, 
          method, 
          endpoint 
        };

        this.requestsInProgress.inc({ 
          service: this.serviceName, 
          endpoint 
        });

        const endTimer = this.requestLatency.startTimer(labels);
        let status = 'success';

        try {
          await handler(req, res, next);
        } catch (error) {
          status = 'error';
          throw error;
        } finally {
          endTimer();
          
          this.requestCount.inc({ ...labels, status });
          this.requestsInProgress.dec({ 
            service: this.serviceName, 
            endpoint 
          });
        }
      };
    };
  }

  async export() {
    return promClient.register.metrics();
  }
}

// ============== Distributed Tracing ==============
class Span {
  constructor(traceId, spanId, parentSpanId, operationName, serviceName) {
    this.traceId = traceId;
    this.spanId = spanId;
    this.parentSpanId = parentSpanId;
    this.operationName = operationName;
    this.serviceName = serviceName;
    this.startTime = Date.now();
    this.endTime = null;
    this.tags = {};
    this.logs = [];
    this.status = 'OK';
  }

  setTag(key, value) {
    this.tags[key] = value;
    return this;
  }

  log(message, fields = {}) {
    this.logs.push({
      timestamp: Date.now(),
      message,
      ...fields
    });
    return this;
  }

  setError(error) {
    this.status = 'ERROR';
    this.tags.error = true;
    this.tags['error.message'] = error.message;
    this.tags['error.type'] = error.name;
    return this;
  }

  finish() {
    this.endTime = Date.now();
  }

  durationMs() {
    return this.endTime ? this.endTime - this.startTime : 0;
  }

  toJSON() {
    return {
      traceId: this.traceId,
      spanId: this.spanId,
      parentSpanId: this.parentSpanId,
      operationName: this.operationName,
      serviceName: this.serviceName,
      startTime: this.startTime * 1000,
      duration: this.durationMs() * 1000,
      tags: this.tags,
      logs: this.logs,
      status: this.status
    };
  }
}

class Tracer {
  constructor(serviceName, exporter = null) {
    this.serviceName = serviceName;
    this.exporter = exporter || new ConsoleSpanExporter();
  }

  startSpan(operationName, tags = {}) {
    const ctx = traceContext.getStore() || {};
    
    const traceId = ctx.traceId || this.generateId();
    const parentSpanId = ctx.spanId;
    const spanId = this.generateId();

    const span = new Span(
      traceId,
      spanId,
      parentSpanId,
      operationName,
      this.serviceName
    );

    Object.entries(tags).forEach(([k, v]) => span.setTag(k, v));

    return span;
  }

  async runInSpan(operationName, tags, fn) {
    const span = this.startSpan(operationName, tags);
    
    const newCtx = {
      traceId: span.traceId,
      spanId: span.spanId,
      parentSpanId: span.parentSpanId
    };

    try {
      return await traceContext.run(newCtx, fn);
    } catch (error) {
      span.setError(error);
      throw error;
    } finally {
      span.finish();
      this.exporter.export(span);
    }
  }

  injectHeaders() {
    const ctx = traceContext.getStore() || {};
    return {
      'X-Trace-ID': ctx.traceId || '',
      'X-Span-ID': ctx.spanId || '',
      'X-Parent-Span-ID': ctx.parentSpanId || ''
    };
  }

  extractHeaders(headers) {
    const ctx = {
      traceId: headers['x-trace-id'] || this.generateId(),
      spanId: headers['x-span-id'],
      parentSpanId: headers['x-parent-span-id']
    };
    return ctx;
  }

  generateId() {
    return uuidv4().replace(/-/g, '').substring(0, 16);
  }
}

class ConsoleSpanExporter {
  export(span) {
    console.log(JSON.stringify(span.toJSON(), null, 2));
  }
}

// ============== Structured Logging ==============
class StructuredLogger {
  constructor(serviceName, logLevel = 'info') {
    this.serviceName = serviceName;
    this.logLevels = { debug: 0, info: 1, warn: 2, error: 3 };
    this.currentLevel = this.logLevels[logLevel] || 1;
  }

  enrich(extra) {
    const ctx = traceContext.getStore() || {};
    const correlationId = correlationContext.getStore() || '';

    return {
      service: this.serviceName,
      timestamp: new Date().toISOString(),
      traceId: ctx.traceId || '',
      spanId: ctx.spanId || '',
      correlationId,
      ...extra
    };
  }

  log(level, message, extra = {}) {
    if (this.logLevels[level] < this.currentLevel) return;

    const entry = {
      level: level.toUpperCase(),
      message,
      ...this.enrich(extra)
    };

    console.log(JSON.stringify(entry));
  }

  info(message, extra) { this.log('info', message, extra); }
  error(message, extra) { this.log('error', message, extra); }
  warn(message, extra) { this.log('warn', message, extra); }
  debug(message, extra) { this.log('debug', message, extra); }
}

// ============== Express Middleware ==============
function observabilityMiddleware(serviceName, metrics, tracer, logger) {
  return async (req, res, next) => {
    // Extract or generate correlation ID
    const correlationId = req.headers['x-correlation-id'] || uuidv4();
    
    // Extract trace context from headers
    const traceCtx = tracer.extractHeaders(req.headers);
    
    const endpoint = req.path;
    const method = req.method;

    await correlationContext.run(correlationId, async () => {
      await tracer.runInSpan(
        `${method} ${endpoint}`,
        { 'http.method': method, 'http.url': req.url },
        async () => {
          const span = tracer.startSpan(`${method} ${endpoint}`);
          
          logger.info('request_started', {
            method,
            path: endpoint,
            userAgent: req.headers['user-agent']
          });

          const startTime = Date.now();

          // Track response
          const originalEnd = res.end;
          res.end = function(...args) {
            const durationMs = Date.now() - startTime;
            
            span.setTag('http.status_code', res.statusCode);
            
            logger.info('request_completed', {
              method,
              path: endpoint,
              statusCode: res.statusCode,
              durationMs
            });

            // Add trace headers to response
            res.setHeader('X-Trace-ID', span.traceId);
            res.setHeader('X-Correlation-ID', correlationId);

            span.finish();
            return originalEnd.apply(this, args);
          };

          next();
        }
      );
    });
  };
}

// ============== Usage Example ==============
const express = require('express');
const app = express();

const metrics = new MetricsCollector('order-service');
const tracer = new Tracer('order-service');
const logger = new StructuredLogger('order-service');

// Apply middleware
app.use(observabilityMiddleware('order-service', metrics, tracer, logger));

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', 'text/plain');
  res.send(await metrics.export());
});

// Health check
app.get('/health', (req, res) => {
  res.json({ status: 'healthy', timestamp: new Date().toISOString() });
});

module.exports = {
  MetricsCollector,
  Tracer,
  Span,
  StructuredLogger,
  observabilityMiddleware
};

Alert Template

# Example PagerDuty/Opsgenie Alert
alert: OrderServiceHighLatency
expr: histogram_quantile(0.99, rate(order_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
  severity: critical
  service: order-service
  team: payments
annotations:
  summary: "Order service p99 latency is {{ $value }}s (threshold: 2s)"
  description: |
    Impact: Users are experiencing slow checkout.
    
    Dashboard: https://grafana.example.com/d/orders
    Runbook: https://wiki.example.com/runbooks/order-latency
    
    Possible causes:
    - Database connection pool exhausted
    - Payment service slow
    - Increased traffic
    
    Immediate actions:
    1. Check dashboard for traffic spike
    2. Check payment-service health
    3. Check database metrics
  runbook_url: https://wiki.example.com/runbooks/order-latency

Alert Severity Levels

┌─────────────────────────────────────────────────────────────────┐
│                   Severity Levels                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SEV-1 (Critical) - Page immediately, all hands                │
│  ─────────────────────────────────────────────                  │
│  • Service is down                                             │
│  • Data loss occurring                                         │
│  • Security breach                                             │
│  Response: 15 minutes                                          │
│                                                                 │
│  SEV-2 (High) - Page on-call                                   │
│  ───────────────────────────                                    │
│  • Major feature degraded                                      │
│  • Significant latency increase                                │
│  • Error rate > 5%                                             │
│  Response: 30 minutes                                          │
│                                                                 │
│  SEV-3 (Medium) - Ticket, fix during business hours            │
│  ───────────────────────────────────────────────                │
│  • Non-critical feature broken                                 │
│  • Performance degradation (not severe)                        │
│  Response: 4 hours                                             │
│                                                                 │
│  SEV-4 (Low) - Ticket, fix when convenient                     │
│  ─────────────────────────────────────────                      │
│  • Minor issues                                                │
│  • Cosmetic problems                                           │
│  Response: 1 week                                              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Dashboards

Dashboard Design

┌─────────────────────────────────────────────────────────────────┐
│                   Dashboard Layout                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  TOP: High-level health (GREEN/YELLOW/RED)                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  [OK] API Gateway  [OK] Orders  [WARN] Payments  [OK] Database    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  GOLDEN SIGNALS (RED method):                                  │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐  │
│  │ Request Rate    │ │ Error Rate      │ │ Latency p99     │  │
│  │     ^           │ │     v           │ │     ~           │  │
│  │  1,234 req/s    │ │    0.05%        │ │    145ms        │  │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘  │
│                                                                 │
│  RESOURCE UTILIZATION (USE method):                            │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐  │
│  │ CPU             │ │ Memory          │ │ Disk I/O        │  │
│  │     ▂▃▅▆▇       │ │     ▃▃▃▃▃       │ │     ▁▂▁▂▁       │  │
│  │    65%          │ │    72%          │ │    15%          │  │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘  │
│                                                                 │
│  DEPENDENCIES:                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Payment API: 50ms │ Database: 10ms │ Redis: 1ms        │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  BOTTOM: Detailed graphs, logs, traces                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

SLIs, SLOs, and SLAs

┌─────────────────────────────────────────────────────────────────┐
│              SLI / SLO / SLA                                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SLI (Service Level Indicator)                                 │
│  ─────────────────────────────                                  │
│  The metric you measure                                        │
│  Example: "99th percentile latency of API requests"            │
│                                                                 │
│  SLO (Service Level Objective)                                 │
│  ─────────────────────────────                                  │
│  The target you aim for (internal)                             │
│  Example: "99th percentile latency < 200ms"                    │
│                                                                 │
│  SLA (Service Level Agreement)                                 │
│  ─────────────────────────────                                  │
│  The contract with customers (external)                        │
│  Example: "99.9% availability or refund"                       │
│  Note: SLA should be looser than SLO (buffer)                  │
│                                                                 │
│  Example SLO Document:                                         │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Service: Order API                                     │   │
│  │  ────────────────────                                    │   │
│  │  Availability: 99.95% of requests successful            │   │
│  │  Latency: 99% of requests < 500ms                       │   │
│  │  Throughput: Handle 10,000 req/s                        │   │
│  │                                                         │   │
│  │  Error Budget: 0.05% = 21.6 minutes/month               │   │
│  │  Current Budget Remaining: 15.3 minutes                 │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Senior Interview Questions

How do you debug a latency spike in production?

Systematic approach:

Check dashboards: RED metrics, identify when it started
Correlate: Deployments? Traffic spike? Dependency issue?
Trace analysis: Find slow spans in traces
Log analysis: Search for errors around that time
Narrow down: Which endpoint? Which users?

Common causes:

Database slow queries (check slow query log)
GC pauses (check GC metrics)
Connection pool exhaustion
External dependency slowdown
Lock contention

How do you set up monitoring for a new service?

Checklist:

Instrument code: Add metrics (RED method)
Add tracing: Propagate trace context
Structured logging: With correlation IDs
Create dashboard: Health, golden signals, resources
Set up alerts: On symptoms, not causes
Document SLOs: Define success criteria
Create runbooks: What to do when alerts fire

What's your approach to reducing alert fatigue?

Strategies:

Alert on symptoms: User impact, not causes
Use thresholds wisely: 5 minutes > 80% vs instant spike
Group related alerts: One page per incident, not 10
Regular review: Delete unused, tune noisy alerts
Escalation policy: Low-priority → ticket, high → page
On-call feedback: Track alert quality metrics

Key metric: If on-call is paged but no action needed, fix the alert!

How would you design a metrics system at scale?

Architecture:

Collection: Agent on each host (Prometheus, StatsD)
Aggregation: Pre-aggregate at edge (reduce cardinality)
Storage: Time-series DB (InfluxDB, M3DB, Thanos)
Query: Federation for cross-cluster queries
Visualization: Grafana dashboards

Scale challenges:

High cardinality labels (user_id) → Aggregate
Long retention → Downsampling
Many metrics → Drop unused

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​The Three Pillars of Observability

​Metrics

​Key Metrics to Track (RED Method)

​USE Method (Resource-focused)

​Metric Types

​Distributed Tracing

​How Tracing Works

​Implementing Tracing

​Structured Logging

​Log Format Best Practices

​Log Levels

​Correlation IDs

​Alerting

​Alert Design Principles

​Production-Ready Observability Implementation

​Alert Template

​Alert Severity Levels

​Dashboards

​Dashboard Design

​SLIs, SLOs, and SLAs

​Senior Interview Questions

The Three Pillars of Observability

Metrics

Key Metrics to Track (RED Method)

USE Method (Resource-focused)

Metric Types

Distributed Tracing

How Tracing Works

Implementing Tracing

Structured Logging

Log Format Best Practices

Log Levels

Correlation IDs

Alerting

Alert Design Principles

Production-Ready Observability Implementation

Alert Template

Alert Severity Levels

Dashboards

Dashboard Design

SLIs, SLOs, and SLAs

Senior Interview Questions