Senior Level: Observability is how you debug production issues. Interviewers expect senior engineers to design systems that can be understood and debugged at scale.
The Three Pillars of Observability
Copy
┌─────────────────────────────────────────────────────────────────┐
│ The Three Pillars │
├─────────────────────────────────────────────────────────────────┤
│ │
│ METRICS LOGS TRACES │
│ ─────── ──── ────── │
│ "What happened?" "Why?" "Where?" │
│ │
│ Aggregated data Individual events Request flow │
│ Time-series Text/structured Distributed │
│ Cheap to store Expensive Moderate │
│ Alerting Debugging Root cause │
│ │
│ Examples: Examples: Examples: │
│ • Request rate • Error messages • Request path │
│ • Error rate • Stack traces • Latency/span │
│ • Latency (p50,p99) • User actions • Dependencies │
│ • CPU/Memory • Audit trail • Bottlenecks │
│ │
│ Tools: Tools: Tools: │
│ • Prometheus • ELK Stack • Jaeger │
│ • Datadog • Splunk • Zipkin │
│ • CloudWatch • CloudWatch Logs • AWS X-Ray │
│ │
└─────────────────────────────────────────────────────────────────┘
Metrics
Key Metrics to Track (RED Method)
Copy
┌─────────────────────────────────────────────────────────────────┐
│ RED Method (Request-focused) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ R - Rate : Requests per second │
│ E - Errors : Error rate (% of failed requests) │
│ D - Duration : Request latency (p50, p95, p99) │
│ │
│ For every service, track: │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Service: payment-service │ │
│ │ ───────────────────────────────────────── │ │
│ │ Rate: 150 req/s │ │
│ │ Errors: 0.1% (5xx), 2% (4xx) │ │
│ │ Duration: p50=45ms, p95=120ms, p99=350ms │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
USE Method (Resource-focused)
Copy
┌─────────────────────────────────────────────────────────────────┐
│ USE Method (Infrastructure) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ U - Utilization : % time resource is busy │
│ S - Saturation : Work resource can't service (queue length) │
│ E - Errors : Count of error events │
│ │
│ Apply to each resource: │
│ │
│ CPU: │
│ • Utilization: 75% │
│ • Saturation: Load average / CPU count │
│ • Errors: CPU errors (rare) │
│ │
│ Memory: │
│ • Utilization: Used / Total │
│ • Saturation: Swap usage, OOM events │
│ • Errors: Allocation failures │
│ │
│ Network: │
│ • Utilization: Bandwidth used │
│ • Saturation: Dropped packets, retransmits │
│ • Errors: Interface errors │
│ │
│ Disk: │
│ • Utilization: I/O time % │
│ • Saturation: I/O queue length │
│ • Errors: Bad sectors, I/O errors │
│ │
└─────────────────────────────────────────────────────────────────┘
Metric Types
Copy
from prometheus_client import Counter, Gauge, Histogram, Summary
# COUNTER: Only goes up (request count, errors)
requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
# GAUGE: Can go up and down (current connections, queue size)
active_connections = Gauge(
'active_connections',
'Number of active connections'
)
# HISTOGRAM: Measures distribution (latency buckets)
request_latency = Histogram(
'http_request_duration_seconds',
'Request latency in seconds',
['endpoint'],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
# SUMMARY: Like histogram but calculates quantiles client-side
request_latency_summary = Summary(
'http_request_duration_summary',
'Request latency summary',
['endpoint']
)
# Usage in request handler
@app.middleware("http")
async def metrics_middleware(request, call_next):
start = time.time()
active_connections.inc()
try:
response = await call_next(request)
return response
finally:
duration = time.time() - start
active_connections.dec()
requests_total.labels(
method=request.method,
endpoint=request.url.path,
status=response.status_code
).inc()
request_latency.labels(
endpoint=request.url.path
).observe(duration)
Distributed Tracing
How Tracing Works
Copy
┌─────────────────────────────────────────────────────────────────┐
│ Distributed Trace │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Trace ID: abc123 (unique per request) │
│ │
│ API Gateway │
│ ├─ Span 1: gateway (trace_id=abc123, span_id=001) │
│ │ │ start: 0ms, duration: 150ms │
│ │ │ │
│ │ ├─ Span 2: auth-service (parent=001, span_id=002) │
│ │ │ │ start: 5ms, duration: 20ms │
│ │ │ └─ tags: {user_id: "usr_123"} │
│ │ │ │
│ │ ├─ Span 3: order-service (parent=001, span_id=003) │
│ │ │ │ start: 30ms, duration: 100ms │
│ │ │ │ │
│ │ │ ├─ Span 4: database (parent=003, span_id=004) │
│ │ │ │ │ start: 35ms, duration: 45ms │
│ │ │ │ └─ tags: {query: "SELECT..."} │
│ │ │ │ │
│ │ │ └─ Span 5: payment-service (parent=003, span_id=005) │
│ │ │ │ start: 85ms, duration: 40ms │
│ │ │ └─ tags: {amount: 150.00} │
│ │ │ │
│ │ └─ Span 6: notification (parent=001, span_id=006) │
│ │ │ start: 135ms, duration: 10ms │
│ │ └─ tags: {type: "email"} │
│ │ │
│ └─ Total: 150ms │
│ │
│ Visualized: │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ gateway │ │
│ │ ├─auth──┤ │ │
│ │ ├──────────────order-service────────────────┤ │ │
│ │ │ ├────database────┤ │ │ │
│ │ │ ├────payment────┤ │ │ │
│ │ ├─notif─┤ │
│ └──────────────────────────────────────────────────────┘ │
│ 0ms 25ms 50ms 75ms 100ms 125ms 150ms │
│ │
└─────────────────────────────────────────────────────────────────┘
Implementing Tracing
Copy
from opentelemetry import trace
from opentelemetry.trace.propagation import set_span_in_context
from opentelemetry.propagate import inject, extract
tracer = trace.get_tracer(__name__)
# Propagate context through HTTP headers
async def call_service(service_url: str, payload: dict, headers: dict):
# Get current context and inject trace headers
inject(headers)
async with httpx.AsyncClient() as client:
return await client.post(service_url, json=payload, headers=headers)
# Start a new span
@app.post("/orders")
async def create_order(request: Request, order: OrderCreate):
# Extract context from incoming request
context = extract(dict(request.headers))
with tracer.start_as_current_span("create_order", context=context) as span:
# Add attributes (searchable in UI)
span.set_attribute("user_id", order.user_id)
span.set_attribute("order_total", order.total)
# Child span for database
with tracer.start_as_current_span("db.insert_order") as db_span:
db_span.set_attribute("db.system", "postgresql")
db_span.set_attribute("db.statement", "INSERT INTO orders...")
order_id = await db.insert_order(order)
# Child span for external service
with tracer.start_as_current_span("payment.charge") as pay_span:
pay_span.set_attribute("payment.amount", order.total)
try:
await call_service(
f"http://payment-service/charge",
{"amount": order.total},
{}
)
except Exception as e:
pay_span.record_exception(e)
pay_span.set_status(trace.Status(trace.StatusCode.ERROR))
raise
return {"order_id": order_id}
Structured Logging
Log Format Best Practices
Copy
import structlog
import logging
from datetime import datetime
# Configure structured logging
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer()
]
)
logger = structlog.get_logger()
# Good: Structured, searchable, includes context
logger.info(
"order_created",
order_id="ord_123",
user_id="usr_456",
total=150.00,
items_count=3,
trace_id="abc123",
duration_ms=45
)
# Output: {"event": "order_created", "order_id": "ord_123",
# "user_id": "usr_456", "total": 150.00, ...}
# Bad: Unstructured, hard to search
logger.info(f"Order ord_123 created for user usr_456 with total $150.00")
Log Levels
Copy
┌─────────────────────────────────────────────────────────────────┐
│ Log Level Guidelines │
├─────────────────────────────────────────────────────────────────┤
│ │
│ DEBUG │ Detailed diagnostic info │
│ │ Use: Development, troubleshooting │
│ │ Example: "Cache key: user:123, value: {...}" │
│ │ │
│ INFO │ Normal operations │
│ │ Use: Track key events │
│ │ Example: "Order created", "User logged in" │
│ │ │
│ WARNING │ Potential issues (but still working) │
│ │ Use: Things that should be investigated │
│ │ Example: "Rate limit approaching", "Retry succeeded"│
│ │ │
│ ERROR │ Operation failed (but service continues) │
│ │ Use: Failed requests, exceptions │
│ │ Example: "Payment failed", "DB connection timeout" │
│ │ │
│ FATAL │ Service is unusable │
│ │ Use: Critical failures requiring immediate action │
│ │ Example: "Database unreachable", "Config missing" │
│ │
│ PRODUCTION LOG LEVELS: │
│ • Default: INFO and above │
│ • Per-service override for debugging │
│ • Never DEBUG in production (too verbose, costly) │
│ │
└─────────────────────────────────────────────────────────────────┘
Correlation IDs
Copy
from contextvars import ContextVar
import uuid
# Context variable for request correlation
correlation_id_ctx: ContextVar[str] = ContextVar('correlation_id')
class CorrelationMiddleware:
"""
Ensures every request has a correlation ID
"""
async def __call__(self, request, call_next):
# Get from header or generate new
correlation_id = request.headers.get(
'X-Correlation-ID',
str(uuid.uuid4())
)
# Set in context for logging
correlation_id_ctx.set(correlation_id)
# Process request
response = await call_next(request)
# Add to response headers
response.headers['X-Correlation-ID'] = correlation_id
return response
# Logger automatically includes correlation ID
class CorrelationLogger:
def _log(self, level, message, **kwargs):
kwargs['correlation_id'] = correlation_id_ctx.get(None)
structlog.get_logger().log(level, message, **kwargs)
def info(self, message, **kwargs):
self._log('info', message, **kwargs)
def error(self, message, **kwargs):
self._log('error', message, **kwargs)
Alerting
Alert Design Principles
Copy
┌─────────────────────────────────────────────────────────────────┐
│ Alerting Best Practices │
├─────────────────────────────────────────────────────────────────┤
│ │
│ GOOD ALERT: │
│ • Actionable (someone needs to do something) │
│ • Urgent (needs attention within SLA) │
│ • Clear (describes what's wrong and impact) │
│ • Rare (not noisy, doesn't cause alert fatigue) │
│ │
│ BAD ALERT: │
│ • "CPU at 80%" (so what? is anything broken?) │
│ • Fires frequently (causes alert fatigue) │
│ • No clear action (what should I do?) │
│ │
│ ALERT ON SYMPTOMS, NOT CAUSES: │
│ │
│ BAD: "Database CPU > 90%" │
│ GOOD: "Order creation latency > 2s for 5 minutes" │
│ │
│ Why? The symptom (slow orders) is what matters to users. │
│ High CPU that doesn't affect users isn't urgent. │
│ │
└─────────────────────────────────────────────────────────────────┘
Production-Ready Observability Implementation
Complete observability stack with metrics, logging, and tracing:- Python
- JavaScript
Copy
import time
import logging
import functools
from typing import Optional, Dict, Any, Callable
from dataclasses import dataclass, field
from datetime import datetime
from contextlib import contextmanager
from contextvars import ContextVar
import json
import uuid
import asyncio
# ============== Metrics Collection ==============
from prometheus_client import Counter, Histogram, Gauge, Info, REGISTRY
from prometheus_client.exposition import generate_latest
class MetricsCollector:
"""Production-ready metrics collection"""
def __init__(self, service_name: str):
self.service_name = service_name
# RED Method Metrics
self.request_count = Counter(
'http_requests_total',
'Total HTTP requests',
['service', 'method', 'endpoint', 'status']
)
self.request_latency = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['service', 'method', 'endpoint'],
buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
)
self.request_in_progress = Gauge(
'http_requests_in_progress',
'HTTP requests currently being processed',
['service', 'endpoint']
)
# USE Method Metrics
self.connection_pool_size = Gauge(
'connection_pool_size',
'Current connection pool size',
['service', 'pool_name']
)
self.queue_size = Gauge(
'queue_size',
'Current queue size',
['service', 'queue_name']
)
# Business Metrics
self.orders_total = Counter(
'orders_total',
'Total orders processed',
['service', 'status', 'payment_method']
)
self.revenue_total = Counter(
'revenue_total_cents',
'Total revenue in cents',
['service', 'currency']
)
# Service Info
self.service_info = Info(
'service_info',
'Service information'
)
self.service_info.info({
'service': service_name,
'version': '1.0.0'
})
def track_request(self, method: str, endpoint: str):
"""Decorator to track request metrics"""
def decorator(func: Callable):
@functools.wraps(func)
async def async_wrapper(*args, **kwargs):
self.request_in_progress.labels(
service=self.service_name,
endpoint=endpoint
).inc()
start_time = time.time()
status = "success"
try:
result = await func(*args, **kwargs)
return result
except Exception as e:
status = "error"
raise
finally:
duration = time.time() - start_time
self.request_count.labels(
service=self.service_name,
method=method,
endpoint=endpoint,
status=status
).inc()
self.request_latency.labels(
service=self.service_name,
method=method,
endpoint=endpoint
).observe(duration)
self.request_in_progress.labels(
service=self.service_name,
endpoint=endpoint
).dec()
return async_wrapper
return decorator
def export(self) -> bytes:
"""Export metrics in Prometheus format"""
return generate_latest(REGISTRY)
# ============== Distributed Tracing ==============
trace_context: ContextVar[Dict[str, Any]] = ContextVar('trace_context', default={})
@dataclass
class Span:
trace_id: str
span_id: str
parent_span_id: Optional[str]
operation_name: str
service_name: str
start_time: float
end_time: Optional[float] = None
tags: Dict[str, Any] = field(default_factory=dict)
logs: list = field(default_factory=list)
status: str = "OK"
def set_tag(self, key: str, value: Any) -> 'Span':
self.tags[key] = value
return self
def log(self, message: str, **kwargs) -> 'Span':
self.logs.append({
"timestamp": time.time(),
"message": message,
**kwargs
})
return self
def set_error(self, error: Exception) -> 'Span':
self.status = "ERROR"
self.tags["error"] = True
self.tags["error.message"] = str(error)
self.tags["error.type"] = type(error).__name__
return self
def finish(self) -> None:
self.end_time = time.time()
def duration_ms(self) -> float:
if self.end_time is None:
return 0
return (self.end_time - self.start_time) * 1000
def to_dict(self) -> Dict[str, Any]:
return {
"traceId": self.trace_id,
"spanId": self.span_id,
"parentSpanId": self.parent_span_id,
"operationName": self.operation_name,
"serviceName": self.service_name,
"startTime": int(self.start_time * 1_000_000),
"duration": int(self.duration_ms() * 1000),
"tags": self.tags,
"logs": self.logs,
"status": self.status
}
class Tracer:
"""Distributed tracing implementation"""
def __init__(self, service_name: str, exporter: Optional['SpanExporter'] = None):
self.service_name = service_name
self.exporter = exporter or ConsoleSpanExporter()
@contextmanager
def start_span(self, operation_name: str, **tags):
"""Start a new span as context manager"""
ctx = trace_context.get()
trace_id = ctx.get('trace_id') or self._generate_id()
parent_span_id = ctx.get('span_id')
span_id = self._generate_id()
span = Span(
trace_id=trace_id,
span_id=span_id,
parent_span_id=parent_span_id,
operation_name=operation_name,
service_name=self.service_name,
start_time=time.time(),
tags=tags
)
# Set new context
new_ctx = {
'trace_id': trace_id,
'span_id': span_id,
'parent_span_id': parent_span_id
}
token = trace_context.set(new_ctx)
try:
yield span
except Exception as e:
span.set_error(e)
raise
finally:
span.finish()
self.exporter.export(span)
trace_context.reset(token)
def inject_headers(self) -> Dict[str, str]:
"""Inject trace context into HTTP headers"""
ctx = trace_context.get()
return {
'X-Trace-ID': ctx.get('trace_id', ''),
'X-Span-ID': ctx.get('span_id', ''),
'X-Parent-Span-ID': ctx.get('parent_span_id', '')
}
def extract_headers(self, headers: Dict[str, str]) -> None:
"""Extract trace context from HTTP headers"""
ctx = {
'trace_id': headers.get('X-Trace-ID') or self._generate_id(),
'span_id': headers.get('X-Span-ID'),
'parent_span_id': headers.get('X-Parent-Span-ID')
}
trace_context.set(ctx)
def _generate_id(self) -> str:
return uuid.uuid4().hex[:16]
class ConsoleSpanExporter:
def export(self, span: Span) -> None:
print(json.dumps(span.to_dict(), indent=2))
class JaegerExporter:
"""Export spans to Jaeger"""
def __init__(self, agent_host: str = "localhost", agent_port: int = 6831):
self.agent_host = agent_host
self.agent_port = agent_port
self.batch: list = []
self.batch_size = 100
def export(self, span: Span) -> None:
self.batch.append(span.to_dict())
if len(self.batch) >= self.batch_size:
self._flush()
def _flush(self) -> None:
if not self.batch:
return
# Send to Jaeger agent via UDP
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
payload = json.dumps({"spans": self.batch}).encode()
sock.sendto(payload, (self.agent_host, self.agent_port))
self.batch = []
# ============== Structured Logging ==============
correlation_id_ctx: ContextVar[str] = ContextVar('correlation_id', default='')
class StructuredLogger:
"""Production structured logging with correlation"""
def __init__(self, service_name: str, log_level: str = "INFO"):
self.service_name = service_name
self.logger = logging.getLogger(service_name)
self.logger.setLevel(getattr(logging, log_level))
# JSON formatter
handler = logging.StreamHandler()
handler.setFormatter(JsonFormatter())
self.logger.addHandler(handler)
def _enrich(self, extra: Dict[str, Any]) -> Dict[str, Any]:
"""Add standard fields to log entry"""
ctx = trace_context.get()
return {
"service": self.service_name,
"timestamp": datetime.utcnow().isoformat() + "Z",
"trace_id": ctx.get('trace_id', ''),
"span_id": ctx.get('span_id', ''),
"correlation_id": correlation_id_ctx.get(''),
**extra
}
def info(self, message: str, **kwargs) -> None:
self.logger.info(message, extra=self._enrich(kwargs))
def error(self, message: str, **kwargs) -> None:
self.logger.error(message, extra=self._enrich(kwargs))
def warning(self, message: str, **kwargs) -> None:
self.logger.warning(message, extra=self._enrich(kwargs))
def debug(self, message: str, **kwargs) -> None:
self.logger.debug(message, extra=self._enrich(kwargs))
class JsonFormatter(logging.Formatter):
"""Format logs as JSON"""
def format(self, record: logging.LogRecord) -> str:
log_entry = {
"level": record.levelname,
"message": record.getMessage(),
"logger": record.name,
}
# Add extra fields
if hasattr(record, '__dict__'):
for key, value in record.__dict__.items():
if key not in ['name', 'msg', 'args', 'levelname', 'levelno',
'pathname', 'filename', 'module', 'lineno',
'funcName', 'created', 'msecs', 'relativeCreated',
'thread', 'threadName', 'processName', 'process',
'getMessage', 'exc_info', 'exc_text', 'stack_info']:
log_entry[key] = value
return json.dumps(log_entry)
# ============== Full Observability Middleware ==============
class ObservabilityMiddleware:
"""Combined metrics, tracing, and logging middleware"""
def __init__(
self,
app,
service_name: str,
metrics: MetricsCollector,
tracer: Tracer,
logger: StructuredLogger
):
self.app = app
self.service_name = service_name
self.metrics = metrics
self.tracer = tracer
self.logger = logger
async def __call__(self, request, call_next):
# Extract or generate correlation ID
correlation_id = request.headers.get(
'X-Correlation-ID',
str(uuid.uuid4())
)
correlation_id_ctx.set(correlation_id)
# Extract trace context
self.tracer.extract_headers(dict(request.headers))
endpoint = request.url.path
method = request.method
with self.tracer.start_span(
f"{method} {endpoint}",
http_method=method,
http_url=str(request.url)
) as span:
self.logger.info(
"request_started",
method=method,
path=endpoint,
user_agent=request.headers.get('user-agent')
)
start_time = time.time()
try:
response = await call_next(request)
span.set_tag("http.status_code", response.status_code)
self.logger.info(
"request_completed",
method=method,
path=endpoint,
status_code=response.status_code,
duration_ms=round((time.time() - start_time) * 1000, 2)
)
# Add trace headers to response
response.headers['X-Trace-ID'] = span.trace_id
response.headers['X-Correlation-ID'] = correlation_id
return response
except Exception as e:
span.set_error(e)
self.logger.error(
"request_failed",
method=method,
path=endpoint,
error=str(e),
error_type=type(e).__name__,
duration_ms=round((time.time() - start_time) * 1000, 2)
)
raise
Copy
const { v4: uuidv4 } = require('uuid');
const { AsyncLocalStorage } = require('async_hooks');
// ============== Context Storage ==============
const traceContext = new AsyncLocalStorage();
const correlationContext = new AsyncLocalStorage();
// ============== Metrics Collection ==============
const promClient = require('prom-client');
class MetricsCollector {
constructor(serviceName) {
this.serviceName = serviceName;
// Enable default metrics
promClient.collectDefaultMetrics({ prefix: `${serviceName}_` });
// RED Method Metrics
this.requestCount = new promClient.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['service', 'method', 'endpoint', 'status']
});
this.requestLatency = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request latency',
labelNames: ['service', 'method', 'endpoint'],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});
this.requestsInProgress = new promClient.Gauge({
name: 'http_requests_in_progress',
help: 'HTTP requests currently being processed',
labelNames: ['service', 'endpoint']
});
// Business Metrics
this.ordersTotal = new promClient.Counter({
name: 'orders_total',
help: 'Total orders processed',
labelNames: ['service', 'status', 'payment_method']
});
this.revenueTotal = new promClient.Counter({
name: 'revenue_total_cents',
help: 'Total revenue in cents',
labelNames: ['service', 'currency']
});
}
trackRequest(method, endpoint) {
return (handler) => {
return async (req, res, next) => {
const labels = {
service: this.serviceName,
method,
endpoint
};
this.requestsInProgress.inc({
service: this.serviceName,
endpoint
});
const endTimer = this.requestLatency.startTimer(labels);
let status = 'success';
try {
await handler(req, res, next);
} catch (error) {
status = 'error';
throw error;
} finally {
endTimer();
this.requestCount.inc({ ...labels, status });
this.requestsInProgress.dec({
service: this.serviceName,
endpoint
});
}
};
};
}
async export() {
return promClient.register.metrics();
}
}
// ============== Distributed Tracing ==============
class Span {
constructor(traceId, spanId, parentSpanId, operationName, serviceName) {
this.traceId = traceId;
this.spanId = spanId;
this.parentSpanId = parentSpanId;
this.operationName = operationName;
this.serviceName = serviceName;
this.startTime = Date.now();
this.endTime = null;
this.tags = {};
this.logs = [];
this.status = 'OK';
}
setTag(key, value) {
this.tags[key] = value;
return this;
}
log(message, fields = {}) {
this.logs.push({
timestamp: Date.now(),
message,
...fields
});
return this;
}
setError(error) {
this.status = 'ERROR';
this.tags.error = true;
this.tags['error.message'] = error.message;
this.tags['error.type'] = error.name;
return this;
}
finish() {
this.endTime = Date.now();
}
durationMs() {
return this.endTime ? this.endTime - this.startTime : 0;
}
toJSON() {
return {
traceId: this.traceId,
spanId: this.spanId,
parentSpanId: this.parentSpanId,
operationName: this.operationName,
serviceName: this.serviceName,
startTime: this.startTime * 1000,
duration: this.durationMs() * 1000,
tags: this.tags,
logs: this.logs,
status: this.status
};
}
}
class Tracer {
constructor(serviceName, exporter = null) {
this.serviceName = serviceName;
this.exporter = exporter || new ConsoleSpanExporter();
}
startSpan(operationName, tags = {}) {
const ctx = traceContext.getStore() || {};
const traceId = ctx.traceId || this.generateId();
const parentSpanId = ctx.spanId;
const spanId = this.generateId();
const span = new Span(
traceId,
spanId,
parentSpanId,
operationName,
this.serviceName
);
Object.entries(tags).forEach(([k, v]) => span.setTag(k, v));
return span;
}
async runInSpan(operationName, tags, fn) {
const span = this.startSpan(operationName, tags);
const newCtx = {
traceId: span.traceId,
spanId: span.spanId,
parentSpanId: span.parentSpanId
};
try {
return await traceContext.run(newCtx, fn);
} catch (error) {
span.setError(error);
throw error;
} finally {
span.finish();
this.exporter.export(span);
}
}
injectHeaders() {
const ctx = traceContext.getStore() || {};
return {
'X-Trace-ID': ctx.traceId || '',
'X-Span-ID': ctx.spanId || '',
'X-Parent-Span-ID': ctx.parentSpanId || ''
};
}
extractHeaders(headers) {
const ctx = {
traceId: headers['x-trace-id'] || this.generateId(),
spanId: headers['x-span-id'],
parentSpanId: headers['x-parent-span-id']
};
return ctx;
}
generateId() {
return uuidv4().replace(/-/g, '').substring(0, 16);
}
}
class ConsoleSpanExporter {
export(span) {
console.log(JSON.stringify(span.toJSON(), null, 2));
}
}
// ============== Structured Logging ==============
class StructuredLogger {
constructor(serviceName, logLevel = 'info') {
this.serviceName = serviceName;
this.logLevels = { debug: 0, info: 1, warn: 2, error: 3 };
this.currentLevel = this.logLevels[logLevel] || 1;
}
enrich(extra) {
const ctx = traceContext.getStore() || {};
const correlationId = correlationContext.getStore() || '';
return {
service: this.serviceName,
timestamp: new Date().toISOString(),
traceId: ctx.traceId || '',
spanId: ctx.spanId || '',
correlationId,
...extra
};
}
log(level, message, extra = {}) {
if (this.logLevels[level] < this.currentLevel) return;
const entry = {
level: level.toUpperCase(),
message,
...this.enrich(extra)
};
console.log(JSON.stringify(entry));
}
info(message, extra) { this.log('info', message, extra); }
error(message, extra) { this.log('error', message, extra); }
warn(message, extra) { this.log('warn', message, extra); }
debug(message, extra) { this.log('debug', message, extra); }
}
// ============== Express Middleware ==============
function observabilityMiddleware(serviceName, metrics, tracer, logger) {
return async (req, res, next) => {
// Extract or generate correlation ID
const correlationId = req.headers['x-correlation-id'] || uuidv4();
// Extract trace context from headers
const traceCtx = tracer.extractHeaders(req.headers);
const endpoint = req.path;
const method = req.method;
await correlationContext.run(correlationId, async () => {
await tracer.runInSpan(
`${method} ${endpoint}`,
{ 'http.method': method, 'http.url': req.url },
async () => {
const span = tracer.startSpan(`${method} ${endpoint}`);
logger.info('request_started', {
method,
path: endpoint,
userAgent: req.headers['user-agent']
});
const startTime = Date.now();
// Track response
const originalEnd = res.end;
res.end = function(...args) {
const durationMs = Date.now() - startTime;
span.setTag('http.status_code', res.statusCode);
logger.info('request_completed', {
method,
path: endpoint,
statusCode: res.statusCode,
durationMs
});
// Add trace headers to response
res.setHeader('X-Trace-ID', span.traceId);
res.setHeader('X-Correlation-ID', correlationId);
span.finish();
return originalEnd.apply(this, args);
};
next();
}
);
});
};
}
// ============== Usage Example ==============
const express = require('express');
const app = express();
const metrics = new MetricsCollector('order-service');
const tracer = new Tracer('order-service');
const logger = new StructuredLogger('order-service');
// Apply middleware
app.use(observabilityMiddleware('order-service', metrics, tracer, logger));
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', 'text/plain');
res.send(await metrics.export());
});
// Health check
app.get('/health', (req, res) => {
res.json({ status: 'healthy', timestamp: new Date().toISOString() });
});
module.exports = {
MetricsCollector,
Tracer,
Span,
StructuredLogger,
observabilityMiddleware
};
Alert Template
Copy
# Example PagerDuty/Opsgenie Alert
alert: OrderServiceHighLatency
expr: histogram_quantile(0.99, rate(order_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: critical
service: order-service
team: payments
annotations:
summary: "Order service p99 latency is {{ $value }}s (threshold: 2s)"
description: |
Impact: Users are experiencing slow checkout.
Dashboard: https://grafana.example.com/d/orders
Runbook: https://wiki.example.com/runbooks/order-latency
Possible causes:
- Database connection pool exhausted
- Payment service slow
- Increased traffic
Immediate actions:
1. Check dashboard for traffic spike
2. Check payment-service health
3. Check database metrics
runbook_url: https://wiki.example.com/runbooks/order-latency
Alert Severity Levels
Copy
┌─────────────────────────────────────────────────────────────────┐
│ Severity Levels │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SEV-1 (Critical) - Page immediately, all hands │
│ ───────────────────────────────────────────── │
│ • Service is down │
│ • Data loss occurring │
│ • Security breach │
│ Response: 15 minutes │
│ │
│ SEV-2 (High) - Page on-call │
│ ─────────────────────────── │
│ • Major feature degraded │
│ • Significant latency increase │
│ • Error rate > 5% │
│ Response: 30 minutes │
│ │
│ SEV-3 (Medium) - Ticket, fix during business hours │
│ ─────────────────────────────────────────────── │
│ • Non-critical feature broken │
│ • Performance degradation (not severe) │
│ Response: 4 hours │
│ │
│ SEV-4 (Low) - Ticket, fix when convenient │
│ ───────────────────────────────────────── │
│ • Minor issues │
│ • Cosmetic problems │
│ Response: 1 week │
│ │
└─────────────────────────────────────────────────────────────────┘
Dashboards
Dashboard Design
Copy
┌─────────────────────────────────────────────────────────────────┐
│ Dashboard Layout │
├─────────────────────────────────────────────────────────────────┤
│ │
│ TOP: High-level health (GREEN/YELLOW/RED) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ [OK] API Gateway [OK] Orders [WARN] Payments [OK] Database │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ GOLDEN SIGNALS (RED method): │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Request Rate │ │ Error Rate │ │ Latency p99 │ │
│ │ ^ │ │ v │ │ ~ │ │
│ │ 1,234 req/s │ │ 0.05% │ │ 145ms │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ RESOURCE UTILIZATION (USE method): │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ CPU │ │ Memory │ │ Disk I/O │ │
│ │ ▂▃▅▆▇ │ │ ▃▃▃▃▃ │ │ ▁▂▁▂▁ │ │
│ │ 65% │ │ 72% │ │ 15% │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ DEPENDENCIES: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Payment API: 50ms │ Database: 10ms │ Redis: 1ms │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ BOTTOM: Detailed graphs, logs, traces │
│ │
└─────────────────────────────────────────────────────────────────┘
SLIs, SLOs, and SLAs
Copy
┌─────────────────────────────────────────────────────────────────┐
│ SLI / SLO / SLA │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SLI (Service Level Indicator) │
│ ───────────────────────────── │
│ The metric you measure │
│ Example: "99th percentile latency of API requests" │
│ │
│ SLO (Service Level Objective) │
│ ───────────────────────────── │
│ The target you aim for (internal) │
│ Example: "99th percentile latency < 200ms" │
│ │
│ SLA (Service Level Agreement) │
│ ───────────────────────────── │
│ The contract with customers (external) │
│ Example: "99.9% availability or refund" │
│ Note: SLA should be looser than SLO (buffer) │
│ │
│ Example SLO Document: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Service: Order API │ │
│ │ ──────────────────── │ │
│ │ Availability: 99.95% of requests successful │ │
│ │ Latency: 99% of requests < 500ms │ │
│ │ Throughput: Handle 10,000 req/s │ │
│ │ │ │
│ │ Error Budget: 0.05% = 21.6 minutes/month │ │
│ │ Current Budget Remaining: 15.3 minutes │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Senior Interview Questions
How do you debug a latency spike in production?
How do you debug a latency spike in production?
Systematic approach:
- Check dashboards: RED metrics, identify when it started
- Correlate: Deployments? Traffic spike? Dependency issue?
- Trace analysis: Find slow spans in traces
- Log analysis: Search for errors around that time
- Narrow down: Which endpoint? Which users?
- Database slow queries (check slow query log)
- GC pauses (check GC metrics)
- Connection pool exhaustion
- External dependency slowdown
- Lock contention
How do you set up monitoring for a new service?
How do you set up monitoring for a new service?
Checklist:
- Instrument code: Add metrics (RED method)
- Add tracing: Propagate trace context
- Structured logging: With correlation IDs
- Create dashboard: Health, golden signals, resources
- Set up alerts: On symptoms, not causes
- Document SLOs: Define success criteria
- Create runbooks: What to do when alerts fire
What's your approach to reducing alert fatigue?
What's your approach to reducing alert fatigue?
Strategies:
- Alert on symptoms: User impact, not causes
- Use thresholds wisely: 5 minutes > 80% vs instant spike
- Group related alerts: One page per incident, not 10
- Regular review: Delete unused, tune noisy alerts
- Escalation policy: Low-priority → ticket, high → page
- On-call feedback: Track alert quality metrics
How would you design a metrics system at scale?
How would you design a metrics system at scale?
Architecture:
- Collection: Agent on each host (Prometheus, StatsD)
- Aggregation: Pre-aggregate at edge (reduce cardinality)
- Storage: Time-series DB (InfluxDB, M3DB, Thanos)
- Query: Federation for cross-cluster queries
- Visualization: Grafana dashboards
- High cardinality labels (user_id) → Aggregate
- Long retention → Downsampling
- Many metrics → Drop unused