Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Observability

In distributed systems, understanding what’s happening across services is crucial. Observability gives you visibility into your system through three pillars: Logs, Metrics, and Traces. In a monolith, a single stack trace tells you almost everything you need to debug a problem. In microservices, that same problem might involve 12 services, 3 databases, and 2 message brokers - and no single log file contains the full picture. Observability is not a “nice to have” in microservices; it is the cost of entry. You cannot operate what you cannot see, and once you move past 3-4 services, every production incident becomes a distributed detective case. The difference between a 15-minute outage and a 15-hour outage is almost always the quality of your observability.
Learning Objectives:
  • Understand the three pillars of observability
  • Implement structured logging
  • Set up distributed tracing with OpenTelemetry
  • Configure metrics with Prometheus
  • Build dashboards with Grafana

Three Pillars of Observability

Before diving into implementation, it helps to understand why these three pillars exist and why you need all three. Logs, metrics, and traces are not redundant - they answer fundamentally different questions. Logs tell you what specifically happened in a given moment (rich context, high detail, expensive to store at volume). Metrics tell you how the system is behaving in aggregate (cheap, fast, perfect for alerting, but with little context). Traces tell you how a single request flowed through the system (showing causality and latency attribution across service boundaries). Relying on only one pillar is a common mistake. Metrics-only monitoring tells you “error rate is up” but not why. Logs-only observability means drowning in data during an incident with no way to see the forest. Traces without metrics mean you have no alerting. The three pillars compose: metrics fire an alert, traces narrow down the failing call path, and logs provide the root-cause detail. Each pillar has different storage economics and query patterns, which is why mature systems use different backends for each (Prometheus for metrics, Jaeger/Tempo for traces, Loki/Elasticsearch for logs).
┌─────────────────────────────────────────────────────────────────────────────┐
│                    THREE PILLARS OF OBSERVABILITY                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                              LOGS                                    │    │
│  │                                                                      │    │
│  │  What happened?                                                      │    │
│  │                                                                      │    │
│  │  [2024-01-15T10:30:00Z] INFO  order-service: Order created          │    │
│  │  [2024-01-15T10:30:01Z] ERROR payment-service: Payment failed       │    │
│  │  [2024-01-15T10:30:02Z] WARN  inventory-service: Low stock alert   │    │
│  │                                                                      │    │
│  │  • Discrete events                                                   │    │
│  │  • Rich context                                                      │    │
│  │  • Good for debugging                                                │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                             METRICS                                  │    │
│  │                                                                      │    │
│  │  How is the system performing?                                       │    │
│  │                                                                      │    │
│  │  request_count{service="order",status="200"} = 1523                 │    │
│  │  request_duration_seconds{service="order",p99} = 0.245              │    │
│  │  error_rate{service="payment"} = 0.02                               │    │
│  │                                                                      │    │
│  │  • Numeric time-series data                                          │    │
│  │  • Aggregated values                                                 │    │
│  │  • Good for alerting & dashboards                                    │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                             TRACES                                   │    │
│  │                                                                      │    │
│  │  How do requests flow through the system?                            │    │
│  │                                                                      │    │
│  │  [trace-id: abc123]                                                  │    │
│  │  └─ order-service (50ms)                                             │    │
│  │     ├─ validate-order (5ms)                                          │    │
│  │     ├─ payment-service (30ms)                                        │    │
│  │     │  └─ process-payment (25ms)                                     │    │
│  │     └─ inventory-service (10ms)                                      │    │
│  │                                                                      │    │
│  │  • Request path visualization                                        │    │
│  │  • Latency breakdown                                                 │    │
│  │  • Dependency mapping                                                │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Structured Logging

Logs should be structured (JSON) for easy parsing and querying. The shift from plaintext to structured logging is one of the most impactful changes a team can make. Plaintext logs work fine when a human reads one log at a time, but they fall apart when machines need to parse them - and in microservices, machines are the primary readers. Log aggregators like Loki, Elasticsearch, and Splunk index structured fields (service, correlation_id, duration_ms, user_id) so you can run queries like “show me all errors for user X across all services in the last 5 minutes.” With plaintext, you’d need regex gymnastics that break the moment someone changes a log message. Structured logging also forces discipline: every field has a name and type, which makes logs self-documenting and prevents the “log rot” where messages drift over time. The alternative - unstructured printf-style logging - works in development but becomes unusable at production scale. Imagine grepping through 500 GB of logs per day across 40 services, where each service has its own ad-hoc format. Structured logs with a consistent schema (service, level, timestamp, correlation_id, message, and domain-specific fields) turn that chaos into a queryable dataset. The tradeoff is verbosity: JSON is harder for humans to read than plain text. The standard solution is to use pretty-printed colorized output in development and JSON in production, which is exactly what the examples below do.
Caveats & Common Pitfalls in Structured Logging
  • Logging PII by accident. Dumping req.body or user object into logs means credit card numbers, passwords, password-reset tokens, and SSNs end up in Splunk, Loki, or Datadog. Once written, they are practically impossible to fully purge because of replication, backups, and SIEM forwarders — you are looking at a GDPR Article 33 notification.
  • Log-level drift. Teams set everything to INFO in production “for visibility,” then pay 6x the log-ingestion bill and still cannot find anything. A single chatty INFO log inside a hot loop can generate 50 GB/day alone.
  • Unbounded fields. Logging full HTTP request bodies, SQL queries with parameters, or stack traces for every 404 will blow through your retention budget in days.
  • Lost context across async boundaries. Node.js async_hooks and Python contextvars can silently drop the correlation ID across setTimeout, worker threads, or asyncio.create_task if you are not careful, and half your logs appear “orphaned” from the request.
Solutions & Patterns
  • Maintain a central PII redaction list (password, token, ssn, credit_card, authorization, cookie, email in some regulated contexts) and enforce it in a single logging formatter so individual developers cannot forget. Libraries like pino-noir (Node) and structlog.processors.CallsiteParameterAdder combined with a custom redactor (Python) make this declarative.
  • Use sampling for high-volume INFO logs (keep 1 in 100) and keep 100% of WARN/ERROR. This keeps the signal and cuts the bill.
  • Define field-length caps — bodies truncated at 1 KB, stack traces at 4 KB — enforced in the formatter, not per-call site.
  • Use the idiomatic async context mechanism for your runtime (AsyncLocalStorage in Node 20+, contextvars in Python) so correlation IDs survive await boundaries automatically. This works because the runtime itself threads context through the async machinery — no manual plumbing needed.

Logger Implementation

Below is a production-grade logger that centralizes three critical concerns: a consistent JSON format for machine consumption, default metadata (service name, environment, version, hostname) attached to every log line, and the concept of “child loggers” that carry request-scoped context. The child logger pattern is the key to correlating logs across a single request - rather than passing context manually into every log call, you create a child logger once per request and use it everywhere. Without this pattern, you’ll end up either polluting every function signature with a context object or losing correlation context entirely.
// observability/Logger.js
const winston = require('winston');
const { v4: uuidv4 } = require('uuid');

class Logger {
  constructor(options = {}) {
    this.serviceName = options.serviceName || process.env.SERVICE_NAME || 'unknown';
    this.environment = options.environment || process.env.NODE_ENV || 'development';
    
    this.logger = winston.createLogger({
      level: options.level || 'info',
      format: winston.format.combine(
        winston.format.timestamp(),
        winston.format.errors({ stack: true }),
        winston.format.json()
      ),
      defaultMeta: {
        service: this.serviceName,
        environment: this.environment,
        version: process.env.APP_VERSION || '1.0.0',
        hostname: require('os').hostname()
      },
      transports: this.createTransports(options)
    });
  }

  createTransports(options) {
    const transports = [];
    
    // Console for development
    if (this.environment === 'development') {
      transports.push(new winston.transports.Console({
        format: winston.format.combine(
          winston.format.colorize(),
          winston.format.simple()
        )
      }));
    } else {
      // JSON for production (shipped to log aggregator)
      transports.push(new winston.transports.Console());
    }
    
    // File transport for local debugging
    if (options.logFile) {
      transports.push(new winston.transports.File({
        filename: options.logFile,
        maxsize: 5242880, // 5MB
        maxFiles: 5
      }));
    }
    
    return transports;
  }

  // Add correlation context to all logs
  child(context) {
    return {
      info: (message, meta = {}) => this.info(message, { ...context, ...meta }),
      warn: (message, meta = {}) => this.warn(message, { ...context, ...meta }),
      error: (message, meta = {}) => this.error(message, { ...context, ...meta }),
      debug: (message, meta = {}) => this.debug(message, { ...context, ...meta })
    };
  }

  info(message, meta = {}) {
    this.logger.info(message, this.enrichMeta(meta));
  }

  warn(message, meta = {}) {
    this.logger.warn(message, this.enrichMeta(meta));
  }

  error(message, meta = {}) {
    this.logger.error(message, this.enrichMeta(meta));
  }

  debug(message, meta = {}) {
    this.logger.debug(message, this.enrichMeta(meta));
  }

  enrichMeta(meta) {
    return {
      ...meta,
      timestamp: new Date().toISOString()
    };
  }
}

// Singleton export
const logger = new Logger({
  serviceName: process.env.SERVICE_NAME,
  level: process.env.LOG_LEVEL || 'info'
});

module.exports = { Logger, logger };

Request Logging Middleware

Correlation IDs are the single most important thing you can add to your logs. Every incoming request gets a unique ID that flows through every log line, every downstream call, and every database query related to that request. When a user complains “my order failed at 3:42 PM,” you search one correlation ID and see the full story across all services. Without correlation IDs, you’re reduced to grep-by-timestamp, which is meaningless when you’re processing 10,000 requests per second. The middleware below generates a correlation ID on the way in (or accepts one from an upstream caller) and attaches a child logger to the request so every subsequent log in that request lifecycle carries the context automatically. If you skip this middleware and rely on ad-hoc logging, debugging a single user’s complaint becomes a scavenger hunt across service boundaries. Each service has its own logs, each with its own IDs, and joining them requires human pattern-matching across timestamps. The cost of adding correlation IDs is tiny (a UUID and a header); the cost of not having them compounds with every incident.
// middleware/requestLogger.js
const { logger } = require('../observability/Logger');
const { v4: uuidv4 } = require('uuid');

function requestLogger(options = {}) {
  return (req, res, next) => {
    // Extract or generate correlation IDs
    const correlationId = req.headers['x-correlation-id'] || uuidv4();
    const traceId = req.headers['x-trace-id'] || correlationId;
    const spanId = uuidv4().substring(0, 16);
    
    // Attach to request
    req.correlationId = correlationId;
    req.traceId = traceId;
    req.spanId = spanId;
    
    // Create child logger with context
    req.logger = logger.child({
      correlationId,
      traceId,
      spanId,
      method: req.method,
      path: req.path,
      userAgent: req.headers['user-agent']
    });

    // Log request start
    const startTime = Date.now();
    req.logger.info('Request started', {
      query: req.query,
      ip: req.ip
    });

    // Capture response
    const originalSend = res.send;
    res.send = function(body) {
      const duration = Date.now() - startTime;
      
      req.logger.info('Request completed', {
        statusCode: res.statusCode,
        duration,
        responseSize: body?.length || 0
      });
      
      return originalSend.call(this, body);
    };

    // Set headers for downstream services
    res.setHeader('X-Correlation-ID', correlationId);
    res.setHeader('X-Trace-ID', traceId);

    next();
  };
}

module.exports = { requestLogger };

Propagating Context to Downstream Services

Correlation IDs are worthless if they stop at the first service. The whole point is end-to-end traceability, which means every outbound HTTP call must carry the correlation headers forward. A traced HTTP client wraps your normal client (axios, httpx, etc.) and automatically injects the headers for you, so developers never have to remember to pass them manually. This is a place where a small abstraction pays huge dividends: if you let each service author write their own HTTP calls, correlation IDs will be forgotten 30% of the time and your trace graphs will have mysterious gaps. The tradeoff is coupling - every outbound call goes through a shared helper - but in practice this is a good kind of coupling. It enforces a cross-cutting concern (observability) in one place rather than 300. If you ever need to add a new header (tenant ID, feature flags, experiment IDs), you change one file instead of hunting every HTTP call in the codebase.
// utils/httpClient.js
const axios = require('axios');

function createTracedClient(req) {
  return axios.create({
    headers: {
      'X-Correlation-ID': req.correlationId,
      'X-Trace-ID': req.traceId,
      'X-Parent-Span-ID': req.spanId
    }
  });
}

// Usage in route handler
app.post('/orders', async (req, res) => {
  const http = createTracedClient(req);
  
  try {
    req.logger.info('Creating order', { customerId: req.body.customerId });
    
    // Context is automatically propagated
    const payment = await http.post('http://payment-service/charge', {
      amount: req.body.total
    });
    
    req.logger.info('Payment processed', { paymentId: payment.data.id });
    
    res.json({ orderId: '123', paymentId: payment.data.id });
  } catch (error) {
    req.logger.error('Order creation failed', {
      error: error.message,
      stack: error.stack
    });
    throw error;
  }
});

Distributed Tracing with OpenTelemetry

OpenTelemetry provides vendor-neutral instrumentation. Before OpenTelemetry, distributed tracing was a vendor lock-in nightmare. You’d instrument your code for Jaeger, and if you later switched to Datadog or Honeycomb, you’d rewrite every span. OpenTelemetry (OTel) emerged from the merger of OpenTracing and OpenCensus around 2019 precisely to solve this - a single instrumentation API, with pluggable exporters for every major backend. The practical implication is that you instrument once and decide later where traces go: Jaeger, Zipkin, Tempo, Datadog APM, New Relic, Honeycomb, Lightstep. This is the correct layer of abstraction: your application code should not know or care which SaaS monitoring vendor you use this quarter. The conceptual model is simple but powerful. A trace represents one end-to-end request. A span represents one unit of work within that trace (a database query, an HTTP call, a function execution). Spans have parent-child relationships that reconstruct causality - “this payment span was caused by the create-order span, which was caused by the HTTP POST to /orders.” Spans carry attributes (key-value metadata) and events (timestamped annotations). Context propagation via W3C Trace Context headers (traceparent, tracestate) ties it all together across process boundaries. Skip propagation and your traces break into disconnected fragments, one per service - useless for debugging cross-service latency.
Caveats & Common Pitfalls in Distributed Tracing
  • Head-based sampling loses exactly the incidents you need. If you sample 1% at the entry point, the dice roll happens before the request fails or times out. When the postmortem asks “show me the trace for that 30-second checkout,” it was probably dropped at the gateway.
  • Broken context propagation. A custom transport (gRPC without the OTel interceptor, Kafka without header forwarding, a fire-and-forget fetch in a background job) creates orphan traces that look like unrelated tiny fragments. Postmortems become needle-in-a-haystack exercises.
  • Span attributes that carry PII. http.request.body, db.statement with parameters, user email on span attributes — these go straight to whatever backend you export to, including third-party SaaS like Datadog or Honeycomb.
  • Instrumentation inside hot loops. Creating a span per row in a 10,000-row loop generates 10k spans per request, destroys the trace view, and bills like crazy.
Solutions & Patterns
  • Use tail-based sampling for anything incident-critical. Keep 100% of traces with errors, 100% above p99 latency, and head-sample 1-5% of the rest. Implement in the OpenTelemetry Collector via tail_sampling processor so app code is unchanged.
  • Centralize context propagation. Wrap every outbound transport (HTTP client, Kafka producer, SQS publisher, gRPC stub) in a traced helper so propagation is impossible to forget. One helper, tested once, used everywhere.
  • Apply redaction at the Collector, not the app. A attributes processor with update actions strips PII keys (http.request.body, db.statement, user.email) before export. Redact at the edge — the backend is untrusted.
  • Batch work, not spans. Wrap the bulk operation in one span with count as an attribute. If per-row detail matters for debugging, log instead of span.
  • Use W3C traceparent as the one true propagation format; avoid mixing B3, Jaeger, and W3C headers across services.
┌─────────────────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED TRACE VISUALIZATION                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Trace ID: abc123-def456-ghi789                                              │
│  ──────────────────────────────────────────────────────────────────         │
│                                                                              │
│  0ms      50ms     100ms    150ms    200ms    250ms    300ms               │
│  ├────────┴────────┴────────┴────────┴────────┴────────┤                   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────┐              │
│  │ api-gateway: POST /orders                                 │ 280ms       │
│  └┬─────────────────────────────────────────────────────────┘              │
│   │                                                                         │
│   │ ┌────────────────────────────────────────────────┐                     │
│   ├▶│ order-service: createOrder                      │ 220ms              │
│   │ └┬───────────────────────────────────────────────┘                     │
│   │  │                                                                      │
│   │  │ ┌─────────────────────┐                                             │
│   │  ├▶│ validate-order      │ 15ms                                        │
│   │  │ └─────────────────────┘                                             │
│   │  │                                                                      │
│   │  │ ┌─────────────────────────────────────┐                             │
│   │  ├▶│ payment-service: processPayment     │ 120ms                       │
│   │  │ └┬────────────────────────────────────┘                             │
│   │  │  │                                                                   │
│   │  │  │ ┌────────────────────────┐                                       │
│   │  │  └▶│ stripe-api: charge     │ 95ms                                  │
│   │  │    └────────────────────────┘                                       │
│   │  │                                                                      │
│   │  │ ┌──────────────────────────────┐                                    │
│   │  └▶│ inventory-service: reserve   │ 45ms                               │
│   │    └┬─────────────────────────────┘                                    │
│   │     │                                                                   │
│   │     │ ┌─────────────────────┐                                          │
│   │     └▶│ database: UPDATE    │ 20ms                                     │
│   │       └─────────────────────┘                                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

OpenTelemetry Setup

Setting up OpenTelemetry in a microservice involves three main pieces: a Resource (metadata identifying the service), one or more Exporters (where traces and metrics go), and Instrumentation (code that creates spans for you automatically). The auto-instrumentation libraries are where most of the magic happens - they hook into your HTTP framework, database driver, and Redis/Kafka client and emit spans without you writing a line of tracing code. This is the correct default: 90% of your observability needs are covered by auto-instrumentation, and you only add manual spans for business-specific operations (like “validate coupon” or “calculate shipping”). A subtle but important consideration: OpenTelemetry emits data to an OTLP Collector, not directly to your backend. The Collector acts as a buffer and translator - it can receive OTLP, batch it, add metadata, sample it, and forward it to Jaeger, Prometheus, Datadog, etc. Running the Collector as a sidecar or DaemonSet decouples your app from backend choice and handles backpressure. Skipping the Collector and exporting directly to a backend works but couples your app to that vendor and makes it harder to add processing later.
// observability/tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

function initTracing(serviceName) {
  const resource = new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: serviceName,
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION || '1.0.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development'
  });

  const traceExporter = new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317'
  });

  const metricExporter = new OTLPMetricExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317'
  });

  const sdk = new NodeSDK({
    resource,
    traceExporter,
    metricReader: new PeriodicExportingMetricReader({
      exporter: metricExporter,
      exportIntervalMillis: 30000
    }),
    instrumentations: [
      getNodeAutoInstrumentations({
        '@opentelemetry/instrumentation-http': {
          ignoreIncomingPaths: ['/health', '/metrics']
        },
        '@opentelemetry/instrumentation-express': {},
        '@opentelemetry/instrumentation-mongodb': {},
        '@opentelemetry/instrumentation-redis': {},
        '@opentelemetry/instrumentation-pg': {},
        '@opentelemetry/instrumentation-grpc': {}
      })
    ]
  });

  sdk.start();
  console.log('OpenTelemetry tracing initialized');

  // Graceful shutdown
  process.on('SIGTERM', () => {
    sdk.shutdown()
      .then(() => console.log('Tracing terminated'))
      .catch((error) => console.error('Error terminating tracing', error))
      .finally(() => process.exit(0));
  });

  return sdk;
}

module.exports = { initTracing };

Manual Instrumentation

Auto-instrumentation covers your HTTP, database, and cache calls - but your most interesting business logic isn’t in those layers. You want spans that say “validate coupon” or “calculate tax” or “send notification,” because those are the operations that show up in latency breakdowns and tell you which business logic is slow. Manual instrumentation is where you mark up meaningful units of work in your own code. The rule of thumb: wrap any operation that takes more than ~5ms, any operation that has a chance of failing in an interesting way, and any operation whose timing matters for debugging. The helper below encapsulates three best practices: set span status to OK on success, set status to ERROR and record the exception on failure, and always end the span (even on exceptions) so you don’t leak spans. Forgetting to set error status is the most common mistake - your trace viewer will show the span as “successful” even though the operation threw. Recording exceptions is what gives you the stack trace inside the trace, which is what turns a trace into a debugging tool rather than just a latency chart.
// observability/tracer.js
const { trace, context, SpanStatusCode } = require('@opentelemetry/api');

class Tracer {
  constructor(serviceName) {
    this.tracer = trace.getTracer(serviceName);
  }

  // Create a new span for an operation
  async trace(name, fn, options = {}) {
    return this.tracer.startActiveSpan(name, options, async (span) => {
      try {
        const result = await fn(span);
        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (error) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: error.message
        });
        span.recordException(error);
        throw error;
      } finally {
        span.end();
      }
    });
  }

  // Add attributes to current span
  addAttribute(key, value) {
    const span = trace.getActiveSpan();
    if (span) {
      span.setAttribute(key, value);
    }
  }

  // Add event to current span
  addEvent(name, attributes = {}) {
    const span = trace.getActiveSpan();
    if (span) {
      span.addEvent(name, attributes);
    }
  }

  // Get current trace context for propagation
  getContext() {
    const span = trace.getActiveSpan();
    if (span) {
      const spanContext = span.spanContext();
      return {
        traceId: spanContext.traceId,
        spanId: spanContext.spanId,
        traceFlags: spanContext.traceFlags
      };
    }
    return null;
  }
}

// Usage
const tracer = new Tracer('order-service');

async function createOrder(orderData) {
  return tracer.trace('createOrder', async (span) => {
    span.setAttribute('order.customerId', orderData.customerId);
    span.setAttribute('order.itemCount', orderData.items.length);

    // Validate order
    await tracer.trace('validateOrder', async () => {
      // validation logic
    });

    // Process payment
    const payment = await tracer.trace('processPayment', async (paymentSpan) => {
      paymentSpan.setAttribute('payment.amount', orderData.total);
      return paymentService.charge(orderData.total);
    });

    span.addEvent('payment_completed', {
      paymentId: payment.id,
      amount: payment.amount
    });

    // Reserve inventory
    await tracer.trace('reserveInventory', async () => {
      return inventoryService.reserve(orderData.items);
    });

    return { orderId: '123', paymentId: payment.id };
  });
}

module.exports = { Tracer, tracer };

Context Propagation

Context propagation is the invisible plumbing that makes distributed tracing work at all. When Service A calls Service B over HTTP, OpenTelemetry injects traceparent and tracestate headers on the outgoing request. Service B’s auto-instrumentation reads these headers and creates a child span under the same trace ID. The result: one unified trace, spanning multiple services, with correct parent-child relationships. If propagation breaks (because a client doesn’t inject, or a server doesn’t extract), you get orphaned traces - each service has its own mini-trace, and you cannot see the causality between them. The OTel auto-instrumentation for HTTP clients (axios, httpx, requests) handles this for you transparently - you don’t usually need to write propagation code yourself. The example below shows the manual API for cases where you have a custom transport (a message queue, a gRPC client without auto-instrumentation, a WebSocket) and need to inject/extract context yourself. For message queues specifically, you must manually propagate because OTel can’t auto-inject into message payloads - you inject the trace context into message headers/metadata on the producer side and extract it on the consumer side.
// observability/contextPropagation.js
const { context, propagation } = require('@opentelemetry/api');

// Inject context into outgoing request headers
function injectContext(headers = {}) {
  propagation.inject(context.active(), headers);
  return headers;
}

// Extract context from incoming request headers
function extractContext(headers) {
  return propagation.extract(context.active(), headers);
}

// HTTP client with automatic context propagation
const axios = require('axios');

function createTracedAxios() {
  const instance = axios.create();

  instance.interceptors.request.use((config) => {
    const headers = injectContext(config.headers);
    config.headers = headers;
    return config;
  });

  return instance;
}

// Usage
const http = createTracedAxios();

// Context is automatically propagated
const response = await http.post('http://payment-service/charge', {
  amount: 100
});

Metrics with Prometheus

Metrics Implementation

Prometheus won the metrics war around 2016-2018 for a specific reason: its pull-based model and multidimensional data model fit microservices perfectly. Instead of your services pushing metrics to a central server (which couples them to the collector and causes backpressure problems), Prometheus scrapes /metrics endpoints on a schedule. This means a service that crashes simply stops being scraped - no lost metrics, no retry queues. The multidimensional model (metric_name) lets you slice and dice without pre-aggregating, which is essential when you don’t know in advance which dimension you’ll need to investigate. The four core metric types - Counter, Gauge, Histogram, Summary - each solve a specific problem. Counters monotonically increase (requests served, errors encountered) and are used with rate() in PromQL to compute per-second rates. Gauges go up and down (active connections, memory usage, queue depth) and represent instantaneous values. Histograms bucket observations (request duration, payload size) and let you compute percentiles server-side via histogram_quantile(). Summaries compute percentiles client-side, which is cheaper at query time but loses the ability to aggregate across instances - prefer Histograms unless you specifically need client-side percentiles. Picking the wrong type is the most common metric mistake: gauges for counters make your dashboards wrong, and summaries for multi-instance services make your percentiles meaningless.
Caveats & Common Pitfalls in Metrics
  • High-cardinality label explosion. Attaching user_id, order_id, request_id, raw URL path, or a stack trace as a label turns one metric into millions of time series. Prometheus ingests each as a distinct series, RAM climbs linearly, and at around 2-5 million active series a single Prometheus replica OOMs. On Datadog / Chronosphere / New Relic, the bill jumps from hundreds to tens of thousands of dollars per month because they price on custom metrics, where a “custom metric” is every unique tag combination.
  • Raw URL paths as labels. /users/123, /users/124, /users/125 becomes three series and keeps growing for every new user. Nobody notices until the metrics backend falls over.
  • Using summaries across replicas. Summaries compute quantiles client-side, so you cannot aggregate them — a p99 averaged across 20 pods is mathematically meaningless.
  • Labels that look safe but are not. error_message, user_agent, tenant_id, SQL query text, or country_region all seem fine until a tenant onboards 50,000 customers or a bot sends malformed UAs.
Solutions & Patterns
  • Normalize paths before they hit labels. /users/123 becomes /users/:id. Prefer FastAPI route.path / Express req.route.path which already give you the template — never req.path / request.url.path raw.
  • Keep cardinality under 10,000 per metric. Run topk(10, count by (__name__)({__name__=~".+"})) daily in Prometheus and alert when any metric crosses 10,000 series. Cardinality growth is almost always a bug, caught late.
  • Always use Histograms for latency/size distributions unless you have a specific reason not to. The default buckets are rarely right — tune them to your SLOs (e.g., [0.05, 0.1, 0.25, 0.5, 1, 2.5] for a service targeting sub-100 ms p95).
  • Move high-cardinality dimensions to traces and logs. user_id belongs on a span attribute (where it helps debug a specific request) or a log field (where you can grep), never on a metric label.
  • For Datadog / New Relic users: audit custom metric count monthly. A single dev adding a debugging tag can quietly double your monthly observability bill.
Strong Answer Framework:
  1. Confirm the hypothesis with data, not intuition. Query prometheus_tsdb_symbol_table_size_bytes and prometheus_tsdb_head_series to see the active series count over time. Plot it against the deploy timeline — if series count doubled at 4:03pm Monday when the deploy landed, you have your culprit.
  2. Identify the offending metric. Run topk(20, count by (__name__)({__name__=~".+"})) to find metrics with the most series. Then count by (label_name)({__name__="suspect_metric"}) to find which label is blowing up cardinality.
  3. Mitigate immediately. Drop the offending label at scrape time via Prometheus metric_relabel_configs with a labeldrop action. This is a live config push — no code change needed, no deploy, and Prometheus stops ingesting the bad label within one scrape interval.
  4. Fix the root cause in code. Open a PR that replaces user_email with either a normalized bucket (user_tier: "free"|"paid"|"enterprise") or removes the label and moves user_email to span attributes and logs.
  5. Install a guardrail. Add a CI check that greps Dockerfiles / code for the high-cardinality label anti-pattern, and a Prometheus alert on prometheus_tsdb_head_series > 3_000_000 that pages before the OOM.
Real-World Example: Grafana Labs published a 2022 blog post describing how a customer’s Mimir cluster ingested 40 million active series after a single service added a query_id label — ingestion cost jumped 15x overnight before they caught it with cardinality alerts.Senior Follow-up Questions:
  • “Why not just give Prometheus more RAM?” Scaling vertically is a bandage — the next PR adds another label and you OOM again at double the cost. Cardinality is unbounded in both dimensions; RAM is not.
  • “What if the business genuinely needs per-user latency?” That is a tracing use case, not metrics. Keep 100% error traces plus tail-sampled slow traces; users can query Jaeger/Tempo by user.id. Metrics summarize; traces particularize.
  • “How do you catch this in review before it ships?” Add a lint rule (e.g., a promtool check metrics step or a custom AST linter) that flags any metric definition whose label list includes fields matching a high-cardinality deny list: *_id, email, path, query, url.
Common Wrong Answers:
  • “Just increase Prometheus memory.” Does not fix the growth; you defer the outage by one deploy. Also, Prometheus queries get linearly slower with series count even if it does not OOM.
  • “Switch to Thanos/Cortex/Mimir.” These help with long-term storage and horizontal scaling, but they still charge (in money or RAM) per series. Cardinality discipline is orthogonal to the backend.
Further Reading:
Strong Answer Framework:
  1. Name the actual problem: alert fatigue. When alerts fire constantly, humans adapt by ignoring them — this is well-documented in aviation safety research and applies directly to on-call. The alert is the bug, not the on-call behavior.
  2. Audit alert-to-action ratio. For each alert type, ask: “In the last 30 days, how many fired, and how many required human action?” If a HighLatency alert fired 1,500 times and 0 required action, it is a false-positive generator. Delete or silence it.
  3. Replace threshold alerts with SLO burn-rate alerts. Instead of “p99 over 1 second for 5 minutes,” use a multi-window multi-burn-rate alert that pages when error budget consumption rate implies the monthly SLO will be breached. This fires only when the issue is large enough to matter.
  4. Separate tickets from pages. ticket-severity routes to a Jira queue for next-day triage. page-severity wakes someone. Right now, every latency blip is a page — most should be tickets.
  5. Apply the “three strikes” rule. If an alert fires and is not actionable, open a PR to fix, tune, or delete it before the next on-call shift. Untuned alerts accumulate until on-call becomes unbearable.
Real-World Example: Google’s SRE book devotes a full chapter to alert fatigue; it is one of the top reported drivers of SRE burnout at every major tech company. At Shopify (documented in their SRE blog, around 2020), a single over-eager p99 alert accounted for 70% of pages in one quarter and was the number-one cause of missed real incidents — they killed it and replaced it with burn-rate alerts, cutting total pages by 85%.Senior Follow-up Questions:
  • “How do you define the right SLO to set burn-rate alerts against?” Work backwards from customer experience — 99.9% of checkout requests under 500ms is a real SLO; “p99 under 1s” is a symptom proxy. The SLO must reflect what users actually care about.
  • “What if leadership wants to keep the noisy alert ‘just in case’?” Push back with data: show alerts-per-incident-caught ratio. A 1,500:0 ratio is not “just in case,” it is a distraction mechanism that actively reduces safety.
  • “How do you handle the first week after disabling an alert, just in case it catches something?” Route it to Slack with no paging for two weeks while you monitor. If it never catches anything real, delete it. If it catches something, re-tune and then re-enable paging.
Common Wrong Answers:
  • “Rotate the on-call more often so no one burns out.” This treats the symptom; the alert is still useless and the next on-call ignores it too.
  • “Just add more engineers to the rotation.” Same distribution of noise across more people — total wasted human time stays constant.
Further Reading:
Strong Answer Framework:
  1. Identify the root problem: head-based sampling decides before the outcome is known. At the entry point, the tracer rolls a die: keep or drop. By the time you know the request was slow or failed, you have already dropped it. Head-based sampling optimizes for the average case and is blind to the important cases.
  2. Switch to tail-based sampling for incident-critical cases. Tail-based sampling buffers spans until the trace finishes, then decides. This lets you keep 100% of: errors (any 5xx or exception), slow requests (above p99 latency threshold), and traces touching specific high-value flows (checkout, payment, login).
  3. Use a layered strategy. 100% errors + 100% slow + head-sample 1% of the rest. This gives you the debuggable tail without the full cost of 100% sampling.
  4. Implement via OpenTelemetry Collector, not in-app. The tail_sampling processor in the OTel Collector can buffer spans for up to 30 seconds and apply policies (status, latency, attributes). Keeping sampling out of the app means no code changes per service.
  5. Track your sampling-to-incident ratio. After every incident, ask: “Did we have traces for the bad requests?” If the answer is no more than once a quarter, the sampling policy needs tuning.
Real-World Example: Uber’s Jaeger team publicly documented (around 2019) that they run tail-based sampling specifically to guarantee all error traces are captured; head-based sampling was abandoned after multiple incidents where the key traces were missing. Honeycomb’s BubbleUp feature is built around the same insight: the interesting traces are the ones a random sampler would drop.Senior Follow-up Questions:
  • “What is the memory cost of tail-based sampling?” The Collector must buffer every active trace until a decision window closes (typically 10-30s). For a service doing 10k RPS with average 200ms traces, that is roughly 50k active traces x 5-10 spans x 1 KB = several hundred MB. Size the Collector accordingly or use a dedicated tail-sampling gateway.
  • “How do you handle the memory pressure when traffic spikes?” Two strategies: shed sampling decisions (drop to head-based during pressure) or shard by trace ID across multiple Collector replicas so each sees a consistent subset of traces.
  • “What about PII in span attributes?” Apply a span processor before export that redacts http.request.body, db.statement (when it contains parameters), and any attribute matching a PII deny list. Redact at the edge, never at the backend, because the backend may be a third-party SaaS.
Common Wrong Answers:
  • “Just sample 100% and pay more.” For services above a few thousand RPS, 100% sampling costs are prohibitive (often $50k+/month on Datadog APM), and the vast majority of spans are never looked at. The right answer is smarter sampling.
  • “Increase head-based sample rate to 10%.” You still miss 90% of incidents; you just pay 10x for the privilege of missing most of them.
Further Reading:
// observability/metrics.js
const client = require('prom-client');

class Metrics {
  constructor(options = {}) {
    // Enable default metrics collection
    client.collectDefaultMetrics({
      prefix: options.prefix || '',
      labels: { service: options.serviceName }
    });

    this.register = client.register;

    // Custom metrics
    this.httpRequestsTotal = new client.Counter({
      name: 'http_requests_total',
      help: 'Total number of HTTP requests',
      labelNames: ['method', 'path', 'status']
    });

    this.httpRequestDuration = new client.Histogram({
      name: 'http_request_duration_seconds',
      help: 'HTTP request latency in seconds',
      labelNames: ['method', 'path', 'status'],
      buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
    });

    this.activeConnections = new client.Gauge({
      name: 'active_connections',
      help: 'Number of active connections',
      labelNames: ['type']
    });

    this.businessMetrics = {
      ordersCreated: new client.Counter({
        name: 'orders_created_total',
        help: 'Total number of orders created',
        labelNames: ['status']
      }),
      
      orderValue: new client.Histogram({
        name: 'order_value_dollars',
        help: 'Order value in dollars',
        buckets: [10, 25, 50, 100, 250, 500, 1000]
      }),

      paymentProcessingTime: new client.Histogram({
        name: 'payment_processing_seconds',
        help: 'Payment processing time in seconds',
        labelNames: ['provider', 'status'],
        buckets: [0.1, 0.5, 1, 2, 5, 10]
      })
    };
  }

  // Middleware for HTTP metrics
  middleware() {
    return (req, res, next) => {
      const start = process.hrtime();

      res.on('finish', () => {
        const [seconds, nanoseconds] = process.hrtime(start);
        const duration = seconds + nanoseconds / 1e9;
        
        const labels = {
          method: req.method,
          path: this.normalizePath(req.route?.path || req.path),
          status: res.statusCode
        };

        this.httpRequestsTotal.inc(labels);
        this.httpRequestDuration.observe(labels, duration);
      });

      next();
    };
  }

  // Normalize path to prevent cardinality explosion
  normalizePath(path) {
    return path
      .replace(/\/[0-9a-f]{24}/g, '/:id')  // MongoDB IDs
      .replace(/\/\d+/g, '/:id');           // Numeric IDs
  }

  // Get metrics for Prometheus scraping
  async getMetrics() {
    return this.register.metrics();
  }

  // Record business metric
  recordOrder(status, value) {
    this.businessMetrics.ordersCreated.inc({ status });
    this.businessMetrics.orderValue.observe(value);
  }

  recordPayment(provider, status, duration) {
    this.businessMetrics.paymentProcessingTime.observe(
      { provider, status },
      duration
    );
  }
}

module.exports = { Metrics };

Metrics Endpoint

Prometheus scrapes your service by polling a /metrics endpoint at a configurable interval (typically every 15 seconds). You must expose this endpoint on every service, and it must be cheap to serve - Prometheus will hit it thousands of times per day per replica. The endpoint returns a plaintext format that Prometheus parses. Exposing it on the same port as your API is simplest for development, but in production you often expose it on a separate port (9090) that isn’t publicly routable, so your metrics aren’t leaked to the internet and aren’t subject to auth middleware that would break Prometheus scraping.
// routes/metrics.js
const express = require('express');
const { Metrics } = require('../observability/metrics');

const router = express.Router();
const metrics = new Metrics({ serviceName: 'order-service' });

// Prometheus scrape endpoint
router.get('/metrics', async (req, res) => {
  res.set('Content-Type', metrics.register.contentType);
  res.send(await metrics.getMetrics());
});

// Apply metrics middleware
app.use(metrics.middleware());

module.exports = { router, metrics };

Prometheus Configuration

Prometheus configuration is declarative YAML: you tell it what to scrape, how often, and under what labels. The two common service discovery modes are static (hardcoded list of targets) and kubernetes_sd (auto-discover pods via Kubernetes API). The Kubernetes discovery pattern is especially powerful - it uses pod annotations (prometheus.io/scrape: "true") to decide what to scrape, so new services automatically appear in Prometheus as soon as they’re deployed with the right annotations. Without auto-discovery, adding a new microservice requires a manual Prometheus config change, which becomes a coordination bottleneck as teams grow.
# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'microservices'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2

  # For non-Kubernetes environments
  - job_name: 'services'
    static_configs:
      - targets: 
          - 'order-service:3000'
          - 'payment-service:3000'
          - 'inventory-service:3000'
    metrics_path: '/metrics'

Alert Rules

Alerts are the difference between observability and paging. Good alerts wake people up only when action is required; bad alerts create alert fatigue where engineers start ignoring pages. The rules below embody two important practices: the for: 5m clause ensures an alert only fires if the condition persists for 5 minutes (avoiding flaps from transient blips), and severity labels feed routing rules in Alertmanager so critical alerts go to PagerDuty at 2 AM while warnings go to Slack and wait until morning. Every alert should answer three questions: what is wrong, what is the business impact, and what should the on-call do? Summary and description fields exist to answer exactly these.
Caveats & Common Pitfalls in Alerting
  • Alert fatigue from low-signal alerts. A HighLatency alert that fires 40 times a day and is never actionable trains the on-call engineer to ack-and-ignore. When a real incident arrives, it is indistinguishable from the noise — so it is ignored too.
  • Alerting on symptoms, not causes. CPU > 80% sounds useful until you realize high CPU is sometimes healthy (a batch job, an autoscaling event) and low CPU can coexist with a total outage (deadlocked process, upstream timeout).
  • No runbook. The page says HighErrorRate on order-service and nothing else. The 3am on-call has no idea what to do. Most teams do not write runbooks because nobody has time, so the alert remains a panic trigger rather than a prompt.
  • Flaky alerts that page and auto-resolve in 30 seconds. Transient network blips trigger a page, then auto-resolve before the engineer opens the laptop. Three of these a night and nobody sleeps — but nothing ever needed fixing.
Solutions & Patterns
  • Use multi-window multi-burn-rate alerts for SLOs, per the Google SRE workbook. A short-window alert fires on sudden large burn; a long-window alert catches slow steady degradation. Together they keep noise under fewer than 3 pages per week per service while catching real issues fast.
  • Every alert has a runbook link in annotations.runbook_url. If you cannot write a runbook for it, you should not page for it — it is not actionable.
  • Tier alerts strictly: page / ticket / log. Only incident-grade issues page. Everything else goes to a queue (Jira / GitHub issue) for next-day triage or a Slack channel for awareness.
  • Review alert fire-to-action ratio monthly. For each alert type, count fires vs. times the on-call took action. Ratios worse than 5:1 are deleted or tuned.
  • Use for: clauses to require persistence (5-10 minutes is typical for latency; 1 minute is fine for up == 0 which is already persistent by nature).
# prometheus/alerts.yml
groups:
  - name: microservices
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          / sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
          > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.service }}"
          description: "P99 latency is {{ $value }}s for {{ $labels.service }}"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"
          description: "{{ $labels.instance }} has been down for more than 1 minute"

      - alert: CircuitBreakerOpen
        expr: circuit_breaker_state{state="open"} == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker open for {{ $labels.target_service }}"

Grafana Dashboards

Docker Compose for Observability Stack

The full observability stack involves several moving parts that work together: Prometheus scrapes and stores metrics, Grafana visualizes, Jaeger stores traces, Loki stores logs, and the OpenTelemetry Collector receives OTLP data and routes it to the right backend. Running this locally via Docker Compose is a great way to understand the data flow - instrument one small service and watch metrics, traces, and logs flow into their respective backends. In production you’d run each of these as its own deployment (or use a managed service like Grafana Cloud, Datadog, or New Relic), but the data flow pattern is the same.
# docker-compose.observability.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3001:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "14268:14268"  # Collector HTTP
      - "14250:14250"  # Collector gRPC

  otel-collector:
    image: otel/opentelemetry-collector:latest
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
    volumes:
      - ./otel/otel-collector-config.yaml:/etc/otelcol/config.yaml

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - loki-data:/loki

  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log
      - ./promtail:/etc/promtail

volumes:
  prometheus-data:
  grafana-data:
  loki-data:

OpenTelemetry Collector Configuration

The OTel Collector config uses a pipeline model: receivers take data in, processors transform it, exporters send it out. The batch processor is essential - it groups spans before export to reduce network overhead (hundreds of tiny exports become one big export). The memory_limiter protects the Collector from OOM when it’s overwhelmed by more data than it can export. Splitting pipelines by signal type (traces, metrics, logs) lets you have different processing and different destinations per signal. Misunderstanding this pipeline is a common operational surprise: a misconfigured exporter in the trace pipeline can cause silent data loss, and you only notice hours later when you look for traces that should be there.
# otel/otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: microservices

  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [jaeger, logging]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

Sample Grafana Dashboard JSON

{
  "title": "Microservices Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[5m])) by (service)",
          "legendFormat": "{{ service }}"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) * 100",
          "legendFormat": "{{ service }}"
        }
      ]
    },
    {
      "title": "P99 Latency",
      "type": "graph",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
          "legendFormat": "{{ service }}"
        }
      ]
    },
    {
      "title": "Active Instances",
      "type": "stat",
      "targets": [
        {
          "expr": "count(up{job=\"microservices\"} == 1) by (service)"
        }
      ]
    }
  ]
}

RED Method for Microservices

The RED Method focuses on three key metrics: The RED method (Rate, Errors, Duration) was popularized by Tom Wilkie at Grafana Labs as a pragmatic minimum viable monitoring approach for request-driven services. The insight is that for any service that serves requests, 99% of operational health can be summarized by these three numbers. Rate tells you whether the service is under load. Errors tell you whether it’s serving those requests successfully. Duration tells you whether users are happy with the response time. If all three look healthy, the service is probably fine from the user’s perspective, even if CPU is 80% or memory is 70%.
┌─────────────────────────────────────────────────────────────────────────────┐
│                          RED METHOD                                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  R - RATE                                                                    │
│  ─────────                                                                   │
│  How many requests per second?                                               │
│                                                                              │
│  Metric: sum(rate(http_requests_total[5m])) by (service)                    │
│                                                                              │
│                                                                              │
│  E - ERRORS                                                                  │
│  ──────────                                                                  │
│  How many requests are failing?                                              │
│                                                                              │
│  Metric: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)     │
│          / sum(rate(http_requests_total[5m])) by (service)                  │
│                                                                              │
│                                                                              │
│  D - DURATION                                                                │
│  ───────────                                                                 │
│  How long do requests take?                                                  │
│                                                                              │
│  Metric: histogram_quantile(0.99,                                           │
│            sum(rate(http_request_duration_seconds_bucket[5m]))              │
│            by (le, service))                                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Log Aggregation with Loki

Loki takes a fundamentally different approach to logs than Elasticsearch/Splunk: it indexes only labels (service, level, pod), not the log content itself. This makes it 10-100x cheaper to run at scale, because indexing log bodies is what makes Elasticsearch expensive. The tradeoff is that full-text search over log contents is slower than in Elasticsearch - you compensate by narrowing down with labels first (service + time range) and then grepping the matched log lines. For most microservices use cases (find all errors in service X in the last hour, correlate with a trace ID), this pattern works great and saves huge amounts of infrastructure cost. The transport below batches log entries and pushes to Loki’s HTTP API. Batching is essential - without it, every log line triggers an HTTP round-trip, which would saturate your service’s I/O and slow down every request. The 100-entry batch size with a 1-second flush interval is a typical tradeoff: you trade up to 1 second of log lag for 100x throughput. In practice you’d usually use a sidecar agent (Promtail, Fluent Bit, Vector) to avoid running this logic inside your application process at all - the app writes to stdout in JSON, the sidecar ships to Loki.
// observability/lokiTransport.js
const Transport = require('winston-transport');
const axios = require('axios');

class LokiTransport extends Transport {
  constructor(opts) {
    super(opts);
    this.lokiUrl = opts.lokiUrl || 'http://localhost:3100';
    this.labels = opts.labels || {};
    this.batchSize = opts.batchSize || 100;
    this.batchInterval = opts.batchInterval || 1000;
    this.batch = [];
    
    setInterval(() => this.flush(), this.batchInterval);
  }

  log(info, callback) {
    const logEntry = {
      ts: Date.now() * 1000000, // Nanoseconds
      line: JSON.stringify(info)
    };

    this.batch.push(logEntry);

    if (this.batch.length >= this.batchSize) {
      this.flush();
    }

    callback();
  }

  async flush() {
    if (this.batch.length === 0) return;

    const entries = this.batch;
    this.batch = [];

    const payload = {
      streams: [
        {
          stream: this.labels,
          values: entries.map(e => [String(e.ts), e.line])
        }
      ]
    };

    try {
      await axios.post(`${this.lokiUrl}/loki/api/v1/push`, payload, {
        headers: { 'Content-Type': 'application/json' }
      });
    } catch (error) {
      console.error('Failed to send logs to Loki:', error.message);
      // Re-add to batch for retry
      this.batch = entries.concat(this.batch);
    }
  }
}

module.exports = { LokiTransport };

Interview Questions

Answer:
  1. Logs: Discrete events with rich context
    • What happened at specific moments
    • Good for debugging
    • Structured JSON format preferred
  2. Metrics: Numeric time-series data
    • System performance indicators
    • Good for alerting and dashboards
    • Examples: request rate, error rate, latency
  3. Traces: Request path through system
    • How requests flow across services
    • Latency breakdown per component
    • Good for debugging distributed issues
Together: Full visibility into distributed systems
Answer:Components:
  • Trace: Complete request journey
  • Span: Single operation within trace
  • Context: Trace ID + Span ID propagated across services
How it works:
  1. First service creates trace ID
  2. Each operation creates a span with parent reference
  3. Context propagated via headers
  4. All spans collected and visualized
Key headers:
  • traceparent: W3C standard trace context
  • X-Trace-ID: Trace identifier
  • X-Span-ID: Current span
Tools: Jaeger, Zipkin, OpenTelemetry
Answer:RED = Rate, Errors, Duration
  1. Rate: Requests per second
    • Indicates load on service
    • Alert on unusual spikes/drops
  2. Errors: Failed requests ratio
    • Track 4xx and 5xx separately
    • Alert when error rate exceeds threshold
  3. Duration: Request latency (p50, p95, p99)
    • User experience indicator
    • Alert on latency degradation
Why RED:
  • Simple, focused metrics
  • Covers most service health scenarios
  • Easy to implement and understand
Complementary: USE method for infrastructure (Utilization, Saturation, Errors)

Summary

Key Takeaways

  • Three pillars: Logs, Metrics, Traces
  • Use structured logging with correlation IDs
  • OpenTelemetry for vendor-neutral instrumentation
  • Prometheus for metrics, Grafana for visualization
  • RED method for service health

Next Steps

In the next chapter, we’ll cover Security - authentication, authorization, and secure communication.

Interview Deep-Dive

Strong Answer:My investigation follows a structured funnel: wide to narrow, infrastructure to application.Step one: Grafana dashboard — the RED metrics view. I check which service has the highest latency spike. If the Order Service P99 jumped from 200ms to 5 seconds but Payment Service is steady, I know the bottleneck is in Order Service or something it calls. If all services spiked simultaneously, it is likely an infrastructure issue (database, network, DNS).Step two: Jaeger trace search. I filter for traces with duration greater than 3 seconds in the affected time window. I look at 5-10 slow traces to find the pattern. Usually, one span dominates — say, the inventory check span went from 50ms to 4 seconds. That tells me exactly which service call is the bottleneck.Step three: drill into that service. I check its Prometheus metrics: CPU usage, memory, connection pool utilization, database query latency. If the database query latency histogram shows P99 at 3 seconds, the problem is in the database layer. If the service CPU is at 95%, it is compute-bound. If the connection pool active count equals the max pool size, it is connection-starved.Step four: log correlation. I take the trace ID from one of the slow traces and search in the log aggregator (Loki, ELK). The correlated logs from every service in the call chain show me exactly what happened for that request. Maybe the database returned a lock timeout error, or an external API returned a 429 rate limit response.Step five: mitigation before root cause. If the issue is a slow database query, I might add a cache or increase connection pool size as an immediate fix while I investigate the root cause (missing index, table lock, vacuum not running). The goal at 2 AM is “stop the bleeding” — root cause analysis happens in the morning.The key discipline: do not skip steps. I have seen engineers jump straight to code review at 2 AM, spending an hour reading application logic when a 30-second look at the Grafana dashboard would have shown the database server was at 100% disk I/O.Follow-up: “How do you set up alerts that catch this before customers notice and before the on-call gets paged?”I use multi-window alerting. A “burn rate” alert that fires when the error budget is being consumed 10x faster than sustainable. Concretely: if the SLO is 99.9% availability (43 minutes of downtime per month), and in the last 5 minutes we have already used 5 minutes of error budget, the alert fires. This catches degradation before customers notice because it triggers on rate-of-change, not absolute thresholds. I also set a slower alert for 1-hour windows that catches gradual degradation that the 5-minute window misses.
Strong Answer:Cardinality explosion happens when a metric label has too many unique values, causing the number of time series to grow unbounded. Prometheus stores each unique combination of labels as a separate time series, and each time series costs memory and disk.The classic example: someone adds a user_id label to an HTTP request duration metric. With 1 million users, you now have 1 million time series for that single metric. Multiply by the number of methods, paths, and status codes, and you can easily reach 100 million time series. Prometheus will run out of memory and crash.I have dealt with this in production in three ways. First, never use unbounded values as metric labels. User IDs, request IDs, session tokens, order IDs — these are all unbounded and must never be labels. They belong in logs and traces, not metrics. Metric labels should be low-cardinality: HTTP method (5 values), status code class (5 values: 2xx, 3xx, 4xx, 5xx, other), service name (tens of values), endpoint pattern (tens of values).Second, normalize URL paths in metrics. If your API has /users/123, /users/456, etc., each unique path becomes a separate label value. I normalize to /users/:id before recording the metric. This is the single most common cause of cardinality explosion in REST API metrics.Third, set up cardinality monitoring. I run a query like count({__name__=~".+"}) by (__name__) in Prometheus to see which metrics have the most time series. Any metric with more than 10,000 series gets investigated. I also set an alert when the total number of active time series exceeds a threshold relative to Prometheus’s memory allocation.At a company I worked at, a well-intentioned engineer added a query_template label to database metrics, where each unique SQL query became a label value. With thousands of unique queries (including those with interpolated parameters), Prometheus ingestion rate quadrupled and the monitoring stack went down during a production incident — the exact moment we needed it most. We caught it at 3 AM and spent two hours rolling back the metric definition.Follow-up: “How do you balance the need for detailed observability with the cost of storing all that data?”I use a tiered retention strategy. High-resolution metrics (15-second scrape interval) are kept for 2 weeks — this is for active debugging. Downsampled metrics (5-minute averages) are kept for 6 months via Thanos or Cortex — this is for trend analysis. Distributed traces are sampled: I keep 100% of error traces and 1-5% of successful traces, with head-based sampling for latency outliers (always keep traces above P99 latency). Logs are the most expensive to store, so I use log levels aggressively: DEBUG in development, INFO in production, and ERROR-only retention after 30 days.
Strong Answer:RED (Rate, Errors, Duration) measures how well a service is serving requests from the user’s perspective. USE (Utilization, Saturation, Errors) measures how well infrastructure resources are performing. They answer fundamentally different questions: RED tells you “are users happy?” while USE tells you “are resources healthy?”I apply RED to every request-serving microservice: order service, payment service, user service. The dashboard for each service shows requests per second, error percentage, and latency percentiles (P50, P95, P99). If RED metrics look bad, something is wrong from the user’s perspective, even if the servers look fine.I apply USE to infrastructure components: databases, message brokers, caches, CPU, memory, disk, network. For PostgreSQL: utilization is connection pool usage percentage, saturation is the number of queued connections, errors are failed queries. For Kafka: utilization is partition throughput relative to capacity, saturation is consumer lag (messages waiting to be processed), errors are failed produce/consume operations.The gap in RED: it tells you something is wrong but not why. If P99 latency jumped, RED does not tell you whether it is a slow database, a noisy neighbor on the same host, or a network issue. You need USE metrics of the underlying infrastructure to diagnose the root cause.The gap in USE: it can show all green while users are unhappy. If you have 20% CPU utilization and 30% memory usage, USE says “everything is fine.” But if your application has a deadlock that causes 50% of requests to timeout, RED catches it immediately. USE misses application-level problems.In practice, I build dashboards with RED metrics at the top (the “user experience” view) and USE metrics below (the “infrastructure health” view). Investigation flows top-down: RED alerts you to a problem, USE helps you diagnose it.Follow-up: “What about the Four Golden Signals from Google’s SRE book? How do those relate?”The Four Golden Signals (Latency, Traffic, Errors, Saturation) are essentially RED plus Saturation from USE. Latency equals Duration, Traffic equals Rate, Errors maps directly. Google adds Saturation because it is the leading indicator — a system at 95% memory utilization has not failed yet but is about to. I think of the Four Golden Signals as the pragmatic union of RED and USE, which is why Google recommends them as the minimum viable monitoring for any service.