Skip to main content

Observability

In distributed systems, understanding what’s happening across services is crucial. Observability gives you visibility into your system through three pillars: Logs, Metrics, and Traces.
Learning Objectives:
  • Understand the three pillars of observability
  • Implement structured logging
  • Set up distributed tracing with OpenTelemetry
  • Configure metrics with Prometheus
  • Build dashboards with Grafana

Three Pillars of Observability

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THREE PILLARS OF OBSERVABILITY                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                              LOGS                                    │    │
│  │                                                                      │    │
│  │  What happened?                                                      │    │
│  │                                                                      │    │
│  │  [2024-01-15T10:30:00Z] INFO  order-service: Order created          │    │
│  │  [2024-01-15T10:30:01Z] ERROR payment-service: Payment failed       │    │
│  │  [2024-01-15T10:30:02Z] WARN  inventory-service: Low stock alert   │    │
│  │                                                                      │    │
│  │  • Discrete events                                                   │    │
│  │  • Rich context                                                      │    │
│  │  • Good for debugging                                                │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                             METRICS                                  │    │
│  │                                                                      │    │
│  │  How is the system performing?                                       │    │
│  │                                                                      │    │
│  │  request_count{service="order",status="200"} = 1523                 │    │
│  │  request_duration_seconds{service="order",p99} = 0.245              │    │
│  │  error_rate{service="payment"} = 0.02                               │    │
│  │                                                                      │    │
│  │  • Numeric time-series data                                          │    │
│  │  • Aggregated values                                                 │    │
│  │  • Good for alerting & dashboards                                    │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                             TRACES                                   │    │
│  │                                                                      │    │
│  │  How do requests flow through the system?                            │    │
│  │                                                                      │    │
│  │  [trace-id: abc123]                                                  │    │
│  │  └─ order-service (50ms)                                             │    │
│  │     ├─ validate-order (5ms)                                          │    │
│  │     ├─ payment-service (30ms)                                        │    │
│  │     │  └─ process-payment (25ms)                                     │    │
│  │     └─ inventory-service (10ms)                                      │    │
│  │                                                                      │    │
│  │  • Request path visualization                                        │    │
│  │  • Latency breakdown                                                 │    │
│  │  • Dependency mapping                                                │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Structured Logging

Logs should be structured (JSON) for easy parsing and querying.

Logger Implementation

// observability/Logger.js
const winston = require('winston');
const { v4: uuidv4 } = require('uuid');

class Logger {
  constructor(options = {}) {
    this.serviceName = options.serviceName || process.env.SERVICE_NAME || 'unknown';
    this.environment = options.environment || process.env.NODE_ENV || 'development';
    
    this.logger = winston.createLogger({
      level: options.level || 'info',
      format: winston.format.combine(
        winston.format.timestamp(),
        winston.format.errors({ stack: true }),
        winston.format.json()
      ),
      defaultMeta: {
        service: this.serviceName,
        environment: this.environment,
        version: process.env.APP_VERSION || '1.0.0',
        hostname: require('os').hostname()
      },
      transports: this.createTransports(options)
    });
  }

  createTransports(options) {
    const transports = [];
    
    // Console for development
    if (this.environment === 'development') {
      transports.push(new winston.transports.Console({
        format: winston.format.combine(
          winston.format.colorize(),
          winston.format.simple()
        )
      }));
    } else {
      // JSON for production (shipped to log aggregator)
      transports.push(new winston.transports.Console());
    }
    
    // File transport for local debugging
    if (options.logFile) {
      transports.push(new winston.transports.File({
        filename: options.logFile,
        maxsize: 5242880, // 5MB
        maxFiles: 5
      }));
    }
    
    return transports;
  }

  // Add correlation context to all logs
  child(context) {
    return {
      info: (message, meta = {}) => this.info(message, { ...context, ...meta }),
      warn: (message, meta = {}) => this.warn(message, { ...context, ...meta }),
      error: (message, meta = {}) => this.error(message, { ...context, ...meta }),
      debug: (message, meta = {}) => this.debug(message, { ...context, ...meta })
    };
  }

  info(message, meta = {}) {
    this.logger.info(message, this.enrichMeta(meta));
  }

  warn(message, meta = {}) {
    this.logger.warn(message, this.enrichMeta(meta));
  }

  error(message, meta = {}) {
    this.logger.error(message, this.enrichMeta(meta));
  }

  debug(message, meta = {}) {
    this.logger.debug(message, this.enrichMeta(meta));
  }

  enrichMeta(meta) {
    return {
      ...meta,
      timestamp: new Date().toISOString()
    };
  }
}

// Singleton export
const logger = new Logger({
  serviceName: process.env.SERVICE_NAME,
  level: process.env.LOG_LEVEL || 'info'
});

module.exports = { Logger, logger };

Request Logging Middleware

// middleware/requestLogger.js
const { logger } = require('../observability/Logger');
const { v4: uuidv4 } = require('uuid');

function requestLogger(options = {}) {
  return (req, res, next) => {
    // Extract or generate correlation IDs
    const correlationId = req.headers['x-correlation-id'] || uuidv4();
    const traceId = req.headers['x-trace-id'] || correlationId;
    const spanId = uuidv4().substring(0, 16);
    
    // Attach to request
    req.correlationId = correlationId;
    req.traceId = traceId;
    req.spanId = spanId;
    
    // Create child logger with context
    req.logger = logger.child({
      correlationId,
      traceId,
      spanId,
      method: req.method,
      path: req.path,
      userAgent: req.headers['user-agent']
    });

    // Log request start
    const startTime = Date.now();
    req.logger.info('Request started', {
      query: req.query,
      ip: req.ip
    });

    // Capture response
    const originalSend = res.send;
    res.send = function(body) {
      const duration = Date.now() - startTime;
      
      req.logger.info('Request completed', {
        statusCode: res.statusCode,
        duration,
        responseSize: body?.length || 0
      });
      
      return originalSend.call(this, body);
    };

    // Set headers for downstream services
    res.setHeader('X-Correlation-ID', correlationId);
    res.setHeader('X-Trace-ID', traceId);

    next();
  };
}

module.exports = { requestLogger };

Propagating Context to Downstream Services

// utils/httpClient.js
const axios = require('axios');

function createTracedClient(req) {
  return axios.create({
    headers: {
      'X-Correlation-ID': req.correlationId,
      'X-Trace-ID': req.traceId,
      'X-Parent-Span-ID': req.spanId
    }
  });
}

// Usage in route handler
app.post('/orders', async (req, res) => {
  const http = createTracedClient(req);
  
  try {
    req.logger.info('Creating order', { customerId: req.body.customerId });
    
    // Context is automatically propagated
    const payment = await http.post('http://payment-service/charge', {
      amount: req.body.total
    });
    
    req.logger.info('Payment processed', { paymentId: payment.data.id });
    
    res.json({ orderId: '123', paymentId: payment.data.id });
  } catch (error) {
    req.logger.error('Order creation failed', {
      error: error.message,
      stack: error.stack
    });
    throw error;
  }
});

Distributed Tracing with OpenTelemetry

OpenTelemetry provides vendor-neutral instrumentation.
┌─────────────────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED TRACE VISUALIZATION                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Trace ID: abc123-def456-ghi789                                              │
│  ──────────────────────────────────────────────────────────────────         │
│                                                                              │
│  0ms      50ms     100ms    150ms    200ms    250ms    300ms               │
│  ├────────┴────────┴────────┴────────┴────────┴────────┤                   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────┐              │
│  │ api-gateway: POST /orders                                 │ 280ms       │
│  └┬─────────────────────────────────────────────────────────┘              │
│   │                                                                         │
│   │ ┌────────────────────────────────────────────────┐                     │
│   ├▶│ order-service: createOrder                      │ 220ms              │
│   │ └┬───────────────────────────────────────────────┘                     │
│   │  │                                                                      │
│   │  │ ┌─────────────────────┐                                             │
│   │  ├▶│ validate-order      │ 15ms                                        │
│   │  │ └─────────────────────┘                                             │
│   │  │                                                                      │
│   │  │ ┌─────────────────────────────────────┐                             │
│   │  ├▶│ payment-service: processPayment     │ 120ms                       │
│   │  │ └┬────────────────────────────────────┘                             │
│   │  │  │                                                                   │
│   │  │  │ ┌────────────────────────┐                                       │
│   │  │  └▶│ stripe-api: charge     │ 95ms                                  │
│   │  │    └────────────────────────┘                                       │
│   │  │                                                                      │
│   │  │ ┌──────────────────────────────┐                                    │
│   │  └▶│ inventory-service: reserve   │ 45ms                               │
│   │    └┬─────────────────────────────┘                                    │
│   │     │                                                                   │
│   │     │ ┌─────────────────────┐                                          │
│   │     └▶│ database: UPDATE    │ 20ms                                     │
│   │       └─────────────────────┘                                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

OpenTelemetry Setup

// observability/tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

function initTracing(serviceName) {
  const resource = new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: serviceName,
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION || '1.0.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development'
  });

  const traceExporter = new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317'
  });

  const metricExporter = new OTLPMetricExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317'
  });

  const sdk = new NodeSDK({
    resource,
    traceExporter,
    metricReader: new PeriodicExportingMetricReader({
      exporter: metricExporter,
      exportIntervalMillis: 30000
    }),
    instrumentations: [
      getNodeAutoInstrumentations({
        '@opentelemetry/instrumentation-http': {
          ignoreIncomingPaths: ['/health', '/metrics']
        },
        '@opentelemetry/instrumentation-express': {},
        '@opentelemetry/instrumentation-mongodb': {},
        '@opentelemetry/instrumentation-redis': {},
        '@opentelemetry/instrumentation-pg': {},
        '@opentelemetry/instrumentation-grpc': {}
      })
    ]
  });

  sdk.start();
  console.log('OpenTelemetry tracing initialized');

  // Graceful shutdown
  process.on('SIGTERM', () => {
    sdk.shutdown()
      .then(() => console.log('Tracing terminated'))
      .catch((error) => console.error('Error terminating tracing', error))
      .finally(() => process.exit(0));
  });

  return sdk;
}

module.exports = { initTracing };

Manual Instrumentation

// observability/tracer.js
const { trace, context, SpanStatusCode } = require('@opentelemetry/api');

class Tracer {
  constructor(serviceName) {
    this.tracer = trace.getTracer(serviceName);
  }

  // Create a new span for an operation
  async trace(name, fn, options = {}) {
    return this.tracer.startActiveSpan(name, options, async (span) => {
      try {
        const result = await fn(span);
        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (error) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: error.message
        });
        span.recordException(error);
        throw error;
      } finally {
        span.end();
      }
    });
  }

  // Add attributes to current span
  addAttribute(key, value) {
    const span = trace.getActiveSpan();
    if (span) {
      span.setAttribute(key, value);
    }
  }

  // Add event to current span
  addEvent(name, attributes = {}) {
    const span = trace.getActiveSpan();
    if (span) {
      span.addEvent(name, attributes);
    }
  }

  // Get current trace context for propagation
  getContext() {
    const span = trace.getActiveSpan();
    if (span) {
      const spanContext = span.spanContext();
      return {
        traceId: spanContext.traceId,
        spanId: spanContext.spanId,
        traceFlags: spanContext.traceFlags
      };
    }
    return null;
  }
}

// Usage
const tracer = new Tracer('order-service');

async function createOrder(orderData) {
  return tracer.trace('createOrder', async (span) => {
    span.setAttribute('order.customerId', orderData.customerId);
    span.setAttribute('order.itemCount', orderData.items.length);

    // Validate order
    await tracer.trace('validateOrder', async () => {
      // validation logic
    });

    // Process payment
    const payment = await tracer.trace('processPayment', async (paymentSpan) => {
      paymentSpan.setAttribute('payment.amount', orderData.total);
      return paymentService.charge(orderData.total);
    });

    span.addEvent('payment_completed', {
      paymentId: payment.id,
      amount: payment.amount
    });

    // Reserve inventory
    await tracer.trace('reserveInventory', async () => {
      return inventoryService.reserve(orderData.items);
    });

    return { orderId: '123', paymentId: payment.id };
  });
}

module.exports = { Tracer, tracer };

Context Propagation

// observability/contextPropagation.js
const { context, propagation } = require('@opentelemetry/api');

// Inject context into outgoing request headers
function injectContext(headers = {}) {
  propagation.inject(context.active(), headers);
  return headers;
}

// Extract context from incoming request headers
function extractContext(headers) {
  return propagation.extract(context.active(), headers);
}

// HTTP client with automatic context propagation
const axios = require('axios');

function createTracedAxios() {
  const instance = axios.create();

  instance.interceptors.request.use((config) => {
    const headers = injectContext(config.headers);
    config.headers = headers;
    return config;
  });

  return instance;
}

// Usage
const http = createTracedAxios();

// Context is automatically propagated
const response = await http.post('http://payment-service/charge', {
  amount: 100
});

Metrics with Prometheus

Metrics Implementation

// observability/metrics.js
const client = require('prom-client');

class Metrics {
  constructor(options = {}) {
    // Enable default metrics collection
    client.collectDefaultMetrics({
      prefix: options.prefix || '',
      labels: { service: options.serviceName }
    });

    this.register = client.register;

    // Custom metrics
    this.httpRequestsTotal = new client.Counter({
      name: 'http_requests_total',
      help: 'Total number of HTTP requests',
      labelNames: ['method', 'path', 'status']
    });

    this.httpRequestDuration = new client.Histogram({
      name: 'http_request_duration_seconds',
      help: 'HTTP request latency in seconds',
      labelNames: ['method', 'path', 'status'],
      buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
    });

    this.activeConnections = new client.Gauge({
      name: 'active_connections',
      help: 'Number of active connections',
      labelNames: ['type']
    });

    this.businessMetrics = {
      ordersCreated: new client.Counter({
        name: 'orders_created_total',
        help: 'Total number of orders created',
        labelNames: ['status']
      }),
      
      orderValue: new client.Histogram({
        name: 'order_value_dollars',
        help: 'Order value in dollars',
        buckets: [10, 25, 50, 100, 250, 500, 1000]
      }),

      paymentProcessingTime: new client.Histogram({
        name: 'payment_processing_seconds',
        help: 'Payment processing time in seconds',
        labelNames: ['provider', 'status'],
        buckets: [0.1, 0.5, 1, 2, 5, 10]
      })
    };
  }

  // Middleware for HTTP metrics
  middleware() {
    return (req, res, next) => {
      const start = process.hrtime();

      res.on('finish', () => {
        const [seconds, nanoseconds] = process.hrtime(start);
        const duration = seconds + nanoseconds / 1e9;
        
        const labels = {
          method: req.method,
          path: this.normalizePath(req.route?.path || req.path),
          status: res.statusCode
        };

        this.httpRequestsTotal.inc(labels);
        this.httpRequestDuration.observe(labels, duration);
      });

      next();
    };
  }

  // Normalize path to prevent cardinality explosion
  normalizePath(path) {
    return path
      .replace(/\/[0-9a-f]{24}/g, '/:id')  // MongoDB IDs
      .replace(/\/\d+/g, '/:id');           // Numeric IDs
  }

  // Get metrics for Prometheus scraping
  async getMetrics() {
    return this.register.metrics();
  }

  // Record business metric
  recordOrder(status, value) {
    this.businessMetrics.ordersCreated.inc({ status });
    this.businessMetrics.orderValue.observe(value);
  }

  recordPayment(provider, status, duration) {
    this.businessMetrics.paymentProcessingTime.observe(
      { provider, status },
      duration
    );
  }
}

module.exports = { Metrics };

Metrics Endpoint

// routes/metrics.js
const express = require('express');
const { Metrics } = require('../observability/metrics');

const router = express.Router();
const metrics = new Metrics({ serviceName: 'order-service' });

// Prometheus scrape endpoint
router.get('/metrics', async (req, res) => {
  res.set('Content-Type', metrics.register.contentType);
  res.send(await metrics.getMetrics());
});

// Apply metrics middleware
app.use(metrics.middleware());

module.exports = { router, metrics };

Prometheus Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'microservices'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2

  # For non-Kubernetes environments
  - job_name: 'services'
    static_configs:
      - targets: 
          - 'order-service:3000'
          - 'payment-service:3000'
          - 'inventory-service:3000'
    metrics_path: '/metrics'

Alert Rules

# prometheus/alerts.yml
groups:
  - name: microservices
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          / sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
          > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.service }}"
          description: "P99 latency is {{ $value }}s for {{ $labels.service }}"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"
          description: "{{ $labels.instance }} has been down for more than 1 minute"

      - alert: CircuitBreakerOpen
        expr: circuit_breaker_state{state="open"} == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker open for {{ $labels.target_service }}"

Grafana Dashboards

Docker Compose for Observability Stack

# docker-compose.observability.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3001:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "14268:14268"  # Collector HTTP
      - "14250:14250"  # Collector gRPC

  otel-collector:
    image: otel/opentelemetry-collector:latest
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
    volumes:
      - ./otel/otel-collector-config.yaml:/etc/otelcol/config.yaml

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - loki-data:/loki

  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log
      - ./promtail:/etc/promtail

volumes:
  prometheus-data:
  grafana-data:
  loki-data:

OpenTelemetry Collector Configuration

# otel/otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: microservices

  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [jaeger, logging]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

Sample Grafana Dashboard JSON

{
  "title": "Microservices Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[5m])) by (service)",
          "legendFormat": "{{ service }}"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) * 100",
          "legendFormat": "{{ service }}"
        }
      ]
    },
    {
      "title": "P99 Latency",
      "type": "graph",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
          "legendFormat": "{{ service }}"
        }
      ]
    },
    {
      "title": "Active Instances",
      "type": "stat",
      "targets": [
        {
          "expr": "count(up{job=\"microservices\"} == 1) by (service)"
        }
      ]
    }
  ]
}

RED Method for Microservices

The RED Method focuses on three key metrics:
┌─────────────────────────────────────────────────────────────────────────────┐
│                          RED METHOD                                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  R - RATE                                                                    │
│  ─────────                                                                   │
│  How many requests per second?                                               │
│                                                                              │
│  Metric: sum(rate(http_requests_total[5m])) by (service)                    │
│                                                                              │
│                                                                              │
│  E - ERRORS                                                                  │
│  ──────────                                                                  │
│  How many requests are failing?                                              │
│                                                                              │
│  Metric: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)     │
│          / sum(rate(http_requests_total[5m])) by (service)                  │
│                                                                              │
│                                                                              │
│  D - DURATION                                                                │
│  ───────────                                                                 │
│  How long do requests take?                                                  │
│                                                                              │
│  Metric: histogram_quantile(0.99,                                           │
│            sum(rate(http_request_duration_seconds_bucket[5m]))              │
│            by (le, service))                                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Log Aggregation with Loki

// observability/lokiTransport.js
const Transport = require('winston-transport');
const axios = require('axios');

class LokiTransport extends Transport {
  constructor(opts) {
    super(opts);
    this.lokiUrl = opts.lokiUrl || 'http://localhost:3100';
    this.labels = opts.labels || {};
    this.batchSize = opts.batchSize || 100;
    this.batchInterval = opts.batchInterval || 1000;
    this.batch = [];
    
    setInterval(() => this.flush(), this.batchInterval);
  }

  log(info, callback) {
    const logEntry = {
      ts: Date.now() * 1000000, // Nanoseconds
      line: JSON.stringify(info)
    };

    this.batch.push(logEntry);

    if (this.batch.length >= this.batchSize) {
      this.flush();
    }

    callback();
  }

  async flush() {
    if (this.batch.length === 0) return;

    const entries = this.batch;
    this.batch = [];

    const payload = {
      streams: [
        {
          stream: this.labels,
          values: entries.map(e => [String(e.ts), e.line])
        }
      ]
    };

    try {
      await axios.post(`${this.lokiUrl}/loki/api/v1/push`, payload, {
        headers: { 'Content-Type': 'application/json' }
      });
    } catch (error) {
      console.error('Failed to send logs to Loki:', error.message);
      // Re-add to batch for retry
      this.batch = entries.concat(this.batch);
    }
  }
}

module.exports = { LokiTransport };

Interview Questions

Answer:
  1. Logs: Discrete events with rich context
    • What happened at specific moments
    • Good for debugging
    • Structured JSON format preferred
  2. Metrics: Numeric time-series data
    • System performance indicators
    • Good for alerting and dashboards
    • Examples: request rate, error rate, latency
  3. Traces: Request path through system
    • How requests flow across services
    • Latency breakdown per component
    • Good for debugging distributed issues
Together: Full visibility into distributed systems
Answer:Components:
  • Trace: Complete request journey
  • Span: Single operation within trace
  • Context: Trace ID + Span ID propagated across services
How it works:
  1. First service creates trace ID
  2. Each operation creates a span with parent reference
  3. Context propagated via headers
  4. All spans collected and visualized
Key headers:
  • traceparent: W3C standard trace context
  • X-Trace-ID: Trace identifier
  • X-Span-ID: Current span
Tools: Jaeger, Zipkin, OpenTelemetry
Answer:RED = Rate, Errors, Duration
  1. Rate: Requests per second
    • Indicates load on service
    • Alert on unusual spikes/drops
  2. Errors: Failed requests ratio
    • Track 4xx and 5xx separately
    • Alert when error rate exceeds threshold
  3. Duration: Request latency (p50, p95, p99)
    • User experience indicator
    • Alert on latency degradation
Why RED:
  • Simple, focused metrics
  • Covers most service health scenarios
  • Easy to implement and understand
Complementary: USE method for infrastructure (Utilization, Saturation, Errors)

Summary

Key Takeaways

  • Three pillars: Logs, Metrics, Traces
  • Use structured logging with correlation IDs
  • OpenTelemetry for vendor-neutral instrumentation
  • Prometheus for metrics, Grafana for visualization
  • RED method for service health

Next Steps

In the next chapter, we’ll cover Security - authentication, authorization, and secure communication.