Observability

In distributed systems, understanding what’s happening across services is crucial. Observability gives you visibility into your system through three pillars: Logs, Metrics, and Traces.

Learning Objectives:

Understand the three pillars of observability
Implement structured logging
Set up distributed tracing with OpenTelemetry
Configure metrics with Prometheus
Build dashboards with Grafana

Three Pillars of Observability

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THREE PILLARS OF OBSERVABILITY                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                              LOGS                                    │    │
│  │                                                                      │    │
│  │  What happened?                                                      │    │
│  │                                                                      │    │
│  │  [2024-01-15T10:30:00Z] INFO  order-service: Order created          │    │
│  │  [2024-01-15T10:30:01Z] ERROR payment-service: Payment failed       │    │
│  │  [2024-01-15T10:30:02Z] WARN  inventory-service: Low stock alert   │    │
│  │                                                                      │    │
│  │  • Discrete events                                                   │    │
│  │  • Rich context                                                      │    │
│  │  • Good for debugging                                                │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                             METRICS                                  │    │
│  │                                                                      │    │
│  │  How is the system performing?                                       │    │
│  │                                                                      │    │
│  │  request_count{service="order",status="200"} = 1523                 │    │
│  │  request_duration_seconds{service="order",p99} = 0.245              │    │
│  │  error_rate{service="payment"} = 0.02                               │    │
│  │                                                                      │    │
│  │  • Numeric time-series data                                          │    │
│  │  • Aggregated values                                                 │    │
│  │  • Good for alerting & dashboards                                    │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                             TRACES                                   │    │
│  │                                                                      │    │
│  │  How do requests flow through the system?                            │    │
│  │                                                                      │    │
│  │  [trace-id: abc123]                                                  │    │
│  │  └─ order-service (50ms)                                             │    │
│  │     ├─ validate-order (5ms)                                          │    │
│  │     ├─ payment-service (30ms)                                        │    │
│  │     │  └─ process-payment (25ms)                                     │    │
│  │     └─ inventory-service (10ms)                                      │    │
│  │                                                                      │    │
│  │  • Request path visualization                                        │    │
│  │  • Latency breakdown                                                 │    │
│  │  • Dependency mapping                                                │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Structured Logging

Logs should be structured (JSON) for easy parsing and querying.

Logger Implementation

// observability/Logger.js
const winston = require('winston');
const { v4: uuidv4 } = require('uuid');

class Logger {
  constructor(options = {}) {
    this.serviceName = options.serviceName || process.env.SERVICE_NAME || 'unknown';
    this.environment = options.environment || process.env.NODE_ENV || 'development';
    
    this.logger = winston.createLogger({
      level: options.level || 'info',
      format: winston.format.combine(
        winston.format.timestamp(),
        winston.format.errors({ stack: true }),
        winston.format.json()
      ),
      defaultMeta: {
        service: this.serviceName,
        environment: this.environment,
        version: process.env.APP_VERSION || '1.0.0',
        hostname: require('os').hostname()
      },
      transports: this.createTransports(options)
    });
  }

  createTransports(options) {
    const transports = [];
    
    // Console for development
    if (this.environment === 'development') {
      transports.push(new winston.transports.Console({
        format: winston.format.combine(
          winston.format.colorize(),
          winston.format.simple()
        )
      }));
    } else {
      // JSON for production (shipped to log aggregator)
      transports.push(new winston.transports.Console());
    }
    
    // File transport for local debugging
    if (options.logFile) {
      transports.push(new winston.transports.File({
        filename: options.logFile,
        maxsize: 5242880, // 5MB
        maxFiles: 5
      }));
    }
    
    return transports;
  }

  // Add correlation context to all logs
  child(context) {
    return {
      info: (message, meta = {}) => this.info(message, { ...context, ...meta }),
      warn: (message, meta = {}) => this.warn(message, { ...context, ...meta }),
      error: (message, meta = {}) => this.error(message, { ...context, ...meta }),
      debug: (message, meta = {}) => this.debug(message, { ...context, ...meta })
    };
  }

  info(message, meta = {}) {
    this.logger.info(message, this.enrichMeta(meta));
  }

  warn(message, meta = {}) {
    this.logger.warn(message, this.enrichMeta(meta));
  }

  error(message, meta = {}) {
    this.logger.error(message, this.enrichMeta(meta));
  }

  debug(message, meta = {}) {
    this.logger.debug(message, this.enrichMeta(meta));
  }

  enrichMeta(meta) {
    return {
      ...meta,
      timestamp: new Date().toISOString()
    };
  }
}

// Singleton export
const logger = new Logger({
  serviceName: process.env.SERVICE_NAME,
  level: process.env.LOG_LEVEL || 'info'
});

module.exports = { Logger, logger };

Request Logging Middleware

// middleware/requestLogger.js
const { logger } = require('../observability/Logger');
const { v4: uuidv4 } = require('uuid');

function requestLogger(options = {}) {
  return (req, res, next) => {
    // Extract or generate correlation IDs
    const correlationId = req.headers['x-correlation-id'] || uuidv4();
    const traceId = req.headers['x-trace-id'] || correlationId;
    const spanId = uuidv4().substring(0, 16);
    
    // Attach to request
    req.correlationId = correlationId;
    req.traceId = traceId;
    req.spanId = spanId;
    
    // Create child logger with context
    req.logger = logger.child({
      correlationId,
      traceId,
      spanId,
      method: req.method,
      path: req.path,
      userAgent: req.headers['user-agent']
    });

    // Log request start
    const startTime = Date.now();
    req.logger.info('Request started', {
      query: req.query,
      ip: req.ip
    });

    // Capture response
    const originalSend = res.send;
    res.send = function(body) {
      const duration = Date.now() - startTime;
      
      req.logger.info('Request completed', {
        statusCode: res.statusCode,
        duration,
        responseSize: body?.length || 0
      });
      
      return originalSend.call(this, body);
    };

    // Set headers for downstream services
    res.setHeader('X-Correlation-ID', correlationId);
    res.setHeader('X-Trace-ID', traceId);

    next();
  };
}

module.exports = { requestLogger };

Propagating Context to Downstream Services

// utils/httpClient.js
const axios = require('axios');

function createTracedClient(req) {
  return axios.create({
    headers: {
      'X-Correlation-ID': req.correlationId,
      'X-Trace-ID': req.traceId,
      'X-Parent-Span-ID': req.spanId
    }
  });
}

// Usage in route handler
app.post('/orders', async (req, res) => {
  const http = createTracedClient(req);
  
  try {
    req.logger.info('Creating order', { customerId: req.body.customerId });
    
    // Context is automatically propagated
    const payment = await http.post('http://payment-service/charge', {
      amount: req.body.total
    });
    
    req.logger.info('Payment processed', { paymentId: payment.data.id });
    
    res.json({ orderId: '123', paymentId: payment.data.id });
  } catch (error) {
    req.logger.error('Order creation failed', {
      error: error.message,
      stack: error.stack
    });
    throw error;
  }
});

Distributed Tracing with OpenTelemetry

OpenTelemetry provides vendor-neutral instrumentation.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED TRACE VISUALIZATION                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Trace ID: abc123-def456-ghi789                                              │
│  ──────────────────────────────────────────────────────────────────         │
│                                                                              │
│  0ms      50ms     100ms    150ms    200ms    250ms    300ms               │
│  ├────────┴────────┴────────┴────────┴────────┴────────┤                   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────┐              │
│  │ api-gateway: POST /orders                                 │ 280ms       │
│  └┬─────────────────────────────────────────────────────────┘              │
│   │                                                                         │
│   │ ┌────────────────────────────────────────────────┐                     │
│   ├▶│ order-service: createOrder                      │ 220ms              │
│   │ └┬───────────────────────────────────────────────┘                     │
│   │  │                                                                      │
│   │  │ ┌─────────────────────┐                                             │
│   │  ├▶│ validate-order      │ 15ms                                        │
│   │  │ └─────────────────────┘                                             │
│   │  │                                                                      │
│   │  │ ┌─────────────────────────────────────┐                             │
│   │  ├▶│ payment-service: processPayment     │ 120ms                       │
│   │  │ └┬────────────────────────────────────┘                             │
│   │  │  │                                                                   │
│   │  │  │ ┌────────────────────────┐                                       │
│   │  │  └▶│ stripe-api: charge     │ 95ms                                  │
│   │  │    └────────────────────────┘                                       │
│   │  │                                                                      │
│   │  │ ┌──────────────────────────────┐                                    │
│   │  └▶│ inventory-service: reserve   │ 45ms                               │
│   │    └┬─────────────────────────────┘                                    │
│   │     │                                                                   │
│   │     │ ┌─────────────────────┐                                          │
│   │     └▶│ database: UPDATE    │ 20ms                                     │
│   │       └─────────────────────┘                                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

OpenTelemetry Setup

// observability/tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

function initTracing(serviceName) {
  const resource = new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: serviceName,
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION || '1.0.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development'
  });

  const traceExporter = new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317'
  });

  const metricExporter = new OTLPMetricExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317'
  });

  const sdk = new NodeSDK({
    resource,
    traceExporter,
    metricReader: new PeriodicExportingMetricReader({
      exporter: metricExporter,
      exportIntervalMillis: 30000
    }),
    instrumentations: [
      getNodeAutoInstrumentations({
        '@opentelemetry/instrumentation-http': {
          ignoreIncomingPaths: ['/health', '/metrics']
        },
        '@opentelemetry/instrumentation-express': {},
        '@opentelemetry/instrumentation-mongodb': {},
        '@opentelemetry/instrumentation-redis': {},
        '@opentelemetry/instrumentation-pg': {},
        '@opentelemetry/instrumentation-grpc': {}
      })
    ]
  });

  sdk.start();
  console.log('OpenTelemetry tracing initialized');

  // Graceful shutdown
  process.on('SIGTERM', () => {
    sdk.shutdown()
      .then(() => console.log('Tracing terminated'))
      .catch((error) => console.error('Error terminating tracing', error))
      .finally(() => process.exit(0));
  });

  return sdk;
}

module.exports = { initTracing };

Manual Instrumentation

// observability/tracer.js
const { trace, context, SpanStatusCode } = require('@opentelemetry/api');

class Tracer {
  constructor(serviceName) {
    this.tracer = trace.getTracer(serviceName);
  }

  // Create a new span for an operation
  async trace(name, fn, options = {}) {
    return this.tracer.startActiveSpan(name, options, async (span) => {
      try {
        const result = await fn(span);
        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (error) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: error.message
        });
        span.recordException(error);
        throw error;
      } finally {
        span.end();
      }
    });
  }

  // Add attributes to current span
  addAttribute(key, value) {
    const span = trace.getActiveSpan();
    if (span) {
      span.setAttribute(key, value);
    }
  }

  // Add event to current span
  addEvent(name, attributes = {}) {
    const span = trace.getActiveSpan();
    if (span) {
      span.addEvent(name, attributes);
    }
  }

  // Get current trace context for propagation
  getContext() {
    const span = trace.getActiveSpan();
    if (span) {
      const spanContext = span.spanContext();
      return {
        traceId: spanContext.traceId,
        spanId: spanContext.spanId,
        traceFlags: spanContext.traceFlags
      };
    }
    return null;
  }
}

// Usage
const tracer = new Tracer('order-service');

async function createOrder(orderData) {
  return tracer.trace('createOrder', async (span) => {
    span.setAttribute('order.customerId', orderData.customerId);
    span.setAttribute('order.itemCount', orderData.items.length);

    // Validate order
    await tracer.trace('validateOrder', async () => {
      // validation logic
    });

    // Process payment
    const payment = await tracer.trace('processPayment', async (paymentSpan) => {
      paymentSpan.setAttribute('payment.amount', orderData.total);
      return paymentService.charge(orderData.total);
    });

    span.addEvent('payment_completed', {
      paymentId: payment.id,
      amount: payment.amount
    });

    // Reserve inventory
    await tracer.trace('reserveInventory', async () => {
      return inventoryService.reserve(orderData.items);
    });

    return { orderId: '123', paymentId: payment.id };
  });
}

module.exports = { Tracer, tracer };

Context Propagation

// observability/contextPropagation.js
const { context, propagation } = require('@opentelemetry/api');

// Inject context into outgoing request headers
function injectContext(headers = {}) {
  propagation.inject(context.active(), headers);
  return headers;
}

// Extract context from incoming request headers
function extractContext(headers) {
  return propagation.extract(context.active(), headers);
}

// HTTP client with automatic context propagation
const axios = require('axios');

function createTracedAxios() {
  const instance = axios.create();

  instance.interceptors.request.use((config) => {
    const headers = injectContext(config.headers);
    config.headers = headers;
    return config;
  });

  return instance;
}

// Usage
const http = createTracedAxios();

// Context is automatically propagated
const response = await http.post('http://payment-service/charge', {
  amount: 100
});

Metrics with Prometheus

Metrics Implementation

// observability/metrics.js
const client = require('prom-client');

class Metrics {
  constructor(options = {}) {
    // Enable default metrics collection
    client.collectDefaultMetrics({
      prefix: options.prefix || '',
      labels: { service: options.serviceName }
    });

    this.register = client.register;

    // Custom metrics
    this.httpRequestsTotal = new client.Counter({
      name: 'http_requests_total',
      help: 'Total number of HTTP requests',
      labelNames: ['method', 'path', 'status']
    });

    this.httpRequestDuration = new client.Histogram({
      name: 'http_request_duration_seconds',
      help: 'HTTP request latency in seconds',
      labelNames: ['method', 'path', 'status'],
      buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
    });

    this.activeConnections = new client.Gauge({
      name: 'active_connections',
      help: 'Number of active connections',
      labelNames: ['type']
    });

    this.businessMetrics = {
      ordersCreated: new client.Counter({
        name: 'orders_created_total',
        help: 'Total number of orders created',
        labelNames: ['status']
      }),
      
      orderValue: new client.Histogram({
        name: 'order_value_dollars',
        help: 'Order value in dollars',
        buckets: [10, 25, 50, 100, 250, 500, 1000]
      }),

      paymentProcessingTime: new client.Histogram({
        name: 'payment_processing_seconds',
        help: 'Payment processing time in seconds',
        labelNames: ['provider', 'status'],
        buckets: [0.1, 0.5, 1, 2, 5, 10]
      })
    };
  }

  // Middleware for HTTP metrics
  middleware() {
    return (req, res, next) => {
      const start = process.hrtime();

      res.on('finish', () => {
        const [seconds, nanoseconds] = process.hrtime(start);
        const duration = seconds + nanoseconds / 1e9;
        
        const labels = {
          method: req.method,
          path: this.normalizePath(req.route?.path || req.path),
          status: res.statusCode
        };

        this.httpRequestsTotal.inc(labels);
        this.httpRequestDuration.observe(labels, duration);
      });

      next();
    };
  }

  // Normalize path to prevent cardinality explosion
  normalizePath(path) {
    return path
      .replace(/\/[0-9a-f]{24}/g, '/:id')  // MongoDB IDs
      .replace(/\/\d+/g, '/:id');           // Numeric IDs
  }

  // Get metrics for Prometheus scraping
  async getMetrics() {
    return this.register.metrics();
  }

  // Record business metric
  recordOrder(status, value) {
    this.businessMetrics.ordersCreated.inc({ status });
    this.businessMetrics.orderValue.observe(value);
  }

  recordPayment(provider, status, duration) {
    this.businessMetrics.paymentProcessingTime.observe(
      { provider, status },
      duration
    );
  }
}

module.exports = { Metrics };

Metrics Endpoint

// routes/metrics.js
const express = require('express');
const { Metrics } = require('../observability/metrics');

const router = express.Router();
const metrics = new Metrics({ serviceName: 'order-service' });

// Prometheus scrape endpoint
router.get('/metrics', async (req, res) => {
  res.set('Content-Type', metrics.register.contentType);
  res.send(await metrics.getMetrics());
});

// Apply metrics middleware
app.use(metrics.middleware());

module.exports = { router, metrics };

Prometheus Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'microservices'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2

  # For non-Kubernetes environments
  - job_name: 'services'
    static_configs:
      - targets: 
          - 'order-service:3000'
          - 'payment-service:3000'
          - 'inventory-service:3000'
    metrics_path: '/metrics'

Alert Rules

# prometheus/alerts.yml
groups:
  - name: microservices
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          / sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
          > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.service }}"
          description: "P99 latency is {{ $value }}s for {{ $labels.service }}"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"
          description: "{{ $labels.instance }} has been down for more than 1 minute"

      - alert: CircuitBreakerOpen
        expr: circuit_breaker_state{state="open"} == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker open for {{ $labels.target_service }}"

Grafana Dashboards

Docker Compose for Observability Stack

# docker-compose.observability.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3001:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "14268:14268"  # Collector HTTP
      - "14250:14250"  # Collector gRPC

  otel-collector:
    image: otel/opentelemetry-collector:latest
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
    volumes:
      - ./otel/otel-collector-config.yaml:/etc/otelcol/config.yaml

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - loki-data:/loki

  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log
      - ./promtail:/etc/promtail

volumes:
  prometheus-data:
  grafana-data:
  loki-data:

OpenTelemetry Collector Configuration

# otel/otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: microservices

  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [jaeger, logging]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

Sample Grafana Dashboard JSON

{
  "title": "Microservices Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[5m])) by (service)",
          "legendFormat": "{{ service }}"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) * 100",
          "legendFormat": "{{ service }}"
        }
      ]
    },
    {
      "title": "P99 Latency",
      "type": "graph",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
          "legendFormat": "{{ service }}"
        }
      ]
    },
    {
      "title": "Active Instances",
      "type": "stat",
      "targets": [
        {
          "expr": "count(up{job=\"microservices\"} == 1) by (service)"
        }
      ]
    }
  ]
}

RED Method for Microservices

The RED Method focuses on three key metrics:

┌─────────────────────────────────────────────────────────────────────────────┐
│                          RED METHOD                                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  R - RATE                                                                    │
│  ─────────                                                                   │
│  How many requests per second?                                               │
│                                                                              │
│  Metric: sum(rate(http_requests_total[5m])) by (service)                    │
│                                                                              │
│                                                                              │
│  E - ERRORS                                                                  │
│  ──────────                                                                  │
│  How many requests are failing?                                              │
│                                                                              │
│  Metric: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)     │
│          / sum(rate(http_requests_total[5m])) by (service)                  │
│                                                                              │
│                                                                              │
│  D - DURATION                                                                │
│  ───────────                                                                 │
│  How long do requests take?                                                  │
│                                                                              │
│  Metric: histogram_quantile(0.99,                                           │
│            sum(rate(http_request_duration_seconds_bucket[5m]))              │
│            by (le, service))                                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Log Aggregation with Loki

// observability/lokiTransport.js
const Transport = require('winston-transport');
const axios = require('axios');

class LokiTransport extends Transport {
  constructor(opts) {
    super(opts);
    this.lokiUrl = opts.lokiUrl || 'http://localhost:3100';
    this.labels = opts.labels || {};
    this.batchSize = opts.batchSize || 100;
    this.batchInterval = opts.batchInterval || 1000;
    this.batch = [];
    
    setInterval(() => this.flush(), this.batchInterval);
  }

  log(info, callback) {
    const logEntry = {
      ts: Date.now() * 1000000, // Nanoseconds
      line: JSON.stringify(info)
    };

    this.batch.push(logEntry);

    if (this.batch.length >= this.batchSize) {
      this.flush();
    }

    callback();
  }

  async flush() {
    if (this.batch.length === 0) return;

    const entries = this.batch;
    this.batch = [];

    const payload = {
      streams: [
        {
          stream: this.labels,
          values: entries.map(e => [String(e.ts), e.line])
        }
      ]
    };

    try {
      await axios.post(`${this.lokiUrl}/loki/api/v1/push`, payload, {
        headers: { 'Content-Type': 'application/json' }
      });
    } catch (error) {
      console.error('Failed to send logs to Loki:', error.message);
      // Re-add to batch for retry
      this.batch = entries.concat(this.batch);
    }
  }
}

module.exports = { LokiTransport };

Interview Questions

Q1: What are the three pillars of observability?

Answer:

Logs: Discrete events with rich context
- What happened at specific moments
- Good for debugging
- Structured JSON format preferred
Metrics: Numeric time-series data
- System performance indicators
- Good for alerting and dashboards
- Examples: request rate, error rate, latency
Traces: Request path through system
- How requests flow across services
- Latency breakdown per component
- Good for debugging distributed issues

Together: Full visibility into distributed systems

Q2: Explain distributed tracing and its components

Answer:Components:

Trace: Complete request journey
Span: Single operation within trace
Context: Trace ID + Span ID propagated across services

How it works:

First service creates trace ID
Each operation creates a span with parent reference
Context propagated via headers
All spans collected and visualized

Key headers:

traceparent: W3C standard trace context
X-Trace-ID: Trace identifier
X-Span-ID: Current span

Tools: Jaeger, Zipkin, OpenTelemetry

Q3: What is the RED method?

Answer:RED = Rate, Errors, Duration

Rate: Requests per second
- Indicates load on service
- Alert on unusual spikes/drops
Errors: Failed requests ratio
- Track 4xx and 5xx separately
- Alert when error rate exceeds threshold
Duration: Request latency (p50, p95, p99)
- User experience indicator
- Alert on latency degradation

Why RED:

Simple, focused metrics
Covers most service health scenarios
Easy to implement and understand

Complementary: USE method for infrastructure (Utilization, Saturation, Errors)

Summary

Key Takeaways

Three pillars: Logs, Metrics, Traces
Use structured logging with correlation IDs
OpenTelemetry for vendor-neutral instrumentation
Prometheus for metrics, Grafana for visualization
RED method for service health

Next Steps

In the next chapter, we’ll cover Security - authentication, authorization, and secure communication.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Observability

​Three Pillars of Observability

​Structured Logging

​Logger Implementation

​Request Logging Middleware

​Propagating Context to Downstream Services

​Distributed Tracing with OpenTelemetry

​OpenTelemetry Setup

​Manual Instrumentation

​Context Propagation

​Metrics with Prometheus

​Metrics Implementation

​Metrics Endpoint

​Prometheus Configuration

​Alert Rules

​Grafana Dashboards

​Docker Compose for Observability Stack

​OpenTelemetry Collector Configuration

​Sample Grafana Dashboard JSON

​RED Method for Microservices

​Log Aggregation with Loki

​Interview Questions

​Summary