Observability
In distributed systems, understanding what’s happening across services is crucial. Observability gives you visibility into your system through three pillars: Logs, Metrics, and Traces.Learning Objectives:
- Understand the three pillars of observability
- Implement structured logging
- Set up distributed tracing with OpenTelemetry
- Configure metrics with Prometheus
- Build dashboards with Grafana
Three Pillars of Observability
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ THREE PILLARS OF OBSERVABILITY │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ LOGS │ │
│ │ │ │
│ │ What happened? │ │
│ │ │ │
│ │ [2024-01-15T10:30:00Z] INFO order-service: Order created │ │
│ │ [2024-01-15T10:30:01Z] ERROR payment-service: Payment failed │ │
│ │ [2024-01-15T10:30:02Z] WARN inventory-service: Low stock alert │ │
│ │ │ │
│ │ • Discrete events │ │
│ │ • Rich context │ │
│ │ • Good for debugging │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ METRICS │ │
│ │ │ │
│ │ How is the system performing? │ │
│ │ │ │
│ │ request_count{service="order",status="200"} = 1523 │ │
│ │ request_duration_seconds{service="order",p99} = 0.245 │ │
│ │ error_rate{service="payment"} = 0.02 │ │
│ │ │ │
│ │ • Numeric time-series data │ │
│ │ • Aggregated values │ │
│ │ • Good for alerting & dashboards │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ TRACES │ │
│ │ │ │
│ │ How do requests flow through the system? │ │
│ │ │ │
│ │ [trace-id: abc123] │ │
│ │ └─ order-service (50ms) │ │
│ │ ├─ validate-order (5ms) │ │
│ │ ├─ payment-service (30ms) │ │
│ │ │ └─ process-payment (25ms) │ │
│ │ └─ inventory-service (10ms) │ │
│ │ │ │
│ │ • Request path visualization │ │
│ │ • Latency breakdown │ │
│ │ • Dependency mapping │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Structured Logging
Logs should be structured (JSON) for easy parsing and querying.Logger Implementation
Copy
// observability/Logger.js
const winston = require('winston');
const { v4: uuidv4 } = require('uuid');
class Logger {
constructor(options = {}) {
this.serviceName = options.serviceName || process.env.SERVICE_NAME || 'unknown';
this.environment = options.environment || process.env.NODE_ENV || 'development';
this.logger = winston.createLogger({
level: options.level || 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: {
service: this.serviceName,
environment: this.environment,
version: process.env.APP_VERSION || '1.0.0',
hostname: require('os').hostname()
},
transports: this.createTransports(options)
});
}
createTransports(options) {
const transports = [];
// Console for development
if (this.environment === 'development') {
transports.push(new winston.transports.Console({
format: winston.format.combine(
winston.format.colorize(),
winston.format.simple()
)
}));
} else {
// JSON for production (shipped to log aggregator)
transports.push(new winston.transports.Console());
}
// File transport for local debugging
if (options.logFile) {
transports.push(new winston.transports.File({
filename: options.logFile,
maxsize: 5242880, // 5MB
maxFiles: 5
}));
}
return transports;
}
// Add correlation context to all logs
child(context) {
return {
info: (message, meta = {}) => this.info(message, { ...context, ...meta }),
warn: (message, meta = {}) => this.warn(message, { ...context, ...meta }),
error: (message, meta = {}) => this.error(message, { ...context, ...meta }),
debug: (message, meta = {}) => this.debug(message, { ...context, ...meta })
};
}
info(message, meta = {}) {
this.logger.info(message, this.enrichMeta(meta));
}
warn(message, meta = {}) {
this.logger.warn(message, this.enrichMeta(meta));
}
error(message, meta = {}) {
this.logger.error(message, this.enrichMeta(meta));
}
debug(message, meta = {}) {
this.logger.debug(message, this.enrichMeta(meta));
}
enrichMeta(meta) {
return {
...meta,
timestamp: new Date().toISOString()
};
}
}
// Singleton export
const logger = new Logger({
serviceName: process.env.SERVICE_NAME,
level: process.env.LOG_LEVEL || 'info'
});
module.exports = { Logger, logger };
Request Logging Middleware
Copy
// middleware/requestLogger.js
const { logger } = require('../observability/Logger');
const { v4: uuidv4 } = require('uuid');
function requestLogger(options = {}) {
return (req, res, next) => {
// Extract or generate correlation IDs
const correlationId = req.headers['x-correlation-id'] || uuidv4();
const traceId = req.headers['x-trace-id'] || correlationId;
const spanId = uuidv4().substring(0, 16);
// Attach to request
req.correlationId = correlationId;
req.traceId = traceId;
req.spanId = spanId;
// Create child logger with context
req.logger = logger.child({
correlationId,
traceId,
spanId,
method: req.method,
path: req.path,
userAgent: req.headers['user-agent']
});
// Log request start
const startTime = Date.now();
req.logger.info('Request started', {
query: req.query,
ip: req.ip
});
// Capture response
const originalSend = res.send;
res.send = function(body) {
const duration = Date.now() - startTime;
req.logger.info('Request completed', {
statusCode: res.statusCode,
duration,
responseSize: body?.length || 0
});
return originalSend.call(this, body);
};
// Set headers for downstream services
res.setHeader('X-Correlation-ID', correlationId);
res.setHeader('X-Trace-ID', traceId);
next();
};
}
module.exports = { requestLogger };
Propagating Context to Downstream Services
Copy
// utils/httpClient.js
const axios = require('axios');
function createTracedClient(req) {
return axios.create({
headers: {
'X-Correlation-ID': req.correlationId,
'X-Trace-ID': req.traceId,
'X-Parent-Span-ID': req.spanId
}
});
}
// Usage in route handler
app.post('/orders', async (req, res) => {
const http = createTracedClient(req);
try {
req.logger.info('Creating order', { customerId: req.body.customerId });
// Context is automatically propagated
const payment = await http.post('http://payment-service/charge', {
amount: req.body.total
});
req.logger.info('Payment processed', { paymentId: payment.data.id });
res.json({ orderId: '123', paymentId: payment.data.id });
} catch (error) {
req.logger.error('Order creation failed', {
error: error.message,
stack: error.stack
});
throw error;
}
});
Distributed Tracing with OpenTelemetry
OpenTelemetry provides vendor-neutral instrumentation.Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ DISTRIBUTED TRACE VISUALIZATION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Trace ID: abc123-def456-ghi789 │
│ ────────────────────────────────────────────────────────────────── │
│ │
│ 0ms 50ms 100ms 150ms 200ms 250ms 300ms │
│ ├────────┴────────┴────────┴────────┴────────┴────────┤ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ api-gateway: POST /orders │ 280ms │
│ └┬─────────────────────────────────────────────────────────┘ │
│ │ │
│ │ ┌────────────────────────────────────────────────┐ │
│ ├▶│ order-service: createOrder │ 220ms │
│ │ └┬───────────────────────────────────────────────┘ │
│ │ │ │
│ │ │ ┌─────────────────────┐ │
│ │ ├▶│ validate-order │ 15ms │
│ │ │ └─────────────────────┘ │
│ │ │ │
│ │ │ ┌─────────────────────────────────────┐ │
│ │ ├▶│ payment-service: processPayment │ 120ms │
│ │ │ └┬────────────────────────────────────┘ │
│ │ │ │ │
│ │ │ │ ┌────────────────────────┐ │
│ │ │ └▶│ stripe-api: charge │ 95ms │
│ │ │ └────────────────────────┘ │
│ │ │ │
│ │ │ ┌──────────────────────────────┐ │
│ │ └▶│ inventory-service: reserve │ 45ms │
│ │ └┬─────────────────────────────┘ │
│ │ │ │
│ │ │ ┌─────────────────────┐ │
│ │ └▶│ database: UPDATE │ 20ms │
│ │ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
OpenTelemetry Setup
Copy
// observability/tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
function initTracing(serviceName) {
const resource = new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: serviceName,
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION || '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development'
});
const traceExporter = new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317'
});
const metricExporter = new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317'
});
const sdk = new NodeSDK({
resource,
traceExporter,
metricReader: new PeriodicExportingMetricReader({
exporter: metricExporter,
exportIntervalMillis: 30000
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': {
ignoreIncomingPaths: ['/health', '/metrics']
},
'@opentelemetry/instrumentation-express': {},
'@opentelemetry/instrumentation-mongodb': {},
'@opentelemetry/instrumentation-redis': {},
'@opentelemetry/instrumentation-pg': {},
'@opentelemetry/instrumentation-grpc': {}
})
]
});
sdk.start();
console.log('OpenTelemetry tracing initialized');
// Graceful shutdown
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.error('Error terminating tracing', error))
.finally(() => process.exit(0));
});
return sdk;
}
module.exports = { initTracing };
Manual Instrumentation
Copy
// observability/tracer.js
const { trace, context, SpanStatusCode } = require('@opentelemetry/api');
class Tracer {
constructor(serviceName) {
this.tracer = trace.getTracer(serviceName);
}
// Create a new span for an operation
async trace(name, fn, options = {}) {
return this.tracer.startActiveSpan(name, options, async (span) => {
try {
const result = await fn(span);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}
// Add attributes to current span
addAttribute(key, value) {
const span = trace.getActiveSpan();
if (span) {
span.setAttribute(key, value);
}
}
// Add event to current span
addEvent(name, attributes = {}) {
const span = trace.getActiveSpan();
if (span) {
span.addEvent(name, attributes);
}
}
// Get current trace context for propagation
getContext() {
const span = trace.getActiveSpan();
if (span) {
const spanContext = span.spanContext();
return {
traceId: spanContext.traceId,
spanId: spanContext.spanId,
traceFlags: spanContext.traceFlags
};
}
return null;
}
}
// Usage
const tracer = new Tracer('order-service');
async function createOrder(orderData) {
return tracer.trace('createOrder', async (span) => {
span.setAttribute('order.customerId', orderData.customerId);
span.setAttribute('order.itemCount', orderData.items.length);
// Validate order
await tracer.trace('validateOrder', async () => {
// validation logic
});
// Process payment
const payment = await tracer.trace('processPayment', async (paymentSpan) => {
paymentSpan.setAttribute('payment.amount', orderData.total);
return paymentService.charge(orderData.total);
});
span.addEvent('payment_completed', {
paymentId: payment.id,
amount: payment.amount
});
// Reserve inventory
await tracer.trace('reserveInventory', async () => {
return inventoryService.reserve(orderData.items);
});
return { orderId: '123', paymentId: payment.id };
});
}
module.exports = { Tracer, tracer };
Context Propagation
Copy
// observability/contextPropagation.js
const { context, propagation } = require('@opentelemetry/api');
// Inject context into outgoing request headers
function injectContext(headers = {}) {
propagation.inject(context.active(), headers);
return headers;
}
// Extract context from incoming request headers
function extractContext(headers) {
return propagation.extract(context.active(), headers);
}
// HTTP client with automatic context propagation
const axios = require('axios');
function createTracedAxios() {
const instance = axios.create();
instance.interceptors.request.use((config) => {
const headers = injectContext(config.headers);
config.headers = headers;
return config;
});
return instance;
}
// Usage
const http = createTracedAxios();
// Context is automatically propagated
const response = await http.post('http://payment-service/charge', {
amount: 100
});
Metrics with Prometheus
Metrics Implementation
Copy
// observability/metrics.js
const client = require('prom-client');
class Metrics {
constructor(options = {}) {
// Enable default metrics collection
client.collectDefaultMetrics({
prefix: options.prefix || '',
labels: { service: options.serviceName }
});
this.register = client.register;
// Custom metrics
this.httpRequestsTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'path', 'status']
});
this.httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request latency in seconds',
labelNames: ['method', 'path', 'status'],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});
this.activeConnections = new client.Gauge({
name: 'active_connections',
help: 'Number of active connections',
labelNames: ['type']
});
this.businessMetrics = {
ordersCreated: new client.Counter({
name: 'orders_created_total',
help: 'Total number of orders created',
labelNames: ['status']
}),
orderValue: new client.Histogram({
name: 'order_value_dollars',
help: 'Order value in dollars',
buckets: [10, 25, 50, 100, 250, 500, 1000]
}),
paymentProcessingTime: new client.Histogram({
name: 'payment_processing_seconds',
help: 'Payment processing time in seconds',
labelNames: ['provider', 'status'],
buckets: [0.1, 0.5, 1, 2, 5, 10]
})
};
}
// Middleware for HTTP metrics
middleware() {
return (req, res, next) => {
const start = process.hrtime();
res.on('finish', () => {
const [seconds, nanoseconds] = process.hrtime(start);
const duration = seconds + nanoseconds / 1e9;
const labels = {
method: req.method,
path: this.normalizePath(req.route?.path || req.path),
status: res.statusCode
};
this.httpRequestsTotal.inc(labels);
this.httpRequestDuration.observe(labels, duration);
});
next();
};
}
// Normalize path to prevent cardinality explosion
normalizePath(path) {
return path
.replace(/\/[0-9a-f]{24}/g, '/:id') // MongoDB IDs
.replace(/\/\d+/g, '/:id'); // Numeric IDs
}
// Get metrics for Prometheus scraping
async getMetrics() {
return this.register.metrics();
}
// Record business metric
recordOrder(status, value) {
this.businessMetrics.ordersCreated.inc({ status });
this.businessMetrics.orderValue.observe(value);
}
recordPayment(provider, status, duration) {
this.businessMetrics.paymentProcessingTime.observe(
{ provider, status },
duration
);
}
}
module.exports = { Metrics };
Metrics Endpoint
Copy
// routes/metrics.js
const express = require('express');
const { Metrics } = require('../observability/metrics');
const router = express.Router();
const metrics = new Metrics({ serviceName: 'order-service' });
// Prometheus scrape endpoint
router.get('/metrics', async (req, res) => {
res.set('Content-Type', metrics.register.contentType);
res.send(await metrics.getMetrics());
});
// Apply metrics middleware
app.use(metrics.middleware());
module.exports = { router, metrics };
Prometheus Configuration
Copy
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'microservices'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
# For non-Kubernetes environments
- job_name: 'services'
static_configs:
- targets:
- 'order-service:3000'
- 'payment-service:3000'
- 'inventory-service:3000'
metrics_path: '/metrics'
Alert Rules
Copy
# prometheus/alerts.yml
groups:
- name: microservices
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.service }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
> 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.service }}"
description: "P99 latency is {{ $value }}s for {{ $labels.service }}"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
description: "{{ $labels.instance }} has been down for more than 1 minute"
- alert: CircuitBreakerOpen
expr: circuit_breaker_state{state="open"} == 1
for: 1m
labels:
severity: warning
annotations:
summary: "Circuit breaker open for {{ $labels.target_service }}"
Grafana Dashboards
Docker Compose for Observability Stack
Copy
# docker-compose.observability.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus:/etc/prometheus
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
grafana:
image: grafana/grafana:latest
ports:
- "3001:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "14268:14268" # Collector HTTP
- "14250:14250" # Collector gRPC
otel-collector:
image: otel/opentelemetry-collector:latest
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
volumes:
- ./otel/otel-collector-config.yaml:/etc/otelcol/config.yaml
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
volumes:
- loki-data:/loki
promtail:
image: grafana/promtail:latest
volumes:
- /var/log:/var/log
- ./promtail:/etc/promtail
volumes:
prometheus-data:
grafana-data:
loki-data:
OpenTelemetry Collector Configuration
Copy
# otel/otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 1000
spike_limit_mib: 200
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
namespace: microservices
logging:
loglevel: debug
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [jaeger, logging]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
Sample Grafana Dashboard JSON
Copy
{
"title": "Microservices Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{ service }}"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) * 100",
"legendFormat": "{{ service }}"
}
]
},
{
"title": "P99 Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
"legendFormat": "{{ service }}"
}
]
},
{
"title": "Active Instances",
"type": "stat",
"targets": [
{
"expr": "count(up{job=\"microservices\"} == 1) by (service)"
}
]
}
]
}
RED Method for Microservices
The RED Method focuses on three key metrics:Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ RED METHOD │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ R - RATE │
│ ───────── │
│ How many requests per second? │
│ │
│ Metric: sum(rate(http_requests_total[5m])) by (service) │
│ │
│ │
│ E - ERRORS │
│ ────────── │
│ How many requests are failing? │
│ │
│ Metric: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) │
│ / sum(rate(http_requests_total[5m])) by (service) │
│ │
│ │
│ D - DURATION │
│ ─────────── │
│ How long do requests take? │
│ │
│ Metric: histogram_quantile(0.99, │
│ sum(rate(http_request_duration_seconds_bucket[5m])) │
│ by (le, service)) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Log Aggregation with Loki
Copy
// observability/lokiTransport.js
const Transport = require('winston-transport');
const axios = require('axios');
class LokiTransport extends Transport {
constructor(opts) {
super(opts);
this.lokiUrl = opts.lokiUrl || 'http://localhost:3100';
this.labels = opts.labels || {};
this.batchSize = opts.batchSize || 100;
this.batchInterval = opts.batchInterval || 1000;
this.batch = [];
setInterval(() => this.flush(), this.batchInterval);
}
log(info, callback) {
const logEntry = {
ts: Date.now() * 1000000, // Nanoseconds
line: JSON.stringify(info)
};
this.batch.push(logEntry);
if (this.batch.length >= this.batchSize) {
this.flush();
}
callback();
}
async flush() {
if (this.batch.length === 0) return;
const entries = this.batch;
this.batch = [];
const payload = {
streams: [
{
stream: this.labels,
values: entries.map(e => [String(e.ts), e.line])
}
]
};
try {
await axios.post(`${this.lokiUrl}/loki/api/v1/push`, payload, {
headers: { 'Content-Type': 'application/json' }
});
} catch (error) {
console.error('Failed to send logs to Loki:', error.message);
// Re-add to batch for retry
this.batch = entries.concat(this.batch);
}
}
}
module.exports = { LokiTransport };
Interview Questions
Q1: What are the three pillars of observability?
Q1: What are the three pillars of observability?
Answer:
- Logs: Discrete events with rich context
- What happened at specific moments
- Good for debugging
- Structured JSON format preferred
- Metrics: Numeric time-series data
- System performance indicators
- Good for alerting and dashboards
- Examples: request rate, error rate, latency
- Traces: Request path through system
- How requests flow across services
- Latency breakdown per component
- Good for debugging distributed issues
Q2: Explain distributed tracing and its components
Q2: Explain distributed tracing and its components
Answer:Components:
- Trace: Complete request journey
- Span: Single operation within trace
- Context: Trace ID + Span ID propagated across services
- First service creates trace ID
- Each operation creates a span with parent reference
- Context propagated via headers
- All spans collected and visualized
traceparent: W3C standard trace contextX-Trace-ID: Trace identifierX-Span-ID: Current span
Q3: What is the RED method?
Q3: What is the RED method?
Answer:RED = Rate, Errors, Duration
- Rate: Requests per second
- Indicates load on service
- Alert on unusual spikes/drops
- Errors: Failed requests ratio
- Track 4xx and 5xx separately
- Alert when error rate exceeds threshold
- Duration: Request latency (p50, p95, p99)
- User experience indicator
- Alert on latency degradation
- Simple, focused metrics
- Covers most service health scenarios
- Easy to implement and understand
Summary
Key Takeaways
- Three pillars: Logs, Metrics, Traces
- Use structured logging with correlation IDs
- OpenTelemetry for vendor-neutral instrumentation
- Prometheus for metrics, Grafana for visualization
- RED method for service health
Next Steps
In the next chapter, we’ll cover Security - authentication, authorization, and secure communication.