Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Observability
In distributed systems, understanding what’s happening across services is crucial. Observability gives you visibility into your system through three pillars: Logs, Metrics, and Traces. In a monolith, a single stack trace tells you almost everything you need to debug a problem. In microservices, that same problem might involve 12 services, 3 databases, and 2 message brokers - and no single log file contains the full picture. Observability is not a “nice to have” in microservices; it is the cost of entry. You cannot operate what you cannot see, and once you move past 3-4 services, every production incident becomes a distributed detective case. The difference between a 15-minute outage and a 15-hour outage is almost always the quality of your observability.- Understand the three pillars of observability
- Implement structured logging
- Set up distributed tracing with OpenTelemetry
- Configure metrics with Prometheus
- Build dashboards with Grafana
Three Pillars of Observability
Before diving into implementation, it helps to understand why these three pillars exist and why you need all three. Logs, metrics, and traces are not redundant - they answer fundamentally different questions. Logs tell you what specifically happened in a given moment (rich context, high detail, expensive to store at volume). Metrics tell you how the system is behaving in aggregate (cheap, fast, perfect for alerting, but with little context). Traces tell you how a single request flowed through the system (showing causality and latency attribution across service boundaries). Relying on only one pillar is a common mistake. Metrics-only monitoring tells you “error rate is up” but not why. Logs-only observability means drowning in data during an incident with no way to see the forest. Traces without metrics mean you have no alerting. The three pillars compose: metrics fire an alert, traces narrow down the failing call path, and logs provide the root-cause detail. Each pillar has different storage economics and query patterns, which is why mature systems use different backends for each (Prometheus for metrics, Jaeger/Tempo for traces, Loki/Elasticsearch for logs).Structured Logging
Logs should be structured (JSON) for easy parsing and querying. The shift from plaintext to structured logging is one of the most impactful changes a team can make. Plaintext logs work fine when a human reads one log at a time, but they fall apart when machines need to parse them - and in microservices, machines are the primary readers. Log aggregators like Loki, Elasticsearch, and Splunk index structured fields (service, correlation_id, duration_ms, user_id) so you can run queries like “show me all errors for user X across all services in the last 5 minutes.” With plaintext, you’d need regex gymnastics that break the moment someone changes a log message. Structured logging also forces discipline: every field has a name and type, which makes logs self-documenting and prevents the “log rot” where messages drift over time. The alternative - unstructured printf-style logging - works in development but becomes unusable at production scale. Imagine grepping through 500 GB of logs per day across 40 services, where each service has its own ad-hoc format. Structured logs with a consistent schema (service, level, timestamp, correlation_id, message, and domain-specific fields) turn that chaos into a queryable dataset. The tradeoff is verbosity: JSON is harder for humans to read than plain text. The standard solution is to use pretty-printed colorized output in development and JSON in production, which is exactly what the examples below do.Logger Implementation
Below is a production-grade logger that centralizes three critical concerns: a consistent JSON format for machine consumption, default metadata (service name, environment, version, hostname) attached to every log line, and the concept of “child loggers” that carry request-scoped context. The child logger pattern is the key to correlating logs across a single request - rather than passing context manually into every log call, you create a child logger once per request and use it everywhere. Without this pattern, you’ll end up either polluting every function signature with a context object or losing correlation context entirely.- Node.js
- Python
Request Logging Middleware
Correlation IDs are the single most important thing you can add to your logs. Every incoming request gets a unique ID that flows through every log line, every downstream call, and every database query related to that request. When a user complains “my order failed at 3:42 PM,” you search one correlation ID and see the full story across all services. Without correlation IDs, you’re reduced to grep-by-timestamp, which is meaningless when you’re processing 10,000 requests per second. The middleware below generates a correlation ID on the way in (or accepts one from an upstream caller) and attaches a child logger to the request so every subsequent log in that request lifecycle carries the context automatically. If you skip this middleware and rely on ad-hoc logging, debugging a single user’s complaint becomes a scavenger hunt across service boundaries. Each service has its own logs, each with its own IDs, and joining them requires human pattern-matching across timestamps. The cost of adding correlation IDs is tiny (a UUID and a header); the cost of not having them compounds with every incident.- Node.js
- Python
Propagating Context to Downstream Services
Correlation IDs are worthless if they stop at the first service. The whole point is end-to-end traceability, which means every outbound HTTP call must carry the correlation headers forward. A traced HTTP client wraps your normal client (axios, httpx, etc.) and automatically injects the headers for you, so developers never have to remember to pass them manually. This is a place where a small abstraction pays huge dividends: if you let each service author write their own HTTP calls, correlation IDs will be forgotten 30% of the time and your trace graphs will have mysterious gaps. The tradeoff is coupling - every outbound call goes through a shared helper - but in practice this is a good kind of coupling. It enforces a cross-cutting concern (observability) in one place rather than 300. If you ever need to add a new header (tenant ID, feature flags, experiment IDs), you change one file instead of hunting every HTTP call in the codebase.- Node.js
- Python
Distributed Tracing with OpenTelemetry
OpenTelemetry provides vendor-neutral instrumentation. Before OpenTelemetry, distributed tracing was a vendor lock-in nightmare. You’d instrument your code for Jaeger, and if you later switched to Datadog or Honeycomb, you’d rewrite every span. OpenTelemetry (OTel) emerged from the merger of OpenTracing and OpenCensus around 2019 precisely to solve this - a single instrumentation API, with pluggable exporters for every major backend. The practical implication is that you instrument once and decide later where traces go: Jaeger, Zipkin, Tempo, Datadog APM, New Relic, Honeycomb, Lightstep. This is the correct layer of abstraction: your application code should not know or care which SaaS monitoring vendor you use this quarter. The conceptual model is simple but powerful. A trace represents one end-to-end request. A span represents one unit of work within that trace (a database query, an HTTP call, a function execution). Spans have parent-child relationships that reconstruct causality - “this payment span was caused by the create-order span, which was caused by the HTTP POST to /orders.” Spans carry attributes (key-value metadata) and events (timestamped annotations). Context propagation via W3C Trace Context headers (traceparent, tracestate) ties it all together across process boundaries. Skip propagation and your traces break into disconnected fragments, one per service - useless for debugging cross-service latency.
OpenTelemetry Setup
Setting up OpenTelemetry in a microservice involves three main pieces: a Resource (metadata identifying the service), one or more Exporters (where traces and metrics go), and Instrumentation (code that creates spans for you automatically). The auto-instrumentation libraries are where most of the magic happens - they hook into your HTTP framework, database driver, and Redis/Kafka client and emit spans without you writing a line of tracing code. This is the correct default: 90% of your observability needs are covered by auto-instrumentation, and you only add manual spans for business-specific operations (like “validate coupon” or “calculate shipping”). A subtle but important consideration: OpenTelemetry emits data to an OTLP Collector, not directly to your backend. The Collector acts as a buffer and translator - it can receive OTLP, batch it, add metadata, sample it, and forward it to Jaeger, Prometheus, Datadog, etc. Running the Collector as a sidecar or DaemonSet decouples your app from backend choice and handles backpressure. Skipping the Collector and exporting directly to a backend works but couples your app to that vendor and makes it harder to add processing later.- Node.js
- Python
Manual Instrumentation
Auto-instrumentation covers your HTTP, database, and cache calls - but your most interesting business logic isn’t in those layers. You want spans that say “validate coupon” or “calculate tax” or “send notification,” because those are the operations that show up in latency breakdowns and tell you which business logic is slow. Manual instrumentation is where you mark up meaningful units of work in your own code. The rule of thumb: wrap any operation that takes more than ~5ms, any operation that has a chance of failing in an interesting way, and any operation whose timing matters for debugging. The helper below encapsulates three best practices: set span status to OK on success, set status to ERROR and record the exception on failure, and always end the span (even on exceptions) so you don’t leak spans. Forgetting to set error status is the most common mistake - your trace viewer will show the span as “successful” even though the operation threw. Recording exceptions is what gives you the stack trace inside the trace, which is what turns a trace into a debugging tool rather than just a latency chart.- Node.js
- Python
Context Propagation
Context propagation is the invisible plumbing that makes distributed tracing work at all. When Service A calls Service B over HTTP, OpenTelemetry injectstraceparent and tracestate headers on the outgoing request. Service B’s auto-instrumentation reads these headers and creates a child span under the same trace ID. The result: one unified trace, spanning multiple services, with correct parent-child relationships. If propagation breaks (because a client doesn’t inject, or a server doesn’t extract), you get orphaned traces - each service has its own mini-trace, and you cannot see the causality between them.
The OTel auto-instrumentation for HTTP clients (axios, httpx, requests) handles this for you transparently - you don’t usually need to write propagation code yourself. The example below shows the manual API for cases where you have a custom transport (a message queue, a gRPC client without auto-instrumentation, a WebSocket) and need to inject/extract context yourself. For message queues specifically, you must manually propagate because OTel can’t auto-inject into message payloads - you inject the trace context into message headers/metadata on the producer side and extract it on the consumer side.
- Node.js
- Python
Metrics with Prometheus
Metrics Implementation
Prometheus won the metrics war around 2016-2018 for a specific reason: its pull-based model and multidimensional data model fit microservices perfectly. Instead of your services pushing metrics to a central server (which couples them to the collector and causes backpressure problems), Prometheus scrapes/metrics endpoints on a schedule. This means a service that crashes simply stops being scraped - no lost metrics, no retry queues. The multidimensional model (metric_name) lets you slice and dice without pre-aggregating, which is essential when you don’t know in advance which dimension you’ll need to investigate.
The four core metric types - Counter, Gauge, Histogram, Summary - each solve a specific problem. Counters monotonically increase (requests served, errors encountered) and are used with rate() in PromQL to compute per-second rates. Gauges go up and down (active connections, memory usage, queue depth) and represent instantaneous values. Histograms bucket observations (request duration, payload size) and let you compute percentiles server-side via histogram_quantile(). Summaries compute percentiles client-side, which is cheaper at query time but loses the ability to aggregate across instances - prefer Histograms unless you specifically need client-side percentiles. Picking the wrong type is the most common metric mistake: gauges for counters make your dashboards wrong, and summaries for multi-instance services make your percentiles meaningless.
Scenario: Your Prometheus cluster OOMs every Tuesday around 11am. On-call finds a new service was deployed Monday that added `user_email` as a metric label. How do you diagnose, mitigate, and prevent recurrence?
Scenario: Your Prometheus cluster OOMs every Tuesday around 11am. On-call finds a new service was deployed Monday that added `user_email` as a metric label. How do you diagnose, mitigate, and prevent recurrence?
- Confirm the hypothesis with data, not intuition. Query
prometheus_tsdb_symbol_table_size_bytesandprometheus_tsdb_head_seriesto see the active series count over time. Plot it against the deploy timeline — if series count doubled at 4:03pm Monday when the deploy landed, you have your culprit. - Identify the offending metric. Run
topk(20, count by (__name__)({__name__=~".+"}))to find metrics with the most series. Thencount by (label_name)({__name__="suspect_metric"})to find which label is blowing up cardinality. - Mitigate immediately. Drop the offending label at scrape time via Prometheus
metric_relabel_configswith alabeldropaction. This is a live config push — no code change needed, no deploy, and Prometheus stops ingesting the bad label within one scrape interval. - Fix the root cause in code. Open a PR that replaces
user_emailwith either a normalized bucket (user_tier: "free"|"paid"|"enterprise") or removes the label and movesuser_emailto span attributes and logs. - Install a guardrail. Add a CI check that greps Dockerfiles / code for the high-cardinality label anti-pattern, and a Prometheus alert on
prometheus_tsdb_head_series > 3_000_000that pages before the OOM.
query_id label — ingestion cost jumped 15x overnight before they caught it with cardinality alerts.Senior Follow-up Questions:- “Why not just give Prometheus more RAM?” Scaling vertically is a bandage — the next PR adds another label and you OOM again at double the cost. Cardinality is unbounded in both dimensions; RAM is not.
- “What if the business genuinely needs per-user latency?” That is a tracing use case, not metrics. Keep 100% error traces plus tail-sampled slow traces; users can query Jaeger/Tempo by
user.id. Metrics summarize; traces particularize. - “How do you catch this in review before it ships?” Add a lint rule (e.g., a
promtool check metricsstep or a custom AST linter) that flags any metric definition whose label list includes fields matching a high-cardinality deny list:*_id,email,path,query,url.
- “Just increase Prometheus memory.” Does not fix the growth; you defer the outage by one deploy. Also, Prometheus queries get linearly slower with series count even if it does not OOM.
- “Switch to Thanos/Cortex/Mimir.” These help with long-term storage and horizontal scaling, but they still charge (in money or RAM) per series. Cardinality discipline is orthogonal to the backend.
- Prometheus docs: Cardinality and dimensionality
- Grafana Labs: “How to tame the cardinality beast”
- Datadog: “Controlling custom metrics usage”
Scenario: Your p99 latency alert fires 50 times a day. The on-call engineer now acks every page in under 3 seconds without looking. An outage happens and the real alert is missed. What is the root cause and how do you fix it?
Scenario: Your p99 latency alert fires 50 times a day. The on-call engineer now acks every page in under 3 seconds without looking. An outage happens and the real alert is missed. What is the root cause and how do you fix it?
- Name the actual problem: alert fatigue. When alerts fire constantly, humans adapt by ignoring them — this is well-documented in aviation safety research and applies directly to on-call. The alert is the bug, not the on-call behavior.
- Audit alert-to-action ratio. For each alert type, ask: “In the last 30 days, how many fired, and how many required human action?” If a
HighLatencyalert fired 1,500 times and 0 required action, it is a false-positive generator. Delete or silence it. - Replace threshold alerts with SLO burn-rate alerts. Instead of “p99 over 1 second for 5 minutes,” use a multi-window multi-burn-rate alert that pages when error budget consumption rate implies the monthly SLO will be breached. This fires only when the issue is large enough to matter.
- Separate tickets from pages.
ticket-severity routes to a Jira queue for next-day triage.page-severity wakes someone. Right now, every latency blip is a page — most should be tickets. - Apply the “three strikes” rule. If an alert fires and is not actionable, open a PR to fix, tune, or delete it before the next on-call shift. Untuned alerts accumulate until on-call becomes unbearable.
- “How do you define the right SLO to set burn-rate alerts against?” Work backwards from customer experience —
99.9% of checkout requests under 500msis a real SLO; “p99 under 1s” is a symptom proxy. The SLO must reflect what users actually care about. - “What if leadership wants to keep the noisy alert ‘just in case’?” Push back with data: show alerts-per-incident-caught ratio. A 1,500:0 ratio is not “just in case,” it is a distraction mechanism that actively reduces safety.
- “How do you handle the first week after disabling an alert, just in case it catches something?” Route it to Slack with no paging for two weeks while you monitor. If it never catches anything real, delete it. If it catches something, re-tune and then re-enable paging.
- “Rotate the on-call more often so no one burns out.” This treats the symptom; the alert is still useless and the next on-call ignores it too.
- “Just add more engineers to the rotation.” Same distribution of noise across more people — total wasted human time stays constant.
- Google SRE Book: Chapter 6 — “Monitoring Distributed Systems” and Chapter 4 — “Service Level Objectives”
- Google SRE Workbook: “Alerting on SLOs” — the canonical reference for multi-window burn-rate alerts
- Charity Majors, Honeycomb: “Observability, not just monitoring”
Scenario: You are using 1% head-based trace sampling to control cost. After an incident, you find that the two slow requests that caused the outage were not sampled and you have no trace data for them. What went wrong and how do you redesign sampling?
Scenario: You are using 1% head-based trace sampling to control cost. After an incident, you find that the two slow requests that caused the outage were not sampled and you have no trace data for them. What went wrong and how do you redesign sampling?
- Identify the root problem: head-based sampling decides before the outcome is known. At the entry point, the tracer rolls a die: keep or drop. By the time you know the request was slow or failed, you have already dropped it. Head-based sampling optimizes for the average case and is blind to the important cases.
- Switch to tail-based sampling for incident-critical cases. Tail-based sampling buffers spans until the trace finishes, then decides. This lets you keep 100% of: errors (any 5xx or exception), slow requests (above p99 latency threshold), and traces touching specific high-value flows (checkout, payment, login).
- Use a layered strategy. 100% errors + 100% slow + head-sample 1% of the rest. This gives you the debuggable tail without the full cost of 100% sampling.
- Implement via OpenTelemetry Collector, not in-app. The
tail_samplingprocessor in the OTel Collector can buffer spans for up to 30 seconds and apply policies (status, latency, attributes). Keeping sampling out of the app means no code changes per service. - Track your sampling-to-incident ratio. After every incident, ask: “Did we have traces for the bad requests?” If the answer is no more than once a quarter, the sampling policy needs tuning.
- “What is the memory cost of tail-based sampling?” The Collector must buffer every active trace until a decision window closes (typically 10-30s). For a service doing 10k RPS with average 200ms traces, that is roughly 50k active traces x 5-10 spans x 1 KB = several hundred MB. Size the Collector accordingly or use a dedicated tail-sampling gateway.
- “How do you handle the memory pressure when traffic spikes?” Two strategies: shed sampling decisions (drop to head-based during pressure) or shard by trace ID across multiple Collector replicas so each sees a consistent subset of traces.
- “What about PII in span attributes?” Apply a span processor before export that redacts
http.request.body,db.statement(when it contains parameters), and any attribute matching a PII deny list. Redact at the edge, never at the backend, because the backend may be a third-party SaaS.
- “Just sample 100% and pay more.” For services above a few thousand RPS, 100% sampling costs are prohibitive (often $50k+/month on Datadog APM), and the vast majority of spans are never looked at. The right answer is smarter sampling.
- “Increase head-based sample rate to 10%.” You still miss 90% of incidents; you just pay 10x for the privilege of missing most of them.
- OpenTelemetry Collector: Tail Sampling Processor docs
- Honeycomb: “Sampling at the edge” by Liz Fong-Jones
- Uber Engineering: “Evolving Distributed Tracing at Uber”
- Node.js
- Python
Metrics Endpoint
Prometheus scrapes your service by polling a/metrics endpoint at a configurable interval (typically every 15 seconds). You must expose this endpoint on every service, and it must be cheap to serve - Prometheus will hit it thousands of times per day per replica. The endpoint returns a plaintext format that Prometheus parses. Exposing it on the same port as your API is simplest for development, but in production you often expose it on a separate port (9090) that isn’t publicly routable, so your metrics aren’t leaked to the internet and aren’t subject to auth middleware that would break Prometheus scraping.
- Node.js
- Python
Prometheus Configuration
Prometheus configuration is declarative YAML: you tell it what to scrape, how often, and under what labels. The two common service discovery modes are static (hardcoded list of targets) and kubernetes_sd (auto-discover pods via Kubernetes API). The Kubernetes discovery pattern is especially powerful - it uses pod annotations (prometheus.io/scrape: "true") to decide what to scrape, so new services automatically appear in Prometheus as soon as they’re deployed with the right annotations. Without auto-discovery, adding a new microservice requires a manual Prometheus config change, which becomes a coordination bottleneck as teams grow.
Alert Rules
Alerts are the difference between observability and paging. Good alerts wake people up only when action is required; bad alerts create alert fatigue where engineers start ignoring pages. The rules below embody two important practices: thefor: 5m clause ensures an alert only fires if the condition persists for 5 minutes (avoiding flaps from transient blips), and severity labels feed routing rules in Alertmanager so critical alerts go to PagerDuty at 2 AM while warnings go to Slack and wait until morning. Every alert should answer three questions: what is wrong, what is the business impact, and what should the on-call do? Summary and description fields exist to answer exactly these.
Grafana Dashboards
Docker Compose for Observability Stack
The full observability stack involves several moving parts that work together: Prometheus scrapes and stores metrics, Grafana visualizes, Jaeger stores traces, Loki stores logs, and the OpenTelemetry Collector receives OTLP data and routes it to the right backend. Running this locally via Docker Compose is a great way to understand the data flow - instrument one small service and watch metrics, traces, and logs flow into their respective backends. In production you’d run each of these as its own deployment (or use a managed service like Grafana Cloud, Datadog, or New Relic), but the data flow pattern is the same.OpenTelemetry Collector Configuration
The OTel Collector config uses a pipeline model: receivers take data in, processors transform it, exporters send it out. Thebatch processor is essential - it groups spans before export to reduce network overhead (hundreds of tiny exports become one big export). The memory_limiter protects the Collector from OOM when it’s overwhelmed by more data than it can export. Splitting pipelines by signal type (traces, metrics, logs) lets you have different processing and different destinations per signal. Misunderstanding this pipeline is a common operational surprise: a misconfigured exporter in the trace pipeline can cause silent data loss, and you only notice hours later when you look for traces that should be there.
Sample Grafana Dashboard JSON
RED Method for Microservices
The RED Method focuses on three key metrics: The RED method (Rate, Errors, Duration) was popularized by Tom Wilkie at Grafana Labs as a pragmatic minimum viable monitoring approach for request-driven services. The insight is that for any service that serves requests, 99% of operational health can be summarized by these three numbers. Rate tells you whether the service is under load. Errors tell you whether it’s serving those requests successfully. Duration tells you whether users are happy with the response time. If all three look healthy, the service is probably fine from the user’s perspective, even if CPU is 80% or memory is 70%.Log Aggregation with Loki
Loki takes a fundamentally different approach to logs than Elasticsearch/Splunk: it indexes only labels (service, level, pod), not the log content itself. This makes it 10-100x cheaper to run at scale, because indexing log bodies is what makes Elasticsearch expensive. The tradeoff is that full-text search over log contents is slower than in Elasticsearch - you compensate by narrowing down with labels first (service + time range) and then grepping the matched log lines. For most microservices use cases (find all errors in service X in the last hour, correlate with a trace ID), this pattern works great and saves huge amounts of infrastructure cost. The transport below batches log entries and pushes to Loki’s HTTP API. Batching is essential - without it, every log line triggers an HTTP round-trip, which would saturate your service’s I/O and slow down every request. The 100-entry batch size with a 1-second flush interval is a typical tradeoff: you trade up to 1 second of log lag for 100x throughput. In practice you’d usually use a sidecar agent (Promtail, Fluent Bit, Vector) to avoid running this logic inside your application process at all - the app writes to stdout in JSON, the sidecar ships to Loki.- Node.js
- Python
Interview Questions
Q1: What are the three pillars of observability?
Q1: What are the three pillars of observability?
-
Logs: Discrete events with rich context
- What happened at specific moments
- Good for debugging
- Structured JSON format preferred
-
Metrics: Numeric time-series data
- System performance indicators
- Good for alerting and dashboards
- Examples: request rate, error rate, latency
-
Traces: Request path through system
- How requests flow across services
- Latency breakdown per component
- Good for debugging distributed issues
Q2: Explain distributed tracing and its components
Q2: Explain distributed tracing and its components
- Trace: Complete request journey
- Span: Single operation within trace
- Context: Trace ID + Span ID propagated across services
- First service creates trace ID
- Each operation creates a span with parent reference
- Context propagated via headers
- All spans collected and visualized
traceparent: W3C standard trace contextX-Trace-ID: Trace identifierX-Span-ID: Current span
Q3: What is the RED method?
Q3: What is the RED method?
-
Rate: Requests per second
- Indicates load on service
- Alert on unusual spikes/drops
-
Errors: Failed requests ratio
- Track 4xx and 5xx separately
- Alert when error rate exceeds threshold
-
Duration: Request latency (p50, p95, p99)
- User experience indicator
- Alert on latency degradation
- Simple, focused metrics
- Covers most service health scenarios
- Easy to implement and understand
Summary
Key Takeaways
- Three pillars: Logs, Metrics, Traces
- Use structured logging with correlation IDs
- OpenTelemetry for vendor-neutral instrumentation
- Prometheus for metrics, Grafana for visualization
- RED method for service health
Next Steps
Interview Deep-Dive
'It is 2 AM and you get paged for high latency on the checkout flow. You have Prometheus, Grafana, Jaeger, and centralized logs. Walk me through your debugging process step by step.'
'It is 2 AM and you get paged for high latency on the checkout flow. You have Prometheus, Grafana, Jaeger, and centralized logs. Walk me through your debugging process step by step.'
'What is cardinality explosion in metrics, and how have you dealt with it in production?'
'What is cardinality explosion in metrics, and how have you dealt with it in production?'
user_id label to an HTTP request duration metric. With 1 million users, you now have 1 million time series for that single metric. Multiply by the number of methods, paths, and status codes, and you can easily reach 100 million time series. Prometheus will run out of memory and crash.I have dealt with this in production in three ways. First, never use unbounded values as metric labels. User IDs, request IDs, session tokens, order IDs — these are all unbounded and must never be labels. They belong in logs and traces, not metrics. Metric labels should be low-cardinality: HTTP method (5 values), status code class (5 values: 2xx, 3xx, 4xx, 5xx, other), service name (tens of values), endpoint pattern (tens of values).Second, normalize URL paths in metrics. If your API has /users/123, /users/456, etc., each unique path becomes a separate label value. I normalize to /users/:id before recording the metric. This is the single most common cause of cardinality explosion in REST API metrics.Third, set up cardinality monitoring. I run a query like count({__name__=~".+"}) by (__name__) in Prometheus to see which metrics have the most time series. Any metric with more than 10,000 series gets investigated. I also set an alert when the total number of active time series exceeds a threshold relative to Prometheus’s memory allocation.At a company I worked at, a well-intentioned engineer added a query_template label to database metrics, where each unique SQL query became a label value. With thousands of unique queries (including those with interpolated parameters), Prometheus ingestion rate quadrupled and the monitoring stack went down during a production incident — the exact moment we needed it most. We caught it at 3 AM and spent two hours rolling back the metric definition.Follow-up: “How do you balance the need for detailed observability with the cost of storing all that data?”I use a tiered retention strategy. High-resolution metrics (15-second scrape interval) are kept for 2 weeks — this is for active debugging. Downsampled metrics (5-minute averages) are kept for 6 months via Thanos or Cortex — this is for trend analysis. Distributed traces are sampled: I keep 100% of error traces and 1-5% of successful traces, with head-based sampling for latency outliers (always keep traces above P99 latency). Logs are the most expensive to store, so I use log levels aggressively: DEBUG in development, INFO in production, and ERROR-only retention after 30 days.'Compare the RED method and the USE method. When do you apply each, and what gaps does each one have?'
'Compare the RED method and the USE method. When do you apply each, and what gaps does each one have?'