Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
The Three Pillars of Observability
Metrics
Key Metrics to Track (RED Method)
USE Method (Resource-focused)
Metric Types
Distributed Tracing
How Tracing Works
Implementing Tracing
Structured Logging
Log Format Best Practices
Log Levels
Correlation IDs
Alerting
Alert Design Principles
Production-Ready Observability Implementation
Complete observability stack with metrics, logging, and tracing:- Python
- JavaScript
Alert Template
Alert Severity Levels
Dashboards
Dashboard Design
SLIs, SLOs, and SLAs
Senior Interview Questions
How do you debug a latency spike in production?
How do you debug a latency spike in production?
- Check dashboards: RED metrics, identify when it started
- Correlate: Deployments? Traffic spike? Dependency issue?
- Trace analysis: Find slow spans in traces
- Log analysis: Search for errors around that time
- Narrow down: Which endpoint? Which users?
- Database slow queries (check slow query log)
- GC pauses (check GC metrics)
- Connection pool exhaustion
- External dependency slowdown
- Lock contention
How do you set up monitoring for a new service?
How do you set up monitoring for a new service?
- Instrument code: Add metrics (RED method)
- Add tracing: Propagate trace context
- Structured logging: With correlation IDs
- Create dashboard: Health, golden signals, resources
- Set up alerts: On symptoms, not causes
- Document SLOs: Define success criteria
- Create runbooks: What to do when alerts fire
What's your approach to reducing alert fatigue?
What's your approach to reducing alert fatigue?
- Alert on symptoms: User impact, not causes
- Use thresholds wisely: 5 minutes > 80% vs instant spike
- Group related alerts: One page per incident, not 10
- Regular review: Delete unused, tune noisy alerts
- Escalation policy: Low-priority → ticket, high → page
- On-call feedback: Track alert quality metrics
How would you design a metrics system at scale?
How would you design a metrics system at scale?
- Collection: Agent on each host (Prometheus, StatsD)
- Aggregation: Pre-aggregate at edge (reduce cardinality)
- Storage: Time-series DB (InfluxDB, M3DB, Thanos)
- Query: Federation for cross-cluster queries
- Visualization: Grafana dashboards
- High cardinality labels (user_id) → Aggregate
- Long retention → Downsampling
- Many metrics → Drop unused
Interview Deep-Dive Questions
Q1: It is 2 AM and the on-call engineer gets paged: p99 latency for the checkout service has spiked from 200ms to 3 seconds. Walk me through how you would use metrics, logs, and traces together to find the root cause.
Q1: It is 2 AM and the on-call engineer gets paged: p99 latency for the checkout service has spiked from 200ms to 3 seconds. Walk me through how you would use metrics, logs, and traces together to find the root cause.
- Start with metrics to scope the problem. Check the RED dashboard for the checkout service: is the spike affecting all endpoints or just one? Is it all users or a specific segment? Is the error rate also elevated, or is it purely a latency issue? Correlate with infrastructure metrics (USE method): is CPU saturated? Is memory under pressure (swap usage, GC pauses)? Is there a spike in database connection pool utilization or queue depth?
- Metrics narrow the blast radius. Let’s say you find: p99 latency is up for the
/checkout/completeendpoint only, error rate is unchanged, and the database connection pool is at 95% utilization. That tells you the database is likely the bottleneck. - Now switch to traces. Pull up a few traces from the last 10 minutes where the checkout took longer than 2 seconds. Distributed tracing shows the span breakdown: the
checkout-servicespan is 3 seconds total, of whichpayment-servicetook 50ms andinventory-servicetook 30ms — both normal. But thedb.queryspan inside checkout-service took 2.8 seconds. Open the span attributes: the query isSELECT * FROM orders WHERE user_id = ? AND status = 'pending'. - Now switch to logs. Search for log entries correlated with the trace ID from one of those slow traces. You find a log line: “Slow query detected: 2800ms, table=orders, missing index on (user_id, status).” Check the deployment log: a migration ran at 1:45 AM that added a new column to the orders table, and the migration accidentally dropped an index.
- Root cause identified: the deployment at 1:45 AM dropped an index, causing a full table scan on a query that previously used the index. Fix: recreate the index. Immediate mitigation: either roll back the migration or run
CREATE INDEX CONCURRENTLYon the affected columns. - The key methodology: metrics tell you WHAT is wrong and WHERE, traces tell you which component in the request path is slow, and logs tell you WHY that component is slow. Using them in this order (metrics first, then traces, then logs) is the most efficient debugging path.
- Example: Stripe’s internal debugging workflow follows exactly this pattern. They call it “start wide, go deep.” Their dashboards show service-level RED metrics with drill-down into per-endpoint metrics, which link directly to example traces for slow requests, which link to correlated logs. An engineer can go from “something is slow” to “this specific query is missing an index” in under 5 minutes.
Q2: Your team has 200 microservices and the monthly observability bill (Datadog/New Relic/etc.) is $150K and growing. The CFO wants you to cut it by 50%. How do you reduce observability costs without going blind?
Q2: Your team has 200 microservices and the monthly observability bill (Datadog/New Relic/etc.) is $150K and growing. The CFO wants you to cut it by 50%. How do you reduce observability costs without going blind?
- Observability costs are driven by three factors: data volume (how many metrics, logs, and traces you ingest), data retention (how long you store it), and data cardinality (how many unique time series your metrics create). Cutting cost means reducing one or more of these without losing the ability to debug production issues.
- Metrics cost reduction: (1) Audit metric cardinality. A single metric with a high-cardinality label (like
user_idorrequest_pathwith thousands of unique values) can create millions of time series. Replace high-cardinality labels with bucketed versions (e.g., replacerequest_path=/users/12345withrequest_path_group=/users/:id). (2) Drop unused metrics. Query your metrics backend to find metrics that have not been queried in 90 days. If nobody is looking at them, stop collecting them. (3) Reduce collection frequency for low-priority services. Internal batch jobs do not need 10-second resolution metrics — 60-second is fine. - Logs cost reduction: (1) Reduce log volume at the source. Debug-level logging in production is almost never needed — set production log level to INFO or WARN. A single service logging at DEBUG can produce 10x the volume of INFO-level logging. (2) Use sampling for high-volume logs. If your API gateway logs every request, sample 10% of successful requests but keep 100% of errors. (3) Implement log-to-metrics pipelines: instead of storing every “request completed” log line, extract the latency and status code into a metric at the edge, then drop the log line. Metrics are orders of magnitude cheaper to store than logs.
- Traces cost reduction: (1) Use tail-based sampling: keep 100% of error and slow traces, sample 1% of successful traces. This can reduce trace volume by 90% while retaining every trace you would actually need for debugging. (2) Reduce span depth — trace at service boundaries, not at every function call within a service.
- Retention tiers: keep high-resolution data for 7 days (recent incidents), downsample to 1-minute resolution for 30 days, 1-hour resolution for 1 year. Most vendors charge significantly less for lower-resolution or archived data.
- The 50% cut is achievable by combining: metric cardinality audit (20% savings), log level and sampling changes (40% savings on log costs, which are typically the largest line item), trace sampling (30% savings on trace costs), and retention tiering (15% savings across the board).
- Example: Uber reduced their observability costs by 40% by building an internal “metric governance” system that automatically detected and alerted on high-cardinality metrics before they exploded the time series count. They also moved to a tiered storage model where data older than 72 hours was automatically downsampled and moved to cheaper storage.
Q3: What are SLIs, SLOs, and SLAs, and how would you define them for a payment processing service? Be specific with numbers.
Q3: What are SLIs, SLOs, and SLAs, and how would you define them for a payment processing service? Be specific with numbers.
- SLI (Service Level Indicator) is the metric itself — a quantitative measurement of one aspect of service quality. SLO (Service Level Objective) is the target value for that SLI — what “good enough” looks like. SLA (Service Level Agreement) is a contractual commitment with consequences (usually financial) if the SLO is breached. The relationship: SLIs measure, SLOs set goals, SLAs have teeth.
- For a payment processing service, I would define these SLIs and SLOs:
- Availability SLI: the proportion of successful payment API requests (non-5xx responses) out of total requests, measured over a rolling 30-day window. SLO: 99.95% (allows 21.6 minutes of downtime per month). Why not 99.99%? Because the payment service depends on external providers (Stripe, banks) that themselves have SLAs around 99.95%, and your SLO cannot meaningfully exceed your dependencies’ reliability.
- Latency SLI: p50 and p99 of payment processing time (from API request received to response sent). SLO: p50 under 300ms, p99 under 2 seconds. The p99 is generous because some payments require 3D Secure verification or bank redirects that add legitimate latency.
- Correctness SLI: the proportion of payments where the amount charged matches the amount requested, measured by daily reconciliation. SLO: 99.999% (1 in 100,000 transactions may have a discrepancy, immediately flagged for investigation). Correctness has a much tighter SLO than availability because a wrong charge erodes trust far more than a brief outage.
- The SLA (external contract with merchants): “Payment API will be available 99.9% of the time per calendar month. If availability falls below 99.9%, affected merchants receive a 10% credit on their monthly processing fees. If below 99.5%, a 25% credit.” Notice the SLA is looser than the SLO — the SLO is 99.95% but the SLA promises 99.9%. This error budget between the SLO and SLA is your buffer. If you are burning through the SLO error budget, you freeze deployments and focus on reliability before the SLA is breached.
- Error budget: at 99.95% SLO over 30 days, you have approximately 21.6 minutes of allowed downtime. Track this as a “remaining error budget” metric. When the budget is below 50%, prioritize reliability work. When it is below 25%, freeze non-critical deployments.
- Example: Google’s internal SRE practices (documented in their SRE book) emphasize that SLOs should be set based on user expectations, not engineering ambition. Their payments team sets SLOs slightly below what they can actually achieve, so that the error budget is a meaningful lever for prioritizing reliability vs. feature work.
Interview Questions
1. Explain the difference between a Histogram and a Summary in Prometheus. When would you choose one over the other?
1. Explain the difference between a Histogram and a Summary in Prometheus. When would you choose one over the other?
- A Histogram buckets observations into configurable ranges (e.g., 0-10ms, 10-50ms, 50-100ms) and stores a cumulative count per bucket, plus a total sum and count. Quantiles (p50, p95, p99) are computed at query time using
histogram_quantile(). A Summary calculates quantiles client-side (inside the application process) and exposes them directly as pre-computed values. - The critical difference is aggregation. Histograms can be aggregated across multiple instances — if you have 20 pods, you can compute a meaningful p99 across all 20 by summing the bucket counts. Summaries cannot be meaningfully aggregated because you cannot merge pre-computed quantiles from separate processes into a correct global quantile. Averaging p99 values from 20 pods does not give you the true p99.
- In practice, Histograms are almost always the right choice for production services because you always want to aggregate across instances. The only case where Summary wins is if you need exact quantiles from a single process with minimal query-time computation (e.g., a batch job running on a single node where the quantile must be precise, not interpolated from buckets).
- The trade-off with Histograms is choosing good bucket boundaries. If your buckets are too coarse (e.g., 100ms, 500ms, 1s), your p99 estimate between 100ms and 500ms is imprecise. If they are too fine, you generate more time series (each bucket is a separate time series), increasing storage and scrape cost. A good default for HTTP latency is
[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]. - Example: At a company running 50 replicas of an API service, switching from Summary to Histogram for request latency metrics enabled accurate global p99 calculations for the first time. The previous Summary-based dashboards were showing “average of p99 across pods,” which masked a single hot pod with a true p99 of 2 seconds behind the averaged value of 400ms.
- You set up a Histogram with default buckets but your service has a bimodal latency distribution — 90% of requests complete in 5ms and 10% take 800ms due to cache misses. How would you adjust the bucket boundaries, and what would the default buckets get wrong?
- Your Prometheus storage is growing rapidly and you discover that a single Histogram metric with 5 labels is generating 500K time series. Walk me through how you would diagnose and fix this cardinality explosion.
2. What is a correlation ID and why is it critical in a microservices architecture? How does it differ from a trace ID?
2. What is a correlation ID and why is it critical in a microservices architecture? How does it differ from a trace ID?
- A correlation ID is a unique identifier generated at the entry point of a request (usually the API gateway or first service to receive the request) and propagated through every downstream service call via HTTP headers (commonly
X-Correlation-ID). Its purpose is to tie together all logs, events, and side effects that belong to the same user-initiated action, even across asynchronous boundaries like message queues. - A trace ID (from distributed tracing) serves a similar linking purpose but lives within the tracing system specifically. The trace ID connects spans in a distributed trace. A correlation ID is broader — it appears in logs, metrics tags, audit trails, error reports, and even database records. Think of the trace ID as the tracing system’s view of a request, and the correlation ID as the business-level identity of the entire operation.
- In practice, many teams use the trace ID as the correlation ID (they are the same value). This works well if all your systems participate in tracing. But some systems do not produce traces — message queues, batch processors, third-party webhook handlers. The correlation ID survives in places where tracing does not. For example, if a request triggers an async job via Kafka, the trace ends at the Kafka producer, but the correlation ID is embedded in the Kafka message and picked up by the consumer, allowing you to search logs across the sync and async boundary.
- Implementation detail: use
ContextVar(Python) orAsyncLocalStorage(Node.js) to propagate the correlation ID within a service without passing it explicitly through every function. The middleware sets it on request entry, and the structured logger automatically includes it in every log line. - Example: During a production incident at an e-commerce company, a customer reported being double-charged. The support team searched logs by the correlation ID from the customer’s API response header and found the entire lifecycle: the initial checkout request, the payment service call, a timeout, the retry (with the same correlation ID), and the payment service processing both the original and the retry because the idempotency key was not checked. Without the correlation ID linking these events across 4 services and a retry boundary, finding this would have taken hours instead of minutes.
- A request enters your system, triggers 3 synchronous service calls, then publishes to Kafka, where 2 consumers process the message asynchronously. How do you ensure the correlation ID flows through the entire chain, including the async leg?
- Your team is debating whether to use the OpenTelemetry trace ID as the correlation ID or maintain a separate one. What are the arguments for and against each approach?
3. Your team alerts on 'CPU > 80%' and 'disk > 90%'. An SRE tells you these are bad alerts. Why, and what should you alert on instead?
3. Your team alerts on 'CPU > 80%' and 'disk > 90%'. An SRE tells you these are bad alerts. Why, and what should you alert on instead?
- These are cause-based alerts, not symptom-based alerts. High CPU or high disk usage is not inherently a problem — it is a problem only if it impacts users. A server running at 85% CPU while serving all requests within SLO is healthy. Alerting on it creates noise and trains on-call engineers to ignore pages, which is the definition of alert fatigue.
- The principle is: alert on symptoms (what users experience), investigate causes (what the infrastructure is doing). Symptoms are things like “error rate exceeded 1% for 5 minutes,” “p99 latency exceeded 2 seconds,” or “order completion rate dropped below baseline.” These tell you that users are suffering and action is needed. The CPU and disk metrics are useful for diagnosis after the symptom alert fires, not as alert triggers themselves.
- There is one exception: predictive alerts for resource exhaustion. “Disk is 90% full” is a bad alert. “At the current growth rate, disk will be 100% full in 4 hours” is a good alert because it predicts an imminent user-facing failure (the service will crash when the disk fills up). Similarly, “connection pool is at 95% capacity” is worth alerting on because at 100% the service stops accepting new requests.
- A well-designed alert has four properties: it is actionable (someone needs to do something), it is urgent (it cannot wait until Monday), it is clear (the alert message says what is wrong and links to a runbook), and it is rare (it does not fire frequently enough to be ignored).
- Example: A team at a fintech company had 47 active alert rules. On-call engineers were paged 15-20 times per week, mostly for “CPU > 80%” on batch processing servers that regularly spiked to 90% during scheduled jobs — completely normal behavior. After a quarterly alert review, they reduced to 12 alerts, all symptom-based (error rates, latency SLO breaches, queue depth growing for more than 10 minutes). On-call pages dropped to 2-3 per week, and the Mean Time to Acknowledge improved from 12 minutes to 3 minutes because engineers trusted that every page was real.
- Your organization has 300 alert rules across all services. How would you audit them and decide which to keep, modify, or delete?
- A junior engineer pushes a new alert: “If any single request takes longer than 5 seconds, page on-call.” Why is this problematic, and how would you redesign it?
4. Explain the concept of an error budget. How does it change the relationship between the product team and the reliability team?
4. Explain the concept of an error budget. How does it change the relationship between the product team and the reliability team?
- An error budget is the inverse of your SLO, expressed as an allowed amount of unreliability. If your SLO is 99.95% availability over 30 days, you are allowed 0.05% failure, which translates to about 21.6 minutes of downtime or about 0.05% of requests returning errors. The error budget is that 21.6 minutes — a quantified “budget” that the team can “spend” on risky activities like deployments, migrations, or experiments.
- The organizational impact is transformative. Without error budgets, product teams and reliability teams are in constant tension. Product wants to ship features fast (which introduces risk), reliability wants to freeze everything (which preserves uptime). Error budgets resolve this by making the trade-off explicit and data-driven: if the error budget has plenty remaining, the product team ships aggressively. If the budget is nearly exhausted, the team prioritizes reliability work. Neither side “wins” — the data decides.
- In practice, error budget policies define what happens at different thresholds. Above 50% budget remaining: normal development velocity, deploy at will. Between 25-50%: increase testing rigor, require canary deployments, postpone risky migrations. Below 25%: feature freeze, all engineering effort goes to reliability improvements. Budget exhausted (0%): no deployments except reliability fixes, postmortem required.
- The error budget also prevents over-investing in reliability. If you consistently have 90% of your budget remaining at the end of each month, you are either too conservative (deploy more aggressively) or your SLO is too loose (tighten it). The budget should be meaningfully spent — ideally 30-60% consumed in a normal month.
- Example: Google’s SRE book describes how the Ads team uses error budgets. When their search ads serving system was too reliable (consistently 99.999% against a 99.99% SLO), the SRE team actually encouraged the product team to take more risks with deployments because the unspent error budget represented wasted velocity. This counterintuitive dynamic — reliability engineers encouraging risk — only works because the error budget framework makes the trade-off explicit.
- The product team has a major launch next week, but the error budget is at 10% remaining. The VP of Product insists on shipping anyway. How do you handle this?
- Your error budget calculations show 99.97% availability, but customers are complaining about frequent errors. What could explain the discrepancy between the budget being healthy and user experience being poor?
5. You are designing a logging pipeline for a platform that generates 500GB of logs per day. How do you architect this to be searchable, cost-effective, and reliable?
5. You are designing a logging pipeline for a platform that generates 500GB of logs per day. How do you architect this to be searchable, cost-effective, and reliable?
- At 500GB per day, the architecture must address three challenges: ingestion throughput (can you capture all 500GB without dropping logs?), storage economics (indexed log storage is expensive at this scale), and query performance (can engineers search these logs quickly during an incident?).
- Ingestion layer: applications write logs to local files or stdout (never directly to a remote service — that creates a coupling that can crash the app if the logging service is down). A local agent (Fluentd, Fluent Bit, or Vector) tails the logs, parses and enriches them (adds host, service name, environment), and forwards them to a central aggregation layer. Use a buffer (Kafka) between the agent and the indexer to absorb traffic spikes. Kafka also provides durability — if the indexer goes down, logs queue up in Kafka rather than being dropped.
- Processing and routing: not all logs deserve the same treatment. Route logs based on level and source: ERROR and WARN logs go to the fast-query indexed store (Elasticsearch, Loki, or Splunk). INFO logs for critical services go to the indexed store. INFO logs for non-critical services and all DEBUG logs go directly to cold storage (S3 with Parquet format). This tiering typically reduces your indexed volume by 60-70%.
- Indexed storage: Elasticsearch is the most common choice. At 500GB per day (with 60-70% going to cold storage, so ~150-175GB indexed daily), you need a cluster sized for both ingest throughput and query performance. Use time-based indices (one per day), apply an ILM (Index Lifecycle Management) policy to roll indices to warm storage after 7 days (fewer replicas, cheaper nodes) and delete after 30 days. For cheaper alternative, Grafana Loki stores log lines without full-text indexing — it indexes only labels (service, level) and uses brute-force grep at query time. Much cheaper, slower for arbitrary text search, excellent for label-based filtering.
- Cold storage: the 60-70% of logs that go to S3 can be queried on-demand using Athena or Presto when needed, typically during deep-dive investigations. The query takes minutes instead of seconds, but the storage cost is cents per GB vs. dollars per GB for indexed storage.
- Reliability: the pipeline must not lose logs. Kafka’s durability guarantees cover the ingestion path. The agent should have a disk-backed buffer for when Kafka is unreachable. End-to-end, monitor the pipeline itself: track the lag between log generation and indexing, alert if it exceeds 5 minutes.
- Example: Netflix processes over 1 PB of logs per day. They use a tiered architecture: a subset of logs is indexed in Elasticsearch for real-time search (their “Atlas” system), while the full log stream goes to S3 via their custom pipeline (“Keystone”). Engineers query S3 via Presto for historical investigations. This hybrid gives them sub-second search for recent logs and cost-effective access to everything.
- Your Elasticsearch cluster is healthy but engineers complain that log searches during incidents are slow (taking 30+ seconds). What are the most likely causes and how do you optimize query performance?
- A compliance requirement mandates that logs must be retained for 7 years. How does this change your architecture, and what is the cost implication?
6. What is tail-based sampling in distributed tracing, and why is it superior to head-based sampling for debugging production issues?
6. What is tail-based sampling in distributed tracing, and why is it superior to head-based sampling for debugging production issues?
- Head-based sampling makes the keep-or-drop decision at the start of the trace (when the first span is created). For example, “sample 10% of traces” means 90% of traces are never recorded at all. The decision is made before you know anything about whether the trace will be interesting. Tail-based sampling defers the decision until the trace is complete, so it can evaluate the entire trace before deciding whether to keep it.
- Tail-based sampling is superior because it can apply intelligent criteria: keep all traces where any span has an error, keep all traces where the total duration exceeds a threshold (e.g.,
>2 seconds), keep all traces for a specific user ID or transaction type, and randomly sample the rest. This means you retain 100% of the traces that matter for debugging (errors and slow requests) while dropping the vast majority of boring, successful traces. - With head-based sampling at 10%, you have a 90% chance of losing the trace for any given production error. During an incident, you might find zero traces for the failing request pattern, making tracing useless exactly when you need it most. With tail-based sampling, every error trace is preserved regardless of the overall sampling rate.
- The trade-off: tail-based sampling is operationally more complex. It requires a collector that temporarily buffers all spans, waits for the trace to complete (or a timeout), evaluates the sampling rules, then either forwards or drops the trace. This buffer consumes memory and adds latency to trace availability (you cannot see the trace until the sampling decision is made, typically 30-60 seconds after trace completion). OpenTelemetry Collector supports tail-based sampling natively.
- Example: A payments company switched from 5% head-based sampling to tail-based sampling that kept 100% of error traces, 100% of traces slower than 1 second, and 1% of everything else. Their total trace volume dropped by 70% (cost savings) while their ability to debug payment failures went from “we might have a trace” to “we always have the trace.” The first week after switching, they diagnosed a race condition in their idempotency logic that only manifested 0.01% of the time — a trace that head-based sampling would have almost certainly missed.
- Your tail-based sampling collector is buffering spans for 60 seconds before making a decision, but some traces in your system take 5 minutes to complete (long-running background jobs). How do you handle this?
- An engineer argues that with tail-based sampling, you can reduce your overall sampling to 0.1% of successful traces because you keep all errors. What is the risk of being this aggressive?
7. You join a team that has monitoring but no observability. Dashboards exist, but every incident takes hours to diagnose. What is the difference, and how do you bridge the gap?
7. You join a team that has monitoring but no observability. Dashboards exist, but every incident takes hours to diagnose. What is the difference, and how do you bridge the gap?
- Monitoring answers pre-defined questions: “Is the CPU high? Is the error rate above threshold? Is the service up?” You set up checks for known failure modes. Observability is the ability to ask arbitrary questions about your system’s internal state using external outputs — questions you did not anticipate when you instrumented the system. The distinction: monitoring tells you THAT something is wrong, observability lets you figure out WHY.
- The telltale sign of monitoring-without-observability is the “dashboard gap”: an alert fires, the on-call engineer opens a dashboard, sees something is red, but cannot drill down to root cause without SSHing into servers, running ad-hoc queries, or asking other engineers “have you seen this before?” Every incident becomes a manual investigation because the system does not expose enough context to reason about novel failures.
- To bridge the gap, three concrete changes: (1) Add structured logging with high-cardinality fields. Replace
log.info("Request failed")withlog.info("request_failed", user_id="usr_123", endpoint="/checkout", error_code="PAYMENT_TIMEOUT", trace_id="abc789", duration_ms=4500). High-cardinality fields (user_id, order_id, trace_id) are what let you pivot and slice data in new ways during an incident. (2) Add distributed tracing. Tracing gives you the request-level view that metrics and logs individually cannot: which service was slow, what the call graph looked like, where the bottleneck was. (3) Correlate all three pillars. Every log line should include the trace_id. Dashboards should link to example traces. Trace views should link to correlated logs. This correlation is what turns three separate data sources into a unified investigative tool. - The cultural shift is equally important. Monitoring culture says “add an alert for this failure mode.” Observability culture says “instrument the system so that any failure mode — including ones we have not seen yet — can be diagnosed from the telemetry.” The instrumentation mindset changes from reactive (add a check when something breaks) to proactive (emit rich context from the start).
- Example: Charity Majors (CEO of Honeycomb) describes the distinction as “monitoring is for known-unknowns, observability is for unknown-unknowns.” A team at Honeycomb went from 4-hour Mean Time to Resolution to 15 minutes after adopting high-cardinality structured events with tracing, because engineers could query production telemetry the same way they would query a database: “Show me all requests from user X in the last hour, grouped by endpoint, sorted by duration.”
- You are trying to convince leadership to invest in observability tooling, but they say “we already have Grafana dashboards and PagerDuty alerts, why do we need more?” How do you make the business case?
- What is “high cardinality” in the context of observability, and why is it both the most powerful feature and the biggest cost driver?
8. Your distributed trace shows a request took 3 seconds, but when you add up all the spans, they only account for 800ms. Where did the other 2.2 seconds go?
8. Your distributed trace shows a request took 3 seconds, but when you add up all the spans, they only account for 800ms. Where did the other 2.2 seconds go?
- This is the “missing time” problem in distributed tracing, and it is one of the most common and frustrating debugging scenarios. The 2.2 seconds are hiding in uninstrumented gaps — parts of the request lifecycle that do not have spans.
- The most common causes: (1) Queue wait time. The request was placed in a thread pool queue or connection pool queue and waited 2.2 seconds for an available worker or connection. This time is between the parent span starting and the child span starting, but no span covers the gap. Fix: add instrumentation that records time spent waiting for resources (thread pool queue time, connection pool checkout time). (2) Middleware or framework overhead. The web framework does work before and after your handler code — request parsing, authentication middleware, response serialization, compression. If you only instrumented the handler, the framework overhead is invisible. Fix: add spans or metrics at the middleware layer. (3) Network latency between services. If Service A calls Service B, the parent span in A ends when it sends the request, and the child span in B starts when B receives it. The network transit time (plus any load balancer processing) is in neither span. Fix: use client-side spans that cover the full HTTP call, not just the server-side processing. (4) Garbage collection pauses. A GC pause can add seconds of latency but does not show up in application-level spans. Fix: emit GC pause metrics and correlate them with trace timestamps. (5) DNS resolution or TLS handshake. The first request to a service may include DNS lookup (50-200ms) and TLS negotiation (50-150ms). Subsequent requests reuse the connection. Fix: instrument the HTTP client at the transport layer.
- The debugging approach: look at the waterfall view of the trace. Identify the gaps between consecutive spans. The largest gap is where the time went. Check whether any span’s start time is significantly later than its parent’s start time — that delta is queue or network time. Check whether any span’s end time is much later than the last child span’s end time — that could be response serialization or middleware.
- Example: A team debugging a slow API response found 1.5 seconds of “missing time” in their trace. The trace showed the handler span started 1.5 seconds after the request hit the server. Investigation revealed the service was using a synchronous thread pool with 10 threads, and under load, requests were queuing for the pool. Adding a span around thread pool checkout time revealed the queue wait. The fix was switching to async request handling, which eliminated the thread pool bottleneck entirely.
- You add instrumentation and discover the 2.2 seconds is connection pool wait time for the database. The pool has 20 connections and 200 concurrent requests. What are your options, and which do you prefer?
- How would you design your tracing instrumentation from the start to minimize these “missing time” gaps, without over-instrumenting and creating too many spans?