Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
December 2025 Update: Covers LangSmith, Langfuse, Phoenix, and custom observability solutions for production LLM systems.
Why Observability Matters for LLMs
Here is the fundamental challenge with LLM applications: they are non-deterministic black boxes. Unlike a traditional API where a bug produces the same wrong answer every time, an LLM might give a perfect answer 95% of the time and hallucinate wildly the other 5%. Without observability, those 5% of failures are invisible until a customer complains — and by then you have no idea what went wrong, because the same input might produce a correct answer on retry. Think of it like running a restaurant without any way to see the kitchen. Customers tell you the food is bad, but you cannot see which chef made it, which ingredients were used, or what went wrong. LLM observability gives you security cameras in the kitchen. Without observability, you can’t:- Debug why a response was wrong (was it the prompt? the retrieved context? the model?)
- Identify cost spikes (a single prompt engineering mistake can 10x your bill overnight)
- Detect quality degradation (model updates can silently break your prompts)
- Optimize performance (you cannot improve what you cannot measure)
Key Metrics to Track
Core Metrics
Track these five categories from day one. You do not need a fancy dashboard to start — even logging to a file and running a weekly analysis script is better than nothing. But get the data flowing early, because you cannot retroactively add observability to requests you did not log.| Category | Metrics | Why It Matters |
|---|---|---|
| Latency | P50, P95, P99 response time | P95 matters more than average — 5% of users having a terrible experience is a real problem |
| Cost | Tokens per request, $ per request | A single prompt change can 10x your bill. Track daily to catch spikes early |
| Quality | User feedback, success rate | The only metric that actually matters — everything else is a proxy |
| Errors | Rate, types, retry success | LLM APIs fail more often than traditional APIs. 1-3% error rate is normal |
| Usage | Requests/min, active users | Capacity planning and detecting abuse (one user making 10K requests/day) |
LLM-Specific Metrics
Observability Tool Decision Framework
Before choosing a tool, answer three questions: (1) Do you need self-hosting for data privacy? (2) Are you already using LangChain? (3) Is this for development or production?| Criteria | Langfuse | LangSmith | Phoenix | Custom (OTel) |
|---|---|---|---|---|
| Self-hosted | Yes (Docker) | No (cloud only) | Yes (local-first) | Yes |
| Free tier | 50K observations/mo | 5K traces/mo | Unlimited (local) | N/A |
| LangChain integration | Good (manual) | Native (automatic) | Good (OpenInference) | Manual |
| OpenAI SDK integration | Drop-in wrapper | Via @traceable | OpenInference auto-instrument | Manual |
| Evaluation workflows | Basic scoring | Full eval pipelines + datasets | LLM-as-judge built-in | Build your own |
| Data residency | Full control if self-hosted | US only (as of 2025) | Full control | Full control |
| Team size sweet spot | 2-50 engineers | LangChain-heavy teams | Solo dev / prototyping | 50+ engineers with existing infra |
| Production readiness | High | High | Medium (better for dev) | Depends on your build |
| Setup time | 10 minutes | 5 minutes | 2 minutes | Days to weeks |
Langfuse: Open-Source LLM Observability
Langfuse is the open-source option that most teams start with, and for good reason: it can be self-hosted (important for data privacy), has a generous free tier on their cloud, and the integration is genuinely minimal — often just a decorator or a drop-in OpenAI client replacement. Think of it as “Datadog for LLMs.”Setup
Tracing LLM Calls
Custom Metrics and Evaluations
LangSmith: LangChain’s Platform
LangSmith is the natural choice if you are already in the LangChain ecosystem — tracing is automatic for chains, agents, and LangGraph workflows. The trade-off: it is a closed-source, hosted service (no self-hosting option), so it may not work for teams with strict data residency requirements. Where it shines is the evaluation workflow — you can build test datasets, run automated evals, and compare prompt versions all in one place.Setup
Tracing Chains
Custom Tracing
Feedback and Evaluation
Arize Phoenix: Open-Source Tracing
Phoenix takes a different approach: local-first observability. It runs entirely on your machine with a beautiful UI, making it ideal for development and debugging. You do not need to send data anywhere — justpip install and launch. For production, you can export traces to any OpenTelemetry-compatible backend. Think of Phoenix as “the development tool” and Langfuse/LangSmith as “the production tool.”
Setup
Tracing RAG Pipelines
Custom Observability Stack
When should you build your own instead of using Langfuse or LangSmith? Two scenarios: (1) you have strict compliance requirements that prevent sending data to third-party services and cannot self-host Langfuse, or (2) you need tight integration with existing infrastructure (your own Grafana dashboards, your own alerting pipeline, your own data warehouse). For most teams, start with a managed tool and only build custom when you outgrow it. Build your own observability for complete control:Dashboards and Alerting
Key Dashboards
Alert Rules
The alerts below represent hard-won lessons from production LLM systems. The latency alert catches model provider degradation (which happens more often than you would expect). The error rate alert catches prompt regressions after deployments. The cost alert prevents runaway spending from infinite loops or unexpectedly verbose prompts.What to Alert On vs. What to Dashboard
This is a common mistake: teams alert on everything and get paged for non-issues, or dashboard everything and miss real problems. Here is the split:| Metric | Alert (page someone) | Dashboard (review daily) | Why |
|---|---|---|---|
| Error rate > 5% for 5 min | Yes | Yes | Immediate user impact |
| P95 latency > 10s for 5 min | Yes | Yes | Users are waiting, may abandon |
| Daily cost > 2x normal | Yes | Yes | Runaway loop or abuse |
| P50 latency creeping up | No | Yes | Gradual trend, not an emergency |
| Token usage per request increasing | No | Yes | Prompt drift, review weekly |
| Single user negative feedback | No | Yes | Noise; look for patterns |
| Model provider returning 503 | Yes (if sustained 2+ min) | Yes | Transient blips are normal |
| New model version deployed | No | Yes | Compare before/after quality |
Debugging LLM Issues
Debugging LLM issues is fundamentally different from debugging traditional software. The “bug” is often not in your code at all — it is in the interaction between your prompt, the retrieved context, and the model’s interpretation. The debugger below codifies the most common failure patterns so you can diagnose issues systematically rather than staring at logs hoping for insight.Common Issues and Diagnosis
Debugging Decision Tree
When a user reports “the AI gave a wrong answer,” use this systematic approach:| Step | Check | What You Are Looking For | Tool |
|---|---|---|---|
| 1 | Trace ID lookup | Find the exact request that went wrong | Langfuse/LangSmith trace view |
| 2 | Retrieved context | Were the right documents retrieved? Was relevant info missing? | RAG span in trace |
| 3 | System prompt | Was the correct prompt version active? Any recent changes? | Prompt registry / version history |
| 4 | Input tokens | Was context truncated due to token limits? | Token count in trace |
| 5 | Model response | Did the model hallucinate, or did it follow bad context? | Compare response to retrieved context |
| 6 | Tool calls | Did the model call the right tools with correct arguments? | Tool call span in trace |
| 7 | Post-processing | Was the response correctly parsed and formatted? | Application logs |
Key Takeaways
Trace Everything
Log every LLM call with inputs, outputs, tokens, latency, and cost.
Structured Logging
Use structured logs (JSON) for easy querying and analysis.
Track Quality
Collect user feedback and run automated evaluations.
Set Alerts
Alert on latency spikes, error rates, and cost anomalies.
What’s Next
AI Security & Guardrails
Learn how to secure LLM applications and implement safety guardrails
Interview Deep-Dive
You just deployed an LLM-powered feature to production. What observability would you set up on day one, and what would you add in the first month?
You just deployed an LLM-powered feature to production. What observability would you set up on day one, and what would you add in the first month?
Strong Answer:
- Day one, I set up five things. First, structured logging of every LLM call: model name, input token count, output token count, latency in milliseconds, HTTP status code, and a request ID that ties back to the user’s session. This is a simple JSON log line per request — no fancy infrastructure needed, just write to stdout and let your log aggregator (Datadog, CloudWatch, whatever you already have) index it. Second, cost tracking: I compute the dollar cost per request based on token counts and model pricing, and I emit it as a metric. I set up a daily cost alert at 150% of the expected daily spend. This catches prompt engineering mistakes and infinite loops before they drain the budget. Third, error rate monitoring with alerting at 5% over a 5-minute window. LLM API error rates above 3% usually indicate a provider issue, not your code. Fourth, latency percentiles — P50, P95, P99 — because average latency is misleading. If P50 is 800ms but P99 is 12 seconds, 1% of your users are having a terrible experience. Fifth, a simple “thumbs up / thumbs down” feedback mechanism in the UI. This is the only metric that directly measures output quality, and you need it from day one to establish a baseline.
- In the first month, I layer on three things. First, trace-level observability using Langfuse or LangSmith, where each user request produces a full trace showing the prompt, retrieved context (if RAG), model response, and any tool calls. This makes debugging specific user complaints trivial — “show me the trace for request X” gives you everything you need. Second, automated quality evaluation: I sample 5-10% of production traffic and run it through an LLM judge that scores responses on relevance, accuracy, and helpfulness. This catches gradual quality degradation that no single user complaint would reveal. Third, a cost breakdown dashboard that shows spend by model, by feature, by user segment, and by day. This is where you discover that one power user is consuming 30% of your budget, or that the summarization feature is 10x more expensive per request than chat.
Compare Langfuse, LangSmith, and Arize Phoenix for LLM observability. How would you choose between them for a startup versus an enterprise?
Compare Langfuse, LangSmith, and Arize Phoenix for LLM observability. How would you choose between them for a startup versus an enterprise?
Strong Answer:
- Langfuse is open-source, can be self-hosted, and has a generous free cloud tier. Its integration is minimal — you can get tracing working with a single decorator or a drop-in OpenAI client replacement. It is framework-agnostic, so it works whether you are using LangChain, LlamaIndex, raw OpenAI calls, or your own framework. The trade-off is that the evaluation and dataset management features are less mature than LangSmith’s.
- LangSmith is the natural choice if you are already invested in the LangChain ecosystem. Tracing is automatic for chains, agents, and LangGraph workflows. The evaluation workflow is the most polished: you can build test datasets, run automated evaluations, compare prompt versions, and track metrics over time, all from a single UI. The trade-offs: it is closed-source with no self-hosting option, which is a dealbreaker for companies with data residency requirements. All your prompts, user queries, and model responses are sent to LangChain’s servers.
- Arize Phoenix takes a local-first approach — it runs entirely on your machine with a browser UI, which makes it ideal for development and debugging. You do not need to send data anywhere. For production, you can export traces to any OpenTelemetry-compatible backend (Jaeger, Grafana Tempo, Datadog). The trade-off is that Phoenix is primarily a development tool; its production monitoring capabilities are less polished than Langfuse or LangSmith.
- For a startup, I would start with Langfuse Cloud for simplicity. You get tracing, cost tracking, and user feedback collection in under an hour of integration work. As you grow, you can self-host Langfuse to reduce costs and gain data control. For an enterprise with data privacy requirements, I would self-host Langfuse or build a custom observability stack on OpenTelemetry. The custom stack is more engineering effort upfront but integrates seamlessly with existing Grafana dashboards, PagerDuty alerting, and data warehouse infrastructure that enterprises already have. I would never recommend LangSmith for an enterprise that handles PII in user queries — the data residency risk is not worth the convenience.
llm.model, llm.input_tokens, llm.output_tokens, llm.cost_usd, llm.latency_ms, llm.finish_reason. For RAG pipelines, the retrieval step gets its own child span with retrieval.num_docs, retrieval.latency_ms, and retrieval.strategy. I would export spans to a backend that supports both real-time dashboarding (Grafana with Tempo) and long-term analytics (a data warehouse like BigQuery or Snowflake). The critical addition on top of standard OpenTelemetry is content logging — the actual prompts and responses. OTel spans are designed for structured metadata, not large text blobs. I would log the full prompt and response to a separate store (S3 or a document database) and include a reference ID in the span. This keeps the tracing infrastructure fast while still giving me the ability to inspect full conversations when debugging. The alert rules would be: cost per hour exceeding 2x baseline, error rate above 3% for 5 minutes, P95 latency above 5 seconds for 5 minutes, and user feedback thumbs-down rate above 15% for any 1-hour window.How do you detect and respond to LLM quality degradation in production when the provider silently updates the model?
How do you detect and respond to LLM quality degradation in production when the provider silently updates the model?
Strong Answer:
- Silent model updates are one of the most frustrating aspects of building on third-party LLM APIs. OpenAI has done this multiple times — the model behind the “gpt-4” endpoint changes behavior without any notification. Your prompts that worked perfectly last week suddenly produce slightly different formatting, miss edge cases, or change their interpretation of ambiguous instructions. And because the change is gradual (not a hard failure), your error rate and latency metrics look perfectly normal.
- The detection strategy is continuous evaluation against a stable golden dataset. I run a subset of my golden dataset (50-100 cases) through the production model every 6 hours and compute quality scores. I track these scores as a time series. A sudden drop (more than 10% in one interval) triggers an alert. A gradual decline (5% over a week) triggers a review. The key is that the golden dataset and the evaluation criteria do not change — so any score change must be attributable to the model.
- I also track the
system_fingerprintfield that OpenAI returns with each response. When the fingerprint changes, I know a model update happened and I proactively run a full evaluation suite rather than waiting for the scheduled run. This gives me same-day detection rather than waiting for the next 6-hour window. - The response strategy depends on the severity. For minor regressions (formatting changes, slightly different phrasing), I update my parsing logic and acceptance criteria. For moderate regressions (quality drop of 5-10% on critical use cases), I adjust the prompt to compensate — often adding more explicit instructions or examples that anchor the model to the desired behavior. For severe regressions (quality drop above 10% or new failure modes), I switch to a pinned model version if available (like
gpt-4-0613instead ofgpt-4), or fail over to an alternative provider while I investigate. - The long-term mitigation is reducing provider dependency. I maintain a backup prompt variant tested against Claude, so that if OpenAI’s model degrades, I can redirect traffic to Anthropic within minutes. This is not about permanently switching — it is about having a hot standby that buys me time to investigate and fix the OpenAI-specific prompt.
What alert thresholds would you set for an LLM application, and how do you avoid alert fatigue?
What alert thresholds would you set for an LLM application, and how do you avoid alert fatigue?
Strong Answer:
- The biggest mistake teams make is setting alert thresholds based on intuition rather than data. “Alert if latency exceeds 5 seconds” sounds reasonable but might fire 50 times a day if your P99 is naturally 6 seconds. My approach is to run the system for 1-2 weeks with logging but no alerts, establish baselines for all metrics, and then set thresholds at statistically meaningful deviations from those baselines.
- I use three severity levels. Critical alerts (page someone immediately): error rate above 10% for 3 minutes (the system is probably down), cost per hour exceeding 5x baseline (runaway loop or prompt explosion), and zero successful requests for 2 minutes (complete outage). Warning alerts (review within 1 hour): P95 latency above 2x baseline for 10 minutes, error rate above 5% for 5 minutes, daily cost exceeding 150% of expected. Informational alerts (review next business day): user thumbs-down rate above 20% for any 4-hour window, model fingerprint change detected, token usage per request trending up over 7 days.
- To avoid alert fatigue, I follow three rules. First, every alert must have a clear action. “Latency is high” is not actionable. “P95 latency for model gpt-4o in region us-east-1 exceeded 8 seconds for 10 minutes” tells the on-call engineer exactly what to investigate. Second, I use alert aggregation and deduplication — if the same alert fires every minute for an hour, the on-call gets one notification, not 60. Third, I do monthly alert reviews: if an alert fired more than 5 times in a month without resulting in a meaningful action, I either fix the underlying issue, adjust the threshold, or remove the alert.
- One LLM-specific alert that most teams miss: I alert on output length anomalies. If the average output token count suddenly doubles, it often means the model started adding unnecessary preambles, repeating itself, or getting stuck in a verbose pattern. This is an early signal of prompt regression or model behavior change, and it shows up before quality metrics degrade.