Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

December 2025 Update: Covers LangSmith, Langfuse, Phoenix, and custom observability solutions for production LLM systems.

Why Observability Matters for LLMs

Here is the fundamental challenge with LLM applications: they are non-deterministic black boxes. Unlike a traditional API where a bug produces the same wrong answer every time, an LLM might give a perfect answer 95% of the time and hallucinate wildly the other 5%. Without observability, those 5% of failures are invisible until a customer complains — and by then you have no idea what went wrong, because the same input might produce a correct answer on retry. Think of it like running a restaurant without any way to see the kitchen. Customers tell you the food is bad, but you cannot see which chef made it, which ingredients were used, or what went wrong. LLM observability gives you security cameras in the kitchen. Without observability, you can’t:
  • Debug why a response was wrong (was it the prompt? the retrieved context? the model?)
  • Identify cost spikes (a single prompt engineering mistake can 10x your bill overnight)
  • Detect quality degradation (model updates can silently break your prompts)
  • Optimize performance (you cannot improve what you cannot measure)
Without Observability          With Observability
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"The AI gave wrong answer"     ├── Prompt: "..."
                               ├── Retrieved Docs: 5 chunks
"No idea why"                  ├── Latency: 2.3s
                               ├── Tokens: 1,847 (input: 1,200)
"Can't reproduce"              ├── Cost: $0.012
                               └── User: user_123
                                   └── Root cause: missing context

Key Metrics to Track

Core Metrics

Track these five categories from day one. You do not need a fancy dashboard to start — even logging to a file and running a weekly analysis script is better than nothing. But get the data flowing early, because you cannot retroactively add observability to requests you did not log.
CategoryMetricsWhy It Matters
LatencyP50, P95, P99 response timeP95 matters more than average — 5% of users having a terrible experience is a real problem
CostTokens per request, $ per requestA single prompt change can 10x your bill. Track daily to catch spikes early
QualityUser feedback, success rateThe only metric that actually matters — everything else is a proxy
ErrorsRate, types, retry successLLM APIs fail more often than traditional APIs. 1-3% error rate is normal
UsageRequests/min, active usersCapacity planning and detecting abuse (one user making 10K requests/day)

LLM-Specific Metrics

from dataclasses import dataclass
from datetime import datetime
from typing import Optional, List

@dataclass
class LLMTrace:
    """Complete trace for an LLM interaction"""
    trace_id: str
    timestamp: datetime
    
    # Request
    model: str
    prompt: str
    system_prompt: Optional[str]
    messages: List[dict]
    
    # Response
    response: str
    finish_reason: str
    
    # Token usage
    input_tokens: int
    output_tokens: int
    total_tokens: int
    
    # Timing
    latency_ms: float
    time_to_first_token_ms: Optional[float]
    
    # Cost
    cost_usd: float
    
    # Context (for RAG)
    retrieved_documents: Optional[List[dict]]
    retrieval_latency_ms: Optional[float]
    
    # Tool calls
    tool_calls: Optional[List[dict]]
    tool_results: Optional[List[dict]]
    
    # Metadata
    user_id: Optional[str]
    session_id: Optional[str]
    environment: str
    version: str
    
    # Quality signals
    user_feedback: Optional[str]  # thumbs_up, thumbs_down
    success: Optional[bool]

Observability Tool Decision Framework

Before choosing a tool, answer three questions: (1) Do you need self-hosting for data privacy? (2) Are you already using LangChain? (3) Is this for development or production?
CriteriaLangfuseLangSmithPhoenixCustom (OTel)
Self-hostedYes (Docker)No (cloud only)Yes (local-first)Yes
Free tier50K observations/mo5K traces/moUnlimited (local)N/A
LangChain integrationGood (manual)Native (automatic)Good (OpenInference)Manual
OpenAI SDK integrationDrop-in wrapperVia @traceableOpenInference auto-instrumentManual
Evaluation workflowsBasic scoringFull eval pipelines + datasetsLLM-as-judge built-inBuild your own
Data residencyFull control if self-hostedUS only (as of 2025)Full controlFull control
Team size sweet spot2-50 engineersLangChain-heavy teamsSolo dev / prototyping50+ engineers with existing infra
Production readinessHighHighMedium (better for dev)Depends on your build
Setup time10 minutes5 minutes2 minutesDays to weeks
Decision shortcut: If you are using LangChain, start with LangSmith. If you need self-hosting, use Langfuse. If you just need to debug locally, use Phoenix. If you already have Grafana/Datadog, build custom with OpenTelemetry.

Langfuse: Open-Source LLM Observability

Langfuse is the open-source option that most teams start with, and for good reason: it can be self-hosted (important for data privacy), has a generous free tier on their cloud, and the integration is genuinely minimal — often just a decorator or a drop-in OpenAI client replacement. Think of it as “Datadog for LLMs.”

Setup

pip install langfuse
import os
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context

# Initialize
langfuse = Langfuse(
    public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
    host="https://cloud.langfuse.com"  # or self-hosted
)

Tracing LLM Calls

from openai import OpenAI
from langfuse.openai import openai  # Drop-in replacement

# Automatic tracing with OpenAI
client = openai.OpenAI()

@observe()  # This single decorator creates a full trace automatically
def chat(user_message: str, user_id: str) -> str:
    # Attach metadata so you can filter traces by user, feature, etc.
    langfuse_context.update_current_observation(
        user_id=user_id,
        metadata={"feature": "chat"}
    )
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": user_message}]
    )
    
    return response.choices[0].message.content

@observe()
def rag_pipeline(query: str, user_id: str) -> dict:
    """Full RAG pipeline with tracing"""
    
    # Trace retrieval as a separate span
    langfuse_context.update_current_observation(name="rag-pipeline")
    
    # Retrieval span
    with langfuse_context.observe(name="retrieval") as span:
        docs = retrieve_documents(query)
        span.update(
            input={"query": query},
            output={"num_docs": len(docs)},
            metadata={"retrieval_strategy": "hybrid"}
        )
    
    # Build context
    context = "\n".join([d["content"] for d in docs])
    
    # LLM generation (auto-traced)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer based on context."},
            {"role": "user", "content": f"Context: {context}\n\nQuestion: {query}"}
        ]
    )
    
    answer = response.choices[0].message.content
    
    # Score the trace
    langfuse_context.score_current_trace(
        name="relevance",
        value=0.9,  # From your evaluation
        comment="High relevance based on source alignment"
    )
    
    return {"answer": answer, "sources": docs}

Custom Metrics and Evaluations

from langfuse import Langfuse

langfuse = Langfuse()

def evaluate_and_log(trace_id: str, answer: str, expected: str):
    """Log evaluation scores to Langfuse"""
    
    # Calculate metrics
    is_correct = check_answer(answer, expected)
    relevance = calculate_relevance(answer, expected)
    
    # Log scores
    langfuse.score(
        trace_id=trace_id,
        name="correctness",
        value=1.0 if is_correct else 0.0
    )
    
    langfuse.score(
        trace_id=trace_id,
        name="relevance",
        value=relevance
    )

def log_user_feedback(trace_id: str, feedback: str):
    """Log user feedback"""
    langfuse.score(
        trace_id=trace_id,
        name="user_feedback",
        value=1.0 if feedback == "thumbs_up" else 0.0,
        comment=feedback
    )

LangSmith: LangChain’s Platform

LangSmith is the natural choice if you are already in the LangChain ecosystem — tracing is automatic for chains, agents, and LangGraph workflows. The trade-off: it is a closed-source, hosted service (no self-hosting option), so it may not work for teams with strict data residency requirements. Where it shines is the evaluation workflow — you can build test datasets, run automated evals, and compare prompt versions all in one place.

Setup

import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"
os.environ["LANGCHAIN_PROJECT"] = "my-ai-app"

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langsmith import traceable

llm = ChatOpenAI(model="gpt-4o")

Tracing Chains

from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Chains are automatically traced
prompt = ChatPromptTemplate.from_template(
    "Answer this question: {question}"
)

chain = (
    {"question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Invoke with metadata
response = chain.invoke(
    "What is machine learning?",
    config={
        "metadata": {"user_id": "user_123"},
        "tags": ["production", "v2"]
    }
)

Custom Tracing

from langsmith import traceable

@traceable(name="custom-rag")
def my_rag_function(query: str) -> str:
    """Custom function with tracing"""
    docs = retrieve(query)
    answer = generate(query, docs)
    return answer

@traceable(run_type="retriever")
def retrieve(query: str) -> list:
    # Traced as retriever type
    return vector_db.search(query)

@traceable(run_type="llm")
def generate(query: str, docs: list) -> str:
    # Traced as LLM call
    return llm.invoke(format_prompt(query, docs))

Feedback and Evaluation

from langsmith import Client

client = Client()

# Log feedback
client.create_feedback(
    run_id="run-uuid",
    key="correctness",
    score=1.0,
    comment="Answer was accurate"
)

# Run evaluations
from langsmith.evaluation import evaluate

results = evaluate(
    lambda x: my_rag_function(x["question"]),
    data="my-dataset",
    evaluators=[
        "qa",  # Built-in QA evaluator
        "relevance",
        my_custom_evaluator
    ]
)

Arize Phoenix: Open-Source Tracing

Phoenix takes a different approach: local-first observability. It runs entirely on your machine with a beautiful UI, making it ideal for development and debugging. You do not need to send data anywhere — just pip install and launch. For production, you can export traces to any OpenTelemetry-compatible backend. Think of Phoenix as “the development tool” and Langfuse/LangSmith as “the production tool.”

Setup

pip install arize-phoenix openinference-instrumentation-openai
import phoenix as px

# Launch Phoenix UI
session = px.launch_app()
print(f"Phoenix UI: {session.url}")

# Instrument OpenAI
from openinference.instrumentation.openai import OpenAIInstrumentor
from phoenix.otel import register

tracer_provider = register()
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

# Now all OpenAI calls are traced
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

Tracing RAG Pipelines

from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

# Instrument LlamaIndex
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

# Now all LlamaIndex operations are traced
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

# Query is fully traced
response = query_engine.query("What is the main topic?")

Custom Observability Stack

When should you build your own instead of using Langfuse or LangSmith? Two scenarios: (1) you have strict compliance requirements that prevent sending data to third-party services and cannot self-host Langfuse, or (2) you need tight integration with existing infrastructure (your own Grafana dashboards, your own alerting pipeline, your own data warehouse). For most teams, start with a managed tool and only build custom when you outgrow it. Build your own observability for complete control:
import time
import uuid
import json
from datetime import datetime
from dataclasses import dataclass, asdict
from typing import Optional, Any
import structlog
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Configure structured logging
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ]
)
logger = structlog.get_logger()

# Configure OpenTelemetry
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

class LLMObserver:
    """Custom LLM observability"""
    
    def __init__(self, service_name: str):
        self.service_name = service_name
        self.metrics_collector = MetricsCollector()
    
    def trace_llm_call(
        self,
        model: str,
        messages: list,
        user_id: Optional[str] = None,
        session_id: Optional[str] = None
    ):
        """Context manager for tracing LLM calls"""
        return LLMTraceContext(
            observer=self,
            model=model,
            messages=messages,
            user_id=user_id,
            session_id=session_id
        )

class LLMTraceContext:
    """Context manager for LLM call tracing"""
    
    def __init__(self, observer: LLMObserver, **kwargs):
        self.observer = observer
        self.trace_id = str(uuid.uuid4())
        self.start_time = None
        self.kwargs = kwargs
    
    def __enter__(self):
        self.start_time = time.perf_counter()
        
        # Start OTel span
        self.span = tracer.start_span(
            "llm.call",
            attributes={
                "llm.model": self.kwargs["model"],
                "llm.user_id": self.kwargs.get("user_id", "anonymous"),
                "trace.id": self.trace_id
            }
        )
        
        logger.info(
            "llm_call_started",
            trace_id=self.trace_id,
            model=self.kwargs["model"],
            user_id=self.kwargs.get("user_id")
        )
        
        return self
    
    def __exit__(self, exc_type, exc_val, exc_tb):
        duration_ms = (time.perf_counter() - self.start_time) * 1000
        
        if exc_type:
            self.span.set_status(trace.Status(trace.StatusCode.ERROR, str(exc_val)))
            logger.error(
                "llm_call_failed",
                trace_id=self.trace_id,
                error=str(exc_val),
                duration_ms=duration_ms
            )
        else:
            self.span.set_status(trace.Status(trace.StatusCode.OK))
        
        self.span.set_attribute("llm.duration_ms", duration_ms)
        self.span.end()
        
        # Record metrics
        self.observer.metrics_collector.record_latency(
            self.kwargs["model"],
            duration_ms
        )
    
    def record_response(
        self,
        response: str,
        input_tokens: int,
        output_tokens: int,
        cost_usd: float
    ):
        """Record response details"""
        self.span.set_attributes({
            "llm.input_tokens": input_tokens,
            "llm.output_tokens": output_tokens,
            "llm.total_tokens": input_tokens + output_tokens,
            "llm.cost_usd": cost_usd
        })
        
        logger.info(
            "llm_call_completed",
            trace_id=self.trace_id,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            cost_usd=cost_usd
        )
        
        # Record cost metrics
        self.observer.metrics_collector.record_cost(
            self.kwargs["model"],
            cost_usd
        )
    
    def record_feedback(self, feedback: str, score: float):
        """Record user feedback"""
        logger.info(
            "user_feedback",
            trace_id=self.trace_id,
            feedback=feedback,
            score=score
        )

# Usage
observer = LLMObserver("my-ai-app")

def chat(user_message: str, user_id: str) -> str:
    with observer.trace_llm_call(
        model="gpt-4o",
        messages=[{"role": "user", "content": user_message}],
        user_id=user_id
    ) as trace:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": user_message}]
        )
        
        trace.record_response(
            response=response.choices[0].message.content,
            input_tokens=response.usage.prompt_tokens,
            output_tokens=response.usage.completion_tokens,
            cost_usd=calculate_cost(response.usage)
        )
        
        return response.choices[0].message.content

Dashboards and Alerting

Key Dashboards

# Prometheus metrics for Grafana
from prometheus_client import Counter, Histogram, Gauge

# Counters
llm_requests_total = Counter(
    "llm_requests_total",
    "Total LLM requests",
    ["model", "status", "environment"]
)

# Histograms
llm_latency_seconds = Histogram(
    "llm_latency_seconds",
    "LLM request latency",
    ["model"],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

llm_tokens_used = Histogram(
    "llm_tokens_used",
    "Tokens used per request",
    ["model", "token_type"],
    buckets=[100, 500, 1000, 2000, 4000, 8000, 16000]
)

# Gauges
llm_cost_usd = Gauge(
    "llm_cost_usd_total",
    "Cumulative LLM cost in USD",
    ["model"]
)

Alert Rules

The alerts below represent hard-won lessons from production LLM systems. The latency alert catches model provider degradation (which happens more often than you would expect). The error rate alert catches prompt regressions after deployments. The cost alert prevents runaway spending from infinite loops or unexpectedly verbose prompts.
# prometheus_alerts.yml
groups:
  - name: llm_alerts
    rules:
      - alert: HighLLMLatency
        expr: histogram_quantile(0.95, llm_latency_seconds) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High LLM latency detected"
          
      - alert: LLMErrorRateHigh
        expr: |
          rate(llm_requests_total{status="error"}[5m]) 
          / rate(llm_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "LLM error rate above 5%"
          
      - alert: DailyCostExceeded
        expr: sum(llm_cost_usd_total) > 1000
        labels:
          severity: warning
        annotations:
          summary: "Daily LLM cost exceeded $1000"

What to Alert On vs. What to Dashboard

This is a common mistake: teams alert on everything and get paged for non-issues, or dashboard everything and miss real problems. Here is the split:
MetricAlert (page someone)Dashboard (review daily)Why
Error rate > 5% for 5 minYesYesImmediate user impact
P95 latency > 10s for 5 minYesYesUsers are waiting, may abandon
Daily cost > 2x normalYesYesRunaway loop or abuse
P50 latency creeping upNoYesGradual trend, not an emergency
Token usage per request increasingNoYesPrompt drift, review weekly
Single user negative feedbackNoYesNoise; look for patterns
Model provider returning 503Yes (if sustained 2+ min)YesTransient blips are normal
New model version deployedNoYesCompare before/after quality
Edge case — alert fatigue: If your LLM error rate naturally sits at 2-3% (normal for most providers), do not set your alert threshold at 3%. Set it at 5% or use a rate-of-change alert: “error rate increased by 2x in the last 10 minutes.”

Debugging LLM Issues

Debugging LLM issues is fundamentally different from debugging traditional software. The “bug” is often not in your code at all — it is in the interaction between your prompt, the retrieved context, and the model’s interpretation. The debugger below codifies the most common failure patterns so you can diagnose issues systematically rather than staring at logs hoping for insight.

Common Issues and Diagnosis

class LLMDebugger:
    """Debug common LLM issues"""
    
    def analyze_trace(self, trace_id: str) -> dict:
        """Analyze a trace for issues"""
        trace = self.fetch_trace(trace_id)
        issues = []
        
        # Check for high latency
        if trace.latency_ms > 5000:
            issues.append({
                "type": "high_latency",
                "details": f"Latency {trace.latency_ms}ms exceeds threshold",
                "suggestions": [
                    "Use streaming for long responses",
                    "Consider smaller model for simple tasks",
                    "Check network latency to API provider"
                ]
            })
        
        # Check for token explosion
        if trace.input_tokens > 10000:
            issues.append({
                "type": "large_context",
                "details": f"Input tokens {trace.input_tokens} is high",
                "suggestions": [
                    "Reduce context size",
                    "Summarize long documents",
                    "Use better chunking strategy"
                ]
            })
        
        # Check for quality issues
        if trace.user_feedback == "thumbs_down":
            issues.append({
                "type": "quality_issue",
                "details": "User gave negative feedback",
                "suggestions": [
                    "Review retrieved documents for relevance",
                    "Check system prompt clarity",
                    "Analyze response for hallucinations"
                ]
            })
        
        return {
            "trace_id": trace_id,
            "issues": issues,
            "trace_details": trace
        }

Debugging Decision Tree

When a user reports “the AI gave a wrong answer,” use this systematic approach:
StepCheckWhat You Are Looking ForTool
1Trace ID lookupFind the exact request that went wrongLangfuse/LangSmith trace view
2Retrieved contextWere the right documents retrieved? Was relevant info missing?RAG span in trace
3System promptWas the correct prompt version active? Any recent changes?Prompt registry / version history
4Input tokensWas context truncated due to token limits?Token count in trace
5Model responseDid the model hallucinate, or did it follow bad context?Compare response to retrieved context
6Tool callsDid the model call the right tools with correct arguments?Tool call span in trace
7Post-processingWas the response correctly parsed and formatted?Application logs
80% of “wrong answer” bugs fall into three categories: (1) bad retrieval — the right document was not retrieved, (2) context truncation — the relevant information was cut off to fit the context window, (3) prompt regression — a recent prompt change broke an edge case. Check these first before investigating model-level issues.

Key Takeaways

Trace Everything

Log every LLM call with inputs, outputs, tokens, latency, and cost.

Structured Logging

Use structured logs (JSON) for easy querying and analysis.

Track Quality

Collect user feedback and run automated evaluations.

Set Alerts

Alert on latency spikes, error rates, and cost anomalies.

What’s Next

AI Security & Guardrails

Learn how to secure LLM applications and implement safety guardrails

Interview Deep-Dive

Strong Answer:
  • Day one, I set up five things. First, structured logging of every LLM call: model name, input token count, output token count, latency in milliseconds, HTTP status code, and a request ID that ties back to the user’s session. This is a simple JSON log line per request — no fancy infrastructure needed, just write to stdout and let your log aggregator (Datadog, CloudWatch, whatever you already have) index it. Second, cost tracking: I compute the dollar cost per request based on token counts and model pricing, and I emit it as a metric. I set up a daily cost alert at 150% of the expected daily spend. This catches prompt engineering mistakes and infinite loops before they drain the budget. Third, error rate monitoring with alerting at 5% over a 5-minute window. LLM API error rates above 3% usually indicate a provider issue, not your code. Fourth, latency percentiles — P50, P95, P99 — because average latency is misleading. If P50 is 800ms but P99 is 12 seconds, 1% of your users are having a terrible experience. Fifth, a simple “thumbs up / thumbs down” feedback mechanism in the UI. This is the only metric that directly measures output quality, and you need it from day one to establish a baseline.
  • In the first month, I layer on three things. First, trace-level observability using Langfuse or LangSmith, where each user request produces a full trace showing the prompt, retrieved context (if RAG), model response, and any tool calls. This makes debugging specific user complaints trivial — “show me the trace for request X” gives you everything you need. Second, automated quality evaluation: I sample 5-10% of production traffic and run it through an LLM judge that scores responses on relevance, accuracy, and helpfulness. This catches gradual quality degradation that no single user complaint would reveal. Third, a cost breakdown dashboard that shows spend by model, by feature, by user segment, and by day. This is where you discover that one power user is consuming 30% of your budget, or that the summarization feature is 10x more expensive per request than chat.
Follow-up: How do you debug a situation where users are reporting bad responses but your automated metrics all look healthy?This is the classic “metrics are green but users are unhappy” scenario, and it happens more often than you would think with LLM applications. The first step is to pull the specific traces for the complaining users and read them manually. Often the issue is obvious once you see the actual prompt and response — the model is technically answering the question but missing the user’s intent, or it is providing accurate but unhelpful information. The second step is to check whether the bad responses share a pattern: are they all from the same user segment, the same feature, the same time window, or the same query topic? I have seen cases where a prompt template change broke responses specifically for questions about refunds — general quality metrics stayed flat because refund questions were only 3% of traffic, but 100% of refund users were unhappy. The third step is to compare the bad responses against your golden dataset. If the golden dataset still passes, the problem is a coverage gap — you have a category of queries that your golden dataset does not represent. Add the failing queries to the dataset. If the golden dataset also fails, something more fundamental changed — a model update, a retrieval regression, a data pipeline issue.
Strong Answer:
  • Langfuse is open-source, can be self-hosted, and has a generous free cloud tier. Its integration is minimal — you can get tracing working with a single decorator or a drop-in OpenAI client replacement. It is framework-agnostic, so it works whether you are using LangChain, LlamaIndex, raw OpenAI calls, or your own framework. The trade-off is that the evaluation and dataset management features are less mature than LangSmith’s.
  • LangSmith is the natural choice if you are already invested in the LangChain ecosystem. Tracing is automatic for chains, agents, and LangGraph workflows. The evaluation workflow is the most polished: you can build test datasets, run automated evaluations, compare prompt versions, and track metrics over time, all from a single UI. The trade-offs: it is closed-source with no self-hosting option, which is a dealbreaker for companies with data residency requirements. All your prompts, user queries, and model responses are sent to LangChain’s servers.
  • Arize Phoenix takes a local-first approach — it runs entirely on your machine with a browser UI, which makes it ideal for development and debugging. You do not need to send data anywhere. For production, you can export traces to any OpenTelemetry-compatible backend (Jaeger, Grafana Tempo, Datadog). The trade-off is that Phoenix is primarily a development tool; its production monitoring capabilities are less polished than Langfuse or LangSmith.
  • For a startup, I would start with Langfuse Cloud for simplicity. You get tracing, cost tracking, and user feedback collection in under an hour of integration work. As you grow, you can self-host Langfuse to reduce costs and gain data control. For an enterprise with data privacy requirements, I would self-host Langfuse or build a custom observability stack on OpenTelemetry. The custom stack is more engineering effort upfront but integrates seamlessly with existing Grafana dashboards, PagerDuty alerting, and data warehouse infrastructure that enterprises already have. I would never recommend LangSmith for an enterprise that handles PII in user queries — the data residency risk is not worth the convenience.
Follow-up: If you were building a custom observability stack from scratch for LLM applications, what would you build on top of OpenTelemetry?I would model LLM calls as OpenTelemetry spans with a set of standardized semantic attributes: llm.model, llm.input_tokens, llm.output_tokens, llm.cost_usd, llm.latency_ms, llm.finish_reason. For RAG pipelines, the retrieval step gets its own child span with retrieval.num_docs, retrieval.latency_ms, and retrieval.strategy. I would export spans to a backend that supports both real-time dashboarding (Grafana with Tempo) and long-term analytics (a data warehouse like BigQuery or Snowflake). The critical addition on top of standard OpenTelemetry is content logging — the actual prompts and responses. OTel spans are designed for structured metadata, not large text blobs. I would log the full prompt and response to a separate store (S3 or a document database) and include a reference ID in the span. This keeps the tracing infrastructure fast while still giving me the ability to inspect full conversations when debugging. The alert rules would be: cost per hour exceeding 2x baseline, error rate above 3% for 5 minutes, P95 latency above 5 seconds for 5 minutes, and user feedback thumbs-down rate above 15% for any 1-hour window.
Strong Answer:
  • Silent model updates are one of the most frustrating aspects of building on third-party LLM APIs. OpenAI has done this multiple times — the model behind the “gpt-4” endpoint changes behavior without any notification. Your prompts that worked perfectly last week suddenly produce slightly different formatting, miss edge cases, or change their interpretation of ambiguous instructions. And because the change is gradual (not a hard failure), your error rate and latency metrics look perfectly normal.
  • The detection strategy is continuous evaluation against a stable golden dataset. I run a subset of my golden dataset (50-100 cases) through the production model every 6 hours and compute quality scores. I track these scores as a time series. A sudden drop (more than 10% in one interval) triggers an alert. A gradual decline (5% over a week) triggers a review. The key is that the golden dataset and the evaluation criteria do not change — so any score change must be attributable to the model.
  • I also track the system_fingerprint field that OpenAI returns with each response. When the fingerprint changes, I know a model update happened and I proactively run a full evaluation suite rather than waiting for the scheduled run. This gives me same-day detection rather than waiting for the next 6-hour window.
  • The response strategy depends on the severity. For minor regressions (formatting changes, slightly different phrasing), I update my parsing logic and acceptance criteria. For moderate regressions (quality drop of 5-10% on critical use cases), I adjust the prompt to compensate — often adding more explicit instructions or examples that anchor the model to the desired behavior. For severe regressions (quality drop above 10% or new failure modes), I switch to a pinned model version if available (like gpt-4-0613 instead of gpt-4), or fail over to an alternative provider while I investigate.
  • The long-term mitigation is reducing provider dependency. I maintain a backup prompt variant tested against Claude, so that if OpenAI’s model degrades, I can redirect traffic to Anthropic within minutes. This is not about permanently switching — it is about having a hot standby that buys me time to investigate and fix the OpenAI-specific prompt.
Follow-up: How do you distinguish between a genuine model degradation and a change in your user query distribution that makes the model appear worse?This is a critical distinction and one that most teams get wrong. If your users suddenly start asking more complex questions (maybe a new marketing campaign brought in a more technical audience), your quality metrics will drop even though the model has not changed. My approach is two-fold. First, I always evaluate against the fixed golden dataset as described above — this is immune to user distribution changes because the inputs are constant. If the golden dataset scores are stable but production quality metrics are dropping, the issue is distribution shift, not model degradation. Second, I segment production quality metrics by query category. I classify each query into categories (simple factual, complex reasoning, creative, code, etc.) and track quality per category. If “complex reasoning” quality dropped but “simple factual” stayed constant, and I also see that the proportion of complex reasoning queries increased, I can attribute the overall quality drop to distribution shift rather than model degradation. The fix for distribution shift is different from the fix for model degradation — you need to improve your prompts for the underperforming query category, not switch providers.
Strong Answer:
  • The biggest mistake teams make is setting alert thresholds based on intuition rather than data. “Alert if latency exceeds 5 seconds” sounds reasonable but might fire 50 times a day if your P99 is naturally 6 seconds. My approach is to run the system for 1-2 weeks with logging but no alerts, establish baselines for all metrics, and then set thresholds at statistically meaningful deviations from those baselines.
  • I use three severity levels. Critical alerts (page someone immediately): error rate above 10% for 3 minutes (the system is probably down), cost per hour exceeding 5x baseline (runaway loop or prompt explosion), and zero successful requests for 2 minutes (complete outage). Warning alerts (review within 1 hour): P95 latency above 2x baseline for 10 minutes, error rate above 5% for 5 minutes, daily cost exceeding 150% of expected. Informational alerts (review next business day): user thumbs-down rate above 20% for any 4-hour window, model fingerprint change detected, token usage per request trending up over 7 days.
  • To avoid alert fatigue, I follow three rules. First, every alert must have a clear action. “Latency is high” is not actionable. “P95 latency for model gpt-4o in region us-east-1 exceeded 8 seconds for 10 minutes” tells the on-call engineer exactly what to investigate. Second, I use alert aggregation and deduplication — if the same alert fires every minute for an hour, the on-call gets one notification, not 60. Third, I do monthly alert reviews: if an alert fired more than 5 times in a month without resulting in a meaningful action, I either fix the underlying issue, adjust the threshold, or remove the alert.
  • One LLM-specific alert that most teams miss: I alert on output length anomalies. If the average output token count suddenly doubles, it often means the model started adding unnecessary preambles, repeating itself, or getting stuck in a verbose pattern. This is an early signal of prompt regression or model behavior change, and it shows up before quality metrics degrade.
Follow-up: How do you handle cost alerting when your LLM spend is legitimately variable — for example, some days have 3x the traffic of others?Fixed-threshold cost alerts do not work for variable workloads. My approach is to normalize cost by traffic volume and alert on cost-per-request anomalies rather than absolute spend. If your typical cost-per-request is 0.02anditsuddenlyjumpsto0.02 and it suddenly jumps to 0.06, that is a 3x anomaly worth investigating regardless of whether total traffic is high or low. For the absolute spend alert, I use a dynamic threshold based on a rolling 7-day average adjusted for day-of-week seasonality. Monday typically has different traffic patterns than Saturday. I compute the expected spend for this specific day-of-week and hour-of-day and alert at 2x that expected value. This eliminates false positives from legitimate traffic spikes while still catching genuine cost anomalies like a prompt change that tripled token usage.