Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
December 2025 Update: Covers LLM-as-Judge, automated evaluation pipelines, A/B testing for AI, and observability with LangSmith/Langfuse.
Why Evaluation Matters
Think of evaluating an LLM like grading essays rather than grading a math test. With deterministic code, you check “does 2 + 2 equal 4?” and you are done. With LLMs, the output is creative, variable, and subjective — more like asking “is this essay well-argued?” You need a fundamentally different testing philosophy: one built on rubrics, statistical thresholds, and automated judges rather than exact-match assertions. Building AI is easy. Building AI that works reliably is hard. Without proper evaluation:- You ship broken features — a prompt tweak that improves one case silently breaks ten others
- Regressions go unnoticed — unlike a compile error, quality degradation is invisible without measurement
- Users lose trust — inconsistent AI outputs erode confidence faster than occasional downtime
- Costs spiral out of control — without benchmarks, you cannot tell if a cheaper model would perform just as well
The Testing Gap: Most teams test traditional code thoroughly but deploy AI features with zero evaluation. This module fixes that.
The Evaluation Stack
Building Eval Datasets
Your eval dataset is the single most important artifact in your AI project — more important than the prompt, more important than the model choice. It is the ground truth that every other decision is measured against. Think of it like a test suite for traditional code: without it, you are deploying on vibes.Dataset Structure
Creating a Golden Dataset
A “golden dataset” is your curated set of representative examples with known-good outputs. Practical tip: start with 20-50 examples covering your most important cases, then grow organically by adding every production failure as a new test case. Within a few months you will have a dataset that catches most regressions before they ship.Unit Testing for AI
Even though LLM outputs are non-deterministic, plenty of properties are testable with traditional assertions: “the response is valid JSON,” “it is under the token limit,” “it does not leak the system prompt.” These are the low-hanging fruit — cheap to write, fast to run, and they catch real bugs. Think of them as smoke tests: they will not tell you the response is good, but they will tell you when something is fundamentally broken.Deterministic Checks
LLM-as-Judge
Why Use LLM Evaluation
Using an LLM to evaluate another LLM’s output sounds circular, but it works surprisingly well — like having a senior editor review a junior writer’s work. The key insight is that judging quality is an easier task than generating quality. A model that cannot write a perfect legal brief can still tell you whether a legal brief addresses the right issues. Traditional metrics (BLEU, ROUGE) correlate poorly with quality. LLM judges are:- Flexible: Evaluate any criteria you can describe in natural language
- Scalable: Thousands of evals per hour at pennies each
- Explainable: Provide reasoning you can audit and debug
Basic LLM Judge
Multi-Criteria Evaluation
Pairwise Comparison
Pairwise comparison is more reliable than absolute scoring because humans and LLMs are better at saying “A is better than B” than assigning a number on a 1-5 scale. This is exactly how Chatbot Arena (lmsys.org) ranks models, and it is the gold standard for A/B testing prompt changes. Practical tip: randomize which response is A vs B across runs to cancel out position bias.RAG Evaluation
RAG evaluation is trickier than evaluating a standalone LLM because there are two stages that can fail independently: retrieval (did you find the right documents?) and generation (did you synthesize them correctly?). A wrong answer might be caused by bad retrieval, bad generation, or both — and the fix is completely different for each. The metrics below separate these concerns so you can diagnose root causes instead of guessing.RAG-Specific Metrics
Automated Eval Pipelines
The goal is to make AI evaluation as routine as runningpytest. Every time someone changes a prompt or updates the model version, the eval pipeline should run automatically and block the merge if quality drops below your threshold. This is the AI equivalent of “no green CI, no merge.”
CI/CD Integration
GitHub Actions Workflow
Observability & Monitoring
Evaluation tells you how your system performs before deployment. Observability tells you how it performs after. Think of eval as your pre-flight checklist and monitoring as your cockpit instruments — you need both. The tools below let you trace every LLM call from prompt to response, measure latency, track costs, and correlate all of it with user satisfaction.LangSmith Integration
Custom Metrics Dashboard
If LangSmith or Langfuse do not fit your stack, building a lightweight metrics collector is straightforward. The trick is logging to an append-only JSONL file (one JSON object per line) — it is simple, grep-friendly, and easy to ingest into any analytics tool later.Key Takeaways
Test Before Deploy
Build eval datasets first. Never ship AI without automated testing.
LLM Judges Scale
Use GPT-4o to evaluate GPT-4o-mini. LLM judges are the most practical solution.
Monitor in Production
Track latency, costs, and user feedback. Catch regressions early.
Version Everything
Prompts, models, and eval datasets all need version control.
What’s Next
Production Patterns
Learn architecture patterns for reliable AI systems at scale
Interview Deep-Dive
Your team ships a RAG chatbot. PM asks how you will know if it is working. Walk us through your evaluation strategy from day one to month three.
Your team ships a RAG chatbot. PM asks how you will know if it is working. Walk us through your evaluation strategy from day one to month three.
Strong Answer:
- Day one I would start with a golden dataset of 50-100 curated question-answer pairs spanning our core use cases: happy path queries, edge cases, adversarial inputs, and out-of-scope questions. Each example gets tagged by category so I can slice pass rates later.
- For automated eval I would set up an LLM-as-Judge pipeline using a stronger model (GPT-4o) to evaluate the weaker production model (GPT-4o-mini). The judge scores on four weighted criteria: faithfulness to retrieved context (weight 2x since hallucination is the top risk), answer relevance, completeness, and clarity. I weight faithfulness highest because a wrong-but-confident answer is worse than an incomplete one.
- I would wire this into CI so every prompt change or retrieval config change triggers an eval run. The pipeline has a hard gate: if pass rate drops below 90% or average score drops below 3.5 out of 5, the PR cannot merge. At one company we caught a prompt regression that would have doubled our hallucination rate — the CI eval blocked the merge before it hit production.
- In production I would track latency, cost per query, retrieval hit rate, and user feedback (thumbs up/down). By month three, I want to run pairwise comparisons of prompt versions using A/B testing — not just automated scores but actual user preference data. The offline evals tell you if you broke something; the online metrics tell you if users actually care about your improvements.
What is the difference between exact-match evaluation and LLM-as-Judge evaluation? When would you use each?
What is the difference between exact-match evaluation and LLM-as-Judge evaluation? When would you use each?
Strong Answer:
- Exact-match evaluation checks deterministic properties: does the output contain required phrases, is it valid JSON, does it stay under a token limit, does it avoid banned terms. These are cheap, fast, and have zero variance — they belong in your unit test suite and run on every commit. For example, checking that a customer support bot always includes a case number in its response, or that a SQL generation tool outputs syntactically valid SQL.
- LLM-as-Judge evaluates subjective qualities that you cannot write a regex for: is this response helpful, is the tone appropriate, is the reasoning sound, does it actually answer the question. You need this for any task where “correct” is fuzzy. The tradeoff is cost (you are paying for another LLM call per evaluation) and non-determinism (the judge might score the same response differently on two runs).
- In practice I use both in layers. The CI pipeline runs exact-match checks first as a fast gate — if format validation fails, there is no point running the expensive LLM judge. Then the LLM judge runs on the subset that passes deterministic checks. This layered approach cut our eval costs by about 40% at my previous team because we caught 30% of regressions at the cheap deterministic layer.
- The key thing people miss is that you should never use LLM-as-Judge for properties you can check deterministically. If you need JSON output, parse it. Do not ask GPT-4 “is this valid JSON.” That is burning money for a worse answer.
Your eval pipeline shows 95% pass rate, but users are complaining. What do you investigate?
Your eval pipeline shows 95% pass rate, but users are complaining. What do you investigate?
Strong Answer:
- The first thing I suspect is that the eval dataset does not represent real user traffic. This is the most common failure mode I have seen. Teams build golden datasets from their own imagination of what users will ask, not from actual production queries. I would pull the last 1000 production queries, sample 100, and run them through the eval pipeline. If the pass rate drops, the dataset has a coverage gap.
- Second, I would check if the evaluation criteria match what users actually care about. We might be scoring on “factual correctness” while users are frustrated about tone, latency, or the bot not knowing when to escalate to a human. I would segment the user complaints and see which categories the eval pipeline does not cover at all.
- Third, I would look for distribution shift. Maybe the eval dataset was built three months ago and the product has changed — new features, new policies, new edge cases that the golden dataset does not cover. At one company, our eval dataset had zero examples about a pricing change that happened after the dataset was created. The bot was giving outdated pricing info, eval said everything was fine because we never tested pricing questions.
- Finally, I would check if the LLM judge threshold is too lenient. A score of 3 out of 5 means “acceptable with some problems” but we were counting it as a pass. Tightening the threshold to 4 immediately revealed 15% more failures that aligned with user complaints.
You need to compare two prompt versions for a production system. How do you design the experiment?
You need to compare two prompt versions for a production system. How do you design the experiment?
Strong Answer:
- I would run this as a structured A/B test, not just eyeball a few examples. First, I define the primary metric: for most production systems this is a composite of automated quality score plus user satisfaction signal (thumbs up/down or task completion rate). I also define guardrail metrics that must not regress: latency p95, cost per query, and safety pass rate.
- For the offline phase, I run both prompts against the full eval dataset and compare on every criterion. I use pairwise comparison with the LLM judge rather than independent scoring because pairwise is more reliable — the judge sees both responses side by side and picks a winner. I randomize presentation order to avoid position bias and run each comparison twice with swapped order. If the results disagree, I count it as a tie.
- For the online phase, I deploy both prompts behind a feature flag with a 50/50 traffic split. I use consistent assignment so the same user always sees the same prompt version during the experiment. I let the experiment run until I have statistical significance, which for most LLM experiments means at least 1000 sessions per variant because the variance in user feedback is high.
- The thing most teams get wrong is they only look at average scores. I always look at the tail: what does the distribution of scores look like? A new prompt might have the same average but much higher variance, which means more really bad responses mixed with more really good ones. Users remember the bad ones.