Selecting the right LLM for your application requires systematic evaluation. This chapter covers benchmark frameworks, evaluation metrics, and decision criteria for model selection.Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Benchmark Framework
Building an Evaluation Harness
Create a structured framework for comparing models:Evaluation Metrics
Response Quality Scorers
Implement various quality metrics:Task-Specific Metrics
Different tasks require specialized metrics:Model Selection Framework
Decision Matrix
Systematically compare models across dimensions:A/B Testing Framework
Test models in production:Model Selection Best Practices
- Test on representative data from your actual use case
- Consider total cost of ownership, not just per-token pricing
- Measure latency at the 95th percentile, not just average
- Run A/B tests long enough for statistical significance
Practice Exercise
Build a comprehensive model evaluation pipeline:- Create benchmark suites for different task types
- Implement at least 5 scoring metrics
- Build an automated selection recommendation system
- Set up A/B testing infrastructure
- Generate comparison reports with visualizations
- Statistical significance in comparisons
- Cost-quality tradeoff analysis
- Latency requirements for your use case
- Reproducible evaluation methodology
Interview Deep-Dive
Your team needs to select an LLM for a production application. Walk me through your evaluation methodology -- what do you measure, and how do you avoid common benchmarking pitfalls?
Your team needs to select an LLM for a production application. Walk me through your evaluation methodology -- what do you measure, and how do you avoid common benchmarking pitfalls?
Strong Answer:
- The first step is defining your evaluation criteria with explicit weights before you run any benchmarks. If you benchmark first and then decide what matters, you will unconsciously bias toward whichever model already performed well. I use a weighted decision matrix with typically four to six dimensions: quality on my specific task (not general benchmarks), latency at the 95th percentile (not average), cost per 1,000 requests, context window requirements, and any hard constraints like data residency or compliance.
- For quality measurement, I never rely on public benchmarks like MMLU or HumanEval. Those measure general capability, but your application has specific needs. I build a custom evaluation dataset of 100-200 examples that represent my actual use case — real user queries, real expected outputs, real edge cases. I then score each model’s responses using a combination of automated metrics (format compliance, factual overlap with reference) and an LLM judge for subjective quality dimensions.
- The biggest benchmarking pitfall is sample size. Running 10 queries per model and declaring a winner is statistically meaningless. With 10 samples, the confidence interval on your quality estimate is enormous. I run at minimum 100 queries per model, and for close comparisons I run 500+ with statistical significance testing. A 2% quality difference on 100 samples is noise; on 500 samples with p less than 0.05, it is a signal.
- The second pitfall is ignoring the cost-quality Pareto frontier. Model A might score 92% on quality and cost 0.15 per request. Is that 2% quality improvement worth 5x the cost? The answer depends entirely on your application. For medical triage, yes. For a casual chatbot, absolutely not. I always plot the cost-quality tradeoff curve and present it to stakeholders rather than declaring a single “best” model.
- The third pitfall is temporal instability. I have seen teams run a benchmark, select a model, and then not re-evaluate for six months. Model providers update their models continuously. GPT-4o in January may behave differently from GPT-4o in June. I schedule quarterly re-evaluations against the same golden dataset.
How would you set up an A/B test comparing two LLMs in production, and what metrics would you track to determine a winner?
How would you set up an A/B test comparing two LLMs in production, and what metrics would you track to determine a winner?
Strong Answer:
- The first design decision is the traffic split and user assignment. I use sticky sessions — a given user always sees the same model for the duration of the test. This is critical because if a user gets Model A for one query and Model B for the next, their experience is inconsistent, and any feedback they give is contaminated. I assign users to groups by hashing their user ID with the test name, which gives deterministic, balanced assignment without maintaining a lookup table.
- For metrics, I track three tiers. Primary metrics are the ones that determine the winner: user satisfaction (measured by thumbs up/down ratio), task completion rate (did the user accomplish their goal), and retention (did the user come back). Secondary metrics inform the decision but do not determine it: latency P95, cost per session, tokens per interaction. Guardrail metrics are things that must not regress: error rate, safety violations, timeout rate. If a guardrail metric regresses beyond a threshold, the test is automatically stopped regardless of primary metrics.
- The minimum sample size calculation is critical and most teams skip it. For a thumbs-up rate of around 80% (typical for a decent LLM application), detecting a 5% relative improvement (from 80% to 84%) with 80% power and 95% confidence requires roughly 1,500 samples per group. If your application gets 500 queries per day, that is a 6-day test at minimum. Running for 3 days and declaring a winner is not a valid A/B test.
- One subtlety specific to LLM A/B tests is prompt sensitivity. If you are also iterating on prompts during the test, you have confounded variables. I strictly separate model comparison tests (same prompt, different model) from prompt comparison tests (same model, different prompt). Never change both simultaneously.
- Finally, I watch for novelty effects. Users might initially prefer the model that gives longer, more detailed answers (verbosity bias), but over a two-week test, they might start preferring the concise model because they are tired of reading. Running the test for at least two weeks captures these behavioral shifts.
When does it make sense to use an LLM-as-judge for model evaluation versus human evaluators, and how do you calibrate the judge?
When does it make sense to use an LLM-as-judge for model evaluation versus human evaluators, and how do you calibrate the judge?
Strong Answer:
- Use LLM-as-judge when you need to evaluate at scale (hundreds to thousands of responses), when the evaluation criteria can be explicitly defined in a rubric, and when the domain does not require specialized expertise. Evaluating whether a customer support response is helpful, relevant, and grammatically correct — LLM judges do this reliably. Evaluating whether a medical diagnosis is clinically accurate or whether a legal argument is sound — you need domain expert humans.
- Use human evaluators when the stakes are high (medical, legal, financial), when the evaluation requires cultural or contextual knowledge the LLM does not have, when you are establishing ground truth for the first time (you need humans to create the calibration dataset), and when you need to evaluate safety-critical dimensions like bias and harmful content (LLMs have blind spots that humans can catch).
- The calibration process is what makes or breaks an LLM judge. I start with 100 examples scored by 3 human annotators. I compute inter-annotator agreement to understand how much humans themselves disagree (this sets the ceiling for LLM judge performance). Then I run the same 100 examples through the LLM judge and compute correlation with the human consensus scores. If the LLM-human correlation is within 0.1 of the human-human correlation, the judge is trustworthy for this rubric and domain.
- When calibration fails — the LLM judge disagrees with humans on more than 20% of cases — I diagnose why. Common causes: the rubric is ambiguous (humans interpret it differently too), the judge has a systematic bias (always scores longer answers higher), or the domain requires knowledge the judge lacks. The fix is usually rubric refinement: adding explicit examples of high-scoring and low-scoring responses directly into the judge prompt. Including 3-5 calibration examples with their correct scores in the judge prompt improves accuracy by 10-15% in my experience.