Testing LLM applications requires different strategies than traditional software due to non-deterministic outputs and external API dependencies.Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Testing Challenges
Mocking LLM Responses
Basic Mocking with pytest
LLM Mock Fixture
Response Caching for Tests
Semantic Assertions
Schema Validation
Golden Dataset Testing
Integration Testing
Performance Testing
Test Configuration
Test Strategy Summary
| Test Type | Speed | Cost | Coverage |
|---|---|---|---|
| Unit (mocked) | Fast | Free | Logic |
| Cached | Fast | One-time | Regression |
| Integration | Medium | Low | Endpoints |
| Golden Dataset | Slow | Medium | Quality |
| Semantic | Slow | Medium | Meaning |
What is Next
Production Logging
Learn structured logging and debugging for LLM applications
Interview Deep-Dive
How do you approach testing an LLM-powered application when the outputs are non-deterministic? What is your testing strategy from unit to production?
How do you approach testing an LLM-powered application when the outputs are non-deterministic? What is your testing strategy from unit to production?
Strong Answer:
- The fundamental shift from traditional testing is that you cannot assert on exact outputs. Instead, you build a layered testing strategy that trades precision for coverage at each level. At the unit test level, I mock the LLM entirely. The tests validate your application logic — prompt construction, response parsing, error handling, token counting — not the model’s behavior. These are fast, free, and deterministic. If your prompt template inserts user context into position 3 of the messages array, the unit test verifies that, regardless of what the model would say.
- At the integration test level, I use response caching. The first run hits the real API and records the response. Subsequent runs replay the cached response. This gives you regression detection — if you change your prompt and the test still passes with the old cached response, something is wrong. I invalidate caches on prompt changes using a hash of the prompt template as part of the cache key.
- At the evaluation level, I use golden datasets with semantic assertions. A golden dataset is 50-200 curated input-output pairs that represent your critical use cases. Instead of exact match, you use metrics like: does the response contain the required information (concept containment), is it semantically similar to the reference answer (embedding similarity above 0.8), does it conform to the expected schema (Pydantic validation), and does it avoid known failure patterns (regex for hallucinated URLs, fabricated statistics).
- At the production level, I run continuous evaluation on a sample of live traffic. Every Nth request gets scored by an LLM judge on dimensions like relevance, correctness, and helpfulness. This catches quality regressions that static golden datasets miss — for example, when a model update changes behavior on a query pattern you did not anticipate.
- The meta-point is: you need all four levels. Teams that only mock never catch model-related regressions. Teams that only use golden datasets have slow, expensive test suites. The pyramid structure — many unit tests, some integration tests, focused golden dataset evals, sampled production monitoring — gives you confidence without breaking the bank.
What is an LLM-as-judge evaluation, and what are the pitfalls of using one LLM to evaluate another?
What is an LLM-as-judge evaluation, and what are the pitfalls of using one LLM to evaluate another?
Strong Answer:
- LLM-as-judge is the pattern of using a capable model (typically GPT-4o or Claude) to score the output of another model on dimensions like accuracy, relevance, helpfulness, and harmfulness. You provide the judge with the original prompt, the response to evaluate, optionally a reference answer, and a scoring rubric. The judge returns structured scores, usually on a 1-10 or 1-5 scale. This is enormously valuable because it scales — you can evaluate thousands of responses automatically where human evaluation would take weeks.
- The first major pitfall is self-preference bias. Studies have shown that GPT-4 systematically rates GPT-4 outputs higher than Claude outputs, and vice versa. If you are using GPT-4 as a judge to compare GPT-4 against Claude, the benchmark is biased. The mitigation is to use a different model family as the judge than the one being evaluated, or to use multiple judges and average their scores.
- The second pitfall is verbosity bias. LLM judges consistently prefer longer, more detailed answers over concise ones, even when the concise answer is more appropriate. A response that says “Paris” to “What is the capital of France?” will score lower than a response that says “The capital of France is Paris, a city located in the northern part of the country on the Seine River.” The mitigation is to explicitly instruct the judge that conciseness is a positive attribute and that longer is not always better.
- The third pitfall is position bias. When comparing two responses side by side, the judge tends to prefer whichever response is presented first (or sometimes second, depending on the model). The mitigation is to evaluate each response independently rather than comparatively, or to run pairwise comparisons twice with swapped positions and average the scores.
- The fourth pitfall is rubric sensitivity. Small changes in how you phrase the evaluation criteria can swing scores by 20-30%. “Is this response accurate?” versus “Does this response contain any factual errors?” will produce different score distributions. You need to iterate on your rubric with calibration examples and validate that the judge’s scores correlate with human judgment on a held-out set.
How do you design a regression testing pipeline for a RAG system where both the retrieval and generation components can change?
How do you design a regression testing pipeline for a RAG system where both the retrieval and generation components can change?
Strong Answer:
- A RAG system has two independent failure surfaces: retrieval quality and generation quality. Your regression tests need to isolate them. I build two separate test suites. The retrieval test suite has 100+ queries with labeled relevant documents. For each query, I know which document chunks should be retrieved. I run the retrieval pipeline and measure recall at k (did the correct chunks appear in the top k results?) and mean reciprocal rank. If I change the chunking strategy, the embedding model, or the reranking logic, this suite catches regressions immediately — without ever calling the LLM.
- The generation test suite uses a fixed set of retrieved contexts paired with queries and reference answers. By fixing the retrieved context, I isolate the LLM generation step. If I change the prompt template or switch models, this suite tells me if generation quality changed. I evaluate using semantic similarity, concept containment (does the response mention all required facts?), and a format compliance check (does it return valid JSON if expected?).
- The end-to-end test suite connects both: real queries, real retrieval, real generation. This catches interaction effects — for example, a retrieval change that slightly alters the context ordering, which then triggers a different generation behavior due to the lost-in-the-middle effect. I run this suite less frequently (nightly rather than on every commit) because it is slow and expensive.
- The most valuable test type that most teams skip is the “known failure” suite. Every time a user reports a bad response, I add the query and the correct answer to a dedicated test file. This ensures that every fixed bug stays fixed. Over six months, this suite grows to 200-300 cases and becomes the single best indicator of system quality.
- For CI integration, I run retrieval tests and mock-based unit tests on every PR. Golden dataset evaluations run nightly. End-to-end tests run weekly or on-demand before major releases.