Basic RAG retrieves documents using direct embedding similarity — but in production, that is often not enough. User queries are short and vague while documents are long and specific. The vocabulary mismatch between how people ask questions and how answers are written creates a “semantic gap” that tanks retrieval quality. Advanced RAG techniques close that gap through query transformation, hierarchical retrieval, and self-correction. Think of basic RAG like searching a library by matching exact words on the spine of each book. Advanced RAG is like having a librarian who understands what you actually need, rephrases your question several different ways, checks whether the books she pulled are actually relevant, and goes back for more if they are not.Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
The 80/20 of RAG quality: In most production systems, retrieval quality matters more than generation quality. If you feed the right context to the LLM, even a smaller model produces great answers. If you feed the wrong context, even GPT-4 hallucinates confidently. Invest your optimization time accordingly.
HyDE: Hypothetical Document Embeddings
The core insight behind HyDE is counterintuitive: instead of embedding the user’s question and looking for similar documents, you first ask the LLM to imagine what the answer document would look like, then embed that hypothetical answer to search. Why does this work? Because a hypothetical answer lives in the same “semantic neighborhood” as the real answer document — it uses similar vocabulary, structure, and phrasing. A short question like “When was Python created?” lives far from its answer in embedding space, but a hypothetical answer paragraph about Python’s creation date lives right next to the real Wikipedia paragraph. The trade-off: HyDE adds one LLM call per query (latency and cost), and it can hurt performance when the LLM’s hypothetical answer is confidently wrong, pulling retrieval in the wrong direction. Use it when your queries are short and your documents are long-form prose.Multi-Query Retrieval
Multi-query retrieval solves a fundamental problem: users express the same intent in wildly different ways, and a single query phrasing may miss relevant documents that use different vocabulary. Think of it like asking five different people to search Google for the same thing — they would each type something different, and the union of their results covers more ground than any single search. The technique generates several reformulations of the original query, runs each one independently, then merges the results. This dramatically improves recall (finding all relevant documents) at the cost of additional embedding calls.Parent Document Retrieval
This technique solves the “Goldilocks chunking problem”: small chunks are better for precise embedding matching (less noise in the vector), but small chunks lack the surrounding context the LLM needs to generate a good answer. Parent document retrieval gives you the best of both worlds — search against small, focused chunks, but return the larger parent document that contains the full context. Think of it like a book index: you look up a specific keyword (small chunk matching), but then you read the entire page or section (parent document) to get the full picture.Query Decomposition
Break complex queries into sub-queries:Corrective RAG (CRAG)
Self-correct retrieval based on relevance assessment:Reciprocal Rank Fusion
Combine results from multiple retrieval methods:Advanced RAG Best Practices
- Use HyDE for queries that are different from document style
- Multi-query retrieval improves recall for ambiguous queries
- Parent document retrieval preserves context for answers
- Always assess retrieval quality before generation
- Combine multiple techniques for best results
Practice Exercise
Build an advanced RAG system that:- Implements HyDE for query transformation
- Uses multi-query retrieval for improved recall
- Applies parent document retrieval for context
- Includes self-correction with relevance assessment
- Combines methods using reciprocal rank fusion
- Measuring retrieval quality improvements
- Balancing latency vs quality tradeoffs
- Handling edge cases gracefully
- Providing explainable retrieval decisions
Interview Deep-Dive
You are seeing low retrieval quality in your RAG system despite using a good embedding model. Walk me through your debugging process.
You are seeing low retrieval quality in your RAG system despite using a good embedding model. Walk me through your debugging process.
Strong Answer:
- Before touching the model or retrieval logic, I start by inspecting the data. The most common cause of poor retrieval is bad chunking, not bad embeddings. I pull 20-30 failing queries, look at what chunks were retrieved versus what chunks should have been retrieved, and check whether the correct answer even exists in any chunk. In at least half the cases I have debugged, the answer was split across two chunks or buried in a chunk dominated by irrelevant surrounding text.
- Next I check for vocabulary mismatch. If users ask “How do I cancel my subscription?” but the docs say “To terminate your recurring billing plan,” even good embeddings may not bridge that gap. This is exactly where techniques like HyDE or multi-query retrieval help, because they generate text in the vocabulary of the answer space rather than the question space.
- I evaluate retrieval separately from generation. I compute recall at k (what percentage of relevant chunks appear in the top k results) and mean reciprocal rank across a test set. If recall at 5 is below 70%, the problem is definitely retrieval. If recall is fine but the final answers are bad, the problem is in the generation prompt or context assembly.
- The chunk size and overlap parameters are the highest-leverage tuning knobs. Too small and you lose context. Too large and you dilute the signal with noise. I typically test three configurations (256, 512, 1024 tokens) on a benchmark set and pick the one with the best recall. Parent document retrieval is a great hybrid approach when you want precise matching but full-context answers.
- Finally, I check for index issues. If you are using HNSW, the ef_search parameter controls the accuracy-speed trade-off at query time. A low ef_search (the default in many libraries) can silently miss relevant results. Increasing it from 40 to 200 often recovers 5-10% recall with modest latency increase.
Explain the trade-offs between HyDE and multi-query retrieval. When would you pick one over the other?
Explain the trade-offs between HyDE and multi-query retrieval. When would you pick one over the other?
Strong Answer:
- HyDE and multi-query retrieval both solve the query-document vocabulary gap, but they attack it from different angles. HyDE generates a hypothetical answer document and embeds that, betting that a fake answer will be semantically closer to the real answer than the original short question. Multi-query generates multiple rephrased versions of the question and retrieves against all of them, betting that at least one rephrasing will match the vocabulary of the relevant document.
- HyDE works best when queries are very short and documents are long-form prose. A query like “Python creation date” is far from any document in embedding space, but a hypothetical paragraph about Python’s history lands right in the neighborhood. The risk with HyDE is that if the model generates a confidently wrong hypothetical, it pulls retrieval in the wrong direction. For a domain-specific corpus where the model has poor training data coverage, HyDE can actually hurt recall.
- Multi-query is safer and more robust. It does not require the model to know the answer, just to rephrase the question. This makes it better for specialized or proprietary domains where the LLM might generate an inaccurate hypothetical. The downside is cost and latency: you are embedding 3-5 queries instead of one, plus the LLM call to generate variations.
- In practice, I choose HyDE for general-knowledge domains with short queries and long documents (help centers, Wikipedia-style knowledge bases). I choose multi-query for specialized domains (legal, medical, internal docs) where the model might not know the answer but can still generate useful query reformulations. When in doubt, multi-query is the safer default because it degrades gracefully.
- One nuance people miss: you can combine them. Use multi-query to generate 3 variations, then apply HyDE to each variation. This gives you 3 hypothetical documents plus the 3 query variations, all contributing to retrieval. Expensive, but for high-value queries where recall matters more than latency, it is extremely effective.
You are building a Corrective RAG system. How do you design the relevance assessment step without it becoming a bottleneck?
You are building a Corrective RAG system. How do you design the relevance assessment step without it becoming a bottleneck?
Strong Answer:
- The relevance assessment in Corrective RAG is the step that checks whether retrieved documents actually answer the query before passing them to generation. The naive approach is to call an LLM for each retrieved document, but if you retrieve 10 documents, that is 10 additional LLM calls per query, which destroys latency.
- The first optimization is to use a cross-encoder reranker instead of an LLM call. Models like Cohere Rerank or a fine-tuned cross-encoder (like ms-marco-MiniLM) take a query-document pair and output a relevance score in 5-20ms versus 500ms+ for an LLM call. This is the approach most production CRAG systems actually use.
- The second optimization is to batch the assessment. Instead of asking “Is document X relevant to query Y?” for each document separately, pass all retrieved documents in a single LLM call with the prompt “Score each of these documents for relevance to the query on a 0-1 scale.” One call instead of N. The trade-off is that the model may give less precise scores when evaluating many documents at once, but for a binary relevant/irrelevant decision, it works well.
- The third approach is to use the embedding similarity score itself as a first-pass filter. Set a threshold (say, cosine similarity below 0.3 is definitely irrelevant) and only run the expensive relevance check on documents in the uncertain zone (0.3-0.7). Documents above 0.7 are assumed relevant without checking. This cascading approach reduces the number of documents that need LLM-based assessment by 50-70% in practice.
- For the refinement loop where you rewrite the query and re-retrieve, I cap it at 2 iterations maximum. Beyond that, if you still have not found relevant documents, the information probably is not in your corpus, and it is better to tell the user honestly than to keep searching and burning tokens.
Explain how parent document retrieval works and when it breaks down. What is the fundamental tension it tries to resolve?
Explain how parent document retrieval works and when it breaks down. What is the fundamental tension it tries to resolve?
Strong Answer:
- Parent document retrieval addresses a core tension in RAG: small chunks are better for precise matching, but large chunks are better for providing complete context to the LLM. If you chunk at 100 tokens, your embedding search is very precise because each chunk is focused on a single idea. But when you pass that 100-token chunk to the LLM for answer generation, it often lacks enough context to produce a good response. If you chunk at 1000 tokens, the LLM gets plenty of context, but the embedding is a blurry average of many ideas, which hurts retrieval precision.
- Parent document retrieval solves this by maintaining two levels: small child chunks for retrieval and larger parent documents for context. You embed and search against the small chunks, but when you find a match, you return the parent document (or a larger surrounding window) to the LLM. You get the best of both worlds: precise retrieval and rich context.
- It breaks down in a few scenarios. First, when the parent document is too large (multiple pages), you end up stuffing too much irrelevant context into the LLM prompt, which can actually reduce answer quality and wastes tokens. Second, when multiple relevant chunks come from different parent documents, you may exceed the context window by returning all parents. You need a strategy for this: either return partial parents (a window around the matching chunk) or cap the total context size and prioritize by relevance score.
- The third failure mode is information that spans parent boundaries. If the answer to a question starts at the end of one parent and continues at the beginning of the next, neither parent alone contains the full answer. Overlapping parent windows help, but they increase storage and can introduce duplicate information in the context.
- In production, I typically set child chunks at 200-300 tokens and parent windows at 1000-1500 tokens (not entire documents). This gives a 3-5x context expansion, which is enough for most use cases without blowing up context size.