Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Basic RAG retrieves documents using direct embedding similarity — but in production, that is often not enough. User queries are short and vague while documents are long and specific. The vocabulary mismatch between how people ask questions and how answers are written creates a “semantic gap” that tanks retrieval quality. Advanced RAG techniques close that gap through query transformation, hierarchical retrieval, and self-correction. Think of basic RAG like searching a library by matching exact words on the spine of each book. Advanced RAG is like having a librarian who understands what you actually need, rephrases your question several different ways, checks whether the books she pulled are actually relevant, and goes back for more if they are not.
The 80/20 of RAG quality: In most production systems, retrieval quality matters more than generation quality. If you feed the right context to the LLM, even a smaller model produces great answers. If you feed the wrong context, even GPT-4 hallucinates confidently. Invest your optimization time accordingly.

HyDE: Hypothetical Document Embeddings

The core insight behind HyDE is counterintuitive: instead of embedding the user’s question and looking for similar documents, you first ask the LLM to imagine what the answer document would look like, then embed that hypothetical answer to search. Why does this work? Because a hypothetical answer lives in the same “semantic neighborhood” as the real answer document — it uses similar vocabulary, structure, and phrasing. A short question like “When was Python created?” lives far from its answer in embedding space, but a hypothetical answer paragraph about Python’s creation date lives right next to the real Wikipedia paragraph. The trade-off: HyDE adds one LLM call per query (latency and cost), and it can hurt performance when the LLM’s hypothetical answer is confidently wrong, pulling retrieval in the wrong direction. Use it when your queries are short and your documents are long-form prose.
from openai import OpenAI
import numpy as np


class HyDERetriever:
    """Hypothetical Document Embeddings for improved retrieval.
    
    Instead of embedding the raw query, we first generate what the ideal
    answer document WOULD look like, then use that for similarity search.
    This bridges the gap between question-space and document-space embeddings.
    """
    
    def __init__(
        self,
        documents: list[str],
        model: str = "gpt-4o-mini",
        embedding_model: str = "text-embedding-3-small"
    ):
        self.client = OpenAI()
        self.model = model
        self.embedding_model = embedding_model
        self.documents = documents
        # Pre-compute document embeddings once at init time -- this is the
        # expensive step, so do it upfront rather than per-query
        self.doc_embeddings = self._embed_documents()
    
    def _embed_documents(self) -> np.ndarray:
        """Embed all documents."""
        response = self.client.embeddings.create(
            model=self.embedding_model,
            input=self.documents
        )
        return np.array([e.embedding for e in response.data])
    
    def _generate_hypothetical_answer(self, query: str) -> str:
        """Generate a hypothetical answer to the query.
        
        The prompt instructs the model to write AS IF quoting a real document.
        This ensures the hypothetical text uses document-style language rather
        than conversational Q&A style, which improves embedding alignment.
        """
        prompt = f"""Given this question, write a passage that would answer it.
Write as if you're quoting from a document that contains the answer.
Be specific and detailed.

Question: {query}

Hypothetical document passage:"""
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            # Moderate temperature -- too low gives generic text, too high
            # introduces hallucinated details that pull search off-course
            temperature=0.7
        )
        
        return response.choices[0].message.content
    
    def retrieve(
        self,
        query: str,
        top_k: int = 5,
        use_hyde: bool = True
    ) -> list[tuple[str, float]]:
        """Retrieve documents using HyDE or direct query."""
        # The key insight: we search with the hypothetical answer's embedding,
        # not the original query's embedding. This moves us from question-space
        # into document-space before computing similarity.
        if use_hyde:
            search_text = self._generate_hypothetical_answer(query)
        else:
            search_text = query
        
        # Embed search text
        response = self.client.embeddings.create(
            model=self.embedding_model,
            input=[search_text]
        )
        query_embedding = np.array(response.data[0].embedding)
        
        # Dot product works here because OpenAI embeddings are normalized,
        # making dot product equivalent to cosine similarity
        similarities = np.dot(self.doc_embeddings, query_embedding)
        
        # Get top-k results
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        results = [
            (self.documents[i], similarities[i])
            for i in top_indices
        ]
        
        return results


# Usage
documents = [
    "The Python programming language was created by Guido van Rossum and first released in 1991.",
    "Machine learning is a subset of artificial intelligence focused on building systems that learn from data.",
    "The Great Wall of China is over 13,000 miles long and was built over many centuries.",
    "Quantum computing uses quantum-mechanical phenomena to perform computation.",
    "The human brain contains approximately 86 billion neurons.",
]

retriever = HyDERetriever(documents)

query = "When was Python created and by whom?"

# Compare HyDE vs direct retrieval
print("With HyDE:")
results = retriever.retrieve(query, top_k=2, use_hyde=True)
for doc, score in results:
    print(f"  [{score:.3f}] {doc[:80]}...")

print("\nWithout HyDE:")
results = retriever.retrieve(query, top_k=2, use_hyde=False)
for doc, score in results:
    print(f"  [{score:.3f}] {doc[:80]}...")

Multi-Query Retrieval

Multi-query retrieval solves a fundamental problem: users express the same intent in wildly different ways, and a single query phrasing may miss relevant documents that use different vocabulary. Think of it like asking five different people to search Google for the same thing — they would each type something different, and the union of their results covers more ground than any single search. The technique generates several reformulations of the original query, runs each one independently, then merges the results. This dramatically improves recall (finding all relevant documents) at the cost of additional embedding calls.
from openai import OpenAI
import numpy as np
from collections import defaultdict


class MultiQueryRetriever:
    """Generate multiple query variations for improved recall.
    
    Why this works: embedding similarity is sensitive to word choice.
    "Python web frameworks" and "Flask Django web development" may retrieve
    different document sets even though they mean the same thing. By searching
    with multiple phrasings, we cast a wider net.
    """
    
    def __init__(
        self,
        documents: list[str],
        model: str = "gpt-4o-mini"
    ):
        self.client = OpenAI()
        self.model = model
        self.documents = documents
        self.doc_embeddings = self._embed_documents()
    
    def _embed_documents(self) -> np.ndarray:
        response = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=self.documents
        )
        return np.array([e.embedding for e in response.data])
    
    def generate_query_variations(
        self,
        query: str,
        num_variations: int = 3
    ) -> list[str]:
        """Generate alternative phrasings of the query."""
        prompt = f"""Generate {num_variations} different versions of this search query.
Each version should capture the same intent but use different words or perspectives.
Make them diverse to improve search coverage.

Original query: {query}

Return only the queries, one per line, without numbering."""
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}]
        )
        
        variations = [
            line.strip()
            for line in response.choices[0].message.content.split("\n")
            if line.strip()
        ]
        
        # Include original query
        return [query] + variations[:num_variations]
    
    def retrieve(
        self,
        query: str,
        top_k: int = 5,
        num_variations: int = 3
    ) -> list[tuple[str, float]]:
        """Retrieve using multiple query variations."""
        # Generate variations
        queries = self.generate_query_variations(query, num_variations)
        
        # Embed all queries
        response = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=queries
        )
        query_embeddings = np.array([e.embedding for e in response.data])
        
        # Aggregate scores across all queries -- each document gets scored
        # against every query variation
        doc_scores = defaultdict(list)
        
        for query_embedding in query_embeddings:
            similarities = np.dot(self.doc_embeddings, query_embedding)
            
            for i, score in enumerate(similarities):
                doc_scores[i].append(score)
        
        # Use MAX score (not mean) so a document only needs to match ONE
        # query variation well. Mean would penalize docs that are very
        # relevant to one phrasing but not others.
        final_scores = [
            (i, max(scores))
            for i, scores in doc_scores.items()
        ]
        
        # Sort and get top-k
        final_scores.sort(key=lambda x: x[1], reverse=True)
        
        return [
            (self.documents[i], score)
            for i, score in final_scores[:top_k]
        ]


# Usage
documents = [
    "Python is a high-level programming language known for its simple syntax.",
    "Snake charming is an ancient practice found in parts of Asia and Africa.",
    "The python snake is one of the largest snake species in the world.",
    "Django and Flask are popular Python web frameworks.",
    "Anaconda is both a snake species and a Python distribution.",
]

retriever = MultiQueryRetriever(documents)

query = "Python programming frameworks"
results = retriever.retrieve(query, top_k=3)

print(f"Query: {query}\n")
for doc, score in results:
    print(f"[{score:.3f}] {doc}")

Parent Document Retrieval

This technique solves the “Goldilocks chunking problem”: small chunks are better for precise embedding matching (less noise in the vector), but small chunks lack the surrounding context the LLM needs to generate a good answer. Parent document retrieval gives you the best of both worlds — search against small, focused chunks, but return the larger parent document that contains the full context. Think of it like a book index: you look up a specific keyword (small chunk matching), but then you read the entire page or section (parent document) to get the full picture.
from openai import OpenAI
from dataclasses import dataclass
import numpy as np
import uuid


@dataclass
class Chunk:
    """A chunk with reference to parent document."""
    id: str
    parent_id: str
    text: str
    start_pos: int
    end_pos: int


@dataclass  
class ParentDocument:
    """A parent document with its chunks."""
    id: str
    text: str
    chunks: list[Chunk]


class ParentDocumentRetriever:
    """Retrieve chunks and return parent documents."""
    
    def __init__(
        self,
        chunk_size: int = 200,
        chunk_overlap: int = 50
    ):
        self.client = OpenAI()
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        
        self.parents: dict[str, ParentDocument] = {}
        self.chunks: list[Chunk] = []
        self.chunk_embeddings: np.ndarray = None
    
    def add_document(self, text: str) -> str:
        """Add a document and create chunks."""
        parent_id = str(uuid.uuid4())
        
        # Create chunks
        chunks = self._create_chunks(text, parent_id)
        
        parent = ParentDocument(
            id=parent_id,
            text=text,
            chunks=chunks
        )
        
        self.parents[parent_id] = parent
        self.chunks.extend(chunks)
        
        # Recompute embeddings
        self._update_embeddings()
        
        return parent_id
    
    def _create_chunks(self, text: str, parent_id: str) -> list[Chunk]:
        """Split text into overlapping chunks.
        
        Overlap is critical -- without it, information that spans a chunk
        boundary gets split and neither chunk captures the full meaning.
        A 50-character overlap on 200-character chunks is a good starting point.
        """
        chunks = []
        start = 0
        
        while start < len(text):
            end = min(start + self.chunk_size, len(text))
            
            # Snap to word boundary to avoid cutting mid-word,
            # which would corrupt both the chunk text and its embedding
            if end < len(text):
                space_pos = text.rfind(" ", start, end)
                if space_pos > start:
                    end = space_pos
            
            chunks.append(Chunk(
                id=str(uuid.uuid4()),
                parent_id=parent_id,
                text=text[start:end].strip(),
                start_pos=start,
                end_pos=end
            ))
            
            start = end - self.chunk_overlap
        
        return chunks
    
    def _update_embeddings(self):
        """Update embeddings for all chunks."""
        if not self.chunks:
            return
        
        texts = [c.text for c in self.chunks]
        
        response = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=texts
        )
        
        self.chunk_embeddings = np.array([e.embedding for e in response.data])
    
    def retrieve(
        self,
        query: str,
        top_k: int = 3,
        return_parent: bool = True
    ) -> list[tuple[str, float]]:
        """Retrieve chunks or parent documents."""
        if self.chunk_embeddings is None:
            return []
        
        # Embed query
        response = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=[query]
        )
        query_embedding = np.array(response.data[0].embedding)
        
        # Calculate similarities
        similarities = np.dot(self.chunk_embeddings, query_embedding)
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        if return_parent:
            # Return unique parent documents
            seen_parents = set()
            results = []
            
            for i in top_indices:
                chunk = self.chunks[i]
                if chunk.parent_id not in seen_parents:
                    seen_parents.add(chunk.parent_id)
                    parent = self.parents[chunk.parent_id]
                    results.append((parent.text, similarities[i]))
            
            return results
        else:
            return [
                (self.chunks[i].text, similarities[i])
                for i in top_indices
            ]


# Usage
retriever = ParentDocumentRetriever(chunk_size=100, chunk_overlap=20)

# Add documents
doc1 = """
Machine learning is a branch of artificial intelligence that enables systems to learn 
from data. It includes supervised learning, unsupervised learning, and reinforcement 
learning. Deep learning, a subset of machine learning, uses neural networks with many 
layers to learn complex patterns.
"""

doc2 = """
Natural language processing (NLP) combines linguistics and machine learning to enable 
computers to understand human language. Key tasks include sentiment analysis, named 
entity recognition, and machine translation. Modern NLP relies heavily on transformer 
architectures like BERT and GPT.
"""

retriever.add_document(doc1)
retriever.add_document(doc2)

query = "deep learning neural networks"

print("Retrieved chunks:")
for text, score in retriever.retrieve(query, return_parent=False):
    print(f"[{score:.3f}] {text[:100]}...")

print("\nRetrieved parent documents:")
for text, score in retriever.retrieve(query, return_parent=True):
    print(f"[{score:.3f}] {text[:150]}...")

Query Decomposition

Break complex queries into sub-queries:
from openai import OpenAI
import json


class QueryDecomposer:
    """Decompose complex queries into simpler sub-queries."""
    
    def __init__(self, model: str = "gpt-4o-mini"):
        self.client = OpenAI()
        self.model = model
    
    def decompose(self, query: str) -> list[str]:
        """Break query into sub-queries."""
        prompt = f"""Analyze this complex query and break it into simpler sub-queries 
that can be answered independently.

Query: {query}

Return as JSON: {{"sub_queries": ["query1", "query2", ...]}}

Only decompose if necessary. For simple queries, return the original query."""
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )
        
        data = json.loads(response.choices[0].message.content)
        return data.get("sub_queries", [query])
    
    def retrieve_and_synthesize(
        self,
        query: str,
        retriever,
        top_k: int = 3
    ) -> str:
        """Decompose query, retrieve for each, and synthesize answer."""
        # Decompose query
        sub_queries = self.decompose(query)
        
        # Retrieve for each sub-query
        all_context = []
        
        for sub_query in sub_queries:
            results = retriever.retrieve(sub_query, top_k=top_k)
            context = [doc for doc, score in results]
            all_context.extend(context)
        
        # Deduplicate context
        unique_context = list(dict.fromkeys(all_context))
        
        # Synthesize answer
        context_text = "\n\n".join(unique_context)
        
        synthesis_prompt = f"""Based on the following context, answer the question.

Context:
{context_text}

Question: {query}

Provide a comprehensive answer based on the context."""
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": synthesis_prompt}]
        )
        
        return response.choices[0].message.content


# Usage
decomposer = QueryDecomposer()

complex_query = """
What are the main differences between supervised and unsupervised learning, 
and how does deep learning relate to each of them?
"""

sub_queries = decomposer.decompose(complex_query)
print("Decomposed queries:")
for i, sq in enumerate(sub_queries, 1):
    print(f"  {i}. {sq}")

Corrective RAG (CRAG)

Self-correct retrieval based on relevance assessment:
from openai import OpenAI
import json
import numpy as np


class CorrectiveRAG:
    """RAG with self-correction for improved accuracy."""
    
    def __init__(
        self,
        documents: list[str],
        model: str = "gpt-4o-mini"
    ):
        self.client = OpenAI()
        self.model = model
        self.documents = documents
        self.doc_embeddings = self._embed_documents()
    
    def _embed_documents(self) -> np.ndarray:
        response = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=self.documents
        )
        return np.array([e.embedding for e in response.data])
    
    def _retrieve(self, query: str, top_k: int) -> list[tuple[str, float]]:
        """Basic retrieval."""
        response = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=[query]
        )
        query_embedding = np.array(response.data[0].embedding)
        
        similarities = np.dot(self.doc_embeddings, query_embedding)
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        return [(self.documents[i], similarities[i]) for i in top_indices]
    
    def _assess_relevance(
        self,
        query: str,
        document: str
    ) -> dict:
        """Assess if document is relevant to query."""
        prompt = f"""Assess if this document is relevant to the query.

Query: {query}

Document: {document}

Respond with JSON:
{{
    "is_relevant": true/false,
    "relevance_score": 0.0-1.0,
    "reasoning": "brief explanation"
}}"""
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )
        
        return json.loads(response.choices[0].message.content)
    
    def _refine_query(self, query: str, context: str) -> str:
        """Refine query based on initial context."""
        prompt = f"""Based on this context, refine the query to get better results.

Original query: {query}

Available context: {context}

If the context is insufficient, create a more specific or alternative query.
Return only the refined query."""
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}]
        )
        
        return response.choices[0].message.content.strip()
    
    def retrieve(
        self,
        query: str,
        top_k: int = 5,
        relevance_threshold: float = 0.5,
        max_iterations: int = 2
    ) -> list[str]:
        """Retrieve with self-correction."""
        current_query = query
        all_relevant_docs = []
        
        for iteration in range(max_iterations):
            # Retrieve documents
            results = self._retrieve(current_query, top_k)
            
            # Assess relevance of each document
            relevant_docs = []
            irrelevant_count = 0
            
            for doc, score in results:
                assessment = self._assess_relevance(query, doc)
                
                if assessment["relevance_score"] >= relevance_threshold:
                    if doc not in all_relevant_docs:
                        relevant_docs.append(doc)
                        all_relevant_docs.append(doc)
                else:
                    irrelevant_count += 1
            
            # If too many irrelevant, refine query
            if irrelevant_count > top_k // 2 and iteration < max_iterations - 1:
                context = "\n".join(relevant_docs) if relevant_docs else "No relevant context found."
                current_query = self._refine_query(query, context)
                print(f"Refined query: {current_query}")
            else:
                break
        
        return all_relevant_docs
    
    def answer(
        self,
        query: str,
        top_k: int = 5
    ) -> str:
        """Retrieve and generate answer with CRAG."""
        relevant_docs = self.retrieve(query, top_k)
        
        if not relevant_docs:
            return "I couldn't find relevant information to answer this question."
        
        context = "\n\n".join(relevant_docs)
        
        prompt = f"""Answer the question based on the following context.
If the context doesn't contain enough information, say so.

Context:
{context}

Question: {query}

Answer:"""
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}]
        )
        
        return response.choices[0].message.content


# Usage
documents = [
    "Python 3.12 was released in October 2023 with improved error messages.",
    "The Django web framework is built on Python and follows the MTV pattern.",
    "Machine learning models can be trained using scikit-learn in Python.",
    "Flask is a lightweight Python web framework for building APIs.",
    "Rust is a systems programming language focused on safety and performance.",
]

crag = CorrectiveRAG(documents)

query = "What Python web frameworks are available?"
answer = crag.answer(query)
print(f"Query: {query}")
print(f"Answer: {answer}")

Reciprocal Rank Fusion

Combine results from multiple retrieval methods:
import numpy as np
from collections import defaultdict


class RRFRetriever:
    """Combine multiple retrievers using Reciprocal Rank Fusion."""
    
    def __init__(self, retrievers: list, k: int = 60):
        self.retrievers = retrievers
        self.k = k  # RRF parameter
    
    def retrieve(self, query: str, top_k: int = 10) -> list[tuple[str, float]]:
        """Retrieve using RRF to combine results."""
        doc_scores = defaultdict(float)
        
        # Get results from each retriever
        for retriever in self.retrievers:
            results = retriever.retrieve(query)
            
            for rank, (doc, _) in enumerate(results):
                # RRF formula: 1 / (k + rank)
                doc_scores[doc] += 1.0 / (self.k + rank + 1)
        
        # Sort by combined score
        sorted_docs = sorted(
            doc_scores.items(),
            key=lambda x: x[1],
            reverse=True
        )
        
        return sorted_docs[:top_k]


# Can combine with keyword (BM25) and semantic retrievers
# See semantic-search.mdx for full implementation
Advanced RAG Best Practices
  • Use HyDE for queries that are different from document style
  • Multi-query retrieval improves recall for ambiguous queries
  • Parent document retrieval preserves context for answers
  • Always assess retrieval quality before generation
  • Combine multiple techniques for best results

Practice Exercise

Build an advanced RAG system that:
  1. Implements HyDE for query transformation
  2. Uses multi-query retrieval for improved recall
  3. Applies parent document retrieval for context
  4. Includes self-correction with relevance assessment
  5. Combines methods using reciprocal rank fusion
Focus on:
  • Measuring retrieval quality improvements
  • Balancing latency vs quality tradeoffs
  • Handling edge cases gracefully
  • Providing explainable retrieval decisions

Interview Deep-Dive

Strong Answer:
  • Before touching the model or retrieval logic, I start by inspecting the data. The most common cause of poor retrieval is bad chunking, not bad embeddings. I pull 20-30 failing queries, look at what chunks were retrieved versus what chunks should have been retrieved, and check whether the correct answer even exists in any chunk. In at least half the cases I have debugged, the answer was split across two chunks or buried in a chunk dominated by irrelevant surrounding text.
  • Next I check for vocabulary mismatch. If users ask “How do I cancel my subscription?” but the docs say “To terminate your recurring billing plan,” even good embeddings may not bridge that gap. This is exactly where techniques like HyDE or multi-query retrieval help, because they generate text in the vocabulary of the answer space rather than the question space.
  • I evaluate retrieval separately from generation. I compute recall at k (what percentage of relevant chunks appear in the top k results) and mean reciprocal rank across a test set. If recall at 5 is below 70%, the problem is definitely retrieval. If recall is fine but the final answers are bad, the problem is in the generation prompt or context assembly.
  • The chunk size and overlap parameters are the highest-leverage tuning knobs. Too small and you lose context. Too large and you dilute the signal with noise. I typically test three configurations (256, 512, 1024 tokens) on a benchmark set and pick the one with the best recall. Parent document retrieval is a great hybrid approach when you want precise matching but full-context answers.
  • Finally, I check for index issues. If you are using HNSW, the ef_search parameter controls the accuracy-speed trade-off at query time. A low ef_search (the default in many libraries) can silently miss relevant results. Increasing it from 40 to 200 often recovers 5-10% recall with modest latency increase.
Follow-up: When would you move from a single retrieval strategy to Reciprocal Rank Fusion with multiple retrievers?When you have queries that span different retrieval modalities. For example, some queries are best served by semantic similarity (conceptual questions) while others need keyword matching (specific error codes, product names, or exact phrases). A single embedding model cannot excel at both. RRF lets you combine a BM25 keyword retriever with a dense embedding retriever, and the fusion naturally up-ranks documents that both methods agree on while still surfacing results that only one method found. In practice, I add a second retriever when I see a bimodal distribution in my failure analysis: one cluster of failures is semantic mismatches (embeddings would help) and another is lexical mismatches (keywords would help). The RRF constant k=60 works well as a default but is worth tuning on your specific query distribution.
Strong Answer:
  • HyDE and multi-query retrieval both solve the query-document vocabulary gap, but they attack it from different angles. HyDE generates a hypothetical answer document and embeds that, betting that a fake answer will be semantically closer to the real answer than the original short question. Multi-query generates multiple rephrased versions of the question and retrieves against all of them, betting that at least one rephrasing will match the vocabulary of the relevant document.
  • HyDE works best when queries are very short and documents are long-form prose. A query like “Python creation date” is far from any document in embedding space, but a hypothetical paragraph about Python’s history lands right in the neighborhood. The risk with HyDE is that if the model generates a confidently wrong hypothetical, it pulls retrieval in the wrong direction. For a domain-specific corpus where the model has poor training data coverage, HyDE can actually hurt recall.
  • Multi-query is safer and more robust. It does not require the model to know the answer, just to rephrase the question. This makes it better for specialized or proprietary domains where the LLM might generate an inaccurate hypothetical. The downside is cost and latency: you are embedding 3-5 queries instead of one, plus the LLM call to generate variations.
  • In practice, I choose HyDE for general-knowledge domains with short queries and long documents (help centers, Wikipedia-style knowledge bases). I choose multi-query for specialized domains (legal, medical, internal docs) where the model might not know the answer but can still generate useful query reformulations. When in doubt, multi-query is the safer default because it degrades gracefully.
  • One nuance people miss: you can combine them. Use multi-query to generate 3 variations, then apply HyDE to each variation. This gives you 3 hypothetical documents plus the 3 query variations, all contributing to retrieval. Expensive, but for high-value queries where recall matters more than latency, it is extremely effective.
Follow-up: How do you evaluate whether HyDE is actually improving your retrieval, versus just adding latency?A/B test on your actual query distribution. Take 200 representative queries with known relevant documents. Run each query with and without HyDE, measure recall at 5 and MRR for both. If HyDE improves recall by more than 5% without tanking precision, it is worth the extra LLM call. Also measure the “hurt rate”: the percentage of queries where HyDE made retrieval worse. If the hurt rate exceeds 15%, you need to add a classifier that decides per-query whether to use HyDE. Short vague queries benefit from HyDE; long specific queries with domain terms usually do not.
Strong Answer:
  • The relevance assessment in Corrective RAG is the step that checks whether retrieved documents actually answer the query before passing them to generation. The naive approach is to call an LLM for each retrieved document, but if you retrieve 10 documents, that is 10 additional LLM calls per query, which destroys latency.
  • The first optimization is to use a cross-encoder reranker instead of an LLM call. Models like Cohere Rerank or a fine-tuned cross-encoder (like ms-marco-MiniLM) take a query-document pair and output a relevance score in 5-20ms versus 500ms+ for an LLM call. This is the approach most production CRAG systems actually use.
  • The second optimization is to batch the assessment. Instead of asking “Is document X relevant to query Y?” for each document separately, pass all retrieved documents in a single LLM call with the prompt “Score each of these documents for relevance to the query on a 0-1 scale.” One call instead of N. The trade-off is that the model may give less precise scores when evaluating many documents at once, but for a binary relevant/irrelevant decision, it works well.
  • The third approach is to use the embedding similarity score itself as a first-pass filter. Set a threshold (say, cosine similarity below 0.3 is definitely irrelevant) and only run the expensive relevance check on documents in the uncertain zone (0.3-0.7). Documents above 0.7 are assumed relevant without checking. This cascading approach reduces the number of documents that need LLM-based assessment by 50-70% in practice.
  • For the refinement loop where you rewrite the query and re-retrieve, I cap it at 2 iterations maximum. Beyond that, if you still have not found relevant documents, the information probably is not in your corpus, and it is better to tell the user honestly than to keep searching and burning tokens.
Follow-up: How do you prevent Corrective RAG from over-correcting and filtering out documents that are partially relevant?This is a real problem. Strict relevance filtering can remove documents that contain useful supporting context even though they do not directly answer the question. I use a three-tier classification instead of binary relevant/irrelevant: “directly answers the query,” “provides useful context,” and “irrelevant.” I keep the first two tiers and only discard the third. When assembling the context for generation, I put directly-relevant documents first and supporting-context documents after a separator, so the model knows which sources to prioritize for the answer versus which are background. This preserves recall while giving the generation step a signal about source quality.
Strong Answer:
  • Parent document retrieval addresses a core tension in RAG: small chunks are better for precise matching, but large chunks are better for providing complete context to the LLM. If you chunk at 100 tokens, your embedding search is very precise because each chunk is focused on a single idea. But when you pass that 100-token chunk to the LLM for answer generation, it often lacks enough context to produce a good response. If you chunk at 1000 tokens, the LLM gets plenty of context, but the embedding is a blurry average of many ideas, which hurts retrieval precision.
  • Parent document retrieval solves this by maintaining two levels: small child chunks for retrieval and larger parent documents for context. You embed and search against the small chunks, but when you find a match, you return the parent document (or a larger surrounding window) to the LLM. You get the best of both worlds: precise retrieval and rich context.
  • It breaks down in a few scenarios. First, when the parent document is too large (multiple pages), you end up stuffing too much irrelevant context into the LLM prompt, which can actually reduce answer quality and wastes tokens. Second, when multiple relevant chunks come from different parent documents, you may exceed the context window by returning all parents. You need a strategy for this: either return partial parents (a window around the matching chunk) or cap the total context size and prioritize by relevance score.
  • The third failure mode is information that spans parent boundaries. If the answer to a question starts at the end of one parent and continues at the beginning of the next, neither parent alone contains the full answer. Overlapping parent windows help, but they increase storage and can introduce duplicate information in the context.
  • In production, I typically set child chunks at 200-300 tokens and parent windows at 1000-1500 tokens (not entire documents). This gives a 3-5x context expansion, which is enough for most use cases without blowing up context size.
Follow-up: How do you choose the right chunk size for child chunks versus parent documents, and is there a systematic way to tune these?I treat it as a hyperparameter search guided by evaluation metrics. Create a benchmark set of 50-100 query-answer pairs. Test child chunk sizes of 128, 256, and 512 tokens with parent windows of 512, 1024, and 2048 tokens. Measure recall at k for the child retrieval step and answer quality (using LLM-as-judge or human evaluation) for the full pipeline. Usually the best child size is the one where most answers fit entirely within a single chunk, which you can estimate by looking at the distribution of answer lengths in your evaluation set. The parent size should be large enough to include the 2-3 sentences surrounding any answer, which for most prose documents is around 3-5x the child size.