Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Semantic search goes beyond keyword matching to understand meaning and intent. Traditional keyword search is like looking up a word in a dictionary — it finds exact matches but misses everything else. Semantic search is like asking a knowledgeable friend — “show me things about vacation policy” will find documents about “PTO guidelines,” “time off procedures,” and “leave of absence rules” even if they never use the word “vacation.” This is the retrieval backbone of every RAG system, search feature, and recommendation engine in modern AI.

Search Methods Comparison

Method              Strengths                    Weaknesses
-----------------------------------------------------------------
Keyword (BM25)      Exact matches, fast          Misses synonyms
Semantic            Understands meaning          Misses keywords
Hybrid              Best of both                 More complex

BM25 Implementation

BM25 (Best Match 25) is the algorithm behind Elasticsearch, Solr, and every traditional search engine you’ve ever used. It is a probabilistic ranking function that scores documents based on term frequency (how often the query words appear) weighted by inverse document frequency (rare words matter more than common ones). Think of it as “smart keyword matching” — it handles the math that makes “rare important word” rank higher than “common filler word.” Despite being decades old, BM25 remains unbeatable for exact-match queries like product SKUs, error codes, and proper nouns:
import math
from collections import Counter
from typing import List, Dict, Tuple

class BM25:
    """BM25 ranking algorithm implementation"""
    
    def __init__(
        self,
        documents: List[str],
        k1: float = 1.5,
        b: float = 0.75
    ):
        self.k1 = k1
        self.b = b
        self.documents = documents
        self.doc_count = len(documents)
        
        # Tokenize documents
        self.tokenized_docs = [self._tokenize(doc) for doc in documents]
        
        # Calculate document lengths
        self.doc_lengths = [len(doc) for doc in self.tokenized_docs]
        self.avg_doc_length = sum(self.doc_lengths) / self.doc_count
        
        # Build inverted index
        self.doc_freqs = {}  # term -> number of docs containing term
        self.term_freqs = []  # per-document term frequencies
        
        for doc_tokens in self.tokenized_docs:
            term_freq = Counter(doc_tokens)
            self.term_freqs.append(term_freq)
            
            for term in set(doc_tokens):
                self.doc_freqs[term] = self.doc_freqs.get(term, 0) + 1
    
    def _tokenize(self, text: str) -> List[str]:
        """Simple tokenization"""
        return text.lower().split()
    
    def _idf(self, term: str) -> float:
        """Calculate inverse document frequency.
        
        IDF is the key insight: a word appearing in 1 out of 10,000 docs
        is much more informative than one appearing in 9,000 out of 10,000.
        "Python" in a programming corpus has low IDF (common), but "asyncpg"
        has high IDF (rare and specific).
        """
        doc_freq = self.doc_freqs.get(term, 0)
        return math.log(
            (self.doc_count - doc_freq + 0.5) / (doc_freq + 0.5) + 1
        )
    
    def _score_document(
        self,
        query_terms: List[str],
        doc_idx: int
    ) -> float:
        """Score a single document against query"""
        score = 0.0
        doc_len = self.doc_lengths[doc_idx]
        term_freqs = self.term_freqs[doc_idx]
        
        for term in query_terms:
            if term not in term_freqs:
                continue
            
            tf = term_freqs[term]
            idf = self._idf(term)
            
            # BM25 formula
            numerator = tf * (self.k1 + 1)
            denominator = tf + self.k1 * (
                1 - self.b + self.b * (doc_len / self.avg_doc_length)
            )
            
            score += idf * (numerator / denominator)
        
        return score
    
    def search(
        self,
        query: str,
        top_k: int = 10
    ) -> List[Tuple[int, float]]:
        """Search documents and return top-k results"""
        query_terms = self._tokenize(query)
        
        scores = []
        for idx in range(self.doc_count):
            score = self._score_document(query_terms, idx)
            if score > 0:
                scores.append((idx, score))
        
        # Sort by score descending
        scores.sort(key=lambda x: x[1], reverse=True)
        
        return scores[:top_k]

# Usage
documents = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning uses neural networks with many layers",
    "Natural language processing enables computers to understand text",
    "Computer vision allows machines to interpret images"
]

bm25 = BM25(documents)
results = bm25.search("neural network deep learning")

for idx, score in results:
    print(f"Score: {score:.3f} - {documents[idx][:50]}...")

Semantic Search with Embeddings

from openai import OpenAI
import numpy as np
from typing import List, Tuple

client = OpenAI()

class SemanticSearch:
    """Semantic search using embeddings"""
    
    def __init__(self, model: str = "text-embedding-3-small"):
        self.model = model
        self.documents = []
        self.embeddings = []
    
    def _embed(self, texts: List[str]) -> np.ndarray:
        """Get embeddings for texts"""
        response = client.embeddings.create(
            model=self.model,
            input=texts
        )
        return np.array([e.embedding for e in response.data])
    
    def add_documents(self, documents: List[str]):
        """Add documents to the index"""
        self.documents.extend(documents)
        embeddings = self._embed(documents)
        
        if len(self.embeddings) == 0:
            self.embeddings = embeddings
        else:
            self.embeddings = np.vstack([self.embeddings, embeddings])
    
    def search(
        self,
        query: str,
        top_k: int = 10
    ) -> List[Tuple[int, float, str]]:
        """Search for similar documents"""
        query_embedding = self._embed([query])[0]
        
        # Cosine similarity
        similarities = np.dot(self.embeddings, query_embedding) / (
            np.linalg.norm(self.embeddings, axis=1) * 
            np.linalg.norm(query_embedding)
        )
        
        # Get top-k indices
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        results = []
        for idx in top_indices:
            results.append((
                int(idx),
                float(similarities[idx]),
                self.documents[idx]
            ))
        
        return results

# Usage
semantic = SemanticSearch()
semantic.add_documents(documents)
results = semantic.search("How do machines learn?")

Neither BM25 nor semantic search is universally better — they have complementary strengths. BM25 excels at exact matches (error codes, function names, acronyms) while semantic search excels at meaning (synonyms, paraphrases, conceptual similarity). Combining them consistently outperforms either alone. The only question is how to weight them. Combine BM25 and semantic search for best results:
from dataclasses import dataclass
from typing import List, Dict, Optional

@dataclass
class SearchResult:
    doc_id: int
    document: str
    bm25_score: float = 0.0
    semantic_score: float = 0.0
    hybrid_score: float = 0.0

class HybridSearch:
    """Combine BM25 and semantic search"""
    
    def __init__(
        self,
        documents: List[str],
        alpha: float = 0.5,  # Weight for semantic (1-alpha for BM25)
        # Tip: alpha=0.7 works well for general text; alpha=0.3 for
        # technical docs where exact terminology matters more
        embedding_model: str = "text-embedding-3-small"
    ):
        self.documents = documents
        self.alpha = alpha
        
        # Initialize both search methods
        self.bm25 = BM25(documents)
        self.semantic = SemanticSearch(embedding_model)
        self.semantic.add_documents(documents)
    
    def _normalize_scores(
        self,
        scores: List[Tuple[int, float]]
    ) -> Dict[int, float]:
        """Min-max normalize scores to 0-1"""
        if not scores:
            return {}
        
        values = [s[1] for s in scores]
        min_val = min(values)
        max_val = max(values)
        range_val = max_val - min_val if max_val > min_val else 1
        
        return {
            idx: (score - min_val) / range_val
            for idx, score in scores
        }
    
    def search(
        self,
        query: str,
        top_k: int = 10
    ) -> List[SearchResult]:
        """Perform hybrid search"""
        
        # Get BM25 results
        bm25_results = self.bm25.search(query, top_k=top_k * 2)
        bm25_scores = self._normalize_scores(bm25_results)
        
        # Get semantic results
        semantic_results = self.semantic.search(query, top_k=top_k * 2)
        semantic_scores = {idx: score for idx, score, _ in semantic_results}
        
        # Combine all unique document IDs
        all_docs = set(bm25_scores.keys()) | set(semantic_scores.keys())
        
        # Calculate hybrid scores
        results = []
        for doc_id in all_docs:
            bm25 = bm25_scores.get(doc_id, 0.0)
            semantic = semantic_scores.get(doc_id, 0.0)
            hybrid = self.alpha * semantic + (1 - self.alpha) * bm25
            
            results.append(SearchResult(
                doc_id=doc_id,
                document=self.documents[doc_id],
                bm25_score=bm25,
                semantic_score=semantic,
                hybrid_score=hybrid
            ))
        
        # Sort by hybrid score
        results.sort(key=lambda x: x.hybrid_score, reverse=True)
        
        return results[:top_k]

# Usage
hybrid = HybridSearch(documents, alpha=0.7)  # 70% semantic, 30% BM25
results = hybrid.search("machine learning algorithms")

for r in results:
    print(f"Hybrid: {r.hybrid_score:.3f} (BM25: {r.bm25_score:.3f}, "
          f"Semantic: {r.semantic_score:.3f})")
    print(f"  {r.document[:60]}...")

Reciprocal Rank Fusion (RRF)

RRF is the industry standard for merging multiple ranked lists. Unlike weighted averaging (which requires score normalization), RRF only uses rank positions, making it robust across different scoring scales. The formula is simple: for each document, sum 1/(k + rank) across all rankings. Documents that appear near the top in multiple lists get the highest combined score.
from collections import defaultdict
from typing import List, Dict

def reciprocal_rank_fusion(
    rankings: List[List[int]],
    k: int = 60  # From the original paper -- rarely needs tuning
) -> List[Tuple[int, float]]:
    """
    Combine multiple rankings using RRF.
    
    Args:
        rankings: List of ranked document ID lists
        k: Constant to prevent high ranks from dominating (60 is standard)
    
    Returns:
        List of (doc_id, rrf_score) sorted by score
    """
    rrf_scores = defaultdict(float)
    
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            rrf_scores[doc_id] += 1 / (k + rank)
    
    # Sort by RRF score
    sorted_docs = sorted(
        rrf_scores.items(),
        key=lambda x: x[1],
        reverse=True
    )
    
    return sorted_docs

class RRFHybridSearch:
    """Hybrid search using Reciprocal Rank Fusion"""
    
    def __init__(self, documents: List[str]):
        self.documents = documents
        self.bm25 = BM25(documents)
        self.semantic = SemanticSearch()
        self.semantic.add_documents(documents)
    
    def search(
        self,
        query: str,
        top_k: int = 10,
        rrf_k: int = 60
    ) -> List[Tuple[int, float, str]]:
        """Search using RRF to combine rankings"""
        
        # Get rankings from both methods
        bm25_results = self.bm25.search(query, top_k=top_k * 2)
        bm25_ranking = [idx for idx, _ in bm25_results]
        
        semantic_results = self.semantic.search(query, top_k=top_k * 2)
        semantic_ranking = [idx for idx, _, _ in semantic_results]
        
        # Apply RRF
        fused = reciprocal_rank_fusion(
            [bm25_ranking, semantic_ranking],
            k=rrf_k
        )
        
        results = []
        for doc_id, score in fused[:top_k]:
            results.append((doc_id, score, self.documents[doc_id]))
        
        return results

Reranking

Retrieval is fast but approximate. Reranking is slow but precise. The two-stage pattern exploits this: retrieve 100 candidates cheaply (milliseconds), then rerank the top 100 with a powerful cross-encoder model that reads the query and each document together (seconds). A cross-encoder sees the query-document pair simultaneously, so it catches subtle relevance signals that bi-encoder similarity misses. Rerank initial results with a more powerful model:
from dataclasses import dataclass
from typing import List

@dataclass
class RerankResult:
    index: int
    document: str
    relevance_score: float

class CrossEncoderReranker:
    """Rerank using cross-encoder models"""
    
    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
        from sentence_transformers import CrossEncoder
        self.model = CrossEncoder(model_name)
    
    def rerank(
        self,
        query: str,
        documents: List[str],
        top_k: int = None
    ) -> List[RerankResult]:
        """Rerank documents by relevance to query"""
        
        # Create query-document pairs
        pairs = [[query, doc] for doc in documents]
        
        # Get relevance scores
        scores = self.model.predict(pairs)
        
        # Create results with scores
        results = [
            RerankResult(
                index=i,
                document=doc,
                relevance_score=float(score)
            )
            for i, (doc, score) in enumerate(zip(documents, scores))
        ]
        
        # Sort by relevance
        results.sort(key=lambda x: x.relevance_score, reverse=True)
        
        if top_k:
            results = results[:top_k]
        
        return results

# Cohere Rerank API
class CohereReranker:
    """Rerank using Cohere API"""
    
    def __init__(self, api_key: str = None):
        import cohere
        self.client = cohere.Client(api_key)
    
    def rerank(
        self,
        query: str,
        documents: List[str],
        top_k: int = 10
    ) -> List[RerankResult]:
        """Rerank using Cohere rerank endpoint"""
        
        response = self.client.rerank(
            model="rerank-english-v3.0",
            query=query,
            documents=documents,
            top_n=top_k
        )
        
        return [
            RerankResult(
                index=r.index,
                document=documents[r.index],
                relevance_score=r.relevance_score
            )
            for r in response.results
        ]

# Two-stage retrieval: retrieve then rerank
class TwoStageRetriever:
    """Retrieve candidates then rerank"""
    
    def __init__(
        self,
        documents: List[str],
        retrieval_k: int = 100,
        final_k: int = 10
    ):
        self.documents = documents
        self.retrieval_k = retrieval_k
        self.final_k = final_k
        
        # Fast retrieval
        self.hybrid = HybridSearch(documents)
        
        # Accurate reranking
        self.reranker = CrossEncoderReranker()
    
    def search(self, query: str) -> List[RerankResult]:
        """Two-stage search"""
        
        # Stage 1: Fast retrieval of candidates
        candidates = self.hybrid.search(query, top_k=self.retrieval_k)
        candidate_docs = [c.document for c in candidates]
        
        # Stage 2: Accurate reranking
        reranked = self.reranker.rerank(
            query,
            candidate_docs,
            top_k=self.final_k
        )
        
        return reranked

Query Expansion

Improve recall by expanding queries:
from openai import OpenAI

client = OpenAI()

class QueryExpander:
    """Expand queries for better recall"""
    
    def __init__(self):
        self.client = OpenAI()
    
    def expand_with_synonyms(self, query: str) -> List[str]:
        """Generate query variations with synonyms"""
        
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": "Generate 3 alternative phrasings of the search query. Return one per line, no numbering."
                },
                {"role": "user", "content": query}
            ],
            temperature=0.7
        )
        
        variations = response.choices[0].message.content.strip().split("\n")
        return [query] + [v.strip() for v in variations if v.strip()]
    
    def expand_with_hypothetical_answer(self, query: str) -> str:
        """HyDE: Generate hypothetical answer for better semantic match"""
        
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": "Write a short paragraph that would answer the following question. Be factual and specific."
                },
                {"role": "user", "content": query}
            ],
            temperature=0.3
        )
        
        return response.choices[0].message.content

class ExpandedSearch:
    """Search with query expansion"""
    
    def __init__(self, documents: List[str]):
        self.documents = documents
        self.semantic = SemanticSearch()
        self.semantic.add_documents(documents)
        self.expander = QueryExpander()
    
    def search_with_expansion(
        self,
        query: str,
        top_k: int = 10,
        expansion_method: str = "synonyms"  # or "hyde"
    ) -> List[Tuple[int, float, str]]:
        """Search with query expansion"""
        
        if expansion_method == "synonyms":
            queries = self.expander.expand_with_synonyms(query)
        elif expansion_method == "hyde":
            hyde_query = self.expander.expand_with_hypothetical_answer(query)
            queries = [hyde_query]
        else:
            queries = [query]
        
        # Search with all query variations
        all_results = {}
        
        for q in queries:
            results = self.semantic.search(q, top_k=top_k)
            for idx, score, doc in results:
                if idx not in all_results:
                    all_results[idx] = {"scores": [], "doc": doc}
                all_results[idx]["scores"].append(score)
        
        # Aggregate scores (max pooling)
        final_results = [
            (idx, max(data["scores"]), data["doc"])
            for idx, data in all_results.items()
        ]
        
        final_results.sort(key=lambda x: x[1], reverse=True)
        
        return final_results[:top_k]

Contextual Retrieval

Add context to chunks before embedding:
class ContextualChunker:
    """Add document context to chunks for better retrieval.
    
    The problem: a chunk that says "This increased by 15% in Q3" is meaningless
    without knowing what "this" refers to. Contextual retrieval prepends a brief
    summary of the surrounding document to each chunk before embedding. This 
    technique, popularized by Anthropic's research, can improve retrieval recall 
    by 20-40% on real-world datasets.
    """
    
    def __init__(self):
        self.client = OpenAI()
    
    def add_context(
        self,
        document: str,
        chunk: str
    ) -> str:
        """Generate contextual prefix for chunk"""
        
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": """Provide a brief context (1-2 sentences) that situates this chunk within the larger document. The context should help understand what the chunk is about without reading the full document."""
                },
                {
                    "role": "user",
                    "content": f"Document:\n{document[:2000]}...\n\nChunk:\n{chunk}"
                }
            ],
            max_tokens=100,
            temperature=0.3
        )
        
        context = response.choices[0].message.content
        return f"{context}\n\n{chunk}"
    
    def process_document(
        self,
        document: str,
        chunk_size: int = 500,
        overlap: int = 50
    ) -> List[str]:
        """Chunk document and add context to each chunk"""
        
        # Simple chunking
        chunks = []
        start = 0
        while start < len(document):
            end = start + chunk_size
            chunk = document[start:end]
            chunks.append(chunk)
            start = end - overlap
        
        # Add context to each chunk
        contextual_chunks = []
        for chunk in chunks:
            contextual = self.add_context(document, chunk)
            contextual_chunks.append(contextual)
        
        return contextual_chunks

Search Pipeline

from dataclasses import dataclass
from typing import List, Optional

@dataclass
class SearchConfig:
    retrieval_k: int = 100
    rerank_k: int = 10
    use_expansion: bool = True
    use_reranking: bool = True
    hybrid_alpha: float = 0.7

class SearchPipeline:
    """Complete search pipeline"""
    
    def __init__(
        self,
        documents: List[str],
        config: SearchConfig = None
    ):
        self.documents = documents
        self.config = config or SearchConfig()
        
        # Initialize components
        self.hybrid = HybridSearch(documents, alpha=self.config.hybrid_alpha)
        self.expander = QueryExpander() if self.config.use_expansion else None
        self.reranker = CrossEncoderReranker() if self.config.use_reranking else None
    
    def search(self, query: str) -> List[dict]:
        """Full search pipeline"""
        
        # Query expansion
        if self.expander:
            queries = self.expander.expand_with_synonyms(query)
        else:
            queries = [query]
        
        # Retrieve candidates
        all_candidates = {}
        for q in queries:
            results = self.hybrid.search(q, top_k=self.config.retrieval_k)
            for r in results:
                if r.doc_id not in all_candidates:
                    all_candidates[r.doc_id] = r
        
        candidates = list(all_candidates.values())
        candidate_docs = [c.document for c in candidates]
        
        # Reranking
        if self.reranker and len(candidate_docs) > 0:
            reranked = self.reranker.rerank(
                query,
                candidate_docs,
                top_k=self.config.rerank_k
            )
            
            return [
                {
                    "rank": i + 1,
                    "document": r.document,
                    "score": r.relevance_score
                }
                for i, r in enumerate(reranked)
            ]
        
        # Return hybrid results if no reranking
        return [
            {
                "rank": i + 1,
                "document": c.document,
                "score": c.hybrid_score
            }
            for i, c in enumerate(candidates[:self.config.rerank_k])
        ]

Performance Comparison

These numbers are representative across multiple benchmarks. The takeaway: each layer of sophistication buys real recall improvement, but at increasing cost and latency. Choose based on your quality requirements.
MethodRecall@10LatencyCostWhen to Use
BM250.65<10msNoneExact-match heavy queries, zero budget
Semantic0.7550msEmbeddingsGeneral-purpose, meaning-based search
Hybrid0.8260msEmbeddingsMost production systems (best bang for buck)
Hybrid + Rerank0.90150msEmbeddings + RerankHigh-stakes: legal, medical, compliance

Search Failure Modes and Fixes

Understanding why search fails is more valuable than understanding why it succeeds. These are the failure patterns you will encounter in production:
Failure ModeSymptomRoot CauseFix
Vocabulary mismatchUser says “PTO” but docs say “paid time off”Bi-encoder embeds query and docs independentlyQuery expansion or HyDE
False positive saturationTop results are vaguely related but not usefulChunks too large, meaning is dilutedSmaller chunks + reranking
Negation blindness”Not Python” still returns Python docsEmbeddings encode topic, not negationAdd keyword filter for negated terms
Recency biasOld documents outrank updated onesNo time-decay in scoringAdd created_at weight or metadata filter
Short query collapseOne-word queries (“auth”) return wildly varied resultsNot enough semantic signalExpand short queries with LLM or require minimum length
Score plateauTop 20 results all score 0.78-0.82All docs are equally “kind of relevant”Reranking with cross-encoder breaks the tie
Edge case — queries with embedded constraints: “What is our refund policy for orders over 500?"theembeddingcapturesthetopic(refundpolicy)butnotthenumericconstraint(500?" -- the embedding captures the topic (refund policy) but not the numeric constraint (500). Semantic search finds refund policy docs but cannot filter by dollar amount. Fix: extract structured constraints with an LLM before search, apply them as metadata filters, and use semantic search only for the topical component.

What is Next

Context Window Management

Learn to manage context windows effectively with compression and optimization

Interview Deep-Dive

Strong Answer:
  • The answer depends on the content and query patterns, but for an internal knowledge base I would almost certainly end up with hybrid search. Here is the reasoning: internal docs contain a mix of natural language (policy documents, onboarding guides) and highly specific terms (project codenames, internal tool names, error codes, Jira ticket IDs). Pure semantic search excels at the first category but completely misses exact-match needs. Pure BM25 handles exact terms but fails when someone asks “how do I take time off” and the document says “PTO request procedure.”
  • I would start by building both pipelines independently and running a retrieval evaluation. Take 50-100 real user queries from search logs (or create them manually if no logs exist), have domain experts label the top 5 relevant documents for each query, then measure Recall@10 for BM25 alone, semantic alone, and hybrid at different alpha values. In my experience, hybrid consistently beats either individual method by 10-25% on Recall@10 for mixed-content corpora.
  • For tuning the alpha weight, I would start at 0.7 semantic / 0.3 BM25 as a default. Then I would segment queries into categories — exact-match queries (error codes, names), conceptual queries (how-to, explanations), and mixed. Tune alpha per category if your system can classify query type, or find the alpha that maximizes recall across the blended query set. I have found that alpha between 0.5 and 0.7 works for most knowledge bases. Technical documentation with lots of code and acronyms benefits from lower alpha (more BM25 weight), around 0.4-0.5.
  • The practical implementation detail most people miss: score normalization. BM25 scores and cosine similarity scores are on completely different scales. BM25 can range from 0 to 20+, while cosine similarity is 0 to 1. You must normalize both to the same range before combining, or the raw BM25 scores will dominate regardless of your alpha. Min-max normalization within each result set is the simplest approach; Reciprocal Rank Fusion (RRF) avoids the normalization problem entirely by using rank positions instead of scores.
Follow-up: You mentioned RRF avoids the normalization problem. When would you prefer RRF over weighted score combination, and what is the downside of RRF?RRF is more robust when combining rankings from systems with incompatible score distributions — which is exactly the BM25 + semantic case. It only uses rank positions, so it does not care about score scales. The constant k=60 from the original paper rarely needs tuning, which makes it operationally simpler. The downside is that RRF throws away magnitude information. If semantic search returns a document with 0.99 similarity (a near-perfect match) and another at 0.72, RRF treats the gap between rank 1 and rank 2 the same regardless. Weighted score combination preserves that signal — a 0.99 match contributes much more than a 0.72 match. In practice, this matters when you have a “golden” document that is a clear best match. RRF can dilute that signal by boosting a document that ranked high in BM25 but is semantically mediocre. I would use RRF as the default for simplicity and switch to weighted combination only if evaluation shows that top-1 precision matters significantly for your use case.
Strong Answer:
  • The two-stage pattern exists because retrieval speed and ranking quality are fundamentally at odds. A bi-encoder (used in embedding-based retrieval) encodes the query and each document independently, which means document embeddings can be pre-computed and indexed. Searching 1 million pre-computed embeddings takes milliseconds using approximate nearest neighbor (ANN) indexes. A cross-encoder (used in reranking) encodes the query and document together as a single input, which means it must do a forward pass for every query-document pair at query time. Running a cross-encoder against 1 million documents would take hours.
  • The two-stage approach exploits this asymmetry: use the fast but approximate bi-encoder to retrieve 50-200 candidates from the full corpus (milliseconds), then use the slow but accurate cross-encoder to rerank only those candidates (hundreds of milliseconds). You get cross-encoder quality at bi-encoder speed. In benchmarks, this pattern typically improves Recall@10 by 5-15% over retrieval alone, with only 100-200ms added latency.
  • The critical tuning parameter is the retrieval set size — how many candidates you pass to the reranker. Too few (say 10) and the reranker cannot recover relevant documents that the bi-encoder missed. Too many (say 1000) and the reranker becomes the latency bottleneck. I typically start with 100 candidates and measure recall improvement as I increase to 200, 500. There are diminishing returns — going from 50 to 100 candidates usually helps significantly, but going from 200 to 500 rarely does.
  • The other nuance is that bi-encoders and cross-encoders often disagree on what is relevant, and that disagreement is exactly where the value lives. The bi-encoder might rank a document at position 50 because it captures semantic similarity but misses a subtle relevance signal. The cross-encoder, seeing both query and document together, catches that signal and promotes it to position 3. Documents that both agree on (top 5 in both) are slam-dunk relevant. Documents where they disagree are the interesting cases.
Follow-up: In production, the reranker adds 150ms of latency. How would you decide if that latency is justified, and what alternatives exist if it is not?Measure the impact on end-user metrics, not just retrieval metrics. If you are building a RAG system, compare the LLM answer quality with and without reranking using a blind evaluation. If the LLM answers are 10% better with reranking, and your users are paying customers making important decisions based on those answers, 150ms is trivially worth it. If you are building a casual search feature and users care more about speed than precision, skip it. Alternatives to a full cross-encoder reranker include: Cohere Rerank API (managed, fast, pay-per-call), ColBERT-style late interaction models that are faster than cross-encoders but more accurate than bi-encoders, or a lightweight “reranker” that is just an LLM prompt asking “which of these 10 documents best answers the query?” — surprisingly effective for small candidate sets and you already have the LLM in your pipeline.
Strong Answer:
  • HyDE is a query expansion technique where instead of embedding the user’s query directly, you first ask an LLM to generate a hypothetical answer to the query, then embed that hypothetical answer and use it for retrieval. The intuition is that a hypothetical answer is closer in embedding space to the actual relevant documents than a short question is. A query like “how to handle database connection pooling” is a question, but the relevant document is an explanation — they live in different parts of embedding space. A hypothetical answer about connection pooling is an explanation, so it lands closer to the real document.
  • In benchmarks, HyDE improves recall by 10-20% on knowledge-intensive queries where the query and the documents have different linguistic structures. It works best when the query is short and abstract (“best practices for microservice auth”) and the documents are long and detailed.
  • When it goes wrong: the LLM can hallucinate facts in the hypothetical answer that steer retrieval in the wrong direction. If the query is “What is the company’s remote work policy?” and the LLM generates a hypothetical answer about a flexible remote policy when the actual policy is strict in-office, the embedding of the hallucinated answer may retrieve documents about flexible work rather than the actual policy. You are searching for what the LLM thinks the answer is, not what the answer actually is.
  • It also adds latency and cost: one full LLM call to generate the hypothetical document before you even start retrieval. For a search feature where users expect sub-second results, this 500ms+ overhead is significant. And you are paying for an LLM generation on every query just for retrieval, before you even get to the answer-generation step.
  • I would use HyDE selectively: for complex analytical queries where recall is more important than latency (research assistants, legal search), and skip it for simple factual queries where standard embedding works fine. A good heuristic: if the query is under 10 words and looks like a keyword search, skip HyDE. If it is a full question or a complex information need, try HyDE.
Follow-up: Could you combine HyDE with hybrid search, and how would that interaction work?Yes, and it is actually a strong combination. Use HyDE for the semantic arm of hybrid search — embed the hypothetical answer for vector similarity — while keeping BM25 on the original query for the keyword arm. This way, the semantic search benefits from the hypothetical document being closer to relevant passages, while BM25 still catches exact-match terms from the original query that might be lost in the LLM-generated hypothesis. The BM25 arm acts as a safety net against HyDE hallucination — if HyDE steers semantic search toward the wrong topic, BM25 can still surface the right document based on keyword overlap. In practice, I have seen this combination outperform both standard hybrid and HyDE-only approaches, but it doubles your retrieval cost (LLM call + embedding call + BM25), so it is only justified when retrieval quality is paramount.
Strong Answer:
  • First, categorize the failures. Pull the 15% of bad-result queries and classify them: Are they exact-match queries where BM25 should dominate? Conceptual queries where semantic should dominate? Multi-intent queries? Queries in a language or jargon the embedding model was not trained on? The distribution of failure types tells you where to focus.
  • Second, check the retrieval stage independently from the rest of the pipeline. For each failing query, look at what the retriever returned (the raw document chunks) before any reranking or LLM processing. If the relevant document is not in the top 100 retrieved candidates, the problem is retrieval. If it is in the top 100 but ranked at position 80, the problem is ranking. If it is ranked at position 3 but the LLM still gave a bad answer, the problem is downstream — not search.
  • Third, for retrieval failures, check the chunking. The number one cause of bad search results in my experience is bad chunking: a relevant passage got split across two chunks and neither chunk is self-contained enough to rank well. Pull the actual chunk that should have been retrieved and examine it. Does it make sense in isolation, or does it start with “This approach…” with no antecedent? If chunking is the issue, increase chunk overlap, switch to semantic-boundary chunking, or add contextual retrieval (prepending a summary to each chunk).
  • Fourth, check the embedding quality. Take a failing query and its known-relevant document, embed both, and compute their cosine similarity. If similarity is below 0.7, the embedding model is not capturing the semantic relationship. This happens with domain-specific jargon, acronyms, or niche technical content. Solutions: fine-tune the embedding model on your domain data, add synonyms to the query via query expansion, or switch to a larger embedding model.
  • Fifth, check the hybrid weighting. It is possible your alpha is wrong for the query distribution. Run a sweep of alpha values (0.3, 0.5, 0.7) on the failing queries and see if a different weight recovers the relevant documents. If the failing queries are mostly exact-match but alpha is 0.8 (heavy semantic), you need to lower alpha or implement query-type-aware weighting.
Follow-up: You have identified that chunking is the root cause for 60% of the failures. How do you fix chunking without re-processing your entire 500K document corpus?You do not avoid re-processing — you plan for it. Chunking changes require re-embedding because different text produces different vectors. The real question is how to do it efficiently and without downtime. I would implement versioned indexes: create a new index (documents_v2) with the improved chunking strategy, embed in batch overnight, then atomically swap the search endpoint to point to the new index. Keep the old index available for rollback. For the batch re-processing itself, use the embedding API’s batch endpoint (up to 2048 texts per call) and parallelize across workers. For 500K documents with an average of 5 chunks each, that is 2.5M embeddings — at 2048 per batch, about 1,200 API calls. With parallelism, this takes a few hours and costs roughly $2-5 with text-embedding-3-small. The key lesson: always design your vector store for re-indexing from day one, because you will change your chunking strategy at least twice.