Use this file to discover all available pages before exploring further.
Semantic search goes beyond keyword matching to understand meaning and intent. Traditional keyword search is like looking up a word in a dictionary — it finds exact matches but misses everything else. Semantic search is like asking a knowledgeable friend — “show me things about vacation policy” will find documents about “PTO guidelines,” “time off procedures,” and “leave of absence rules” even if they never use the word “vacation.”This is the retrieval backbone of every RAG system, search feature, and recommendation engine in modern AI.
Method Strengths Weaknesses-----------------------------------------------------------------Keyword (BM25) Exact matches, fast Misses synonymsSemantic Understands meaning Misses keywordsHybrid Best of both More complex
BM25 (Best Match 25) is the algorithm behind Elasticsearch, Solr, and every traditional search engine you’ve ever used. It is a probabilistic ranking function that scores documents based on term frequency (how often the query words appear) weighted by inverse document frequency (rare words matter more than common ones). Think of it as “smart keyword matching” — it handles the math that makes “rare important word” rank higher than “common filler word.”Despite being decades old, BM25 remains unbeatable for exact-match queries like product SKUs, error codes, and proper nouns:
import mathfrom collections import Counterfrom typing import List, Dict, Tupleclass BM25: """BM25 ranking algorithm implementation""" def __init__( self, documents: List[str], k1: float = 1.5, b: float = 0.75 ): self.k1 = k1 self.b = b self.documents = documents self.doc_count = len(documents) # Tokenize documents self.tokenized_docs = [self._tokenize(doc) for doc in documents] # Calculate document lengths self.doc_lengths = [len(doc) for doc in self.tokenized_docs] self.avg_doc_length = sum(self.doc_lengths) / self.doc_count # Build inverted index self.doc_freqs = {} # term -> number of docs containing term self.term_freqs = [] # per-document term frequencies for doc_tokens in self.tokenized_docs: term_freq = Counter(doc_tokens) self.term_freqs.append(term_freq) for term in set(doc_tokens): self.doc_freqs[term] = self.doc_freqs.get(term, 0) + 1 def _tokenize(self, text: str) -> List[str]: """Simple tokenization""" return text.lower().split() def _idf(self, term: str) -> float: """Calculate inverse document frequency. IDF is the key insight: a word appearing in 1 out of 10,000 docs is much more informative than one appearing in 9,000 out of 10,000. "Python" in a programming corpus has low IDF (common), but "asyncpg" has high IDF (rare and specific). """ doc_freq = self.doc_freqs.get(term, 0) return math.log( (self.doc_count - doc_freq + 0.5) / (doc_freq + 0.5) + 1 ) def _score_document( self, query_terms: List[str], doc_idx: int ) -> float: """Score a single document against query""" score = 0.0 doc_len = self.doc_lengths[doc_idx] term_freqs = self.term_freqs[doc_idx] for term in query_terms: if term not in term_freqs: continue tf = term_freqs[term] idf = self._idf(term) # BM25 formula numerator = tf * (self.k1 + 1) denominator = tf + self.k1 * ( 1 - self.b + self.b * (doc_len / self.avg_doc_length) ) score += idf * (numerator / denominator) return score def search( self, query: str, top_k: int = 10 ) -> List[Tuple[int, float]]: """Search documents and return top-k results""" query_terms = self._tokenize(query) scores = [] for idx in range(self.doc_count): score = self._score_document(query_terms, idx) if score > 0: scores.append((idx, score)) # Sort by score descending scores.sort(key=lambda x: x[1], reverse=True) return scores[:top_k]# Usagedocuments = [ "Machine learning is a subset of artificial intelligence", "Deep learning uses neural networks with many layers", "Natural language processing enables computers to understand text", "Computer vision allows machines to interpret images"]bm25 = BM25(documents)results = bm25.search("neural network deep learning")for idx, score in results: print(f"Score: {score:.3f} - {documents[idx][:50]}...")
Neither BM25 nor semantic search is universally better — they have complementary strengths. BM25 excels at exact matches (error codes, function names, acronyms) while semantic search excels at meaning (synonyms, paraphrases, conceptual similarity). Combining them consistently outperforms either alone. The only question is how to weight them.Combine BM25 and semantic search for best results:
from dataclasses import dataclassfrom typing import List, Dict, Optional@dataclassclass SearchResult: doc_id: int document: str bm25_score: float = 0.0 semantic_score: float = 0.0 hybrid_score: float = 0.0class HybridSearch: """Combine BM25 and semantic search""" def __init__( self, documents: List[str], alpha: float = 0.5, # Weight for semantic (1-alpha for BM25) # Tip: alpha=0.7 works well for general text; alpha=0.3 for # technical docs where exact terminology matters more embedding_model: str = "text-embedding-3-small" ): self.documents = documents self.alpha = alpha # Initialize both search methods self.bm25 = BM25(documents) self.semantic = SemanticSearch(embedding_model) self.semantic.add_documents(documents) def _normalize_scores( self, scores: List[Tuple[int, float]] ) -> Dict[int, float]: """Min-max normalize scores to 0-1""" if not scores: return {} values = [s[1] for s in scores] min_val = min(values) max_val = max(values) range_val = max_val - min_val if max_val > min_val else 1 return { idx: (score - min_val) / range_val for idx, score in scores } def search( self, query: str, top_k: int = 10 ) -> List[SearchResult]: """Perform hybrid search""" # Get BM25 results bm25_results = self.bm25.search(query, top_k=top_k * 2) bm25_scores = self._normalize_scores(bm25_results) # Get semantic results semantic_results = self.semantic.search(query, top_k=top_k * 2) semantic_scores = {idx: score for idx, score, _ in semantic_results} # Combine all unique document IDs all_docs = set(bm25_scores.keys()) | set(semantic_scores.keys()) # Calculate hybrid scores results = [] for doc_id in all_docs: bm25 = bm25_scores.get(doc_id, 0.0) semantic = semantic_scores.get(doc_id, 0.0) hybrid = self.alpha * semantic + (1 - self.alpha) * bm25 results.append(SearchResult( doc_id=doc_id, document=self.documents[doc_id], bm25_score=bm25, semantic_score=semantic, hybrid_score=hybrid )) # Sort by hybrid score results.sort(key=lambda x: x.hybrid_score, reverse=True) return results[:top_k]# Usagehybrid = HybridSearch(documents, alpha=0.7) # 70% semantic, 30% BM25results = hybrid.search("machine learning algorithms")for r in results: print(f"Hybrid: {r.hybrid_score:.3f} (BM25: {r.bm25_score:.3f}, " f"Semantic: {r.semantic_score:.3f})") print(f" {r.document[:60]}...")
RRF is the industry standard for merging multiple ranked lists. Unlike weighted averaging (which requires score normalization), RRF only uses rank positions, making it robust across different scoring scales. The formula is simple: for each document, sum 1/(k + rank) across all rankings. Documents that appear near the top in multiple lists get the highest combined score.
from collections import defaultdictfrom typing import List, Dictdef reciprocal_rank_fusion( rankings: List[List[int]], k: int = 60 # From the original paper -- rarely needs tuning) -> List[Tuple[int, float]]: """ Combine multiple rankings using RRF. Args: rankings: List of ranked document ID lists k: Constant to prevent high ranks from dominating (60 is standard) Returns: List of (doc_id, rrf_score) sorted by score """ rrf_scores = defaultdict(float) for ranking in rankings: for rank, doc_id in enumerate(ranking, start=1): rrf_scores[doc_id] += 1 / (k + rank) # Sort by RRF score sorted_docs = sorted( rrf_scores.items(), key=lambda x: x[1], reverse=True ) return sorted_docsclass RRFHybridSearch: """Hybrid search using Reciprocal Rank Fusion""" def __init__(self, documents: List[str]): self.documents = documents self.bm25 = BM25(documents) self.semantic = SemanticSearch() self.semantic.add_documents(documents) def search( self, query: str, top_k: int = 10, rrf_k: int = 60 ) -> List[Tuple[int, float, str]]: """Search using RRF to combine rankings""" # Get rankings from both methods bm25_results = self.bm25.search(query, top_k=top_k * 2) bm25_ranking = [idx for idx, _ in bm25_results] semantic_results = self.semantic.search(query, top_k=top_k * 2) semantic_ranking = [idx for idx, _, _ in semantic_results] # Apply RRF fused = reciprocal_rank_fusion( [bm25_ranking, semantic_ranking], k=rrf_k ) results = [] for doc_id, score in fused[:top_k]: results.append((doc_id, score, self.documents[doc_id])) return results
Retrieval is fast but approximate. Reranking is slow but precise. The two-stage pattern exploits this: retrieve 100 candidates cheaply (milliseconds), then rerank the top 100 with a powerful cross-encoder model that reads the query and each document together (seconds). A cross-encoder sees the query-document pair simultaneously, so it catches subtle relevance signals that bi-encoder similarity misses.Rerank initial results with a more powerful model:
from dataclasses import dataclassfrom typing import List@dataclassclass RerankResult: index: int document: str relevance_score: floatclass CrossEncoderReranker: """Rerank using cross-encoder models""" def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"): from sentence_transformers import CrossEncoder self.model = CrossEncoder(model_name) def rerank( self, query: str, documents: List[str], top_k: int = None ) -> List[RerankResult]: """Rerank documents by relevance to query""" # Create query-document pairs pairs = [[query, doc] for doc in documents] # Get relevance scores scores = self.model.predict(pairs) # Create results with scores results = [ RerankResult( index=i, document=doc, relevance_score=float(score) ) for i, (doc, score) in enumerate(zip(documents, scores)) ] # Sort by relevance results.sort(key=lambda x: x.relevance_score, reverse=True) if top_k: results = results[:top_k] return results# Cohere Rerank APIclass CohereReranker: """Rerank using Cohere API""" def __init__(self, api_key: str = None): import cohere self.client = cohere.Client(api_key) def rerank( self, query: str, documents: List[str], top_k: int = 10 ) -> List[RerankResult]: """Rerank using Cohere rerank endpoint""" response = self.client.rerank( model="rerank-english-v3.0", query=query, documents=documents, top_n=top_k ) return [ RerankResult( index=r.index, document=documents[r.index], relevance_score=r.relevance_score ) for r in response.results ]# Two-stage retrieval: retrieve then rerankclass TwoStageRetriever: """Retrieve candidates then rerank""" def __init__( self, documents: List[str], retrieval_k: int = 100, final_k: int = 10 ): self.documents = documents self.retrieval_k = retrieval_k self.final_k = final_k # Fast retrieval self.hybrid = HybridSearch(documents) # Accurate reranking self.reranker = CrossEncoderReranker() def search(self, query: str) -> List[RerankResult]: """Two-stage search""" # Stage 1: Fast retrieval of candidates candidates = self.hybrid.search(query, top_k=self.retrieval_k) candidate_docs = [c.document for c in candidates] # Stage 2: Accurate reranking reranked = self.reranker.rerank( query, candidate_docs, top_k=self.final_k ) return reranked
class ContextualChunker: """Add document context to chunks for better retrieval. The problem: a chunk that says "This increased by 15% in Q3" is meaningless without knowing what "this" refers to. Contextual retrieval prepends a brief summary of the surrounding document to each chunk before embedding. This technique, popularized by Anthropic's research, can improve retrieval recall by 20-40% on real-world datasets. """ def __init__(self): self.client = OpenAI() def add_context( self, document: str, chunk: str ) -> str: """Generate contextual prefix for chunk""" response = self.client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": """Provide a brief context (1-2 sentences) that situates this chunk within the larger document. The context should help understand what the chunk is about without reading the full document.""" }, { "role": "user", "content": f"Document:\n{document[:2000]}...\n\nChunk:\n{chunk}" } ], max_tokens=100, temperature=0.3 ) context = response.choices[0].message.content return f"{context}\n\n{chunk}" def process_document( self, document: str, chunk_size: int = 500, overlap: int = 50 ) -> List[str]: """Chunk document and add context to each chunk""" # Simple chunking chunks = [] start = 0 while start < len(document): end = start + chunk_size chunk = document[start:end] chunks.append(chunk) start = end - overlap # Add context to each chunk contextual_chunks = [] for chunk in chunks: contextual = self.add_context(document, chunk) contextual_chunks.append(contextual) return contextual_chunks
These numbers are representative across multiple benchmarks. The takeaway: each layer of sophistication buys real recall improvement, but at increasing cost and latency. Choose based on your quality requirements.
Expand short queries with LLM or require minimum length
Score plateau
Top 20 results all score 0.78-0.82
All docs are equally “kind of relevant”
Reranking with cross-encoder breaks the tie
Edge case — queries with embedded constraints:“What is our refund policy for orders over 500?"−−theembeddingcapturesthetopic(refundpolicy)butnotthenumericconstraint(500). Semantic search finds refund policy docs but cannot filter by dollar amount. Fix: extract structured constraints with an LLM before search, apply them as metadata filters, and use semantic search only for the topical component.
You are building a search feature for an internal knowledge base with 500K documents. Walk me through how you would decide between pure semantic search, BM25, and hybrid search -- and how you would tune the hybrid weighting.
Strong Answer:
The answer depends on the content and query patterns, but for an internal knowledge base I would almost certainly end up with hybrid search. Here is the reasoning: internal docs contain a mix of natural language (policy documents, onboarding guides) and highly specific terms (project codenames, internal tool names, error codes, Jira ticket IDs). Pure semantic search excels at the first category but completely misses exact-match needs. Pure BM25 handles exact terms but fails when someone asks “how do I take time off” and the document says “PTO request procedure.”
I would start by building both pipelines independently and running a retrieval evaluation. Take 50-100 real user queries from search logs (or create them manually if no logs exist), have domain experts label the top 5 relevant documents for each query, then measure Recall@10 for BM25 alone, semantic alone, and hybrid at different alpha values. In my experience, hybrid consistently beats either individual method by 10-25% on Recall@10 for mixed-content corpora.
For tuning the alpha weight, I would start at 0.7 semantic / 0.3 BM25 as a default. Then I would segment queries into categories — exact-match queries (error codes, names), conceptual queries (how-to, explanations), and mixed. Tune alpha per category if your system can classify query type, or find the alpha that maximizes recall across the blended query set. I have found that alpha between 0.5 and 0.7 works for most knowledge bases. Technical documentation with lots of code and acronyms benefits from lower alpha (more BM25 weight), around 0.4-0.5.
The practical implementation detail most people miss: score normalization. BM25 scores and cosine similarity scores are on completely different scales. BM25 can range from 0 to 20+, while cosine similarity is 0 to 1. You must normalize both to the same range before combining, or the raw BM25 scores will dominate regardless of your alpha. Min-max normalization within each result set is the simplest approach; Reciprocal Rank Fusion (RRF) avoids the normalization problem entirely by using rank positions instead of scores.
Follow-up: You mentioned RRF avoids the normalization problem. When would you prefer RRF over weighted score combination, and what is the downside of RRF?RRF is more robust when combining rankings from systems with incompatible score distributions — which is exactly the BM25 + semantic case. It only uses rank positions, so it does not care about score scales. The constant k=60 from the original paper rarely needs tuning, which makes it operationally simpler. The downside is that RRF throws away magnitude information. If semantic search returns a document with 0.99 similarity (a near-perfect match) and another at 0.72, RRF treats the gap between rank 1 and rank 2 the same regardless. Weighted score combination preserves that signal — a 0.99 match contributes much more than a 0.72 match. In practice, this matters when you have a “golden” document that is a clear best match. RRF can dilute that signal by boosting a document that ranked high in BM25 but is semantically mediocre. I would use RRF as the default for simplicity and switch to weighted combination only if evaluation shows that top-1 precision matters significantly for your use case.
Explain the two-stage retrieve-then-rerank pattern. Why not just use the reranker directly on all documents?
Strong Answer:
The two-stage pattern exists because retrieval speed and ranking quality are fundamentally at odds. A bi-encoder (used in embedding-based retrieval) encodes the query and each document independently, which means document embeddings can be pre-computed and indexed. Searching 1 million pre-computed embeddings takes milliseconds using approximate nearest neighbor (ANN) indexes. A cross-encoder (used in reranking) encodes the query and document together as a single input, which means it must do a forward pass for every query-document pair at query time. Running a cross-encoder against 1 million documents would take hours.
The two-stage approach exploits this asymmetry: use the fast but approximate bi-encoder to retrieve 50-200 candidates from the full corpus (milliseconds), then use the slow but accurate cross-encoder to rerank only those candidates (hundreds of milliseconds). You get cross-encoder quality at bi-encoder speed. In benchmarks, this pattern typically improves Recall@10 by 5-15% over retrieval alone, with only 100-200ms added latency.
The critical tuning parameter is the retrieval set size — how many candidates you pass to the reranker. Too few (say 10) and the reranker cannot recover relevant documents that the bi-encoder missed. Too many (say 1000) and the reranker becomes the latency bottleneck. I typically start with 100 candidates and measure recall improvement as I increase to 200, 500. There are diminishing returns — going from 50 to 100 candidates usually helps significantly, but going from 200 to 500 rarely does.
The other nuance is that bi-encoders and cross-encoders often disagree on what is relevant, and that disagreement is exactly where the value lives. The bi-encoder might rank a document at position 50 because it captures semantic similarity but misses a subtle relevance signal. The cross-encoder, seeing both query and document together, catches that signal and promotes it to position 3. Documents that both agree on (top 5 in both) are slam-dunk relevant. Documents where they disagree are the interesting cases.
Follow-up: In production, the reranker adds 150ms of latency. How would you decide if that latency is justified, and what alternatives exist if it is not?Measure the impact on end-user metrics, not just retrieval metrics. If you are building a RAG system, compare the LLM answer quality with and without reranking using a blind evaluation. If the LLM answers are 10% better with reranking, and your users are paying customers making important decisions based on those answers, 150ms is trivially worth it. If you are building a casual search feature and users care more about speed than precision, skip it. Alternatives to a full cross-encoder reranker include: Cohere Rerank API (managed, fast, pay-per-call), ColBERT-style late interaction models that are faster than cross-encoders but more accurate than bi-encoders, or a lightweight “reranker” that is just an LLM prompt asking “which of these 10 documents best answers the query?” — surprisingly effective for small candidate sets and you already have the LLM in your pipeline.
What is HyDE (Hypothetical Document Embeddings), and when would you use it versus standard query embedding? What can go wrong with it?
Strong Answer:
HyDE is a query expansion technique where instead of embedding the user’s query directly, you first ask an LLM to generate a hypothetical answer to the query, then embed that hypothetical answer and use it for retrieval. The intuition is that a hypothetical answer is closer in embedding space to the actual relevant documents than a short question is. A query like “how to handle database connection pooling” is a question, but the relevant document is an explanation — they live in different parts of embedding space. A hypothetical answer about connection pooling is an explanation, so it lands closer to the real document.
In benchmarks, HyDE improves recall by 10-20% on knowledge-intensive queries where the query and the documents have different linguistic structures. It works best when the query is short and abstract (“best practices for microservice auth”) and the documents are long and detailed.
When it goes wrong: the LLM can hallucinate facts in the hypothetical answer that steer retrieval in the wrong direction. If the query is “What is the company’s remote work policy?” and the LLM generates a hypothetical answer about a flexible remote policy when the actual policy is strict in-office, the embedding of the hallucinated answer may retrieve documents about flexible work rather than the actual policy. You are searching for what the LLM thinks the answer is, not what the answer actually is.
It also adds latency and cost: one full LLM call to generate the hypothetical document before you even start retrieval. For a search feature where users expect sub-second results, this 500ms+ overhead is significant. And you are paying for an LLM generation on every query just for retrieval, before you even get to the answer-generation step.
I would use HyDE selectively: for complex analytical queries where recall is more important than latency (research assistants, legal search), and skip it for simple factual queries where standard embedding works fine. A good heuristic: if the query is under 10 words and looks like a keyword search, skip HyDE. If it is a full question or a complex information need, try HyDE.
Follow-up: Could you combine HyDE with hybrid search, and how would that interaction work?Yes, and it is actually a strong combination. Use HyDE for the semantic arm of hybrid search — embed the hypothetical answer for vector similarity — while keeping BM25 on the original query for the keyword arm. This way, the semantic search benefits from the hypothetical document being closer to relevant passages, while BM25 still catches exact-match terms from the original query that might be lost in the LLM-generated hypothesis. The BM25 arm acts as a safety net against HyDE hallucination — if HyDE steers semantic search toward the wrong topic, BM25 can still surface the right document based on keyword overlap. In practice, I have seen this combination outperform both standard hybrid and HyDE-only approaches, but it doubles your retrieval cost (LLM call + embedding call + BM25), so it is only justified when retrieval quality is paramount.
Your hybrid search system is returning irrelevant results for 15% of queries. Walk me through your debugging process from end to end.
Strong Answer:
First, categorize the failures. Pull the 15% of bad-result queries and classify them: Are they exact-match queries where BM25 should dominate? Conceptual queries where semantic should dominate? Multi-intent queries? Queries in a language or jargon the embedding model was not trained on? The distribution of failure types tells you where to focus.
Second, check the retrieval stage independently from the rest of the pipeline. For each failing query, look at what the retriever returned (the raw document chunks) before any reranking or LLM processing. If the relevant document is not in the top 100 retrieved candidates, the problem is retrieval. If it is in the top 100 but ranked at position 80, the problem is ranking. If it is ranked at position 3 but the LLM still gave a bad answer, the problem is downstream — not search.
Third, for retrieval failures, check the chunking. The number one cause of bad search results in my experience is bad chunking: a relevant passage got split across two chunks and neither chunk is self-contained enough to rank well. Pull the actual chunk that should have been retrieved and examine it. Does it make sense in isolation, or does it start with “This approach…” with no antecedent? If chunking is the issue, increase chunk overlap, switch to semantic-boundary chunking, or add contextual retrieval (prepending a summary to each chunk).
Fourth, check the embedding quality. Take a failing query and its known-relevant document, embed both, and compute their cosine similarity. If similarity is below 0.7, the embedding model is not capturing the semantic relationship. This happens with domain-specific jargon, acronyms, or niche technical content. Solutions: fine-tune the embedding model on your domain data, add synonyms to the query via query expansion, or switch to a larger embedding model.
Fifth, check the hybrid weighting. It is possible your alpha is wrong for the query distribution. Run a sweep of alpha values (0.3, 0.5, 0.7) on the failing queries and see if a different weight recovers the relevant documents. If the failing queries are mostly exact-match but alpha is 0.8 (heavy semantic), you need to lower alpha or implement query-type-aware weighting.
Follow-up: You have identified that chunking is the root cause for 60% of the failures. How do you fix chunking without re-processing your entire 500K document corpus?You do not avoid re-processing — you plan for it. Chunking changes require re-embedding because different text produces different vectors. The real question is how to do it efficiently and without downtime. I would implement versioned indexes: create a new index (documents_v2) with the improved chunking strategy, embed in batch overnight, then atomically swap the search endpoint to point to the new index. Keep the old index available for rollback. For the batch re-processing itself, use the embedding API’s batch endpoint (up to 2048 texts per call) and parallelize across workers. For 500K documents with an average of 5 chunks each, that is 2.5M embeddings — at 2048 per batch, about 1,200 API calls. With parallelism, this takes a few hours and costs roughly $2-5 with text-embedding-3-small. The key lesson: always design your vector store for re-indexing from day one, because you will change your chunking strategy at least twice.