Search Methods Comparison
BM25 Implementation
BM25 (Best Match 25) is the algorithm behind Elasticsearch, Solr, and every traditional search engine you’ve ever used. It is a probabilistic ranking function that scores documents based on term frequency (how often the query words appear) weighted by inverse document frequency (rare words matter more than common ones). Think of it as “smart keyword matching” — it handles the math that makes “rare important word” rank higher than “common filler word.” Despite being decades old, BM25 remains unbeatable for exact-match queries like product SKUs, error codes, and proper nouns:Semantic Search with Embeddings
Hybrid Search
Neither BM25 nor semantic search is universally better — they have complementary strengths. BM25 excels at exact matches (error codes, function names, acronyms) while semantic search excels at meaning (synonyms, paraphrases, conceptual similarity). Combining them consistently outperforms either alone. The only question is how to weight them. Combine BM25 and semantic search for best results:Reciprocal Rank Fusion (RRF)
RRF is the industry standard for merging multiple ranked lists. Unlike weighted averaging (which requires score normalization), RRF only uses rank positions, making it robust across different scoring scales. The formula is simple: for each document, sum 1/(k + rank) across all rankings. Documents that appear near the top in multiple lists get the highest combined score.Reranking
Retrieval is fast but approximate. Reranking is slow but precise. The two-stage pattern exploits this: retrieve 100 candidates cheaply (milliseconds), then rerank the top 100 with a powerful cross-encoder model that reads the query and each document together (seconds). A cross-encoder sees the query-document pair simultaneously, so it catches subtle relevance signals that bi-encoder similarity misses. Rerank initial results with a more powerful model:Query Expansion
Improve recall by expanding queries:Contextual Retrieval
Add context to chunks before embedding:Search Pipeline
Performance Comparison
These numbers are representative across multiple benchmarks. The takeaway: each layer of sophistication buys real recall improvement, but at increasing cost and latency. Choose based on your quality requirements.| Method | Recall@10 | Latency | Cost | When to Use |
|---|---|---|---|---|
| BM25 | 0.65 | <10ms | None | Exact-match heavy queries, zero budget |
| Semantic | 0.75 | 50ms | Embeddings | General-purpose, meaning-based search |
| Hybrid | 0.82 | 60ms | Embeddings | Most production systems (best bang for buck) |
| Hybrid + Rerank | 0.90 | 150ms | Embeddings + Rerank | High-stakes: legal, medical, compliance |
Search Failure Modes and Fixes
Understanding why search fails is more valuable than understanding why it succeeds. These are the failure patterns you will encounter in production:| Failure Mode | Symptom | Root Cause | Fix |
|---|---|---|---|
| Vocabulary mismatch | User says “PTO” but docs say “paid time off” | Bi-encoder embeds query and docs independently | Query expansion or HyDE |
| False positive saturation | Top results are vaguely related but not useful | Chunks too large, meaning is diluted | Smaller chunks + reranking |
| Negation blindness | ”Not Python” still returns Python docs | Embeddings encode topic, not negation | Add keyword filter for negated terms |
| Recency bias | Old documents outrank updated ones | No time-decay in scoring | Add created_at weight or metadata filter |
| Short query collapse | One-word queries (“auth”) return wildly varied results | Not enough semantic signal | Expand short queries with LLM or require minimum length |
| Score plateau | Top 20 results all score 0.78-0.82 | All docs are equally “kind of relevant” | Reranking with cross-encoder breaks the tie |
What is Next
Context Window Management
Interview Deep-Dive
You are building a search feature for an internal knowledge base with 500K documents. Walk me through how you would decide between pure semantic search, BM25, and hybrid search -- and how you would tune the hybrid weighting.
You are building a search feature for an internal knowledge base with 500K documents. Walk me through how you would decide between pure semantic search, BM25, and hybrid search -- and how you would tune the hybrid weighting.
- The answer depends on the content and query patterns, but for an internal knowledge base I would almost certainly end up with hybrid search. Here is the reasoning: internal docs contain a mix of natural language (policy documents, onboarding guides) and highly specific terms (project codenames, internal tool names, error codes, Jira ticket IDs). Pure semantic search excels at the first category but completely misses exact-match needs. Pure BM25 handles exact terms but fails when someone asks “how do I take time off” and the document says “PTO request procedure.”
- I would start by building both pipelines independently and running a retrieval evaluation. Take 50-100 real user queries from search logs (or create them manually if no logs exist), have domain experts label the top 5 relevant documents for each query, then measure Recall@10 for BM25 alone, semantic alone, and hybrid at different alpha values. In my experience, hybrid consistently beats either individual method by 10-25% on Recall@10 for mixed-content corpora.
- For tuning the alpha weight, I would start at 0.7 semantic / 0.3 BM25 as a default. Then I would segment queries into categories — exact-match queries (error codes, names), conceptual queries (how-to, explanations), and mixed. Tune alpha per category if your system can classify query type, or find the alpha that maximizes recall across the blended query set. I have found that alpha between 0.5 and 0.7 works for most knowledge bases. Technical documentation with lots of code and acronyms benefits from lower alpha (more BM25 weight), around 0.4-0.5.
- The practical implementation detail most people miss: score normalization. BM25 scores and cosine similarity scores are on completely different scales. BM25 can range from 0 to 20+, while cosine similarity is 0 to 1. You must normalize both to the same range before combining, or the raw BM25 scores will dominate regardless of your alpha. Min-max normalization within each result set is the simplest approach; Reciprocal Rank Fusion (RRF) avoids the normalization problem entirely by using rank positions instead of scores.
Explain the two-stage retrieve-then-rerank pattern. Why not just use the reranker directly on all documents?
Explain the two-stage retrieve-then-rerank pattern. Why not just use the reranker directly on all documents?
- The two-stage pattern exists because retrieval speed and ranking quality are fundamentally at odds. A bi-encoder (used in embedding-based retrieval) encodes the query and each document independently, which means document embeddings can be pre-computed and indexed. Searching 1 million pre-computed embeddings takes milliseconds using approximate nearest neighbor (ANN) indexes. A cross-encoder (used in reranking) encodes the query and document together as a single input, which means it must do a forward pass for every query-document pair at query time. Running a cross-encoder against 1 million documents would take hours.
- The two-stage approach exploits this asymmetry: use the fast but approximate bi-encoder to retrieve 50-200 candidates from the full corpus (milliseconds), then use the slow but accurate cross-encoder to rerank only those candidates (hundreds of milliseconds). You get cross-encoder quality at bi-encoder speed. In benchmarks, this pattern typically improves Recall@10 by 5-15% over retrieval alone, with only 100-200ms added latency.
- The critical tuning parameter is the retrieval set size — how many candidates you pass to the reranker. Too few (say 10) and the reranker cannot recover relevant documents that the bi-encoder missed. Too many (say 1000) and the reranker becomes the latency bottleneck. I typically start with 100 candidates and measure recall improvement as I increase to 200, 500. There are diminishing returns — going from 50 to 100 candidates usually helps significantly, but going from 200 to 500 rarely does.
- The other nuance is that bi-encoders and cross-encoders often disagree on what is relevant, and that disagreement is exactly where the value lives. The bi-encoder might rank a document at position 50 because it captures semantic similarity but misses a subtle relevance signal. The cross-encoder, seeing both query and document together, catches that signal and promotes it to position 3. Documents that both agree on (top 5 in both) are slam-dunk relevant. Documents where they disagree are the interesting cases.
What is HyDE (Hypothetical Document Embeddings), and when would you use it versus standard query embedding? What can go wrong with it?
What is HyDE (Hypothetical Document Embeddings), and when would you use it versus standard query embedding? What can go wrong with it?
- HyDE is a query expansion technique where instead of embedding the user’s query directly, you first ask an LLM to generate a hypothetical answer to the query, then embed that hypothetical answer and use it for retrieval. The intuition is that a hypothetical answer is closer in embedding space to the actual relevant documents than a short question is. A query like “how to handle database connection pooling” is a question, but the relevant document is an explanation — they live in different parts of embedding space. A hypothetical answer about connection pooling is an explanation, so it lands closer to the real document.
- In benchmarks, HyDE improves recall by 10-20% on knowledge-intensive queries where the query and the documents have different linguistic structures. It works best when the query is short and abstract (“best practices for microservice auth”) and the documents are long and detailed.
- When it goes wrong: the LLM can hallucinate facts in the hypothetical answer that steer retrieval in the wrong direction. If the query is “What is the company’s remote work policy?” and the LLM generates a hypothetical answer about a flexible remote policy when the actual policy is strict in-office, the embedding of the hallucinated answer may retrieve documents about flexible work rather than the actual policy. You are searching for what the LLM thinks the answer is, not what the answer actually is.
- It also adds latency and cost: one full LLM call to generate the hypothetical document before you even start retrieval. For a search feature where users expect sub-second results, this 500ms+ overhead is significant. And you are paying for an LLM generation on every query just for retrieval, before you even get to the answer-generation step.
- I would use HyDE selectively: for complex analytical queries where recall is more important than latency (research assistants, legal search), and skip it for simple factual queries where standard embedding works fine. A good heuristic: if the query is under 10 words and looks like a keyword search, skip HyDE. If it is a full question or a complex information need, try HyDE.
Your hybrid search system is returning irrelevant results for 15% of queries. Walk me through your debugging process from end to end.
Your hybrid search system is returning irrelevant results for 15% of queries. Walk me through your debugging process from end to end.
- First, categorize the failures. Pull the 15% of bad-result queries and classify them: Are they exact-match queries where BM25 should dominate? Conceptual queries where semantic should dominate? Multi-intent queries? Queries in a language or jargon the embedding model was not trained on? The distribution of failure types tells you where to focus.
- Second, check the retrieval stage independently from the rest of the pipeline. For each failing query, look at what the retriever returned (the raw document chunks) before any reranking or LLM processing. If the relevant document is not in the top 100 retrieved candidates, the problem is retrieval. If it is in the top 100 but ranked at position 80, the problem is ranking. If it is ranked at position 3 but the LLM still gave a bad answer, the problem is downstream — not search.
- Third, for retrieval failures, check the chunking. The number one cause of bad search results in my experience is bad chunking: a relevant passage got split across two chunks and neither chunk is self-contained enough to rank well. Pull the actual chunk that should have been retrieved and examine it. Does it make sense in isolation, or does it start with “This approach…” with no antecedent? If chunking is the issue, increase chunk overlap, switch to semantic-boundary chunking, or add contextual retrieval (prepending a summary to each chunk).
- Fourth, check the embedding quality. Take a failing query and its known-relevant document, embed both, and compute their cosine similarity. If similarity is below 0.7, the embedding model is not capturing the semantic relationship. This happens with domain-specific jargon, acronyms, or niche technical content. Solutions: fine-tune the embedding model on your domain data, add synonyms to the query via query expansion, or switch to a larger embedding model.
- Fifth, check the hybrid weighting. It is possible your alpha is wrong for the query distribution. Run a sweep of alpha values (0.3, 0.5, 0.7) on the failing queries and see if a different weight recovers the relevant documents. If the failing queries are mostly exact-match but alpha is 0.8 (heavy semantic), you need to lower alpha or implement query-type-aware weighting.
text-embedding-3-small. The key lesson: always design your vector store for re-indexing from day one, because you will change your chunking strategy at least twice.