If LLMs are the brain, embeddings are the language it uses to think about similarity. An embedding converts a chunk of text into a list of numbers (a vector) where similar meanings end up close together in number-space. It is like plotting cities on a map: New York and Boston end up near each other, while Tokyo is far away. Except instead of geographic coordinates, you have 1536 dimensions capturing meaning, topic, tone, and intent.Embeddings convert text into dense numerical vectors that capture semantic meaning:
Text Embedding Vector━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"I love pizza" → [0.12, -0.34, 0.87, ..., 0.23]"Pizza is my favorite" → [0.11, -0.32, 0.85, ..., 0.21] ← Similar!"I hate broccoli" → [-0.45, 0.12, -0.33, ..., 0.67] ← Different
Choosing an embedding model is a three-way trade-off between quality, cost, and speed. OpenAI’s text-embedding-3-small is the default choice for most teams — it is cheap, fast, and good enough. Move to text-embedding-3-large when you need higher accuracy (legal, medical). Go open-source with BGE or MiniLM when cost is critical at scale (millions of documents) or you cannot send data to an external API. The table below gives the concrete numbers.
This is one of the most underused features of OpenAI’s embedding models. You can request a 256-dimensional embedding instead of the full 1536, and OpenAI applies Matryoshka Representation Learning to give you a smaller vector that retains most of the quality. The practical impact is huge: 256 dimensions instead of 1536 means 6x less storage in your vector database, 6x faster similarity search, and cheaper pgvector indexes — all for a quality drop that is often less than 5% on retrieval benchmarks.OpenAI’s text-embedding-3 models support native dimension reduction:
Open-source embedding models run on your own hardware, which means zero API costs and no data leaving your network. The trade-off is that you need to manage the infrastructure: GPU for fast inference (or accept slower CPU speeds), model loading, and batching. For teams processing millions of documents, the math usually favors self-hosted: embedding 1M chunks costs ~20withOpenAIbuteffectively0 with a local model (after the one-time GPU cost).
from sentence_transformers import SentenceTransformerimport numpy as npclass LocalEmbedder: """Local embedding using sentence-transformers. Tip: BGE models expect a query prefix for retrieval tasks ("Represent this sentence for retrieval:"). Forgetting this prefix can drop recall by 10-15%. Always check the model card. """ def __init__(self, model_name: str = "BAAI/bge-large-en-v1.5"): self.model = SentenceTransformer(model_name) def embed(self, text: str) -> np.ndarray: return self.model.encode(text, normalize_embeddings=True) def embed_batch(self, texts: list[str]) -> np.ndarray: return self.model.encode( texts, normalize_embeddings=True, batch_size=32, show_progress_bar=True ) def embed_with_instruction( self, text: str, instruction: str = "Represent this sentence for retrieval:" ) -> np.ndarray: """Some models perform better with instructions""" return self.model.encode( f"{instruction} {text}", normalize_embeddings=True )# Usageembedder = LocalEmbedder()embedding = embedder.embed("What is artificial intelligence?")
Similarity metrics answer the question “how close are these two vectors?” Different metrics measure “closeness” differently, and picking the wrong one can silently degrade your search quality. The rule of thumb: use cosine similarity unless you have a specific reason not to. It is magnitude-invariant (a long document and a short document about the same topic will still be similar), which is exactly what you want for text.
import numpy as npdef cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: """Compute cosine similarity between two vectors. Returns a value between -1 and 1, where 1 means identical direction (identical meaning), 0 means orthogonal (unrelated), and -1 means opposite (rare in practice with text embeddings). """ return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))def cosine_similarity_normalized(a: np.ndarray, b: np.ndarray) -> float: """For pre-normalized vectors, just dot product""" return np.dot(a, b)# Usageemb1 = get_embedding("I love pizza")emb2 = get_embedding("Pizza is my favorite food")emb3 = get_embedding("The weather is nice today")print(f"Similar: {cosine_similarity(emb1, emb2):.3f}") # ~0.85print(f"Different: {cosine_similarity(emb1, emb3):.3f}") # ~0.40
def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float: """L2 distance - lower is more similar. Good when vector magnitude carries meaning (e.g., document length matters).""" return np.linalg.norm(a - b)def dot_product(a: np.ndarray, b: np.ndarray) -> float: """Dot product - higher is more similar. For pre-normalized vectors, this is equivalent to cosine similarity but faster (skips the normalization division).""" return np.dot(a, b)def manhattan_distance(a: np.ndarray, b: np.ndarray) -> float: """L1 distance - lower is more similar. More robust to outlier dimensions than L2. Rarely used for embeddings in practice but useful for sparse feature vectors.""" return np.sum(np.abs(a - b))# When to use each:# - Cosine: General purpose, magnitude-invariant -- the safe default# - Dot product: For pre-normalized vectors (fastest -- just a sum of products)# - Euclidean (L2): When magnitude matters (rare for text embeddings)# - Manhattan (L1): Sparse vectors, high dimensions -- not common in embedding search## Practical tip: Most embedding APIs return normalized vectors, so cosine# similarity and dot product give identical results. Use dot product for speed.
Here is a scenario that pure embedding search fails at: a user asks “error code E-4021” and the most similar embeddings are about generic error handling rather than the specific error code. That is because embeddings capture meaning, not exact strings. Keyword search (BM25) handles this perfectly — it matches the literal text “E-4021.” Hybrid search combines both approaches: semantic similarity for understanding intent, keyword matching for precision. In practice, hybrid search outperforms either approach alone for 80-90% of real-world retrieval tasks.The alpha parameter controls the blend: 0.7 means 70% semantic weight and 30% keyword weight. Start there, then tune based on your query patterns. If users frequently search for specific identifiers, product names, or codes, lower alpha (more keyword weight). If queries are natural-language questions, raise it.Combine semantic search with keyword matching:
from rank_bm25 import BM25Okapiimport numpy as npclass HybridSearchEngine: """Combines vector search with BM25 keyword search""" def __init__(self, alpha: float = 0.5): self.alpha = alpha # Weight for semantic vs keyword self.vector_engine = VectorSearchEngine() self.bm25 = None self.tokenized_docs = [] def add_documents(self, documents: List[Document]): """Add documents to both indices""" # Vector index self.vector_engine.add_documents(documents) # BM25 index self.tokenized_docs = [ doc.text.lower().split() for doc in documents ] self.bm25 = BM25Okapi(self.tokenized_docs) def search( self, query: str, top_k: int = 5 ) -> List[Tuple[Document, float]]: """Hybrid search combining semantic and keyword scores""" # Semantic search semantic_results = self.vector_engine.search(query, top_k=top_k * 2) # BM25 keyword search tokenized_query = query.lower().split() bm25_scores = self.bm25.get_scores(tokenized_query) # Normalize scores semantic_scores = {r[0].id: r[1] for r in semantic_results} # Normalize BM25 scores to 0-1 max_bm25 = max(bm25_scores) if max(bm25_scores) > 0 else 1 normalized_bm25 = { self.vector_engine.documents[i].id: score / max_bm25 for i, score in enumerate(bm25_scores) } # Combine scores combined_scores = {} all_doc_ids = set(semantic_scores.keys()) | set(normalized_bm25.keys()) for doc_id in all_doc_ids: semantic = semantic_scores.get(doc_id, 0) keyword = normalized_bm25.get(doc_id, 0) combined_scores[doc_id] = ( self.alpha * semantic + (1 - self.alpha) * keyword ) # Sort by combined score sorted_ids = sorted( combined_scores.keys(), key=lambda x: combined_scores[x], reverse=True )[:top_k] # Return documents with scores doc_map = {d.id: d for d in self.vector_engine.documents} return [ (doc_map[doc_id], combined_scores[doc_id]) for doc_id in sorted_ids ]# Usagehybrid = HybridSearchEngine(alpha=0.7) # 70% semantic, 30% keywordhybrid.add_documents(documents)results = hybrid.search("What is AI and machine learning?")
The difference between a hobby project and a production embedding pipeline is how you handle scale. Embedding 100 documents is trivial. Embedding 1 million documents means dealing with rate limits, batching to reduce HTTP overhead, and caching to avoid re-embedding documents that haven’t changed. The patterns below address each of these concerns.
Embeddings are deterministic: the same text with the same model always produces the same vector. This makes them perfect for caching. If a user re-uploads a document or you re-index your knowledge base, a cache prevents paying for the same embedding twice. The file-based cache below works for development and small datasets. For production, swap in Redis or a database-backed cache for concurrent access and TTL management.
import hashlibimport picklefrom pathlib import Pathclass CachedEmbedder: """Cache embeddings to avoid recomputation. Pitfall: Cache key must include the model name. If you switch from text-embedding-3-small to text-embedding-3-large, old cached vectors have different dimensions and will cause silent errors. """ def __init__(self, cache_dir: str = ".embedding_cache"): self.cache_dir = Path(cache_dir) self.cache_dir.mkdir(exist_ok=True) self.client = OpenAI() def _cache_key(self, text: str, model: str) -> str: content = f"{model}:{text}" return hashlib.sha256(content.encode()).hexdigest() def _cache_path(self, key: str) -> Path: return self.cache_dir / f"{key}.pkl" def get_embedding( self, text: str, model: str = "text-embedding-3-small" ) -> np.ndarray: """Get embedding, using cache if available""" key = self._cache_key(text, model) cache_path = self._cache_path(key) # Check cache if cache_path.exists(): with open(cache_path, "rb") as f: return pickle.load(f) # Compute embedding response = self.client.embeddings.create( model=model, input=text ) embedding = np.array(response.data[0].embedding) # Cache it with open(cache_path, "wb") as f: pickle.dump(embedding, f) return embedding
When off-the-shelf embeddings aren’t cutting it — medical jargon isn’t matching synonyms, legal terms aren’t clustering correctly, or your domain-specific acronyms are treated as noise — fine-tuning adapts the model to your vocabulary and similarity relationships. The approach below uses contrastive learning: you provide pairs of texts with similarity scores, and the model adjusts its internal weights so that your domain’s notion of “similar” is reflected in the embedding space. Even 500-1000 labeled pairs can produce meaningful improvements.For domain-specific applications, fine-tune embedding models:
from sentence_transformers import SentenceTransformer, InputExample, lossesfrom torch.utils.data import DataLoaderdef fine_tune_embeddings( model_name: str, training_data: List[Tuple[str, str, float]], # (text1, text2, similarity) output_path: str, epochs: int = 3): """Fine-tune an embedding model on domain data""" # Load base model model = SentenceTransformer(model_name) # Prepare training data train_examples = [ InputExample(texts=[t1, t2], label=sim) for t1, t2, sim in training_data ] train_dataloader = DataLoader( train_examples, shuffle=True, batch_size=16 ) # Use cosine similarity loss train_loss = losses.CosineSimilarityLoss(model) # Train model.fit( train_objectives=[(train_dataloader, train_loss)], epochs=epochs, warmup_steps=100, output_path=output_path ) return model# Example training data for a medical domaintraining_data = [ ("patient has fever", "elevated temperature", 0.9), ("patient has fever", "broken leg", 0.1), ("chest pain", "cardiac symptoms", 0.85), # ... more examples]model = fine_tune_embeddings( "sentence-transformers/all-MiniLM-L6-v2", training_data, "medical-embeddings")
Queries and documents at different abstraction levels. A user asks “how do I make my app faster?” but the relevant document chunk says “optimize database query performance by adding indexes.” The semantic gap between the high-level question and the specific answer reduces similarity scores. Mitigation: generate hypothetical questions for each chunk at indexing time (HyDE approach), or index at multiple granularity levels.Short queries against long chunks. A 3-word query like “refund policy” produces a sparse embedding that matches poorly against 500-word chunks. The chunk’s embedding is an average of many topics, diluting the signal. Either use shorter chunks for short-query use cases, or apply query expansion (“refund policy” becomes “what is the refund and return policy for customers who want their money back”).Embedding model version mismatches. You embedded 1M documents with text-embedding-ada-002, then switched to text-embedding-3-small for new documents. These models produce vectors in different embedding spaces — cosine similarity between them is meaningless. Every vector in your database must come from the same model. Model changes require full re-indexing.Near-duplicate detection thresholds. Two documents are 95% identical except for a date. Their cosine similarity will be 0.98+. But a 0.95 threshold intended for semantic caching will also match “what is our return policy?” with “what is our shipping policy?” — similar structure, completely different intent. Tune your threshold on your actual data: plot similarity distributions for true matches vs. false matches and pick the threshold that minimizes overlap.Embedding normalization assumptions. OpenAI models return normalized vectors (unit length), so dot product equals cosine similarity. Some open-source models (older sentence-transformers) do NOT normalize by default. If you skip normalization and use dot product, longer documents get artificially higher scores. Always check normalize_embeddings=True when using sentence-transformers, or normalize manually.