Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

December 2025 Update: Comprehensive guide to embeddings including model selection, dimensionality, fine-tuning, and production patterns.

What Are Embeddings?

If LLMs are the brain, embeddings are the language it uses to think about similarity. An embedding converts a chunk of text into a list of numbers (a vector) where similar meanings end up close together in number-space. It is like plotting cities on a map: New York and Boston end up near each other, while Tokyo is far away. Except instead of geographic coordinates, you have 1536 dimensions capturing meaning, topic, tone, and intent. Embeddings convert text into dense numerical vectors that capture semantic meaning:
Text                          Embedding Vector
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"I love pizza"         →      [0.12, -0.34, 0.87, ..., 0.23]
"Pizza is my favorite" →      [0.11, -0.32, 0.85, ..., 0.21]  ← Similar!
"I hate broccoli"      →      [-0.45, 0.12, -0.33, ..., 0.67] ← Different
Use CaseWhat Embeddings Enable
Semantic SearchFind documents by meaning, not keywords
RAGRetrieve relevant context for LLMs
ClusteringGroup similar content automatically
RecommendationsFind similar items/users
DeduplicationDetect near-duplicate content

Embedding Models Comparison

Choosing an embedding model is a three-way trade-off between quality, cost, and speed. OpenAI’s text-embedding-3-small is the default choice for most teams — it is cheap, fast, and good enough. Move to text-embedding-3-large when you need higher accuracy (legal, medical). Go open-source with BGE or MiniLM when cost is critical at scale (millions of documents) or you cannot send data to an external API. The table below gives the concrete numbers.
# Model comparison (December 2024)
EMBEDDING_MODELS = {
    # OpenAI
    "text-embedding-3-small": {
        "dimensions": 1536,
        "max_tokens": 8191,
        "cost_per_1m": 0.02,
        "quality": "good",
        "speed": "fast"
    },
    "text-embedding-3-large": {
        "dimensions": 3072,
        "max_tokens": 8191,
        "cost_per_1m": 0.13,
        "quality": "excellent",
        "speed": "medium"
    },
    
    # Cohere
    "embed-english-v3.0": {
        "dimensions": 1024,
        "max_tokens": 512,
        "cost_per_1m": 0.10,
        "quality": "excellent",
        "speed": "fast"
    },
    
    # Open Source (via HuggingFace)
    "BAAI/bge-large-en-v1.5": {
        "dimensions": 1024,
        "max_tokens": 512,
        "cost_per_1m": 0,  # Free if self-hosted
        "quality": "excellent",
        "speed": "varies"
    },
    "sentence-transformers/all-MiniLM-L6-v2": {
        "dimensions": 384,
        "max_tokens": 256,
        "cost_per_1m": 0,
        "quality": "good",
        "speed": "very fast"
    }
}

Getting Embeddings

OpenAI Embeddings

from openai import OpenAI
import numpy as np

client = OpenAI()

def get_embedding(
    text: str,
    model: str = "text-embedding-3-small"
) -> np.ndarray:
    """Get embedding for a single text"""
    response = client.embeddings.create(
        model=model,
        input=text
    )
    return np.array(response.data[0].embedding)

def get_embeddings_batch(
    texts: list[str],
    model: str = "text-embedding-3-small"
) -> list[np.ndarray]:
    """Get embeddings for multiple texts efficiently"""
    response = client.embeddings.create(
        model=model,
        input=texts
    )
    return [np.array(e.embedding) for e in response.data]

# Usage
embedding = get_embedding("What is machine learning?")
print(f"Dimensions: {len(embedding)}")

# Batch processing
texts = ["Hello world", "How are you?", "Machine learning is cool"]
embeddings = get_embeddings_batch(texts)

Dimensionality Reduction

This is one of the most underused features of OpenAI’s embedding models. You can request a 256-dimensional embedding instead of the full 1536, and OpenAI applies Matryoshka Representation Learning to give you a smaller vector that retains most of the quality. The practical impact is huge: 256 dimensions instead of 1536 means 6x less storage in your vector database, 6x faster similarity search, and cheaper pgvector indexes — all for a quality drop that is often less than 5% on retrieval benchmarks. OpenAI’s text-embedding-3 models support native dimension reduction:
def get_embedding_with_dimensions(
    text: str,
    dimensions: int = 256,
    model: str = "text-embedding-3-small"
) -> np.ndarray:
    """Get embedding with reduced dimensions"""
    response = client.embeddings.create(
        model=model,
        input=text,
        dimensions=dimensions  # 256, 512, 1024, 1536...
    )
    return np.array(response.data[0].embedding)

# Smaller embeddings = faster search, less storage
small_embedding = get_embedding_with_dimensions("Hello", dimensions=256)
print(f"Reduced dimensions: {len(small_embedding)}")

Open Source Embeddings

Open-source embedding models run on your own hardware, which means zero API costs and no data leaving your network. The trade-off is that you need to manage the infrastructure: GPU for fast inference (or accept slower CPU speeds), model loading, and batching. For teams processing millions of documents, the math usually favors self-hosted: embedding 1M chunks costs ~20withOpenAIbuteffectively20 with OpenAI but effectively 0 with a local model (after the one-time GPU cost).
from sentence_transformers import SentenceTransformer
import numpy as np

class LocalEmbedder:
    """Local embedding using sentence-transformers.
    
    Tip: BGE models expect a query prefix for retrieval tasks
    ("Represent this sentence for retrieval:"). Forgetting this prefix
    can drop recall by 10-15%. Always check the model card.
    """
    
    def __init__(self, model_name: str = "BAAI/bge-large-en-v1.5"):
        self.model = SentenceTransformer(model_name)
    
    def embed(self, text: str) -> np.ndarray:
        return self.model.encode(text, normalize_embeddings=True)
    
    def embed_batch(self, texts: list[str]) -> np.ndarray:
        return self.model.encode(
            texts,
            normalize_embeddings=True,
            batch_size=32,
            show_progress_bar=True
        )
    
    def embed_with_instruction(
        self,
        text: str,
        instruction: str = "Represent this sentence for retrieval:"
    ) -> np.ndarray:
        """Some models perform better with instructions"""
        return self.model.encode(
            f"{instruction} {text}",
            normalize_embeddings=True
        )

# Usage
embedder = LocalEmbedder()
embedding = embedder.embed("What is artificial intelligence?")

Similarity Metrics

Similarity metrics answer the question “how close are these two vectors?” Different metrics measure “closeness” differently, and picking the wrong one can silently degrade your search quality. The rule of thumb: use cosine similarity unless you have a specific reason not to. It is magnitude-invariant (a long document and a short document about the same topic will still be similar), which is exactly what you want for text.

Cosine Similarity

import numpy as np

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Compute cosine similarity between two vectors.
    
    Returns a value between -1 and 1, where 1 means identical direction
    (identical meaning), 0 means orthogonal (unrelated), and -1 means
    opposite (rare in practice with text embeddings).
    """
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def cosine_similarity_normalized(a: np.ndarray, b: np.ndarray) -> float:
    """For pre-normalized vectors, just dot product"""
    return np.dot(a, b)

# Usage
emb1 = get_embedding("I love pizza")
emb2 = get_embedding("Pizza is my favorite food")
emb3 = get_embedding("The weather is nice today")

print(f"Similar: {cosine_similarity(emb1, emb2):.3f}")  # ~0.85
print(f"Different: {cosine_similarity(emb1, emb3):.3f}")  # ~0.40

Other Metrics

def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float:
    """L2 distance - lower is more similar.
    Good when vector magnitude carries meaning (e.g., document length matters)."""
    return np.linalg.norm(a - b)

def dot_product(a: np.ndarray, b: np.ndarray) -> float:
    """Dot product - higher is more similar.
    For pre-normalized vectors, this is equivalent to cosine similarity
    but faster (skips the normalization division)."""
    return np.dot(a, b)

def manhattan_distance(a: np.ndarray, b: np.ndarray) -> float:
    """L1 distance - lower is more similar.
    More robust to outlier dimensions than L2. Rarely used for embeddings
    in practice but useful for sparse feature vectors."""
    return np.sum(np.abs(a - b))

# When to use each:
# - Cosine: General purpose, magnitude-invariant -- the safe default
# - Dot product: For pre-normalized vectors (fastest -- just a sum of products)
# - Euclidean (L2): When magnitude matters (rare for text embeddings)
# - Manhattan (L1): Sparse vectors, high dimensions -- not common in embedding search
#
# Practical tip: Most embedding APIs return normalized vectors, so cosine
# similarity and dot product give identical results. Use dot product for speed.

Building a Similarity Search Engine

import numpy as np
from typing import List, Tuple
from dataclasses import dataclass

@dataclass
class Document:
    id: str
    text: str
    embedding: np.ndarray = None
    metadata: dict = None

class VectorSearchEngine:
    """Simple in-memory vector search"""
    
    def __init__(self, embedding_model: str = "text-embedding-3-small"):
        self.model = embedding_model
        self.documents: List[Document] = []
        self.embeddings: np.ndarray = None
        self.client = OpenAI()
    
    def add_documents(self, documents: List[Document]):
        """Add documents and compute embeddings"""
        texts = [doc.text for doc in documents]
        
        # Batch embed
        response = self.client.embeddings.create(
            model=self.model,
            input=texts
        )
        
        for doc, emb_data in zip(documents, response.data):
            doc.embedding = np.array(emb_data.embedding)
            self.documents.append(doc)
        
        # Build matrix for fast search
        self._rebuild_index()
    
    def _rebuild_index(self):
        """Rebuild the embedding matrix"""
        if self.documents:
            self.embeddings = np.vstack([
                doc.embedding for doc in self.documents
            ])
    
    def search(
        self,
        query: str,
        top_k: int = 5,
        threshold: float = 0.0
    ) -> List[Tuple[Document, float]]:
        """Search for similar documents"""
        # Embed query
        response = self.client.embeddings.create(
            model=self.model,
            input=query
        )
        query_embedding = np.array(response.data[0].embedding)
        
        # Compute similarities (matrix operation)
        similarities = np.dot(self.embeddings, query_embedding) / (
            np.linalg.norm(self.embeddings, axis=1) * np.linalg.norm(query_embedding)
        )
        
        # Get top-k
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        results = []
        for idx in top_indices:
            score = similarities[idx]
            if score >= threshold:
                results.append((self.documents[idx], float(score)))
        
        return results

# Usage
engine = VectorSearchEngine()

documents = [
    Document(id="1", text="Machine learning is a subset of AI"),
    Document(id="2", text="Deep learning uses neural networks"),
    Document(id="3", text="Python is a programming language"),
    Document(id="4", text="Natural language processing handles text"),
]

engine.add_documents(documents)

results = engine.search("What is artificial intelligence?", top_k=3)
for doc, score in results:
    print(f"[{score:.3f}] {doc.text}")

Hybrid Search: Embeddings + Keywords

Here is a scenario that pure embedding search fails at: a user asks “error code E-4021” and the most similar embeddings are about generic error handling rather than the specific error code. That is because embeddings capture meaning, not exact strings. Keyword search (BM25) handles this perfectly — it matches the literal text “E-4021.” Hybrid search combines both approaches: semantic similarity for understanding intent, keyword matching for precision. In practice, hybrid search outperforms either approach alone for 80-90% of real-world retrieval tasks. The alpha parameter controls the blend: 0.7 means 70% semantic weight and 30% keyword weight. Start there, then tune based on your query patterns. If users frequently search for specific identifiers, product names, or codes, lower alpha (more keyword weight). If queries are natural-language questions, raise it. Combine semantic search with keyword matching:
from rank_bm25 import BM25Okapi
import numpy as np

class HybridSearchEngine:
    """Combines vector search with BM25 keyword search"""
    
    def __init__(self, alpha: float = 0.5):
        self.alpha = alpha  # Weight for semantic vs keyword
        self.vector_engine = VectorSearchEngine()
        self.bm25 = None
        self.tokenized_docs = []
    
    def add_documents(self, documents: List[Document]):
        """Add documents to both indices"""
        # Vector index
        self.vector_engine.add_documents(documents)
        
        # BM25 index
        self.tokenized_docs = [
            doc.text.lower().split() for doc in documents
        ]
        self.bm25 = BM25Okapi(self.tokenized_docs)
    
    def search(
        self,
        query: str,
        top_k: int = 5
    ) -> List[Tuple[Document, float]]:
        """Hybrid search combining semantic and keyword scores"""
        
        # Semantic search
        semantic_results = self.vector_engine.search(query, top_k=top_k * 2)
        
        # BM25 keyword search
        tokenized_query = query.lower().split()
        bm25_scores = self.bm25.get_scores(tokenized_query)
        
        # Normalize scores
        semantic_scores = {r[0].id: r[1] for r in semantic_results}
        
        # Normalize BM25 scores to 0-1
        max_bm25 = max(bm25_scores) if max(bm25_scores) > 0 else 1
        normalized_bm25 = {
            self.vector_engine.documents[i].id: score / max_bm25
            for i, score in enumerate(bm25_scores)
        }
        
        # Combine scores
        combined_scores = {}
        all_doc_ids = set(semantic_scores.keys()) | set(normalized_bm25.keys())
        
        for doc_id in all_doc_ids:
            semantic = semantic_scores.get(doc_id, 0)
            keyword = normalized_bm25.get(doc_id, 0)
            combined_scores[doc_id] = (
                self.alpha * semantic + (1 - self.alpha) * keyword
            )
        
        # Sort by combined score
        sorted_ids = sorted(
            combined_scores.keys(),
            key=lambda x: combined_scores[x],
            reverse=True
        )[:top_k]
        
        # Return documents with scores
        doc_map = {d.id: d for d in self.vector_engine.documents}
        return [
            (doc_map[doc_id], combined_scores[doc_id])
            for doc_id in sorted_ids
        ]

# Usage
hybrid = HybridSearchEngine(alpha=0.7)  # 70% semantic, 30% keyword
hybrid.add_documents(documents)
results = hybrid.search("What is AI and machine learning?")

Embedding Optimization

The difference between a hobby project and a production embedding pipeline is how you handle scale. Embedding 100 documents is trivial. Embedding 1 million documents means dealing with rate limits, batching to reduce HTTP overhead, and caching to avoid re-embedding documents that haven’t changed. The patterns below address each of these concerns.

Batching and Rate Limiting

import asyncio
from openai import AsyncOpenAI
from typing import List
import time

class OptimizedEmbedder:
    """Efficient batch embedding with rate limiting"""
    
    def __init__(
        self,
        model: str = "text-embedding-3-small",
        batch_size: int = 100,
        requests_per_minute: int = 3000
    ):
        self.model = model
        self.batch_size = batch_size
        self.min_interval = 60 / requests_per_minute
        self.client = AsyncOpenAI()
        self.last_request_time = 0
    
    async def _rate_limit(self):
        """Ensure we don't exceed rate limits"""
        elapsed = time.time() - self.last_request_time
        if elapsed < self.min_interval:
            await asyncio.sleep(self.min_interval - elapsed)
        self.last_request_time = time.time()
    
    async def embed_batch(self, texts: List[str]) -> List[np.ndarray]:
        """Embed a batch of texts"""
        await self._rate_limit()
        
        response = await self.client.embeddings.create(
            model=self.model,
            input=texts
        )
        
        return [np.array(e.embedding) for e in response.data]
    
    async def embed_all(self, texts: List[str]) -> List[np.ndarray]:
        """Embed all texts with batching"""
        all_embeddings = []
        
        for i in range(0, len(texts), self.batch_size):
            batch = texts[i:i + self.batch_size]
            embeddings = await self.embed_batch(batch)
            all_embeddings.extend(embeddings)
            
            print(f"Processed {min(i + self.batch_size, len(texts))}/{len(texts)}")
        
        return all_embeddings

# Usage
async def main():
    embedder = OptimizedEmbedder()
    
    texts = ["Text " + str(i) for i in range(1000)]
    embeddings = await embedder.embed_all(texts)
    print(f"Embedded {len(embeddings)} texts")

asyncio.run(main())

Caching Embeddings

Embeddings are deterministic: the same text with the same model always produces the same vector. This makes them perfect for caching. If a user re-uploads a document or you re-index your knowledge base, a cache prevents paying for the same embedding twice. The file-based cache below works for development and small datasets. For production, swap in Redis or a database-backed cache for concurrent access and TTL management.
import hashlib
import pickle
from pathlib import Path

class CachedEmbedder:
    """Cache embeddings to avoid recomputation.
    
    Pitfall: Cache key must include the model name. If you switch from
    text-embedding-3-small to text-embedding-3-large, old cached vectors
    have different dimensions and will cause silent errors.
    """
    
    def __init__(self, cache_dir: str = ".embedding_cache"):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
        self.client = OpenAI()
    
    def _cache_key(self, text: str, model: str) -> str:
        content = f"{model}:{text}"
        return hashlib.sha256(content.encode()).hexdigest()
    
    def _cache_path(self, key: str) -> Path:
        return self.cache_dir / f"{key}.pkl"
    
    def get_embedding(
        self,
        text: str,
        model: str = "text-embedding-3-small"
    ) -> np.ndarray:
        """Get embedding, using cache if available"""
        key = self._cache_key(text, model)
        cache_path = self._cache_path(key)
        
        # Check cache
        if cache_path.exists():
            with open(cache_path, "rb") as f:
                return pickle.load(f)
        
        # Compute embedding
        response = self.client.embeddings.create(
            model=model,
            input=text
        )
        embedding = np.array(response.data[0].embedding)
        
        # Cache it
        with open(cache_path, "wb") as f:
            pickle.dump(embedding, f)
        
        return embedding

Fine-Tuning Embeddings

When off-the-shelf embeddings aren’t cutting it — medical jargon isn’t matching synonyms, legal terms aren’t clustering correctly, or your domain-specific acronyms are treated as noise — fine-tuning adapts the model to your vocabulary and similarity relationships. The approach below uses contrastive learning: you provide pairs of texts with similarity scores, and the model adjusts its internal weights so that your domain’s notion of “similar” is reflected in the embedding space. Even 500-1000 labeled pairs can produce meaningful improvements. For domain-specific applications, fine-tune embedding models:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

def fine_tune_embeddings(
    model_name: str,
    training_data: List[Tuple[str, str, float]],  # (text1, text2, similarity)
    output_path: str,
    epochs: int = 3
):
    """Fine-tune an embedding model on domain data"""
    
    # Load base model
    model = SentenceTransformer(model_name)
    
    # Prepare training data
    train_examples = [
        InputExample(texts=[t1, t2], label=sim)
        for t1, t2, sim in training_data
    ]
    
    train_dataloader = DataLoader(
        train_examples,
        shuffle=True,
        batch_size=16
    )
    
    # Use cosine similarity loss
    train_loss = losses.CosineSimilarityLoss(model)
    
    # Train
    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=epochs,
        warmup_steps=100,
        output_path=output_path
    )
    
    return model

# Example training data for a medical domain
training_data = [
    ("patient has fever", "elevated temperature", 0.9),
    ("patient has fever", "broken leg", 0.1),
    ("chest pain", "cardiac symptoms", 0.85),
    # ... more examples
]

model = fine_tune_embeddings(
    "sentence-transformers/all-MiniLM-L6-v2",
    training_data,
    "medical-embeddings"
)

Embedding Model Selection Framework

Choosing an embedding model is a three-way trade-off. This decision table covers the most common scenarios.
ScenarioRecommended ModelDimensionsWhy
General-purpose, getting startedtext-embedding-3-small1536 (or 256 reduced)Best cost/quality ratio, no infrastructure to manage
Need highest retrieval accuracytext-embedding-3-large3072 (or 1024 reduced)3-5% better on benchmarks than small, worth it for legal/medical
Data cannot leave your networkBAAI/bge-large-en-v1.5 (self-hosted)1024Top-tier open-source, runs on a single GPU
Millions of documents, cost-sensitiveall-MiniLM-L6-v2 (self-hosted)384Fastest open-source option, 384 dims means minimal storage
Multilingual contentCohere embed-multilingual-v3.01024Best multilingual support among commercial options
Code search (functions, docstrings)text-embedding-3-small with code-specific chunking1536General models handle code adequately; specialized code models (CodeBERT) rarely outperform in practice
Decision flowchart:
  1. Can your data leave your network? If no, self-host BGE-large or E5-large.
  2. If yes, is your budget under $50/month for embeddings? Use text-embedding-3-small with reduced dimensions (256 or 512).
  3. If budget is flexible, is retrieval accuracy critical (legal, medical, compliance)? Use text-embedding-3-large.
  4. Are you embedding in multiple languages? Use Cohere multilingual or BGE-m3 (open-source multilingual).

Similarity Search Edge Cases

Queries and documents at different abstraction levels. A user asks “how do I make my app faster?” but the relevant document chunk says “optimize database query performance by adding indexes.” The semantic gap between the high-level question and the specific answer reduces similarity scores. Mitigation: generate hypothetical questions for each chunk at indexing time (HyDE approach), or index at multiple granularity levels. Short queries against long chunks. A 3-word query like “refund policy” produces a sparse embedding that matches poorly against 500-word chunks. The chunk’s embedding is an average of many topics, diluting the signal. Either use shorter chunks for short-query use cases, or apply query expansion (“refund policy” becomes “what is the refund and return policy for customers who want their money back”). Embedding model version mismatches. You embedded 1M documents with text-embedding-ada-002, then switched to text-embedding-3-small for new documents. These models produce vectors in different embedding spaces — cosine similarity between them is meaningless. Every vector in your database must come from the same model. Model changes require full re-indexing. Near-duplicate detection thresholds. Two documents are 95% identical except for a date. Their cosine similarity will be 0.98+. But a 0.95 threshold intended for semantic caching will also match “what is our return policy?” with “what is our shipping policy?” — similar structure, completely different intent. Tune your threshold on your actual data: plot similarity distributions for true matches vs. false matches and pick the threshold that minimizes overlap. Embedding normalization assumptions. OpenAI models return normalized vectors (unit length), so dot product equals cosine similarity. Some open-source models (older sentence-transformers) do NOT normalize by default. If you skip normalization and use dot product, longer documents get artificially higher scores. Always check normalize_embeddings=True when using sentence-transformers, or normalize manually.

Key Takeaways

Choose the Right Model

Balance quality, speed, and cost for your use case

Normalize Your Vectors

Pre-normalize for faster similarity search

Hybrid Search Works

Combine semantic + keyword for best results

Cache Everything

Embeddings are deterministic - cache aggressively

What’s Next

AI Streaming

Master streaming responses for real-time AI applications