Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Why Vector Databases Matter
Every AI product with memory or search uses vector databases. ChatGPT’s memory, Notion AI, Perplexity’s search, GitHub Copilot’s context — all vector databases under the hood. A traditional database answers “find the row where name = ‘John’.” A vector database answers “find me things similar to this” — and that’s the fundamental operation behind semantic search, RAG, recommendations, and deduplication.The Mental Model
The idea: convert text (or images, or audio) into a list of numbers (a vector) that captures meaning. Similar meanings produce similar vectors. Then use the database to find the closest vectors to your query.Choosing Your Database
| Database | Best For | Scale | Cost | Setup |
|---|---|---|---|---|
| pgvector | Existing Postgres apps | Millions | Free (self-host) | Add extension |
| Pinecone | Production, scale | Billions | $70/month+ | Managed |
| Chroma | Development, prototyping | Thousands | Free | pip install |
| Weaviate | Multi-modal, GraphQL | Millions | Free/Managed | Docker |
| Qdrant | Self-hosted production | Millions | Free | Docker |
Index Type Comparison
The index type determines how vectors are organized for fast search. This is the most impactful performance decision after choosing your database.| Index Type | Build Time | Query Speed | Memory | Recall@10 | Best For |
|---|---|---|---|---|---|
| Flat (brute force) | None | Slowest | Low | 100% (exact) | Under 50K vectors, benchmarks |
| IVFFlat | Fast | Good | Low | 95-99% | 100K-5M vectors, balanced workloads |
| HNSW | Slow | Fastest | High (2-3x) | 97-99.5% | Latency-critical production apps |
| DiskANN (managed DBs) | Medium | Fast | Very Low | 95-99% | Billions of vectors, cost-sensitive |
- Under 100K vectors: Flat or HNSW. Everything is fast at this scale.
- 100K-10M vectors: HNSW for query-heavy workloads, IVFFlat if you need faster index rebuilds.
- Over 10M vectors: Managed solution (Pinecone, Qdrant) that handles sharding automatically. Self-hosted pgvector becomes hard to tune.
- Frequent inserts: IVFFlat handles inserts better than HNSW. HNSW graph rebuilds are expensive.
pgvector: Production-Ready PostgreSQL
Complete Setup
Production Python Integration
Pinecone: Managed Scale
Complete Setup
Pinecone Filter Syntax
Chunking: The Art of Splitting
Bad chunking = bad search results. This is where most RAG systems fail, and it is the single highest-leverage improvement you can make. The analogy: imagine ripping pages out of a book at random page boundaries. Some pages start mid-sentence, others combine the end of one chapter with the start of another. Your search results will be garbage because the chunks don’t represent coherent ideas. Smart chunking splits at semantic boundaries (paragraphs, sections, headers) so each chunk is a self-contained unit of meaning.Smart Chunking System
Hybrid Search: Best of Both Worlds
Semantic search finds “things that mean the same thing.” Keyword search finds “things that use the same words.” Neither alone is sufficient: semantic search misses exact terms like error codes (“ERR_4012”) and function names (asyncpg.create_pool), while keyword search misses synonyms and paraphrases. Combining both is the single biggest retrieval quality improvement you can make — expect 10-25% better recall.
Combine semantic search with keyword search for better results.
Mini-Project: Document Q&A System
Performance Optimization
Index Tuning for pgvector
Index tuning is the difference between 10ms and 500ms queries at scale. The two main knobs: how the index is built (affects accuracy) and how it’s searched (affects speed vs. recall trade-off).Embedding Caching
Embeddings are deterministic: the same text always produces the same vector. This means you can cache them aggressively. In practice, 30-50% of embedding requests in a RAG system are duplicates (common queries, re-indexed documents). Caching can cut your OpenAI embedding costs by 50-90%.Key Takeaways
pgvector for Most Apps
Chunking Is Critical
Hybrid Search Wins
Cache Everything
What’s Next
RAG Systems
Interview Deep-Dive
You are choosing a vector database for a new RAG product. You have 5 million documents, expect 100 queries per second at peak, and your team already runs PostgreSQL in production. Walk me through your decision framework.
You are choosing a vector database for a new RAG product. You have 5 million documents, expect 100 queries per second at peak, and your team already runs PostgreSQL in production. Walk me through your decision framework.
- Given those parameters, pgvector is the strong default choice, and let me explain why before considering alternatives. You already run PostgreSQL, which means your team has operational expertise, monitoring, backup procedures, connection pooling, and deployment pipelines for Postgres. Adding the pgvector extension is one SQL command (
CREATE EXTENSION vector). Introducing Pinecone or Qdrant means adding an entirely new service to your infrastructure: new deployment, new monitoring, new on-call runbook, new failure modes. The operational overhead of a new database is almost always underestimated. - At 5 million documents with 1536-dimension embeddings, you are looking at roughly 30GB of vector data. pgvector handles this comfortably on a single machine with 64GB RAM. With an HNSW index (
m=16, ef_construction=64), you will get sub-50ms query latency at 95% recall for single queries. At 100 QPS, you need connection pooling (pgbouncer or asyncpg pool), and you may need to increaseshared_buffersandeffective_cache_sizein PostgreSQL configuration to keep the HNSW index in memory. But this is standard Postgres tuning, not new knowledge. - Where pgvector falls short and I would switch: if the document count grows beyond 50 million, pgvector’s single-node architecture becomes a bottleneck. Pinecone or Qdrant can distribute across multiple nodes, giving you horizontal scalability. If you need multi-tenancy with thousands of separate namespaces (each customer has their own isolated vector space), Pinecone’s namespace feature handles this natively, while in pgvector you would use metadata filtering or separate tables, which adds query complexity. If your team has zero PostgreSQL experience and is already running on a serverless stack, managed Pinecone eliminates all operational burden.
- The decision framework: operational fit first (does the team know the technology?), scale requirements second (how big will the data grow?), feature requirements third (namespaces, metadata filtering, hybrid search support), and cost fourth. Most teams that start with Pinecone because it sounds modern end up paying $500-2000/month for a managed service they could have run on their existing Postgres instance for free.
lists = sqrt(20_000_000) = ~4500 and probes = 20-40 to balance recall and latency. Third, shard the data: split by a natural partition key (document date, customer ID, content category) across multiple Postgres instances, each holding a subset of the vectors. Query the relevant shard based on the query context. This is more complex but gives you horizontal scalability. If none of these are sufficient, it is time to migrate to a purpose-built distributed vector database like Qdrant or Weaviate that handles sharding natively.Explain IVFFlat versus HNSW indexing in pgvector. When would you choose each, and what are the tuning parameters that actually matter in production?
Explain IVFFlat versus HNSW indexing in pgvector. When would you choose each, and what are the tuning parameters that actually matter in production?
- IVFFlat (Inverted File with Flat quantization) partitions all vectors into a configurable number of clusters (called “lists”) using k-means. At query time, it identifies the nearest clusters to the query vector and performs an exhaustive search only within those clusters. Think of it as dividing a city into neighborhoods and only searching the nearby neighborhoods rather than every house in the city.
- HNSW (Hierarchical Navigable Small World) builds a multi-layer graph where each vector is a node connected to its nearest neighbors. Higher layers have fewer nodes and longer-range connections (like highway exits), lower layers have more nodes and local connections (like city streets). A query traverses from the top layer down, getting progressively more precise. Think of it as GPS navigation: first find the right region, then the right city, then the right street.
- Performance differences: HNSW has faster query times (2-10x faster than IVFFlat at the same recall level) because graph traversal is algorithmically more efficient than cluster scanning. But HNSW uses 2-3x more memory than IVFFlat because it stores the graph structure alongside the vectors. HNSW also has significantly slower index build times — building the graph for 5 million vectors can take 30 minutes versus 5 minutes for IVFFlat.
- Choose IVFFlat when: memory is constrained, you do frequent bulk inserts (IVFFlat is faster to rebuild), or your data changes frequently enough that you need to reindex regularly. IVFFlat also works well when recall requirements are moderate (90% is acceptable). Choose HNSW when: query latency is critical (real-time search, user-facing features), you need high recall (95%+), and your data is relatively stable (not reindexing every hour).
- Tuning parameters that matter. For IVFFlat:
listscontrols the number of clusters. Rule of thumb:sqrt(num_rows). Too few lists means each cluster is large and queries are slow. Too many lists means clusters are too small and recall drops.probescontrols how many clusters to search at query time. Default is 1 (terrible recall). Set it to 10-20 for production. Higher probes = better recall but slower queries. Benchmark to find your sweet spot. - For HNSW:
mcontrols how many neighbors each node connects to. Higher m = better recall but more memory and slower build. Default 16 works well up to 10 million vectors.ef_constructioncontrols build-time quality. Higher values produce better graphs but slower builds. 64-128 is typical.ef_search(set at query time viaSET hnsw.ef_search) controls query-time graph exploration depth. Higher = better recall, slower queries. Default 40; production systems often use 100-200.
EXPLAIN ANALYZE on the vector search query. If the index scan itself is 80ms, the issue is graph traversal depth. Lower ef_search from 100 to 40 — this will drop latency to roughly 30-40ms, but recall might drop to 93-94%. If that recall is acceptable, you are done. If not, the next lever is reducing vector dimensions. If you are using 1536-dimension embeddings, consider text-embedding-3-small with dimensions=512 or dimensions=256 using OpenAI’s dimension reduction feature. Halving dimensions roughly halves search time and memory usage, with a modest recall reduction (typically 1-3%). The final lever is hardware: move to a machine with faster NVMe storage and more RAM to ensure the entire index is in memory. A common mistake is running vector search on shared database instances with memory pressure from other workloads — give the vector index its own machine or at least its own memory allocation.Your vector search returns the right documents but the top result has a similarity score of 0.82 and the fifth result has 0.79. The user asks: are these scores meaningful? Can I use them for filtering? What do you tell them?
Your vector search returns the right documents but the top result has a similarity score of 0.82 and the fifth result has 0.79. The user asks: are these scores meaningful? Can I use them for filtering? What do you tell them?
- Cosine similarity scores between embeddings are meaningful for relative ranking within a single query but dangerous for absolute thresholds across queries. Within your query, 0.82 being higher than 0.79 reliably means the first result is more relevant. But comparing 0.82 from query A with 0.82 from query B tells you almost nothing — the absolute score depends heavily on the query length, specificity, and the embedding model’s behavior.
- Here is the concrete problem with fixed thresholds: a short, specific query like “asyncpg connection pool configuration” against a document that is exactly about that topic might score 0.92. A broad query like “how to build software” against a relevant but general document might score 0.78. A threshold of 0.80 would correctly include the first result but incorrectly exclude the second, even though both are relevant to their respective queries. Conversely, an irrelevant document might score 0.81 for a query where many documents are topically similar, passing the threshold despite being useless.
- For production filtering, I recommend three approaches instead of a fixed threshold. First, use a relative threshold: take the top result’s score and accept results within 0.1 of it. If the top result is 0.82, accept everything above 0.72. This adapts to the score distribution of each query. Second, use the score gap: if there is a sudden drop in scores (top 5 are 0.82-0.79 but the 6th is 0.65), use the gap as a natural cutoff. Third, do not filter by embedding score at all — retrieve a fixed top-k (say 20) and use a reranker to determine true relevance. The reranker’s cross-encoder scores are better calibrated for absolute thresholds because the model sees the query and document together.
- The deeper issue is that cosine similarity in high-dimensional spaces has a known problem: all pairs tend to cluster in a narrow range (typically 0.6-0.95 for related content). The discriminative power lives in small differences within that range, which is why relative ranking works but absolute thresholds do not. This is called the “hubness problem” in high-dimensional spaces.
You mentioned that bad chunking is the number one cause of bad search results. Design a chunking strategy for a mixed-content knowledge base that has technical documentation, legal contracts, and marketing pages.
You mentioned that bad chunking is the number one cause of bad search results. Design a chunking strategy for a mixed-content knowledge base that has technical documentation, legal contracts, and marketing pages.
- The critical insight is that one chunking strategy does not fit all content types. A 1000-character chunk is too large for a dense legal clause (where every sentence is independently important) and too small for a technical tutorial (where a concept needs multiple paragraphs of context to be retrievable). The solution is content-type-aware chunking.
- For technical documentation: chunk at section headers (H2 or H3 level). Technical docs are already structured with headers that delineate concepts. Each section becomes one chunk. If a section exceeds 1500 characters, split at paragraph boundaries within it. Include the section header and parent header as a prefix to every chunk (“API Reference > Authentication > OAuth2 Flow: …”) so the chunk is self-contained. Overlap of 1-2 sentences to capture cross-paragraph references.
- For legal contracts: chunk at clause or sub-clause boundaries. Legal documents have clear structural markers (Section 3.2.1, Article IV, etc.). Each numbered clause becomes a chunk, regardless of length. Prefix each chunk with the full hierarchical section path (“Employment Agreement > Section 5: Termination > 5.2: Termination for Cause”). Do not split a single clause across chunks — even if it is 3000 characters, keep it whole. Lawyers search for complete clauses, not fragments.
- For marketing pages: chunk at paragraph level with generous overlap (200-300 characters). Marketing content is less structured and more narrative. Key phrases and value propositions can appear anywhere. Larger chunks (1500-2000 characters) work better because marketing queries are often conceptual (“what makes your product different”) and need surrounding context to answer meaningfully.
- Implementation: route documents through a content-type classifier (can be as simple as checking the file path or metadata tag) and apply the appropriate chunking strategy. Store the content type and chunk parameters in the metadata alongside the embedding, so you can later analyze retrieval performance per content type and tune independently.
- The universal rules regardless of content type: always add overlap to prevent information loss at boundaries. Always include enough context for the chunk to be self-contained. Always store chunk position metadata (chunk 3 of 12) so the application can retrieve neighboring chunks if needed. Test your chunking by reading 20 random chunks and asking: “Could I understand what this chunk is about without reading the full document?” If the answer is no for more than 20% of chunks, your chunking is too aggressive.
Embedding costs are 40% of your monthly AI spend. Walk me through an embedding caching strategy and how you would measure whether it is working.
Embedding costs are 40% of your monthly AI spend. Walk me through an embedding caching strategy and how you would measure whether it is working.
- Embedding caching exploits a fundamental property: embeddings are deterministic. The same text with the same model always produces the same vector. Once you have embedded a text, you never need to embed it again. In a RAG system, the same popular queries (“how do I reset my password”) are asked hundreds of times, and you pay for a new embedding each time. Caching eliminates this waste entirely.
- Architecture: two-tier cache. Layer one is an in-memory dictionary (or Redis in multi-process deployments) keyed by
md5(model_name + text). Layer two is disk-based (SQLite or a simple file cache) for persistence across restarts. On an embedding request: check memory cache first (sub-millisecond), then disk cache (1-5ms), then fall back to the API call (50-100ms). Store the result in both cache layers. - For query embeddings (the user’s search query), the cache hit rate in a typical product is 30-50% because users ask similar questions. For document embeddings (the corpus you are indexing), the cache hit rate should be near 100% — you only need to embed each document chunk once, and re-embedding only happens when the document content changes. The highest-impact optimization is caching document embeddings aggressively and invalidating only when the source content changes.
- Measuring effectiveness requires four metrics: cache hit rate (target: 40%+ for queries, 95%+ for documents), cost savings (compare monthly embedding API spend before and after caching), latency improvement (cached responses are 10-100x faster than API calls), and cache freshness (are you serving stale embeddings for content that has changed?). The most important is cost savings, because that is the business justification. Track
embedding_api_calls_without_cache - embedding_api_calls_with_cachemonthly. - Cache invalidation strategy: for document embeddings, invalidate when the document content changes. Hash the document content and compare against the stored hash. For query embeddings, use a TTL of 24 hours to 7 days — queries do not change meaning over time, so long TTLs are safe. The only reason to expire query embeddings is to control cache size.
- Common mistake: caching embeddings but not caching the API response metadata (model version, dimensions). If you upgrade from
text-embedding-3-smallto a new model, all cached embeddings are invalid because different models produce incompatible vector spaces. Include the model name in the cache key to prevent cross-model contamination.