Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Why RAG is the Killer App
RAG (Retrieval-Augmented Generation) is how you build AI products that actually work with real data. ChatGPT’s Browsing, Perplexity’s search, enterprise knowledge bots — all RAG. Think of it this way: an LLM without RAG is like a brilliant consultant who hasn’t read your company’s documents. They can reason and communicate well, but they don’t know your specific data. RAG gives that consultant a research assistant who instantly pulls up the right internal documents before every answer.Industry Reality: 90% of enterprise AI projects are RAG systems. Mastering RAG means you can build products that work with any company’s data without expensive fine-tuning. The alternative — fine-tuning a model on your data — costs 10-100x more and needs to be redone every time your data changes.
The RAG Mental Model
Production RAG System
Complete Implementation
Advanced Techniques
1. Parent-Child Retrieval
The fundamental tension: small chunks embed better (more precise meaning), but small chunks lack context (the LLM can’t understand a 50-word snippet). Parent-child retrieval solves this — search on small child chunks for precision, then return the larger parent chunk for context. It’s like using an index to find the right page, then reading the whole page:2. Agentic RAG with Query Decomposition
When a question is too complex for a single retrieval pass — “Compare our Q3 and Q4 revenue trends and explain why churn increased” — the system needs to break it into sub-questions, retrieve independently for each, then synthesize. This is the agentic pattern:Evaluation Framework
Retrieval Strategy Decision Framework
Choosing the right retrieval strategy is the single most impactful decision in your RAG pipeline. Here is a decision table for the three main strategies and their sub-techniques:| Signal in Your Data | Recommended Strategy | Why |
|---|---|---|
| Queries use different words than docs (“PTO” vs “vacation”) | Hybrid search + query expansion | Vector catches synonyms, expansion bridges vocabulary gaps |
| Queries contain exact identifiers (error codes, SKUs, names) | Hybrid search, keyword-weighted | BM25 nails exact matches that vector similarity misses |
| Documents are long and dense (legal, academic) | Parent-child retrieval + reranking | Small chunks embed precisely; parent chunks provide LLM context |
| Users ask multi-part comparative questions | Agentic RAG with query decomposition | Single retrieval pass cannot address distinct sub-questions |
| Context window is tight (using smaller models) | Aggressive reranking (top_k=10 -> rerank to 3) | Fewer, higher-quality docs beat more mediocre docs |
| Queries are mostly well-formed and specific | Vector-only with threshold 0.75+ | Simpler pipeline, lower latency, sufficient accuracy |
| Low-latency requirement (under 500ms total) | Vector-only, pre-computed embeddings, skip reranking | Every pipeline stage adds 50-200ms |
Common Failures and Fixes
Retrieval returns irrelevant documents
Retrieval returns irrelevant documents
Symptoms: Low precision, answer quality poor. The LLM generates answers that are technically well-written but address the wrong topic because it was fed the wrong context.Root Cause: Usually a chunking problem. If your chunks are too large, every chunk is “kind of about everything” and similarity scores flatten. If too small, they lack enough meaning to match well.Fixes (try in this order):
- Improve chunking — smaller chunks with semantic boundaries, not arbitrary character splits
- Add query expansion — the user’s words may not match your document’s vocabulary
- Use hybrid search (vector + keyword) — catches what either method alone misses
- Add re-ranking step — a cross-encoder can catch false positives that slipped through
- Tune similarity threshold — raise it to cut noise, lower it if you’re missing results
LLM ignores provided context
LLM ignores provided context
Symptoms: Answer doesn’t use sources, makes up facts. This is the most dangerous RAG failure because it looks correct but is hallucinated.Root Cause: The LLM’s parametric knowledge (training data) is competing with your retrieved context. If the context is poorly formatted or buried in the prompt, the model defaults to what it “knows.”Fixes:
- Set temperature=0 — reduces creative drift from sources
- Put context closer to the question — LLMs attend more to nearby text (recency bias)
- Add explicit instructions: “ONLY use information from the provided sources. If not found, say so.”
- Use structured output to force citations — the model must produce [Source N] references
- Try a more capable model — GPT-4o follows grounding instructions better than GPT-4o-mini
Missing relevant documents
Missing relevant documents
Symptoms: Low recall, answer says “no information”Fixes:
- Query expansion (multiple query variations)
- HyDE (hypothetical document embeddings)
- Lower similarity threshold
- Increase top_k before re-ranking
- Improve document coverage
Slow response times
Slow response times
Symptoms: >3s total latencyFixes:
- Cache embeddings (most common queries)
- Use async database connections with pooling
- Stream LLM responses
- Use faster embedding model
- Pre-compute common query answers
Key Takeaways
Hybrid Search Wins
Vector + keyword search with RRF scoring outperforms either alone for most use cases.
Re-ranking Is Worth It
Cross-encoder or LLM re-ranking significantly improves precision at modest latency cost.
Query Processing Matters
Query expansion and HyDE can dramatically improve recall for ambiguous queries.
Evaluate Continuously
Track retrieval precision, answer faithfulness, and latency. What you don’t measure, you can’t improve.
What’s Next
AI Agents
Build autonomous agents that use tools, make decisions, and complete multi-step tasks