RAG has evolved far beyond simple “retrieve and generate.” Think of it like restaurant service: Basic RAG is a waiter who brings you whatever matches your order from the menu. Advanced RAG is a sommelier who understands what you really want, checks multiple sources, and curates the perfect selection. Agentic RAG is a personal chef who plans multiple courses, adjusts based on your reactions, and synthesizes a complete experience.Modern RAG systems use these sophisticated architectures to handle complex queries, maintain context, and deliver accurate, grounded responses.
What It IsBasic RAG is the simplest form: search for relevant documents using vector similarity, then feed them to an LLM to generate an answer. It is the “SELECT * WHERE similar” of AI — straightforward, predictable, and often good enough. Think of it as asking a librarian who quickly finds relevant books and summarizes them for you, but who only makes one trip to the shelves.Real-World ExampleUse Case: Company documentation Q&AEmployee asks: “What’s our remote work policy?”
System finds: 3 policy documents mentioning remote work
LLM summarizes: “According to company policy, employees can work remotely up to 3 days per week with manager approval…”
from openai import OpenAIfrom typing import Listimport jsonclass BasicRAG: """Simple retrieve-and-generate RAG""" def __init__(self, database): self.client = OpenAI() self.db = database def query(self, question: str, top_k: int = 5) -> str: # 1. Embed the question -- convert text to a vector for similarity comparison embedding = self._embed(question) # 2. Retrieve similar documents -- "find me chunks that mean something similar" docs = self.db.vector_search(embedding, top_k=top_k) # 3. Build context -- format docs so the LLM can cite them context = "\n\n".join([ f"[Source {i+1}]: {doc['content']}" for i, doc in enumerate(docs) ]) # 4. Generate answer -- the LLM reads your docs and responds response = self.client.chat.completions.create( model="gpt-4o", messages=[ { "role": "system", "content": """Answer the question based only on the provided sources. Cite sources using [Source N] format.""" }, { "role": "user", "content": f"Sources:\n{context}\n\nQuestion: {question}" } ] ) return response.choices[0].message.content def _embed(self, text: str) -> List[float]: response = self.client.embeddings.create( model="text-embedding-3-small", input=text ) return response.data[0].embedding
When to Use Basic RAG:
Simple Q&A over documentation
Small to medium document collections
Well-formed, specific questions
MVP/prototype stage
Limitations:
Single retrieval step may miss context
No query understanding or rewriting
Limited handling of complex queries
No reasoning over multiple documents
Practical Tips:
Chunk Size: Keep chunks 200-500 tokens. Too small = no context for the LLM. Too large = diluted relevance scores. A good rule of thumb: each chunk should be understandable on its own.
Embedding Model: text-embedding-3-small is the best cost-to-quality ratio for most use cases. Only upgrade to text-embedding-3-large if you see retrieval quality issues.
Top-K: Start with 3-5 documents. More is not always better — irrelevant docs confuse the LLM and waste tokens.
Temperature: Use 0.0 for factual answers. The moment you raise it, you invite creative paraphrasing — fine for chatbots, dangerous for compliance docs.
What It IsAdvanced RAG improves accuracy by adding query processing, hybrid search (vector + keyword), and re-ranking. If Basic RAG is a single Google search, Advanced RAG is what a research analyst does: rephrase the question several ways, search multiple databases, cross-reference results, and rank everything by relevance before presenting findings.Real-World ExampleUse Case: Legal document searchLawyer asks: “Cases about contract breach in California”
System expands to: [“contract breach California”, “contractual violations CA”, “breach of agreement California courts”]
Searches using both semantic similarity AND keyword matching
Re-ranks results by legal relevance
Returns top 5 most relevant cases
class AdvancedRAG: """RAG with query expansion, hybrid search, and re-ranking""" def __init__(self, database): self.client = OpenAI() self.db = database async def query(self, question: str) -> dict: # 1. Query Processing processed_queries = await self._process_query(question) # 2. Hybrid Retrieval (Vector + Keyword) all_docs = [] for q in processed_queries: vector_docs = await self._vector_search(q) keyword_docs = await self._keyword_search(q) all_docs.extend(vector_docs + keyword_docs) # 3. Reciprocal Rank Fusion fused_docs = self._rrf_fusion(all_docs) # 4. Re-ranking with Cross-Encoder reranked_docs = await self._rerank(question, fused_docs[:20]) # 5. Generate with best documents answer = await self._generate(question, reranked_docs[:5]) return { "answer": answer, "sources": reranked_docs[:5], "query_expansions": processed_queries } async def _process_query(self, question: str) -> List[str]: """Expand query into multiple search variations""" response = self.client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": """Generate 3 alternative phrasings of this question to improve search. Return JSON: {"queries": ["...", "...", "..."]}""" }, {"role": "user", "content": question} ], response_format={"type": "json_object"} ) result = json.loads(response.choices[0].message.content) return [question] + result.get("queries", []) def _rrf_fusion(self, docs: List[dict], k: int = 60) -> List[dict]: """Reciprocal Rank Fusion: a principled way to merge ranked lists. RRF assigns score = 1/(k + rank) per list, then sums across lists. A doc ranked #1 in both lists beats a doc ranked #1 in one and absent from the other. The k=60 constant prevents top ranks from dominating. """ scores = {} doc_map = {} for doc in docs: doc_id = doc['id'] if doc_id not in scores: scores[doc_id] = 0 doc_map[doc_id] = doc # RRF score: 1 / (k + rank) rank = doc.get('rank', 1) scores[doc_id] += 1 / (k + rank) # Sort by fused score sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True) return [doc_map[doc_id] for doc_id in sorted_ids] async def _rerank(self, question: str, docs: List[dict]) -> List[dict]: """Re-rank documents using LLM as judge""" doc_texts = "\n".join([ f"[{i+1}] {doc['content'][:500]}" for i, doc in enumerate(docs) ]) response = self.client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": """Rate relevance of each document to the question (0-10). Return JSON: {"rankings": [{"doc": 1, "score": 8}, ...]}""" }, { "role": "user", "content": f"Question: {question}\n\nDocuments:\n{doc_texts}" } ], response_format={"type": "json_object"} ) rankings = json.loads(response.choices[0].message.content) # Apply scores and re-sort for ranking in rankings.get("rankings", []): idx = ranking["doc"] - 1 if 0 <= idx < len(docs): docs[idx]["rerank_score"] = ranking["score"] return sorted(docs, key=lambda x: x.get("rerank_score", 0), reverse=True)
When to Use Advanced RAG:
Production systems requiring high accuracy
Diverse queries with varying terminology
Technical domains with specific jargon
Ambiguous or complex user questions
Need for better precision and recall
Limitations:
Higher cost than Basic RAG (~2-3x)
Increased latency due to multiple processing steps
Requires more infrastructure (re-ranking models)
More complex to implement and maintain
May be overkill for simple use cases
Performance Tips:
Query Expansion: Use gpt-4o-mini for cost efficiency
Re-ranking: Only re-rank top 20 candidates to balance cost/quality
Hybrid Search: Weight vector (0.7) and keyword (0.3) for best results
RRF Fusion: Use k=60 for optimal ranking combination
Caching: Cache query expansions and common re-ranking results
What It IsMemory RAG adds conversation history and user context, enabling personalized, context-aware responses. Without memory, every question starts from scratch — the user says “what about lunch?” and the system has no idea they were just discussing diabetic-friendly breakfasts. Memory RAG is the difference between talking to a stranger every time versus talking to a colleague who remembers your past conversations.Real-World ExampleUse Case: Personal health assistantFirst conversation: “I have diabetes. What should I eat for breakfast?”
System remembers: User has diabetes, prefers quick meals
Later conversation: “What about lunch?”
System uses memory: Suggests low-carb lunches, remembers breakfast preferences
from datetime import datetimefrom typing import Optionalclass MemoryRAG: """RAG with short-term and long-term memory""" def __init__(self, database, memory_store): self.client = OpenAI() self.db = database self.memory = memory_store async def query( self, question: str, user_id: str, session_id: str ) -> dict: # 1. Short-term memory: what did we just talk about? (last N turns) conversation = await self.memory.get_conversation(session_id, limit=10) # 2. Long-term memory: what do we know about this user? (preferences, facts) user_context = await self.memory.get_user_context(user_id) # 3. Rewrite the query with context -- "what about lunch?" becomes # "suggest diabetic-friendly lunch options for someone who prefers quick meals" contextualized_query = await self._contextualize_query( question, conversation, user_context ) # 4. Retrieve with contextualized query docs = await self._retrieve(contextualized_query) # 5. Generate with full context answer = await self._generate( question=question, documents=docs, conversation=conversation, user_context=user_context ) # 6. Update memory await self._update_memory( user_id=user_id, session_id=session_id, question=question, answer=answer ) return {"answer": answer, "sources": docs} async def _contextualize_query( self, question: str, conversation: List[dict], user_context: dict ) -> str: """Rewrite query with conversation context""" if not conversation: return question history = "\n".join([ f"{'User' if m['role'] == 'user' else 'Assistant'}: {m['content']}" for m in conversation[-5:] # Last 5 turns ]) response = self.client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": """Given the conversation history, rewrite the user's question to be self-contained. Include relevant context from the conversation. Return only the rewritten question.""" }, { "role": "user", "content": f"""Conversation:{history}User preferences: {json.dumps(user_context.get('preferences', {}))}Current question: {question}Rewritten question:""" } ] ) return response.choices[0].message.content async def _update_memory( self, user_id: str, session_id: str, question: str, answer: str ): """Update both short-term and long-term memory""" # Short-term: conversation history await self.memory.add_message(session_id, "user", question) await self.memory.add_message(session_id, "assistant", answer) # Long-term: extract and store facts facts = await self._extract_facts(question, answer) for fact in facts: await self.memory.add_user_fact(user_id, fact) async def _extract_facts(self, question: str, answer: str) -> List[dict]: """Extract memorable facts from conversation""" response = self.client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": """Extract user preferences or facts to remember. Return JSON: {"facts": [{"type": "preference|fact", "content": "..."}]} Only include genuinely useful information. Return empty if nothing notable.""" }, { "role": "user", "content": f"User asked: {question}\nResponse: {answer}" } ], response_format={"type": "json_object"} ) result = json.loads(response.choices[0].message.content) return result.get("facts", [])
When to Use Memory RAG:
Building chatbots or conversational assistants
Users have repeat interactions with the system
Personalization improves user experience
Multi-turn conversations with context dependencies
Need to remember user preferences and facts
Limitations:
Requires memory storage infrastructure
Privacy concerns with storing user data
Memory can become stale or incorrect over time
Additional cost for memory operations
More complex state management
Performance Tips:
Short-term Memory: Keep last 5-10 conversation turns for context
Long-term Memory: Extract only meaningful facts, not every detail
Memory Retrieval: Use vector search for semantic memory lookup
Memory Updates: Batch updates to reduce database calls
Context Window: Limit conversation history to avoid token limits
What It IsAgentic RAG uses iterative reasoning to answer complex, multi-step questions. While other RAG types do a single “retrieve then generate” pass, Agentic RAG operates in a loop: plan what information is needed, retrieve it, evaluate whether the answer is complete, and retrieve more if not. It is the most powerful pattern but also the most expensive — think of it as hiring a research assistant who bills by the hour rather than a librarian who fetches one book.Real-World ExampleUse Case: Research assistant for academic papersUser asks: “Compare the effectiveness of transformer models vs RNNs for machine translation, considering recent papers from 2023-2024”
System plans:
Step 1: Search for “transformer models machine translation”
Step 2: Search for “RNN machine translation comparison”
Step 3: Search for “transformer vs RNN 2023 2024”
Step 4: Synthesize findings and compare
System answers: Comprehensive comparison with citations from multiple sources
from enum import Enumclass Action(Enum): SEARCH = "search" ANSWER = "answer" CLARIFY = "clarify"class AgenticRAG: """RAG with iterative reasoning and multi-hop retrieval""" def __init__(self, database): self.client = OpenAI() self.db = database self.max_iterations = 5 async def query(self, question: str) -> dict: context = [] reasoning_chain = [] for i in range(self.max_iterations): # Decide next action action, action_input = await self._plan_action( question=question, context=context, reasoning=reasoning_chain ) reasoning_chain.append({ "step": i + 1, "action": action.value, "input": action_input }) if action == Action.ANSWER: # Ready to answer return { "answer": action_input, "reasoning": reasoning_chain, "sources": context } elif action == Action.SEARCH: # Retrieve more information docs = await self._retrieve(action_input) context.extend(docs) reasoning_chain[-1]["result"] = f"Found {len(docs)} documents" elif action == Action.CLARIFY: # Need clarification return { "answer": None, "clarification_needed": action_input, "reasoning": reasoning_chain } # Max iterations reached return { "answer": await self._force_answer(question, context), "reasoning": reasoning_chain, "sources": context, "warning": "Max iterations reached" } async def _plan_action( self, question: str, context: List[dict], reasoning: List[dict] ) -> tuple[Action, str]: """Decide what to do next""" context_summary = "\n".join([ f"- {doc['content'][:200]}..." for doc in context[-5:] ]) if context else "No information gathered yet." reasoning_str = "\n".join([ f"Step {r['step']}: {r['action']} - {r.get('result', r.get('input', ''))}" for r in reasoning ]) if reasoning else "No steps taken yet." response = self.client.chat.completions.create( model="gpt-4o", messages=[ { "role": "system", "content": """You are a reasoning agent. Decide what to do next:1. SEARCH: Need more information. Provide a search query.2. ANSWER: Have enough info to answer. Provide the answer.3. CLARIFY: Question is ambiguous. Provide clarification request.Return JSON: {"action": "search|answer|clarify", "input": "..."}Be thorough—search multiple times if needed for complex questions.""" }, { "role": "user", "content": f"""Question: {question}Information gathered:{context_summary}Reasoning so far:{reasoning_str}What should I do next?""" } ], response_format={"type": "json_object"} ) result = json.loads(response.choices[0].message.content) action = Action(result["action"]) return action, result["input"] async def _retrieve(self, query: str) -> List[dict]: """Search for documents""" embedding = self._embed(query) return self.db.vector_search(embedding, top_k=5)
When to Use Agentic RAG:
Complex research questions requiring multiple sources
Questions that need synthesis across documents
Multi-hop reasoning (“who works at company that acquired X”)
Comparative analysis questions
Questions requiring iterative information gathering
Limitations:
Highest cost among RAG types ($15-40 per 1000 queries)
Slowest latency due to multiple iterations
Can get stuck in loops if max_iterations too high
Requires careful prompt engineering for action planning
More complex to debug and monitor
Performance Tips:
Max Iterations: Set to 3-5 for most use cases
Action Planning: Use gpt-4o for better reasoning, gpt-4o-mini for cost savings
Early Stopping: Implement confidence thresholds to stop early
Query Generation: Cache common query patterns
Monitoring: Track iteration count and reasoning chains for optimization
What It IsMulti-Vector RAG uses multiple embedding types (dense, sparse, metadata) for precise retrieval. By combining semantic understanding, exact keyword matching, and structured metadata, it provides more accurate and flexible search capabilities.Real-World ExampleUse Case: Technical documentation searchUser asks: “Python async/await error handling”
System searches:
Dense vectors: Find semantically similar docs about async programming
Sparse vectors: Match exact keywords “async”, “await”, “error”
Metadata vectors: Filter by language=“Python”, category=“error-handling”
System combines: Weighted fusion returns most relevant technical docs
class MultiVectorRAG: """RAG with dense, sparse, and metadata vectors""" def __init__(self, database): self.client = OpenAI() self.db = database async def index_document(self, doc_id: str, content: str, metadata: dict): """Index with multiple vector types""" # 1. Dense embedding (semantic) dense_embedding = self._embed_dense(content) # 2. Sparse embedding (BM25/keywords) sparse_embedding = self._embed_sparse(content) # 3. Metadata embedding metadata_text = " ".join([ f"{k}: {v}" for k, v in metadata.items() ]) metadata_embedding = self._embed_dense(metadata_text) # Store all embeddings await self.db.store_multi_vector( doc_id=doc_id, content=content, dense=dense_embedding, sparse=sparse_embedding, metadata_vec=metadata_embedding, metadata=metadata ) async def query( self, question: str, filters: dict = None, weights: dict = None ) -> List[dict]: """Multi-vector retrieval with configurable weights""" weights = weights or { "dense": 0.5, "sparse": 0.3, "metadata": 0.2 } # Embed query query_dense = self._embed_dense(question) query_sparse = self._embed_sparse(question) # Search with each vector type dense_results = await self.db.search( vector=query_dense, vector_type="dense", filters=filters ) sparse_results = await self.db.search( vector=query_sparse, vector_type="sparse", filters=filters ) # If metadata filters provided, search metadata vectors if filters: filter_text = " ".join([f"{k}: {v}" for k, v in filters.items()]) filter_vec = self._embed_dense(filter_text) metadata_results = await self.db.search( vector=filter_vec, vector_type="metadata" ) else: metadata_results = [] # Weighted combination combined = self._weighted_fusion( dense_results, sparse_results, metadata_results, weights ) return combined def _weighted_fusion( self, dense: List[dict], sparse: List[dict], metadata: List[dict], weights: dict ) -> List[dict]: """Combine results with weighted scoring""" scores = {} docs = {} for rank, doc in enumerate(dense): doc_id = doc['id'] docs[doc_id] = doc scores[doc_id] = scores.get(doc_id, 0) + weights["dense"] / (rank + 1) for rank, doc in enumerate(sparse): doc_id = doc['id'] docs[doc_id] = doc scores[doc_id] = scores.get(doc_id, 0) + weights["sparse"] / (rank + 1) for rank, doc in enumerate(metadata): doc_id = doc['id'] docs[doc_id] = doc scores[doc_id] = scores.get(doc_id, 0) + weights["metadata"] / (rank + 1) sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True) return [docs[doc_id] for doc_id in sorted_ids] def _embed_sparse(self, text: str) -> dict: """Create sparse BM25-style embedding""" # Simplified—use actual BM25 or SPLADE in production from collections import Counter import re words = re.findall(r'\w+', text.lower()) word_counts = Counter(words) return { "indices": list(range(len(word_counts))), "values": list(word_counts.values()), "tokens": list(word_counts.keys()) }
When to Use Multi-Vector RAG:
Technical documentation with exact terminology
Need both semantic and keyword matching
Rich metadata available for filtering
Mixed content types (code, docs, comments)
Require fine-tuned relevance control
Limitations:
Requires storing multiple embeddings per document
More storage and indexing overhead
Weight tuning requires experimentation
Sparse embeddings need specialized infrastructure
More complex than single-vector approaches
Performance Tips:
Weight Tuning: Start with dense=0.5, sparse=0.3, metadata=0.2, adjust based on domain
Sparse Embeddings: Use BM25 or SPLADE for production
Metadata Indexing: Index frequently filtered fields separately
What It IsGraph RAG traverses knowledge graphs to follow relationships and discover connected information. It understands how entities relate to each other, enabling multi-hop reasoning and discovery of indirectly related information.Real-World ExampleUse Case: Company knowledge baseUser asks: “Who are the key engineers working on projects related to machine learning?”
System:
Selecting the appropriate RAG architecture depends on your use case, complexity requirements, and performance needs. Use this guide to make the right choice.
Use this flowchart logic to pick the right RAG type:
1. Is the user asking relationship/connection questions? ("Who reports to the VP who launched project X?") YES -> Graph RAG2. Does the query require synthesizing 3+ distinct topics? ("Compare our Q3 revenue, churn rate, and NPS trends") YES -> Agentic RAG3. Do you need exact keyword matching AND semantic search? (Technical docs with function names, error codes, SKUs) YES -> Multi-Vector RAG4. Is this a multi-turn conversation with returning users? YES -> Memory RAG5. Are users reporting irrelevant results or low accuracy? YES -> Advanced RAG6. None of the above -> Basic RAG (start here and upgrade)
Building production-ready RAG systems requires careful consideration of infrastructure, costs, performance, and monitoring. Here’s what you need to know.
Pitfall 1: “Chunk Boundaries Cut Off Important Info”Problem: The answer spans 2 chunks, and neither chunk alone contains enough information to be retrieved. This is the most common silent failure in RAG systems — it looks like “no results found” but actually the information is there, just split across a boundary.
Chunk 1: "...The policy states that remote work..."Chunk 2: "...is allowed up to 3 days per week."
Solution: Use overlapping chunks (the overlap region contains both pieces)
Pitfall 2: “Too Many Irrelevant Results”Problem: Top-k returns junk documentsSolutions:
Similarity Threshold: Only use docs > 0.7 similarity
Metadata Filtering: Filter by date, category, etc.
Re-ranking: Use LLM to score relevance
Better Chunking: Smaller, more focused chunks
Pitfall 3: “Hallucinations Despite RAG”Problem: LLM makes things up even with sources provided. This is the most dangerous RAG failure because the answer looks authoritative and cited, but the actual claims don’t appear in the sources. It happens because the LLM’s parametric knowledge “leaks through” even when instructed to stick to sources.Solutions:
Stricter System Prompt: “ONLY use information from the provided sources. If the answer is not in the sources, say ‘I don’t have information about that in the provided sources.’ Do NOT use your general knowledge.”
Temperature = 0: Reduces creative drift from source material
Post-processing: Programmatically verify that key claims in the answer actually appear in the retrieved sources
Citation Requirement: Force [Source N] citations — this makes hallucination easier to detect and catches the model when it invents references
Pitfall 4: “Slow Query Times”Problem: Takes 5-10 seconds per querySolutions:
Use faster embedding models
Implement caching
Reduce top-k
Use async operations
Consider hybrid database with cache layer
Pitfall 5: “Memory RAG Recalls Stale or Wrong Facts”Problem: The long-term memory stores a user fact (“I’m vegetarian”) that the user later contradicts (“I started eating fish”). The system keeps using the outdated fact, producing irrelevant responses.Solutions:
Timestamp all memory entries and decay old facts
When extracting new facts, check for contradictions with existing memory and overwrite
Allow users to explicitly view and delete their stored preferences
Set a maximum memory age (e.g., 90 days) after which facts require re-confirmation
Pitfall 6: “Agentic RAG Gets Stuck in Loops”Problem: The planner keeps issuing search queries that return the same documents, never deciding it has enough information to answer.Solutions:
Deduplicate retrieved documents across iterations — skip docs already in context
Add a “diminishing returns” check: if the last search returned zero new documents, force an answer
Include retrieved document count and content summaries in the planning prompt so the model knows what it already has
Set max_iterations conservatively (3-5) and implement a _force_answer fallback