RAG has evolved far beyond simple “retrieve and generate.” Modern RAG systems use sophisticated architectures to handle complex queries, maintain context, and deliver accurate, grounded responses.
What It IsBasic RAG is the simplest form: search for relevant documents using vector similarity, then feed them to an LLM to generate an answer. It’s like asking a librarian who quickly finds relevant books and summarizes them for you.Real-World ExampleUse Case: Company documentation Q&AEmployee asks: “What’s our remote work policy?”
System finds: 3 policy documents mentioning remote work
LLM summarizes: “According to company policy, employees can work remotely up to 3 days per week with manager approval…”
Copy
from openai import OpenAIfrom typing import Listimport jsonclass BasicRAG: """Simple retrieve-and-generate RAG""" def __init__(self, database): self.client = OpenAI() self.db = database def query(self, question: str, top_k: int = 5) -> str: # 1. Embed the question embedding = self._embed(question) # 2. Retrieve similar documents docs = self.db.vector_search(embedding, top_k=top_k) # 3. Build context context = "\n\n".join([ f"[Source {i+1}]: {doc['content']}" for i, doc in enumerate(docs) ]) # 4. Generate answer response = self.client.chat.completions.create( model="gpt-4o", messages=[ { "role": "system", "content": """Answer the question based only on the provided sources. Cite sources using [Source N] format.""" }, { "role": "user", "content": f"Sources:\n{context}\n\nQuestion: {question}" } ] ) return response.choices[0].message.content def _embed(self, text: str) -> List[float]: response = self.client.embeddings.create( model="text-embedding-3-small", input=text ) return response.data[0].embedding
When to Use Basic RAG:
Simple Q&A over documentation
Small to medium document collections
Well-formed, specific questions
MVP/prototype stage
Limitations:
Single retrieval step may miss context
No query understanding or rewriting
Limited handling of complex queries
No reasoning over multiple documents
Performance Tips:
Chunk Size: Keep chunks 200-500 tokens for best results
Embedding Model: text-embedding-3-small is fast and cheap
Top-K: Start with 3-5 documents, adjust based on answer quality
Temperature: Use 0.0 for factual answers, 0.3-0.7 for creative responses
What It IsAdvanced RAG improves accuracy by adding query processing, hybrid search (vector + keyword), and re-ranking. It’s like having a smart librarian who understands your question better, searches multiple ways, and ranks results by relevance.Real-World ExampleUse Case: Legal document searchLawyer asks: “Cases about contract breach in California”
System expands to: [“contract breach California”, “contractual violations CA”, “breach of agreement California courts”]
Searches using both semantic similarity AND keyword matching
Re-ranks results by legal relevance
Returns top 5 most relevant cases
Copy
class AdvancedRAG: """RAG with query expansion, hybrid search, and re-ranking""" def __init__(self, database): self.client = OpenAI() self.db = database async def query(self, question: str) -> dict: # 1. Query Processing processed_queries = await self._process_query(question) # 2. Hybrid Retrieval (Vector + Keyword) all_docs = [] for q in processed_queries: vector_docs = await self._vector_search(q) keyword_docs = await self._keyword_search(q) all_docs.extend(vector_docs + keyword_docs) # 3. Reciprocal Rank Fusion fused_docs = self._rrf_fusion(all_docs) # 4. Re-ranking with Cross-Encoder reranked_docs = await self._rerank(question, fused_docs[:20]) # 5. Generate with best documents answer = await self._generate(question, reranked_docs[:5]) return { "answer": answer, "sources": reranked_docs[:5], "query_expansions": processed_queries } async def _process_query(self, question: str) -> List[str]: """Expand query into multiple search variations""" response = self.client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": """Generate 3 alternative phrasings of this question to improve search. Return JSON: {"queries": ["...", "...", "..."]}""" }, {"role": "user", "content": question} ], response_format={"type": "json_object"} ) result = json.loads(response.choices[0].message.content) return [question] + result.get("queries", []) def _rrf_fusion(self, docs: List[dict], k: int = 60) -> List[dict]: """Reciprocal Rank Fusion for combining multiple rankings""" scores = {} doc_map = {} for doc in docs: doc_id = doc['id'] if doc_id not in scores: scores[doc_id] = 0 doc_map[doc_id] = doc # RRF score: 1 / (k + rank) rank = doc.get('rank', 1) scores[doc_id] += 1 / (k + rank) # Sort by fused score sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True) return [doc_map[doc_id] for doc_id in sorted_ids] async def _rerank(self, question: str, docs: List[dict]) -> List[dict]: """Re-rank documents using LLM as judge""" doc_texts = "\n".join([ f"[{i+1}] {doc['content'][:500]}" for i, doc in enumerate(docs) ]) response = self.client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": """Rate relevance of each document to the question (0-10). Return JSON: {"rankings": [{"doc": 1, "score": 8}, ...]}""" }, { "role": "user", "content": f"Question: {question}\n\nDocuments:\n{doc_texts}" } ], response_format={"type": "json_object"} ) rankings = json.loads(response.choices[0].message.content) # Apply scores and re-sort for ranking in rankings.get("rankings", []): idx = ranking["doc"] - 1 if 0 <= idx < len(docs): docs[idx]["rerank_score"] = ranking["score"] return sorted(docs, key=lambda x: x.get("rerank_score", 0), reverse=True)
When to Use Advanced RAG:
Production systems requiring high accuracy
Diverse queries with varying terminology
Technical domains with specific jargon
Ambiguous or complex user questions
Need for better precision and recall
Limitations:
Higher cost than Basic RAG (~2-3x)
Increased latency due to multiple processing steps
Requires more infrastructure (re-ranking models)
More complex to implement and maintain
May be overkill for simple use cases
Performance Tips:
Query Expansion: Use gpt-4o-mini for cost efficiency
Re-ranking: Only re-rank top 20 candidates to balance cost/quality
Hybrid Search: Weight vector (0.7) and keyword (0.3) for best results
RRF Fusion: Use k=60 for optimal ranking combination
Caching: Cache query expansions and common re-ranking results
What It IsMemory RAG adds conversation history and user context, enabling personalized, context-aware responses. It’s like talking to someone who remembers your previous conversations and preferences.Real-World ExampleUse Case: Personal health assistantFirst conversation: “I have diabetes. What should I eat for breakfast?”
System remembers: User has diabetes, prefers quick meals
Later conversation: “What about lunch?”
System uses memory: Suggests low-carb lunches, remembers breakfast preferences
Copy
from datetime import datetimefrom typing import Optionalclass MemoryRAG: """RAG with short-term and long-term memory""" def __init__(self, database, memory_store): self.client = OpenAI() self.db = database self.memory = memory_store async def query( self, question: str, user_id: str, session_id: str ) -> dict: # 1. Get short-term memory (conversation context) conversation = await self.memory.get_conversation(session_id, limit=10) # 2. Get long-term memory (user preferences, facts) user_context = await self.memory.get_user_context(user_id) # 3. Contextualize the query contextualized_query = await self._contextualize_query( question, conversation, user_context ) # 4. Retrieve with contextualized query docs = await self._retrieve(contextualized_query) # 5. Generate with full context answer = await self._generate( question=question, documents=docs, conversation=conversation, user_context=user_context ) # 6. Update memory await self._update_memory( user_id=user_id, session_id=session_id, question=question, answer=answer ) return {"answer": answer, "sources": docs} async def _contextualize_query( self, question: str, conversation: List[dict], user_context: dict ) -> str: """Rewrite query with conversation context""" if not conversation: return question history = "\n".join([ f"{'User' if m['role'] == 'user' else 'Assistant'}: {m['content']}" for m in conversation[-5:] # Last 5 turns ]) response = self.client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": """Given the conversation history, rewrite the user's question to be self-contained. Include relevant context from the conversation. Return only the rewritten question.""" }, { "role": "user", "content": f"""Conversation:{history}User preferences: {json.dumps(user_context.get('preferences', {}))}Current question: {question}Rewritten question:""" } ] ) return response.choices[0].message.content async def _update_memory( self, user_id: str, session_id: str, question: str, answer: str ): """Update both short-term and long-term memory""" # Short-term: conversation history await self.memory.add_message(session_id, "user", question) await self.memory.add_message(session_id, "assistant", answer) # Long-term: extract and store facts facts = await self._extract_facts(question, answer) for fact in facts: await self.memory.add_user_fact(user_id, fact) async def _extract_facts(self, question: str, answer: str) -> List[dict]: """Extract memorable facts from conversation""" response = self.client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": """Extract user preferences or facts to remember. Return JSON: {"facts": [{"type": "preference|fact", "content": "..."}]} Only include genuinely useful information. Return empty if nothing notable.""" }, { "role": "user", "content": f"User asked: {question}\nResponse: {answer}" } ], response_format={"type": "json_object"} ) result = json.loads(response.choices[0].message.content) return result.get("facts", [])
When to Use Memory RAG:
Building chatbots or conversational assistants
Users have repeat interactions with the system
Personalization improves user experience
Multi-turn conversations with context dependencies
Need to remember user preferences and facts
Limitations:
Requires memory storage infrastructure
Privacy concerns with storing user data
Memory can become stale or incorrect over time
Additional cost for memory operations
More complex state management
Performance Tips:
Short-term Memory: Keep last 5-10 conversation turns for context
Long-term Memory: Extract only meaningful facts, not every detail
Memory Retrieval: Use vector search for semantic memory lookup
Memory Updates: Batch updates to reduce database calls
Context Window: Limit conversation history to avoid token limits
What It IsAgentic RAG uses iterative reasoning to answer complex, multi-step questions. The system can plan multiple retrieval steps, synthesize information from different sources, and decide when it has enough information to provide an answer.Real-World ExampleUse Case: Research assistant for academic papersUser asks: “Compare the effectiveness of transformer models vs RNNs for machine translation, considering recent papers from 2023-2024”
System plans:
Step 1: Search for “transformer models machine translation”
Step 2: Search for “RNN machine translation comparison”
Step 3: Search for “transformer vs RNN 2023 2024”
Step 4: Synthesize findings and compare
System answers: Comprehensive comparison with citations from multiple sources
Copy
from enum import Enumclass Action(Enum): SEARCH = "search" ANSWER = "answer" CLARIFY = "clarify"class AgenticRAG: """RAG with iterative reasoning and multi-hop retrieval""" def __init__(self, database): self.client = OpenAI() self.db = database self.max_iterations = 5 async def query(self, question: str) -> dict: context = [] reasoning_chain = [] for i in range(self.max_iterations): # Decide next action action, action_input = await self._plan_action( question=question, context=context, reasoning=reasoning_chain ) reasoning_chain.append({ "step": i + 1, "action": action.value, "input": action_input }) if action == Action.ANSWER: # Ready to answer return { "answer": action_input, "reasoning": reasoning_chain, "sources": context } elif action == Action.SEARCH: # Retrieve more information docs = await self._retrieve(action_input) context.extend(docs) reasoning_chain[-1]["result"] = f"Found {len(docs)} documents" elif action == Action.CLARIFY: # Need clarification return { "answer": None, "clarification_needed": action_input, "reasoning": reasoning_chain } # Max iterations reached return { "answer": await self._force_answer(question, context), "reasoning": reasoning_chain, "sources": context, "warning": "Max iterations reached" } async def _plan_action( self, question: str, context: List[dict], reasoning: List[dict] ) -> tuple[Action, str]: """Decide what to do next""" context_summary = "\n".join([ f"- {doc['content'][:200]}..." for doc in context[-5:] ]) if context else "No information gathered yet." reasoning_str = "\n".join([ f"Step {r['step']}: {r['action']} - {r.get('result', r.get('input', ''))}" for r in reasoning ]) if reasoning else "No steps taken yet." response = self.client.chat.completions.create( model="gpt-4o", messages=[ { "role": "system", "content": """You are a reasoning agent. Decide what to do next:1. SEARCH: Need more information. Provide a search query.2. ANSWER: Have enough info to answer. Provide the answer.3. CLARIFY: Question is ambiguous. Provide clarification request.Return JSON: {"action": "search|answer|clarify", "input": "..."}Be thorough—search multiple times if needed for complex questions.""" }, { "role": "user", "content": f"""Question: {question}Information gathered:{context_summary}Reasoning so far:{reasoning_str}What should I do next?""" } ], response_format={"type": "json_object"} ) result = json.loads(response.choices[0].message.content) action = Action(result["action"]) return action, result["input"] async def _retrieve(self, query: str) -> List[dict]: """Search for documents""" embedding = self._embed(query) return self.db.vector_search(embedding, top_k=5)
When to Use Agentic RAG:
Complex research questions requiring multiple sources
Questions that need synthesis across documents
Multi-hop reasoning (“who works at company that acquired X”)
Comparative analysis questions
Questions requiring iterative information gathering
Limitations:
Highest cost among RAG types ($15-40 per 1000 queries)
Slowest latency due to multiple iterations
Can get stuck in loops if max_iterations too high
Requires careful prompt engineering for action planning
More complex to debug and monitor
Performance Tips:
Max Iterations: Set to 3-5 for most use cases
Action Planning: Use gpt-4o for better reasoning, gpt-4o-mini for cost savings
Early Stopping: Implement confidence thresholds to stop early
Query Generation: Cache common query patterns
Monitoring: Track iteration count and reasoning chains for optimization
What It IsMulti-Vector RAG uses multiple embedding types (dense, sparse, metadata) for precise retrieval. By combining semantic understanding, exact keyword matching, and structured metadata, it provides more accurate and flexible search capabilities.Real-World ExampleUse Case: Technical documentation searchUser asks: “Python async/await error handling”
System searches:
Dense vectors: Find semantically similar docs about async programming
Sparse vectors: Match exact keywords “async”, “await”, “error”
Metadata vectors: Filter by language=“Python”, category=“error-handling”
System combines: Weighted fusion returns most relevant technical docs
Copy
class MultiVectorRAG: """RAG with dense, sparse, and metadata vectors""" def __init__(self, database): self.client = OpenAI() self.db = database async def index_document(self, doc_id: str, content: str, metadata: dict): """Index with multiple vector types""" # 1. Dense embedding (semantic) dense_embedding = self._embed_dense(content) # 2. Sparse embedding (BM25/keywords) sparse_embedding = self._embed_sparse(content) # 3. Metadata embedding metadata_text = " ".join([ f"{k}: {v}" for k, v in metadata.items() ]) metadata_embedding = self._embed_dense(metadata_text) # Store all embeddings await self.db.store_multi_vector( doc_id=doc_id, content=content, dense=dense_embedding, sparse=sparse_embedding, metadata_vec=metadata_embedding, metadata=metadata ) async def query( self, question: str, filters: dict = None, weights: dict = None ) -> List[dict]: """Multi-vector retrieval with configurable weights""" weights = weights or { "dense": 0.5, "sparse": 0.3, "metadata": 0.2 } # Embed query query_dense = self._embed_dense(question) query_sparse = self._embed_sparse(question) # Search with each vector type dense_results = await self.db.search( vector=query_dense, vector_type="dense", filters=filters ) sparse_results = await self.db.search( vector=query_sparse, vector_type="sparse", filters=filters ) # If metadata filters provided, search metadata vectors if filters: filter_text = " ".join([f"{k}: {v}" for k, v in filters.items()]) filter_vec = self._embed_dense(filter_text) metadata_results = await self.db.search( vector=filter_vec, vector_type="metadata" ) else: metadata_results = [] # Weighted combination combined = self._weighted_fusion( dense_results, sparse_results, metadata_results, weights ) return combined def _weighted_fusion( self, dense: List[dict], sparse: List[dict], metadata: List[dict], weights: dict ) -> List[dict]: """Combine results with weighted scoring""" scores = {} docs = {} for rank, doc in enumerate(dense): doc_id = doc['id'] docs[doc_id] = doc scores[doc_id] = scores.get(doc_id, 0) + weights["dense"] / (rank + 1) for rank, doc in enumerate(sparse): doc_id = doc['id'] docs[doc_id] = doc scores[doc_id] = scores.get(doc_id, 0) + weights["sparse"] / (rank + 1) for rank, doc in enumerate(metadata): doc_id = doc['id'] docs[doc_id] = doc scores[doc_id] = scores.get(doc_id, 0) + weights["metadata"] / (rank + 1) sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True) return [docs[doc_id] for doc_id in sorted_ids] def _embed_sparse(self, text: str) -> dict: """Create sparse BM25-style embedding""" # Simplified—use actual BM25 or SPLADE in production from collections import Counter import re words = re.findall(r'\w+', text.lower()) word_counts = Counter(words) return { "indices": list(range(len(word_counts))), "values": list(word_counts.values()), "tokens": list(word_counts.keys()) }
When to Use Multi-Vector RAG:
Technical documentation with exact terminology
Need both semantic and keyword matching
Rich metadata available for filtering
Mixed content types (code, docs, comments)
Require fine-tuned relevance control
Limitations:
Requires storing multiple embeddings per document
More storage and indexing overhead
Weight tuning requires experimentation
Sparse embeddings need specialized infrastructure
More complex than single-vector approaches
Performance Tips:
Weight Tuning: Start with dense=0.5, sparse=0.3, metadata=0.2, adjust based on domain
Sparse Embeddings: Use BM25 or SPLADE for production
Metadata Indexing: Index frequently filtered fields separately
What It IsGraph RAG traverses knowledge graphs to follow relationships and discover connected information. It understands how entities relate to each other, enabling multi-hop reasoning and discovery of indirectly related information.Real-World ExampleUse Case: Company knowledge baseUser asks: “Who are the key engineers working on projects related to machine learning?”
System:
Selecting the appropriate RAG architecture depends on your use case, complexity requirements, and performance needs. Use this guide to make the right choice.
Building production-ready RAG systems requires careful consideration of infrastructure, costs, performance, and monitoring. Here’s what you need to know.