Skip to main content
December 2025 Update: Covers the latest RAG patterns including Agentic RAG, Graph RAG, and Multi-Vector approaches used in production systems.

The RAG Evolution

RAG has evolved far beyond simple “retrieve and generate.” Modern RAG systems use sophisticated architectures to handle complex queries, maintain context, and deliver accurate, grounded responses.
RAG Evolution Timeline
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

2022          2023           2024           2025
 │             │              │              │
 ▼             ▼              ▼              ▼
Basic RAG → Advanced RAG → Agentic RAG → Graph RAG
             + Re-ranking    + Memory       + Knowledge Graphs
             + Hybrid        + Multi-hop    + Multi-Modal
                            + Self-RAG     

1. Basic RAG

What It Is Basic RAG is the simplest form: search for relevant documents using vector similarity, then feed them to an LLM to generate an answer. It’s like asking a librarian who quickly finds relevant books and summarizes them for you. Real-World Example Use Case: Company documentation Q&A Employee asks: “What’s our remote work policy?” System finds: 3 policy documents mentioning remote work LLM summarizes: “According to company policy, employees can work remotely up to 3 days per week with manager approval…”
from openai import OpenAI
from typing import List
import json

class BasicRAG:
    """Simple retrieve-and-generate RAG"""
    
    def __init__(self, database):
        self.client = OpenAI()
        self.db = database
    
    def query(self, question: str, top_k: int = 5) -> str:
        # 1. Embed the question
        embedding = self._embed(question)
        
        # 2. Retrieve similar documents
        docs = self.db.vector_search(embedding, top_k=top_k)
        
        # 3. Build context
        context = "\n\n".join([
            f"[Source {i+1}]: {doc['content']}"
            for i, doc in enumerate(docs)
        ])
        
        # 4. Generate answer
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": """Answer the question based only on the provided sources.
                    Cite sources using [Source N] format."""
                },
                {
                    "role": "user",
                    "content": f"Sources:\n{context}\n\nQuestion: {question}"
                }
            ]
        )
        
        return response.choices[0].message.content
    
    def _embed(self, text: str) -> List[float]:
        response = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding
When to Use Basic RAG:
  • Simple Q&A over documentation
  • Small to medium document collections
  • Well-formed, specific questions
  • MVP/prototype stage
Limitations:
  • Single retrieval step may miss context
  • No query understanding or rewriting
  • Limited handling of complex queries
  • No reasoning over multiple documents
Performance Tips:
  • Chunk Size: Keep chunks 200-500 tokens for best results
  • Embedding Model: text-embedding-3-small is fast and cheap
  • Top-K: Start with 3-5 documents, adjust based on answer quality
  • Temperature: Use 0.0 for factual answers, 0.3-0.7 for creative responses

2. Advanced RAG

What It Is Advanced RAG improves accuracy by adding query processing, hybrid search (vector + keyword), and re-ranking. It’s like having a smart librarian who understands your question better, searches multiple ways, and ranks results by relevance. Real-World Example Use Case: Legal document search Lawyer asks: “Cases about contract breach in California” System expands to: [“contract breach California”, “contractual violations CA”, “breach of agreement California courts”] Searches using both semantic similarity AND keyword matching Re-ranks results by legal relevance Returns top 5 most relevant cases
class AdvancedRAG:
    """RAG with query expansion, hybrid search, and re-ranking"""
    
    def __init__(self, database):
        self.client = OpenAI()
        self.db = database
    
    async def query(self, question: str) -> dict:
        # 1. Query Processing
        processed_queries = await self._process_query(question)
        
        # 2. Hybrid Retrieval (Vector + Keyword)
        all_docs = []
        for q in processed_queries:
            vector_docs = await self._vector_search(q)
            keyword_docs = await self._keyword_search(q)
            all_docs.extend(vector_docs + keyword_docs)
        
        # 3. Reciprocal Rank Fusion
        fused_docs = self._rrf_fusion(all_docs)
        
        # 4. Re-ranking with Cross-Encoder
        reranked_docs = await self._rerank(question, fused_docs[:20])
        
        # 5. Generate with best documents
        answer = await self._generate(question, reranked_docs[:5])
        
        return {
            "answer": answer,
            "sources": reranked_docs[:5],
            "query_expansions": processed_queries
        }
    
    async def _process_query(self, question: str) -> List[str]:
        """Expand query into multiple search variations"""
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": """Generate 3 alternative phrasings of this question to improve search.
                    Return JSON: {"queries": ["...", "...", "..."]}"""
                },
                {"role": "user", "content": question}
            ],
            response_format={"type": "json_object"}
        )
        
        result = json.loads(response.choices[0].message.content)
        return [question] + result.get("queries", [])
    
    def _rrf_fusion(self, docs: List[dict], k: int = 60) -> List[dict]:
        """Reciprocal Rank Fusion for combining multiple rankings"""
        scores = {}
        doc_map = {}
        
        for doc in docs:
            doc_id = doc['id']
            if doc_id not in scores:
                scores[doc_id] = 0
                doc_map[doc_id] = doc
            
            # RRF score: 1 / (k + rank)
            rank = doc.get('rank', 1)
            scores[doc_id] += 1 / (k + rank)
        
        # Sort by fused score
        sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
        return [doc_map[doc_id] for doc_id in sorted_ids]
    
    async def _rerank(self, question: str, docs: List[dict]) -> List[dict]:
        """Re-rank documents using LLM as judge"""
        doc_texts = "\n".join([
            f"[{i+1}] {doc['content'][:500]}"
            for i, doc in enumerate(docs)
        ])
        
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": """Rate relevance of each document to the question (0-10).
                    Return JSON: {"rankings": [{"doc": 1, "score": 8}, ...]}"""
                },
                {
                    "role": "user",
                    "content": f"Question: {question}\n\nDocuments:\n{doc_texts}"
                }
            ],
            response_format={"type": "json_object"}
        )
        
        rankings = json.loads(response.choices[0].message.content)
        
        # Apply scores and re-sort
        for ranking in rankings.get("rankings", []):
            idx = ranking["doc"] - 1
            if 0 <= idx < len(docs):
                docs[idx]["rerank_score"] = ranking["score"]
        
        return sorted(docs, key=lambda x: x.get("rerank_score", 0), reverse=True)
When to Use Advanced RAG:
  • Production systems requiring high accuracy
  • Diverse queries with varying terminology
  • Technical domains with specific jargon
  • Ambiguous or complex user questions
  • Need for better precision and recall
Limitations:
  • Higher cost than Basic RAG (~2-3x)
  • Increased latency due to multiple processing steps
  • Requires more infrastructure (re-ranking models)
  • More complex to implement and maintain
  • May be overkill for simple use cases
Performance Tips:
  • Query Expansion: Use gpt-4o-mini for cost efficiency
  • Re-ranking: Only re-rank top 20 candidates to balance cost/quality
  • Hybrid Search: Weight vector (0.7) and keyword (0.3) for best results
  • RRF Fusion: Use k=60 for optimal ranking combination
  • Caching: Cache query expansions and common re-ranking results

3. Memory RAG

What It Is Memory RAG adds conversation history and user context, enabling personalized, context-aware responses. It’s like talking to someone who remembers your previous conversations and preferences. Real-World Example Use Case: Personal health assistant First conversation: “I have diabetes. What should I eat for breakfast?” System remembers: User has diabetes, prefers quick meals Later conversation: “What about lunch?” System uses memory: Suggests low-carb lunches, remembers breakfast preferences
from datetime import datetime
from typing import Optional

class MemoryRAG:
    """RAG with short-term and long-term memory"""
    
    def __init__(self, database, memory_store):
        self.client = OpenAI()
        self.db = database
        self.memory = memory_store
    
    async def query(
        self,
        question: str,
        user_id: str,
        session_id: str
    ) -> dict:
        # 1. Get short-term memory (conversation context)
        conversation = await self.memory.get_conversation(session_id, limit=10)
        
        # 2. Get long-term memory (user preferences, facts)
        user_context = await self.memory.get_user_context(user_id)
        
        # 3. Contextualize the query
        contextualized_query = await self._contextualize_query(
            question, 
            conversation,
            user_context
        )
        
        # 4. Retrieve with contextualized query
        docs = await self._retrieve(contextualized_query)
        
        # 5. Generate with full context
        answer = await self._generate(
            question=question,
            documents=docs,
            conversation=conversation,
            user_context=user_context
        )
        
        # 6. Update memory
        await self._update_memory(
            user_id=user_id,
            session_id=session_id,
            question=question,
            answer=answer
        )
        
        return {"answer": answer, "sources": docs}
    
    async def _contextualize_query(
        self,
        question: str,
        conversation: List[dict],
        user_context: dict
    ) -> str:
        """Rewrite query with conversation context"""
        if not conversation:
            return question
        
        history = "\n".join([
            f"{'User' if m['role'] == 'user' else 'Assistant'}: {m['content']}"
            for m in conversation[-5:]  # Last 5 turns
        ])
        
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": """Given the conversation history, rewrite the user's question
                    to be self-contained. Include relevant context from the conversation.
                    Return only the rewritten question."""
                },
                {
                    "role": "user",
                    "content": f"""Conversation:
{history}

User preferences: {json.dumps(user_context.get('preferences', {}))}

Current question: {question}

Rewritten question:"""
                }
            ]
        )
        
        return response.choices[0].message.content
    
    async def _update_memory(
        self,
        user_id: str,
        session_id: str,
        question: str,
        answer: str
    ):
        """Update both short-term and long-term memory"""
        # Short-term: conversation history
        await self.memory.add_message(session_id, "user", question)
        await self.memory.add_message(session_id, "assistant", answer)
        
        # Long-term: extract and store facts
        facts = await self._extract_facts(question, answer)
        for fact in facts:
            await self.memory.add_user_fact(user_id, fact)
    
    async def _extract_facts(self, question: str, answer: str) -> List[dict]:
        """Extract memorable facts from conversation"""
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": """Extract user preferences or facts to remember.
                    Return JSON: {"facts": [{"type": "preference|fact", "content": "..."}]}
                    Only include genuinely useful information. Return empty if nothing notable."""
                },
                {
                    "role": "user",
                    "content": f"User asked: {question}\nResponse: {answer}"
                }
            ],
            response_format={"type": "json_object"}
        )
        
        result = json.loads(response.choices[0].message.content)
        return result.get("facts", [])
When to Use Memory RAG:
  • Building chatbots or conversational assistants
  • Users have repeat interactions with the system
  • Personalization improves user experience
  • Multi-turn conversations with context dependencies
  • Need to remember user preferences and facts
Limitations:
  • Requires memory storage infrastructure
  • Privacy concerns with storing user data
  • Memory can become stale or incorrect over time
  • Additional cost for memory operations
  • More complex state management
Performance Tips:
  • Short-term Memory: Keep last 5-10 conversation turns for context
  • Long-term Memory: Extract only meaningful facts, not every detail
  • Memory Retrieval: Use vector search for semantic memory lookup
  • Memory Updates: Batch updates to reduce database calls
  • Context Window: Limit conversation history to avoid token limits

4. Agentic RAG

What It Is Agentic RAG uses iterative reasoning to answer complex, multi-step questions. The system can plan multiple retrieval steps, synthesize information from different sources, and decide when it has enough information to provide an answer. Real-World Example Use Case: Research assistant for academic papers User asks: “Compare the effectiveness of transformer models vs RNNs for machine translation, considering recent papers from 2023-2024” System plans:
  • Step 1: Search for “transformer models machine translation”
  • Step 2: Search for “RNN machine translation comparison”
  • Step 3: Search for “transformer vs RNN 2023 2024”
  • Step 4: Synthesize findings and compare System answers: Comprehensive comparison with citations from multiple sources
from enum import Enum

class Action(Enum):
    SEARCH = "search"
    ANSWER = "answer"
    CLARIFY = "clarify"

class AgenticRAG:
    """RAG with iterative reasoning and multi-hop retrieval"""
    
    def __init__(self, database):
        self.client = OpenAI()
        self.db = database
        self.max_iterations = 5
    
    async def query(self, question: str) -> dict:
        context = []
        reasoning_chain = []
        
        for i in range(self.max_iterations):
            # Decide next action
            action, action_input = await self._plan_action(
                question=question,
                context=context,
                reasoning=reasoning_chain
            )
            
            reasoning_chain.append({
                "step": i + 1,
                "action": action.value,
                "input": action_input
            })
            
            if action == Action.ANSWER:
                # Ready to answer
                return {
                    "answer": action_input,
                    "reasoning": reasoning_chain,
                    "sources": context
                }
            
            elif action == Action.SEARCH:
                # Retrieve more information
                docs = await self._retrieve(action_input)
                context.extend(docs)
                
                reasoning_chain[-1]["result"] = f"Found {len(docs)} documents"
            
            elif action == Action.CLARIFY:
                # Need clarification
                return {
                    "answer": None,
                    "clarification_needed": action_input,
                    "reasoning": reasoning_chain
                }
        
        # Max iterations reached
        return {
            "answer": await self._force_answer(question, context),
            "reasoning": reasoning_chain,
            "sources": context,
            "warning": "Max iterations reached"
        }
    
    async def _plan_action(
        self,
        question: str,
        context: List[dict],
        reasoning: List[dict]
    ) -> tuple[Action, str]:
        """Decide what to do next"""
        context_summary = "\n".join([
            f"- {doc['content'][:200]}..." for doc in context[-5:]
        ]) if context else "No information gathered yet."
        
        reasoning_str = "\n".join([
            f"Step {r['step']}: {r['action']} - {r.get('result', r.get('input', ''))}"
            for r in reasoning
        ]) if reasoning else "No steps taken yet."
        
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": """You are a reasoning agent. Decide what to do next:

1. SEARCH: Need more information. Provide a search query.
2. ANSWER: Have enough info to answer. Provide the answer.
3. CLARIFY: Question is ambiguous. Provide clarification request.

Return JSON: {"action": "search|answer|clarify", "input": "..."}

Be thorough—search multiple times if needed for complex questions."""
                },
                {
                    "role": "user",
                    "content": f"""Question: {question}

Information gathered:
{context_summary}

Reasoning so far:
{reasoning_str}

What should I do next?"""
                }
            ],
            response_format={"type": "json_object"}
        )
        
        result = json.loads(response.choices[0].message.content)
        action = Action(result["action"])
        return action, result["input"]
    
    async def _retrieve(self, query: str) -> List[dict]:
        """Search for documents"""
        embedding = self._embed(query)
        return self.db.vector_search(embedding, top_k=5)
When to Use Agentic RAG:
  • Complex research questions requiring multiple sources
  • Questions that need synthesis across documents
  • Multi-hop reasoning (“who works at company that acquired X”)
  • Comparative analysis questions
  • Questions requiring iterative information gathering
Limitations:
  • Highest cost among RAG types ($15-40 per 1000 queries)
  • Slowest latency due to multiple iterations
  • Can get stuck in loops if max_iterations too high
  • Requires careful prompt engineering for action planning
  • More complex to debug and monitor
Performance Tips:
  • Max Iterations: Set to 3-5 for most use cases
  • Action Planning: Use gpt-4o for better reasoning, gpt-4o-mini for cost savings
  • Early Stopping: Implement confidence thresholds to stop early
  • Query Generation: Cache common query patterns
  • Monitoring: Track iteration count and reasoning chains for optimization

5. Multi-Vector RAG

What It Is Multi-Vector RAG uses multiple embedding types (dense, sparse, metadata) for precise retrieval. By combining semantic understanding, exact keyword matching, and structured metadata, it provides more accurate and flexible search capabilities. Real-World Example Use Case: Technical documentation search User asks: “Python async/await error handling” System searches:
  • Dense vectors: Find semantically similar docs about async programming
  • Sparse vectors: Match exact keywords “async”, “await”, “error”
  • Metadata vectors: Filter by language=“Python”, category=“error-handling” System combines: Weighted fusion returns most relevant technical docs
class MultiVectorRAG:
    """RAG with dense, sparse, and metadata vectors"""
    
    def __init__(self, database):
        self.client = OpenAI()
        self.db = database
    
    async def index_document(self, doc_id: str, content: str, metadata: dict):
        """Index with multiple vector types"""
        
        # 1. Dense embedding (semantic)
        dense_embedding = self._embed_dense(content)
        
        # 2. Sparse embedding (BM25/keywords)
        sparse_embedding = self._embed_sparse(content)
        
        # 3. Metadata embedding
        metadata_text = " ".join([
            f"{k}: {v}" for k, v in metadata.items()
        ])
        metadata_embedding = self._embed_dense(metadata_text)
        
        # Store all embeddings
        await self.db.store_multi_vector(
            doc_id=doc_id,
            content=content,
            dense=dense_embedding,
            sparse=sparse_embedding,
            metadata_vec=metadata_embedding,
            metadata=metadata
        )
    
    async def query(
        self,
        question: str,
        filters: dict = None,
        weights: dict = None
    ) -> List[dict]:
        """Multi-vector retrieval with configurable weights"""
        weights = weights or {
            "dense": 0.5,
            "sparse": 0.3,
            "metadata": 0.2
        }
        
        # Embed query
        query_dense = self._embed_dense(question)
        query_sparse = self._embed_sparse(question)
        
        # Search with each vector type
        dense_results = await self.db.search(
            vector=query_dense,
            vector_type="dense",
            filters=filters
        )
        
        sparse_results = await self.db.search(
            vector=query_sparse,
            vector_type="sparse",
            filters=filters
        )
        
        # If metadata filters provided, search metadata vectors
        if filters:
            filter_text = " ".join([f"{k}: {v}" for k, v in filters.items()])
            filter_vec = self._embed_dense(filter_text)
            metadata_results = await self.db.search(
                vector=filter_vec,
                vector_type="metadata"
            )
        else:
            metadata_results = []
        
        # Weighted combination
        combined = self._weighted_fusion(
            dense_results,
            sparse_results,
            metadata_results,
            weights
        )
        
        return combined
    
    def _weighted_fusion(
        self,
        dense: List[dict],
        sparse: List[dict],
        metadata: List[dict],
        weights: dict
    ) -> List[dict]:
        """Combine results with weighted scoring"""
        scores = {}
        docs = {}
        
        for rank, doc in enumerate(dense):
            doc_id = doc['id']
            docs[doc_id] = doc
            scores[doc_id] = scores.get(doc_id, 0) + weights["dense"] / (rank + 1)
        
        for rank, doc in enumerate(sparse):
            doc_id = doc['id']
            docs[doc_id] = doc
            scores[doc_id] = scores.get(doc_id, 0) + weights["sparse"] / (rank + 1)
        
        for rank, doc in enumerate(metadata):
            doc_id = doc['id']
            docs[doc_id] = doc
            scores[doc_id] = scores.get(doc_id, 0) + weights["metadata"] / (rank + 1)
        
        sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
        return [docs[doc_id] for doc_id in sorted_ids]
    
    def _embed_sparse(self, text: str) -> dict:
        """Create sparse BM25-style embedding"""
        # Simplified—use actual BM25 or SPLADE in production
        from collections import Counter
        import re
        
        words = re.findall(r'\w+', text.lower())
        word_counts = Counter(words)
        
        return {
            "indices": list(range(len(word_counts))),
            "values": list(word_counts.values()),
            "tokens": list(word_counts.keys())
        }
When to Use Multi-Vector RAG:
  • Technical documentation with exact terminology
  • Need both semantic and keyword matching
  • Rich metadata available for filtering
  • Mixed content types (code, docs, comments)
  • Require fine-tuned relevance control
Limitations:
  • Requires storing multiple embeddings per document
  • More storage and indexing overhead
  • Weight tuning requires experimentation
  • Sparse embeddings need specialized infrastructure
  • More complex than single-vector approaches
Performance Tips:
  • Weight Tuning: Start with dense=0.5, sparse=0.3, metadata=0.2, adjust based on domain
  • Sparse Embeddings: Use BM25 or SPLADE for production
  • Metadata Indexing: Index frequently filtered fields separately
  • Storage: Compress sparse embeddings to save space
  • Query Optimization: Cache dense embeddings, compute sparse on-demand

6. Graph RAG

What It Is Graph RAG traverses knowledge graphs to follow relationships and discover connected information. It understands how entities relate to each other, enabling multi-hop reasoning and discovery of indirectly related information. Real-World Example Use Case: Company knowledge base User asks: “Who are the key engineers working on projects related to machine learning?” System:
  • Extracts entities: “engineers”, “projects”, “machine learning”
  • Finds in graph: Engineers → Work On → Projects → Related To → “machine learning”
  • Traverses relationships: Discovers connected engineers and projects
  • Combines with vector search: Adds relevant documents System answers: Lists engineers with their ML-related projects and expertise
from dataclasses import dataclass
from typing import Set

@dataclass
class Entity:
    id: str
    name: str
    type: str
    properties: dict

@dataclass
class Relationship:
    source: str
    target: str
    type: str
    properties: dict

class GraphRAG:
    """RAG using knowledge graph traversal"""
    
    def __init__(self, graph_db, vector_db):
        self.client = OpenAI()
        self.graph = graph_db
        self.vectors = vector_db
    
    async def query(self, question: str) -> dict:
        # 1. Extract entities from question
        entities = await self._extract_entities(question)
        
        # 2. Find entities in knowledge graph
        graph_entities = []
        for entity in entities:
            matches = await self.graph.find_entity(entity["name"], entity["type"])
            graph_entities.extend(matches)
        
        # 3. Traverse graph to find related information
        subgraph = await self._traverse_graph(graph_entities, depth=2)
        
        # 4. Also do vector search for additional context
        vector_docs = await self._vector_search(question)
        
        # 5. Combine graph and vector context
        context = self._build_context(subgraph, vector_docs)
        
        # 6. Generate answer
        answer = await self._generate(question, context)
        
        return {
            "answer": answer,
            "entities": [e.name for e in graph_entities],
            "relationships": len(subgraph["relationships"]),
            "documents": len(vector_docs)
        }
    
    async def _extract_entities(self, question: str) -> List[dict]:
        """Extract named entities from question"""
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": """Extract named entities from the question.
                    Return JSON: {"entities": [{"name": "...", "type": "person|org|concept|product|..."}]}"""
                },
                {"role": "user", "content": question}
            ],
            response_format={"type": "json_object"}
        )
        
        result = json.loads(response.choices[0].message.content)
        return result.get("entities", [])
    
    async def _traverse_graph(
        self,
        start_entities: List[Entity],
        depth: int = 2
    ) -> dict:
        """Traverse knowledge graph from starting entities"""
        visited: Set[str] = set()
        entities = []
        relationships = []
        
        queue = [(e, 0) for e in start_entities]
        
        while queue:
            entity, current_depth = queue.pop(0)
            
            if entity.id in visited or current_depth > depth:
                continue
            
            visited.add(entity.id)
            entities.append(entity)
            
            # Get relationships
            rels = await self.graph.get_relationships(entity.id)
            
            for rel in rels:
                relationships.append(rel)
                
                # Get connected entity
                connected_id = rel.target if rel.source == entity.id else rel.source
                connected = await self.graph.get_entity(connected_id)
                
                if connected and connected.id not in visited:
                    queue.append((connected, current_depth + 1))
        
        return {"entities": entities, "relationships": relationships}
    
    def _build_context(
        self,
        subgraph: dict,
        vector_docs: List[dict]
    ) -> str:
        """Build context from graph and documents"""
        parts = []
        
        # Add graph context
        parts.append("## Knowledge Graph Context")
        
        # Entities
        parts.append("\n### Entities:")
        for entity in subgraph["entities"]:
            parts.append(f"- **{entity.name}** ({entity.type}): {entity.properties}")
        
        # Relationships
        parts.append("\n### Relationships:")
        for rel in subgraph["relationships"]:
            parts.append(f"- {rel.source} --[{rel.type}]--> {rel.target}")
        
        # Add document context
        parts.append("\n## Document Context")
        for i, doc in enumerate(vector_docs, 1):
            parts.append(f"\n[Document {i}]: {doc['content']}")
        
        return "\n".join(parts)
    
    async def index_with_graph(self, doc_id: str, content: str):
        """Index document and extract graph entities"""
        # 1. Standard vector indexing
        embedding = self._embed(content)
        await self.vectors.store(doc_id, content, embedding)
        
        # 2. Extract entities and relationships
        graph_data = await self._extract_graph(content)
        
        # 3. Add to knowledge graph
        for entity in graph_data["entities"]:
            await self.graph.add_entity(entity)
        
        for rel in graph_data["relationships"]:
            await self.graph.add_relationship(rel)
    
    async def _extract_graph(self, content: str) -> dict:
        """Extract entities and relationships from text"""
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": """Extract entities and relationships from this text.
                    Return JSON:
                    {
                        "entities": [{"id": "unique_id", "name": "...", "type": "...", "properties": {...}}],
                        "relationships": [{"source": "id1", "target": "id2", "type": "WORKS_AT|OWNS|RELATED_TO|..."}]
                    }"""
                },
                {"role": "user", "content": content}
            ],
            response_format={"type": "json_object"}
        )
        
        return json.loads(response.choices[0].message.content)
When to Use Graph RAG:
  • Data with rich entity relationships
  • Questions about connections and relationships
  • Multi-hop queries (“who works at company that acquired X”)
  • Knowledge bases with structured information
  • Need to discover indirectly related information
Limitations:
  • Requires graph database infrastructure
  • Entity extraction and graph construction overhead
  • Graph traversal can be slow for large graphs
  • More complex to set up and maintain
  • Requires structured data or entity extraction pipeline
Performance Tips:
  • Graph Depth: Limit traversal to depth 2-3 for performance
  • Entity Extraction: Use gpt-4o for accurate extraction, cache results
  • Graph Indexing: Index frequently queried entity types
  • Hybrid Approach: Combine graph traversal with vector search
  • Caching: Cache common graph traversal paths

Choosing the Right Architecture

Selecting the appropriate RAG architecture depends on your use case, complexity requirements, and performance needs. Use this guide to make the right choice.

Decision Matrix

If You Need…Use This
Simple Q&A over docsBasic RAG
Production system with quality requirementsAdvanced RAG
Conversational AIMemory RAG
Complex research/multi-step questionsAgentic RAG
Technical docs (exact + semantic)Multi-Vector RAG
Relationship-based queriesGraph RAG

When to Upgrade

Basic → Advanced:
  • Accuracy < 70%
  • Users complain about irrelevant results
  • Technical domain with specific terminology
  • Ambiguous queries are common
Advanced → Memory:
  • Building a chatbot
  • Users have repeat interactions
  • Personalization improves experience
  • Multi-turn conversations
Memory → Agentic:
  • Questions require research
  • Need to synthesize multiple sources
  • Multi-hop reasoning required
  • Questions like “compare X and Y across Z”

Production Considerations

Building production-ready RAG systems requires careful consideration of infrastructure, costs, performance, and monitoring. Here’s what you need to know.

1. Vector Database Selection

Choosing the right vector database is critical for production performance and reliability.
DatabaseBest ForProsCons
PineconeProduction appsManaged, fast, reliableCost
WeaviateOpen-source needsSelf-hosted, feature-richSetup complexity
QdrantHigh performanceVery fast, Rust-basedSmaller community
ChromaDBPrototypingEasy setup, Python-friendlyNot production-ready

2. Cost Optimization

Understanding and managing costs is essential for sustainable RAG deployments. Typical Costs per 1000 Queries:
  • Basic RAG: $2-5
  • Advanced RAG: $6-12
  • Memory RAG: $3-8
  • Agentic RAG: $15-40
Optimization Strategies:
  • Cache common queries
  • Use cheaper models for embeddings
  • Implement query throttling
  • Batch operations when possible

3. Performance Tuning

Different document types and use cases require different chunk sizes and retrieval settings.
# Chunk size optimization
CHUNK_SIZES = {
    "technical_docs": 800,  # More context needed
    "chat_logs": 400,       # Conversational
    "legal_documents": 1000, # Dense information
    "news_articles": 600    # Balanced
}

# Top-k tuning
TOP_K_SETTINGS = {
    "simple_faq": 3,
    "research": 10,
    "general_qa": 5
}

# Re-ranking budget
USE_RERANKING_IF = {
    "query_length": "> 10 words",
    "ambiguous": True,
    "high_stakes": True  # Legal, medical, financial
}

4. Monitoring and Metrics

Track key metrics to ensure your RAG system performs well in production.
{
    "retrieval_metrics": {
        "latency_p50": "< 500ms",
        "latency_p99": "< 2s",
        "relevance_score": "> 0.7"
    },
    "generation_metrics": {
        "answer_latency": "< 3s",
        "answer_length": "50-300 words",
        "citation_rate": "> 80%"
    },
    "quality_metrics": {
        "user_satisfaction": "> 4/5",
        "thumbs_up_rate": "> 70%",
        "follow_up_rate": "< 30%"
    }
}

5. Common Pitfalls and Solutions

Pitfall 1: “Chunk Boundaries Cut Off Important Info” Problem: Answer spans 2 chunks, retrieves neither
Chunk 1: "...The policy states that remote work..."
Chunk 2: "...is allowed up to 3 days per week."
Solution: Use overlapping chunks
def chunk_with_overlap(text, size=500, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start += size - overlap  # Overlap
    return chunks
Pitfall 2: “Too Many Irrelevant Results” Problem: Top-k returns junk documents Solutions:
  • Similarity Threshold: Only use docs > 0.7 similarity
  • Metadata Filtering: Filter by date, category, etc.
  • Re-ranking: Use LLM to score relevance
  • Better Chunking: Smaller, more focused chunks
Pitfall 3: “Hallucinations Despite RAG” Problem: LLM makes things up even with sources Solutions:
  • Stricter System Prompt: “ONLY use information from sources. If unsure, say ‘I don’t have information about that in the provided sources.’”
  • Temperature = 0: Reduce creativity
  • Post-processing: Check if answer content appears in sources
  • Citation Requirement: Force [Source N] citations
Pitfall 4: “Slow Query Times” Problem: Takes 5-10 seconds per query Solutions:
  • Use faster embedding models
  • Implement caching
  • Reduce top-k
  • Use async operations
  • Consider hybrid database with cache layer

Summary and Next Steps

Key Takeaways:
  • Start Simple: Begin with Basic RAG, add complexity as needed
  • Hybrid Works Best: Vector + keyword search outperforms either alone
  • Memory for Conversations: Essential for chatbots and assistants
  • Monitor Quality: Track metrics, iterate on poor performance
  • Cost vs Quality: Advanced techniques cost more but deliver better results

Choosing the Right RAG Type

RAG TypeBest ForComplexityLatency
BasicSimple Q&A, MVPsLowFast
AdvancedProduction systems, diverse queriesMediumModerate
MemoryChatbots, personalized assistantsMediumModerate
AgenticComplex, multi-step questionsHighSlow
Multi-VectorHybrid search requirementsMediumModerate
GraphConnected data, relationship queriesHighVariable

Key Takeaways

Start Simple

Begin with Basic RAG, then add complexity as needed based on failure analysis.

Hybrid is Best

Combining vector + keyword search outperforms either alone for most use cases.

Memory Matters

For conversational AI, memory context dramatically improves response relevance.

Graphs for Relationships

When entities and relationships matter, Graph RAG provides structured understanding.

What’s Next

Tool Calling

Learn how to give LLMs the ability to call functions and external APIs