Types of RAG - Dev Weekends

December 2025 Update: Covers the latest RAG patterns including Agentic RAG, Graph RAG, and Multi-Vector approaches used in production systems.

The RAG Evolution

RAG has evolved far beyond simple “retrieve and generate.” Modern RAG systems use sophisticated architectures to handle complex queries, maintain context, and deliver accurate, grounded responses.

RAG Evolution Timeline
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

2022          2023           2024           2025
 │             │              │              │
 ▼             ▼              ▼              ▼
Basic RAG → Advanced RAG → Agentic RAG → Graph RAG
             + Re-ranking    + Memory       + Knowledge Graphs
             + Hybrid        + Multi-hop    + Multi-Modal
                            + Self-RAG     

1. Basic RAG

What It Is Basic RAG is the simplest form: search for relevant documents using vector similarity, then feed them to an LLM to generate an answer. It’s like asking a librarian who quickly finds relevant books and summarizes them for you. Real-World Example Use Case: Company documentation Q&A Employee asks: “What’s our remote work policy?” System finds: 3 policy documents mentioning remote work LLM summarizes: “According to company policy, employees can work remotely up to 3 days per week with manager approval…”

from openai import OpenAI
from typing import List
import json

class BasicRAG:
    """Simple retrieve-and-generate RAG"""
    
    def __init__(self, database):
        self.client = OpenAI()
        self.db = database
    
    def query(self, question: str, top_k: int = 5) -> str:
        # 1. Embed the question
        embedding = self._embed(question)
        
        # 2. Retrieve similar documents
        docs = self.db.vector_search(embedding, top_k=top_k)
        
        # 3. Build context
        context = "\n\n".join([
            f"[Source {i+1}]: {doc['content']}"
            for i, doc in enumerate(docs)
        ])
        
        # 4. Generate answer
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": """Answer the question based only on the provided sources.
                    Cite sources using [Source N] format."""
                },
                {
                    "role": "user",
                    "content": f"Sources:\n{context}\n\nQuestion: {question}"
                }
            ]
        )
        
        return response.choices[0].message.content
    
    def _embed(self, text: str) -> List[float]:
        response = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding

When to Use Basic RAG:

Simple Q&A over documentation
Small to medium document collections
Well-formed, specific questions
MVP/prototype stage

Limitations:

Single retrieval step may miss context
No query understanding or rewriting
Limited handling of complex queries
No reasoning over multiple documents

Performance Tips:

Chunk Size: Keep chunks 200-500 tokens for best results
Embedding Model: text-embedding-3-small is fast and cheap
Top-K: Start with 3-5 documents, adjust based on answer quality
Temperature: Use 0.0 for factual answers, 0.3-0.7 for creative responses

2. Advanced RAG

What It Is Advanced RAG improves accuracy by adding query processing, hybrid search (vector + keyword), and re-ranking. It’s like having a smart librarian who understands your question better, searches multiple ways, and ranks results by relevance. Real-World Example Use Case: Legal document search Lawyer asks: “Cases about contract breach in California” System expands to: [“contract breach California”, “contractual violations CA”, “breach of agreement California courts”] Searches using both semantic similarity AND keyword matching Re-ranks results by legal relevance Returns top 5 most relevant cases

class AdvancedRAG:
    """RAG with query expansion, hybrid search, and re-ranking"""
    
    def __init__(self, database):
        self.client = OpenAI()
        self.db = database
    
    async def query(self, question: str) -> dict:
        # 1. Query Processing
        processed_queries = await self._process_query(question)
        
        # 2. Hybrid Retrieval (Vector + Keyword)
        all_docs = []
        for q in processed_queries:
            vector_docs = await self._vector_search(q)
            keyword_docs = await self._keyword_search(q)
            all_docs.extend(vector_docs + keyword_docs)
        
        # 3. Reciprocal Rank Fusion
        fused_docs = self._rrf_fusion(all_docs)
        
        # 4. Re-ranking with Cross-Encoder
        reranked_docs = await self._rerank(question, fused_docs[:20])
        
        # 5. Generate with best documents
        answer = await self._generate(question, reranked_docs[:5])
        
        return {
            "answer": answer,
            "sources": reranked_docs[:5],
            "query_expansions": processed_queries
        }
    
    async def _process_query(self, question: str) -> List[str]:
        """Expand query into multiple search variations"""
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": """Generate 3 alternative phrasings of this question to improve search.
                    Return JSON: {"queries": ["...", "...", "..."]}"""
                },
                {"role": "user", "content": question}
            ],
            response_format={"type": "json_object"}
        )
        
        result = json.loads(response.choices[0].message.content)
        return [question] + result.get("queries", [])
    
    def _rrf_fusion(self, docs: List[dict], k: int = 60) -> List[dict]:
        """Reciprocal Rank Fusion for combining multiple rankings"""
        scores = {}
        doc_map = {}
        
        for doc in docs:
            doc_id = doc['id']
            if doc_id not in scores:
                scores[doc_id] = 0
                doc_map[doc_id] = doc
            
            # RRF score: 1 / (k + rank)
            rank = doc.get('rank', 1)
            scores[doc_id] += 1 / (k + rank)
        
        # Sort by fused score
        sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
        return [doc_map[doc_id] for doc_id in sorted_ids]
    
    async def _rerank(self, question: str, docs: List[dict]) -> List[dict]:
        """Re-rank documents using LLM as judge"""
        doc_texts = "\n".join([
            f"[{i+1}] {doc['content'][:500]}"
            for i, doc in enumerate(docs)
        ])
        
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": """Rate relevance of each document to the question (0-10).
                    Return JSON: {"rankings": [{"doc": 1, "score": 8}, ...]}"""
                },
                {
                    "role": "user",
                    "content": f"Question: {question}\n\nDocuments:\n{doc_texts}"
                }
            ],
            response_format={"type": "json_object"}
        )
        
        rankings = json.loads(response.choices[0].message.content)
        
        # Apply scores and re-sort
        for ranking in rankings.get("rankings", []):
            idx = ranking["doc"] - 1
            if 0 <= idx < len(docs):
                docs[idx]["rerank_score"] = ranking["score"]
        
        return sorted(docs, key=lambda x: x.get("rerank_score", 0), reverse=True)

When to Use Advanced RAG:

Production systems requiring high accuracy
Diverse queries with varying terminology
Technical domains with specific jargon
Ambiguous or complex user questions
Need for better precision and recall

Limitations:

Higher cost than Basic RAG (~2-3x)
Increased latency due to multiple processing steps
Requires more infrastructure (re-ranking models)
More complex to implement and maintain
May be overkill for simple use cases

Performance Tips:

Query Expansion: Use gpt-4o-mini for cost efficiency
Re-ranking: Only re-rank top 20 candidates to balance cost/quality
Hybrid Search: Weight vector (0.7) and keyword (0.3) for best results
RRF Fusion: Use k=60 for optimal ranking combination
Caching: Cache query expansions and common re-ranking results

3. Memory RAG

What It Is Memory RAG adds conversation history and user context, enabling personalized, context-aware responses. It’s like talking to someone who remembers your previous conversations and preferences. Real-World Example Use Case: Personal health assistant First conversation: “I have diabetes. What should I eat for breakfast?” System remembers: User has diabetes, prefers quick meals Later conversation: “What about lunch?” System uses memory: Suggests low-carb lunches, remembers breakfast preferences

from datetime import datetime
from typing import Optional

class MemoryRAG:
    """RAG with short-term and long-term memory"""
    
    def __init__(self, database, memory_store):
        self.client = OpenAI()
        self.db = database
        self.memory = memory_store
    
    async def query(
        self,
        question: str,
        user_id: str,
        session_id: str
    ) -> dict:
        # 1. Get short-term memory (conversation context)
        conversation = await self.memory.get_conversation(session_id, limit=10)
        
        # 2. Get long-term memory (user preferences, facts)
        user_context = await self.memory.get_user_context(user_id)
        
        # 3. Contextualize the query
        contextualized_query = await self._contextualize_query(
            question, 
            conversation,
            user_context
        )
        
        # 4. Retrieve with contextualized query
        docs = await self._retrieve(contextualized_query)
        
        # 5. Generate with full context
        answer = await self._generate(
            question=question,
            documents=docs,
            conversation=conversation,
            user_context=user_context
        )
        
        # 6. Update memory
        await self._update_memory(
            user_id=user_id,
            session_id=session_id,
            question=question,
            answer=answer
        )
        
        return {"answer": answer, "sources": docs}
    
    async def _contextualize_query(
        self,
        question: str,
        conversation: List[dict],
        user_context: dict
    ) -> str:
        """Rewrite query with conversation context"""
        if not conversation:
            return question
        
        history = "\n".join([
            f"{'User' if m['role'] == 'user' else 'Assistant'}: {m['content']}"
            for m in conversation[-5:]  # Last 5 turns
        ])
        
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": """Given the conversation history, rewrite the user's question
                    to be self-contained. Include relevant context from the conversation.
                    Return only the rewritten question."""
                },
                {
                    "role": "user",
                    "content": f"""Conversation:
{history}

User preferences: {json.dumps(user_context.get('preferences', {}))}

Current question: {question}

Rewritten question:"""
                }
            ]
        )
        
        return response.choices[0].message.content
    
    async def _update_memory(
        self,
        user_id: str,
        session_id: str,
        question: str,
        answer: str
    ):
        """Update both short-term and long-term memory"""
        # Short-term: conversation history
        await self.memory.add_message(session_id, "user", question)
        await self.memory.add_message(session_id, "assistant", answer)
        
        # Long-term: extract and store facts
        facts = await self._extract_facts(question, answer)
        for fact in facts:
            await self.memory.add_user_fact(user_id, fact)
    
    async def _extract_facts(self, question: str, answer: str) -> List[dict]:
        """Extract memorable facts from conversation"""
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": """Extract user preferences or facts to remember.
                    Return JSON: {"facts": [{"type": "preference|fact", "content": "..."}]}
                    Only include genuinely useful information. Return empty if nothing notable."""
                },
                {
                    "role": "user",
                    "content": f"User asked: {question}\nResponse: {answer}"
                }
            ],
            response_format={"type": "json_object"}
        )
        
        result = json.loads(response.choices[0].message.content)
        return result.get("facts", [])

When to Use Memory RAG:

Building chatbots or conversational assistants
Users have repeat interactions with the system
Personalization improves user experience
Multi-turn conversations with context dependencies
Need to remember user preferences and facts

Limitations:

Requires memory storage infrastructure
Privacy concerns with storing user data
Memory can become stale or incorrect over time
Additional cost for memory operations
More complex state management

Performance Tips:

Short-term Memory: Keep last 5-10 conversation turns for context
Long-term Memory: Extract only meaningful facts, not every detail
Memory Retrieval: Use vector search for semantic memory lookup
Memory Updates: Batch updates to reduce database calls
Context Window: Limit conversation history to avoid token limits

4. Agentic RAG

What It Is Agentic RAG uses iterative reasoning to answer complex, multi-step questions. The system can plan multiple retrieval steps, synthesize information from different sources, and decide when it has enough information to provide an answer. Real-World Example Use Case: Research assistant for academic papers User asks: “Compare the effectiveness of transformer models vs RNNs for machine translation, considering recent papers from 2023-2024” System plans:

Step 1: Search for “transformer models machine translation”
Step 2: Search for “RNN machine translation comparison”
Step 3: Search for “transformer vs RNN 2023 2024”
Step 4: Synthesize findings and compare System answers: Comprehensive comparison with citations from multiple sources

from enum import Enum

class Action(Enum):
    SEARCH = "search"
    ANSWER = "answer"
    CLARIFY = "clarify"

class AgenticRAG:
    """RAG with iterative reasoning and multi-hop retrieval"""
    
    def __init__(self, database):
        self.client = OpenAI()
        self.db = database
        self.max_iterations = 5
    
    async def query(self, question: str) -> dict:
        context = []
        reasoning_chain = []
        
        for i in range(self.max_iterations):
            # Decide next action
            action, action_input = await self._plan_action(
                question=question,
                context=context,
                reasoning=reasoning_chain
            )
            
            reasoning_chain.append({
                "step": i + 1,
                "action": action.value,
                "input": action_input
            })
            
            if action == Action.ANSWER:
                # Ready to answer
                return {
                    "answer": action_input,
                    "reasoning": reasoning_chain,
                    "sources": context
                }
            
            elif action == Action.SEARCH:
                # Retrieve more information
                docs = await self._retrieve(action_input)
                context.extend(docs)
                
                reasoning_chain[-1]["result"] = f"Found {len(docs)} documents"
            
            elif action == Action.CLARIFY:
                # Need clarification
                return {
                    "answer": None,
                    "clarification_needed": action_input,
                    "reasoning": reasoning_chain
                }
        
        # Max iterations reached
        return {
            "answer": await self._force_answer(question, context),
            "reasoning": reasoning_chain,
            "sources": context,
            "warning": "Max iterations reached"
        }
    
    async def _plan_action(
        self,
        question: str,
        context: List[dict],
        reasoning: List[dict]
    ) -> tuple[Action, str]:
        """Decide what to do next"""
        context_summary = "\n".join([
            f"- {doc['content'][:200]}..." for doc in context[-5:]
        ]) if context else "No information gathered yet."
        
        reasoning_str = "\n".join([
            f"Step {r['step']}: {r['action']} - {r.get('result', r.get('input', ''))}"
            for r in reasoning
        ]) if reasoning else "No steps taken yet."
        
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": """You are a reasoning agent. Decide what to do next:

1. SEARCH: Need more information. Provide a search query.
2. ANSWER: Have enough info to answer. Provide the answer.
3. CLARIFY: Question is ambiguous. Provide clarification request.

Return JSON: {"action": "search|answer|clarify", "input": "..."}

Be thorough—search multiple times if needed for complex questions."""
                },
                {
                    "role": "user",
                    "content": f"""Question: {question}

Information gathered:
{context_summary}

Reasoning so far:
{reasoning_str}

What should I do next?"""
                }
            ],
            response_format={"type": "json_object"}
        )
        
        result = json.loads(response.choices[0].message.content)
        action = Action(result["action"])
        return action, result["input"]
    
    async def _retrieve(self, query: str) -> List[dict]:
        """Search for documents"""
        embedding = self._embed(query)
        return self.db.vector_search(embedding, top_k=5)

When to Use Agentic RAG:

Complex research questions requiring multiple sources
Questions that need synthesis across documents
Multi-hop reasoning (“who works at company that acquired X”)
Comparative analysis questions
Questions requiring iterative information gathering

Limitations:

Highest cost among RAG types ($15-40 per 1000 queries)
Slowest latency due to multiple iterations
Can get stuck in loops if max_iterations too high
Requires careful prompt engineering for action planning
More complex to debug and monitor

Performance Tips:

Max Iterations: Set to 3-5 for most use cases
Action Planning: Use gpt-4o for better reasoning, gpt-4o-mini for cost savings
Early Stopping: Implement confidence thresholds to stop early
Query Generation: Cache common query patterns
Monitoring: Track iteration count and reasoning chains for optimization

5. Multi-Vector RAG

What It Is Multi-Vector RAG uses multiple embedding types (dense, sparse, metadata) for precise retrieval. By combining semantic understanding, exact keyword matching, and structured metadata, it provides more accurate and flexible search capabilities. Real-World Example Use Case: Technical documentation search User asks: “Python async/await error handling” System searches:

Dense vectors: Find semantically similar docs about async programming
Sparse vectors: Match exact keywords “async”, “await”, “error”
Metadata vectors: Filter by language=“Python”, category=“error-handling” System combines: Weighted fusion returns most relevant technical docs

class MultiVectorRAG:
    """RAG with dense, sparse, and metadata vectors"""
    
    def __init__(self, database):
        self.client = OpenAI()
        self.db = database
    
    async def index_document(self, doc_id: str, content: str, metadata: dict):
        """Index with multiple vector types"""
        
        # 1. Dense embedding (semantic)
        dense_embedding = self._embed_dense(content)
        
        # 2. Sparse embedding (BM25/keywords)
        sparse_embedding = self._embed_sparse(content)
        
        # 3. Metadata embedding
        metadata_text = " ".join([
            f"{k}: {v}" for k, v in metadata.items()
        ])
        metadata_embedding = self._embed_dense(metadata_text)
        
        # Store all embeddings
        await self.db.store_multi_vector(
            doc_id=doc_id,
            content=content,
            dense=dense_embedding,
            sparse=sparse_embedding,
            metadata_vec=metadata_embedding,
            metadata=metadata
        )
    
    async def query(
        self,
        question: str,
        filters: dict = None,
        weights: dict = None
    ) -> List[dict]:
        """Multi-vector retrieval with configurable weights"""
        weights = weights or {
            "dense": 0.5,
            "sparse": 0.3,
            "metadata": 0.2
        }
        
        # Embed query
        query_dense = self._embed_dense(question)
        query_sparse = self._embed_sparse(question)
        
        # Search with each vector type
        dense_results = await self.db.search(
            vector=query_dense,
            vector_type="dense",
            filters=filters
        )
        
        sparse_results = await self.db.search(
            vector=query_sparse,
            vector_type="sparse",
            filters=filters
        )
        
        # If metadata filters provided, search metadata vectors
        if filters:
            filter_text = " ".join([f"{k}: {v}" for k, v in filters.items()])
            filter_vec = self._embed_dense(filter_text)
            metadata_results = await self.db.search(
                vector=filter_vec,
                vector_type="metadata"
            )
        else:
            metadata_results = []
        
        # Weighted combination
        combined = self._weighted_fusion(
            dense_results,
            sparse_results,
            metadata_results,
            weights
        )
        
        return combined
    
    def _weighted_fusion(
        self,
        dense: List[dict],
        sparse: List[dict],
        metadata: List[dict],
        weights: dict
    ) -> List[dict]:
        """Combine results with weighted scoring"""
        scores = {}
        docs = {}
        
        for rank, doc in enumerate(dense):
            doc_id = doc['id']
            docs[doc_id] = doc
            scores[doc_id] = scores.get(doc_id, 0) + weights["dense"] / (rank + 1)
        
        for rank, doc in enumerate(sparse):
            doc_id = doc['id']
            docs[doc_id] = doc
            scores[doc_id] = scores.get(doc_id, 0) + weights["sparse"] / (rank + 1)
        
        for rank, doc in enumerate(metadata):
            doc_id = doc['id']
            docs[doc_id] = doc
            scores[doc_id] = scores.get(doc_id, 0) + weights["metadata"] / (rank + 1)
        
        sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
        return [docs[doc_id] for doc_id in sorted_ids]
    
    def _embed_sparse(self, text: str) -> dict:
        """Create sparse BM25-style embedding"""
        # Simplified—use actual BM25 or SPLADE in production
        from collections import Counter
        import re
        
        words = re.findall(r'\w+', text.lower())
        word_counts = Counter(words)
        
        return {
            "indices": list(range(len(word_counts))),
            "values": list(word_counts.values()),
            "tokens": list(word_counts.keys())
        }

When to Use Multi-Vector RAG:

Technical documentation with exact terminology
Need both semantic and keyword matching
Rich metadata available for filtering
Mixed content types (code, docs, comments)
Require fine-tuned relevance control

Limitations:

Requires storing multiple embeddings per document
More storage and indexing overhead
Weight tuning requires experimentation
Sparse embeddings need specialized infrastructure
More complex than single-vector approaches

Performance Tips:

Weight Tuning: Start with dense=0.5, sparse=0.3, metadata=0.2, adjust based on domain
Sparse Embeddings: Use BM25 or SPLADE for production
Metadata Indexing: Index frequently filtered fields separately
Storage: Compress sparse embeddings to save space
Query Optimization: Cache dense embeddings, compute sparse on-demand

6. Graph RAG

What It Is Graph RAG traverses knowledge graphs to follow relationships and discover connected information. It understands how entities relate to each other, enabling multi-hop reasoning and discovery of indirectly related information. Real-World Example Use Case: Company knowledge base User asks: “Who are the key engineers working on projects related to machine learning?” System:

Extracts entities: “engineers”, “projects”, “machine learning”
Finds in graph: Engineers → Work On → Projects → Related To → “machine learning”
Traverses relationships: Discovers connected engineers and projects
Combines with vector search: Adds relevant documents System answers: Lists engineers with their ML-related projects and expertise

from dataclasses import dataclass
from typing import Set

@dataclass
class Entity:
    id: str
    name: str
    type: str
    properties: dict

@dataclass
class Relationship:
    source: str
    target: str
    type: str
    properties: dict

class GraphRAG:
    """RAG using knowledge graph traversal"""
    
    def __init__(self, graph_db, vector_db):
        self.client = OpenAI()
        self.graph = graph_db
        self.vectors = vector_db
    
    async def query(self, question: str) -> dict:
        # 1. Extract entities from question
        entities = await self._extract_entities(question)
        
        # 2. Find entities in knowledge graph
        graph_entities = []
        for entity in entities:
            matches = await self.graph.find_entity(entity["name"], entity["type"])
            graph_entities.extend(matches)
        
        # 3. Traverse graph to find related information
        subgraph = await self._traverse_graph(graph_entities, depth=2)
        
        # 4. Also do vector search for additional context
        vector_docs = await self._vector_search(question)
        
        # 5. Combine graph and vector context
        context = self._build_context(subgraph, vector_docs)
        
        # 6. Generate answer
        answer = await self._generate(question, context)
        
        return {
            "answer": answer,
            "entities": [e.name for e in graph_entities],
            "relationships": len(subgraph["relationships"]),
            "documents": len(vector_docs)
        }
    
    async def _extract_entities(self, question: str) -> List[dict]:
        """Extract named entities from question"""
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": """Extract named entities from the question.
                    Return JSON: {"entities": [{"name": "...", "type": "person|org|concept|product|..."}]}"""
                },
                {"role": "user", "content": question}
            ],
            response_format={"type": "json_object"}
        )
        
        result = json.loads(response.choices[0].message.content)
        return result.get("entities", [])
    
    async def _traverse_graph(
        self,
        start_entities: List[Entity],
        depth: int = 2
    ) -> dict:
        """Traverse knowledge graph from starting entities"""
        visited: Set[str] = set()
        entities = []
        relationships = []
        
        queue = [(e, 0) for e in start_entities]
        
        while queue:
            entity, current_depth = queue.pop(0)
            
            if entity.id in visited or current_depth > depth:
                continue
            
            visited.add(entity.id)
            entities.append(entity)
            
            # Get relationships
            rels = await self.graph.get_relationships(entity.id)
            
            for rel in rels:
                relationships.append(rel)
                
                # Get connected entity
                connected_id = rel.target if rel.source == entity.id else rel.source
                connected = await self.graph.get_entity(connected_id)
                
                if connected and connected.id not in visited:
                    queue.append((connected, current_depth + 1))
        
        return {"entities": entities, "relationships": relationships}
    
    def _build_context(
        self,
        subgraph: dict,
        vector_docs: List[dict]
    ) -> str:
        """Build context from graph and documents"""
        parts = []
        
        # Add graph context
        parts.append("## Knowledge Graph Context")
        
        # Entities
        parts.append("\n### Entities:")
        for entity in subgraph["entities"]:
            parts.append(f"- **{entity.name}** ({entity.type}): {entity.properties}")
        
        # Relationships
        parts.append("\n### Relationships:")
        for rel in subgraph["relationships"]:
            parts.append(f"- {rel.source} --[{rel.type}]--> {rel.target}")
        
        # Add document context
        parts.append("\n## Document Context")
        for i, doc in enumerate(vector_docs, 1):
            parts.append(f"\n[Document {i}]: {doc['content']}")
        
        return "\n".join(parts)
    
    async def index_with_graph(self, doc_id: str, content: str):
        """Index document and extract graph entities"""
        # 1. Standard vector indexing
        embedding = self._embed(content)
        await self.vectors.store(doc_id, content, embedding)
        
        # 2. Extract entities and relationships
        graph_data = await self._extract_graph(content)
        
        # 3. Add to knowledge graph
        for entity in graph_data["entities"]:
            await self.graph.add_entity(entity)
        
        for rel in graph_data["relationships"]:
            await self.graph.add_relationship(rel)
    
    async def _extract_graph(self, content: str) -> dict:
        """Extract entities and relationships from text"""
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": """Extract entities and relationships from this text.
                    Return JSON:
                    {
                        "entities": [{"id": "unique_id", "name": "...", "type": "...", "properties": {...}}],
                        "relationships": [{"source": "id1", "target": "id2", "type": "WORKS_AT|OWNS|RELATED_TO|..."}]
                    }"""
                },
                {"role": "user", "content": content}
            ],
            response_format={"type": "json_object"}
        )
        
        return json.loads(response.choices[0].message.content)

When to Use Graph RAG:

Data with rich entity relationships
Questions about connections and relationships
Multi-hop queries (“who works at company that acquired X”)
Knowledge bases with structured information
Need to discover indirectly related information

Limitations:

Requires graph database infrastructure
Entity extraction and graph construction overhead
Graph traversal can be slow for large graphs
More complex to set up and maintain
Requires structured data or entity extraction pipeline

Performance Tips:

Graph Depth: Limit traversal to depth 2-3 for performance
Entity Extraction: Use gpt-4o for accurate extraction, cache results
Graph Indexing: Index frequently queried entity types
Hybrid Approach: Combine graph traversal with vector search
Caching: Cache common graph traversal paths

Choosing the Right Architecture

Selecting the appropriate RAG architecture depends on your use case, complexity requirements, and performance needs. Use this guide to make the right choice.

Decision Matrix

If You Need…	Use This
Simple Q&A over docs	Basic RAG
Production system with quality requirements	Advanced RAG
Conversational AI	Memory RAG
Complex research/multi-step questions	Agentic RAG
Technical docs (exact + semantic)	Multi-Vector RAG
Relationship-based queries	Graph RAG

When to Upgrade

Basic → Advanced:

Accuracy < 70%
Users complain about irrelevant results
Technical domain with specific terminology
Ambiguous queries are common

Advanced → Memory:

Building a chatbot
Users have repeat interactions
Personalization improves experience
Multi-turn conversations

Memory → Agentic:

Questions require research
Need to synthesize multiple sources
Multi-hop reasoning required
Questions like “compare X and Y across Z”

Production Considerations

Building production-ready RAG systems requires careful consideration of infrastructure, costs, performance, and monitoring. Here’s what you need to know.

1. Vector Database Selection

Choosing the right vector database is critical for production performance and reliability.

Database	Best For	Pros	Cons
Pinecone	Production apps	Managed, fast, reliable	Cost
Weaviate	Open-source needs	Self-hosted, feature-rich	Setup complexity
Qdrant	High performance	Very fast, Rust-based	Smaller community
ChromaDB	Prototyping	Easy setup, Python-friendly	Not production-ready

2. Cost Optimization

Understanding and managing costs is essential for sustainable RAG deployments. Typical Costs per 1000 Queries:

Basic RAG: $2-5
Advanced RAG: $6-12
Memory RAG: $3-8
Agentic RAG: $15-40

Optimization Strategies:

Cache common queries
Use cheaper models for embeddings
Implement query throttling
Batch operations when possible

3. Performance Tuning

Different document types and use cases require different chunk sizes and retrieval settings.

# Chunk size optimization
CHUNK_SIZES = {
    "technical_docs": 800,  # More context needed
    "chat_logs": 400,       # Conversational
    "legal_documents": 1000, # Dense information
    "news_articles": 600    # Balanced
}

# Top-k tuning
TOP_K_SETTINGS = {
    "simple_faq": 3,
    "research": 10,
    "general_qa": 5
}

# Re-ranking budget
USE_RERANKING_IF = {
    "query_length": "> 10 words",
    "ambiguous": True,
    "high_stakes": True  # Legal, medical, financial
}

4. Monitoring and Metrics

Track key metrics to ensure your RAG system performs well in production.

{
    "retrieval_metrics": {
        "latency_p50": "< 500ms",
        "latency_p99": "< 2s",
        "relevance_score": "> 0.7"
    },
    "generation_metrics": {
        "answer_latency": "< 3s",
        "answer_length": "50-300 words",
        "citation_rate": "> 80%"
    },
    "quality_metrics": {
        "user_satisfaction": "> 4/5",
        "thumbs_up_rate": "> 70%",
        "follow_up_rate": "< 30%"
    }
}

5. Common Pitfalls and Solutions

Pitfall 1: “Chunk Boundaries Cut Off Important Info” Problem: Answer spans 2 chunks, retrieves neither

Chunk 1: "...The policy states that remote work..."
Chunk 2: "...is allowed up to 3 days per week."

Solution: Use overlapping chunks

def chunk_with_overlap(text, size=500, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start += size - overlap  # Overlap
    return chunks

Pitfall 2: “Too Many Irrelevant Results” Problem: Top-k returns junk documents Solutions:

Similarity Threshold: Only use docs > 0.7 similarity
Metadata Filtering: Filter by date, category, etc.
Re-ranking: Use LLM to score relevance
Better Chunking: Smaller, more focused chunks

Pitfall 3: “Hallucinations Despite RAG” Problem: LLM makes things up even with sources Solutions:

Stricter System Prompt: “ONLY use information from sources. If unsure, say ‘I don’t have information about that in the provided sources.’”
Temperature = 0: Reduce creativity
Post-processing: Check if answer content appears in sources
Citation Requirement: Force [Source N] citations

Pitfall 4: “Slow Query Times” Problem: Takes 5-10 seconds per query Solutions:

Use faster embedding models
Implement caching
Reduce top-k
Use async operations
Consider hybrid database with cache layer

Summary and Next Steps

Key Takeaways:

Start Simple: Begin with Basic RAG, add complexity as needed
Hybrid Works Best: Vector + keyword search outperforms either alone
Memory for Conversations: Essential for chatbots and assistants
Monitor Quality: Track metrics, iterate on poor performance
Cost vs Quality: Advanced techniques cost more but deliver better results

Choosing the Right RAG Type

RAG Type	Best For	Complexity	Latency
Basic	Simple Q&A, MVPs	Low	Fast
Advanced	Production systems, diverse queries	Medium	Moderate
Memory	Chatbots, personalized assistants	Medium	Moderate
Agentic	Complex, multi-step questions	High	Slow
Multi-Vector	Hybrid search requirements	Medium	Moderate
Graph	Connected data, relationship queries	High	Variable

Key Takeaways

Start Simple

Begin with Basic RAG, then add complexity as needed based on failure analysis.

Hybrid is Best

Combining vector + keyword search outperforms either alone for most use cases.

Memory Matters

For conversational AI, memory context dramatically improves response relevance.

Graphs for Relationships

When entities and relationships matter, Graph RAG provides structured understanding.

What’s Next

Tool Calling

Learn how to give LLMs the ability to call functions and external APIs

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​The RAG Evolution

​1. Basic RAG

​2. Advanced RAG

​3. Memory RAG

​4. Agentic RAG

​5. Multi-Vector RAG

​6. Graph RAG

​Choosing the Right Architecture

​Decision Matrix

​When to Upgrade

​Production Considerations

​1. Vector Database Selection

​2. Cost Optimization

​3. Performance Tuning

​4. Monitoring and Metrics

​5. Common Pitfalls and Solutions

​Summary and Next Steps

​Choosing the Right RAG Type

​Key Takeaways

Start Simple

Hybrid is Best

Memory Matters

Graphs for Relationships

​What’s Next

Tool Calling

The RAG Evolution

1. Basic RAG

2. Advanced RAG

3. Memory RAG

4. Agentic RAG

5. Multi-Vector RAG

6. Graph RAG

Choosing the Right Architecture

Decision Matrix

When to Upgrade

Production Considerations

1. Vector Database Selection

2. Cost Optimization

3. Performance Tuning

4. Monitoring and Metrics

5. Common Pitfalls and Solutions

Summary and Next Steps

Choosing the Right RAG Type

Key Takeaways

What’s Next