Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

December 2025 Update: Complete guide to DSPy - the framework that treats LLM calls as optimizable modules instead of brittle prompts.

What is DSPy?

The core insight behind DSPy is that prompt engineering is a local maximum. You spend hours tweaking a prompt for GPT-4, then OpenAI releases a new model and your carefully crafted prompt works worse. Or you switch to Claude and everything breaks because the two models respond differently to the same instructions. DSPy treats this problem like a compiler treats assembly: you write high-level intent, and the framework figures out the optimal “machine code” (prompt) for your target model. DSPy (Declarative Self-improving Python) is a framework from Stanford NLP that replaces:
  • Prompting → with Programming
  • String manipulation → with Typed signatures
  • Manual tuning → with Automatic optimization
Traditional Prompting              DSPy Approach
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
"You are an expert..."             class MyModule(dspy.Module):
"Step 1: First..."                     def forward(self, question):
"Example: ..."                             return self.predictor(question=question)
Manual prompt engineering           Automatic optimization
Fragile to model changes           Model-agnostic

Installation and Setup

pip install dspy-ai
import dspy
from dspy import Example

# Configure LM
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

# Or use local models
# lm = dspy.LM("ollama/llama3.2")
# lm = dspy.LM("together/meta-llama/Llama-3-70b-chat-hf")

Core Concepts

Signatures: Define Input/Output

Signatures are DSPy’s replacement for prompt templates. Instead of writing “You are an expert that takes a question and returns an answer,” you declare the input and output types. The framework generates the actual prompt automatically — and crucially, can optimize that generated prompt later without you changing any code. Think of signatures as function type signatures in a statically typed language: they tell the system what goes in and what comes out, and the implementation details are handled elsewhere. Signatures define what your LLM module does:
import dspy

# Simple signature (inline)
qa = dspy.Predict("question -> answer")
result = qa(question="What is the capital of France?")
print(result.answer)  # Paris

# Class-based signature for more control
class QuestionAnswer(dspy.Signature):
    """Answer questions with concise, accurate responses."""
    
    question: str = dspy.InputField(desc="The question to answer")
    answer: str = dspy.OutputField(desc="A concise answer")

qa = dspy.Predict(QuestionAnswer)
result = qa(question="What is machine learning?")
print(result.answer)

Multi-field Signatures

class SentimentAnalysis(dspy.Signature):
    """Analyze the sentiment of text."""
    
    text: str = dspy.InputField()
    sentiment: str = dspy.OutputField(desc="positive, negative, or neutral")
    confidence: float = dspy.OutputField(desc="Confidence score 0-1")
    reasoning: str = dspy.OutputField(desc="Brief explanation")

analyzer = dspy.Predict(SentimentAnalysis)
result = analyzer(text="I absolutely loved this product!")
print(f"Sentiment: {result.sentiment} ({result.confidence})")
print(f"Reason: {result.reasoning}")

Modules: Building Blocks

DSPy modules are like LEGO bricks for LLM pipelines. Each module wraps a specific reasoning pattern (step-by-step thinking, code generation, tool use) and can be composed with other modules to build complex workflows. The key advantage over raw API calls: each module is independently optimizable, so you can improve one stage of your pipeline without disrupting others.

ChainOfThought: Step-by-Step Reasoning

ChainOfThought automatically adds a reasoning field to your output, forcing the model to show its work before giving a final answer. This is the DSPy equivalent of asking someone to “think out loud” — it dramatically improves accuracy on tasks that require multi-step logic, at the cost of more output tokens.
class MathProblem(dspy.Signature):
    """Solve math word problems step by step."""
    
    problem: str = dspy.InputField()
    answer: float = dspy.OutputField()

# ChainOfThought wraps your signature with a "reasoning" step.
# The model must explain its thinking before committing to an answer.
solver = dspy.ChainOfThought(MathProblem)

result = solver(problem="""
    A store has 45 apples. They sell 12 in the morning and 
    receive a shipment of 30 more. How many apples do they have now?
""")

print(f"Answer: {result.answer}")
print(f"Reasoning: {result.reasoning}")  # Shows step-by-step work

ProgramOfThought: Code-Based Reasoning

class Calculation(dspy.Signature):
    """Solve problems by writing Python code."""
    
    problem: str = dspy.InputField()
    answer: str = dspy.OutputField()

solver = dspy.ProgramOfThought(Calculation)
result = solver(problem="Calculate compound interest on $1000 at 5% for 10 years")

ReAct: Reason and Act

class SearchAndAnswer(dspy.Signature):
    """Answer questions using search when needed."""
    
    question: str = dspy.InputField()
    answer: str = dspy.OutputField()

# Define tools
def search(query: str) -> str:
    """Search the web for information."""
    # Implement actual search
    return f"Search results for: {query}"

def calculate(expression: str) -> str:
    """Evaluate a mathematical expression."""
    return str(eval(expression))

# Create ReAct agent
react = dspy.ReAct(
    SearchAndAnswer,
    tools=[search, calculate],
    max_iters=5
)

result = react(question="What is the population of Tokyo and what's 10% of that?")
print(result.answer)

Building Complex Pipelines

Custom Modules

class RAGModule(dspy.Module):
    """RAG pipeline as a DSPy module."""
    
    def __init__(self, num_docs: int = 3):
        super().__init__()
        self.num_docs = num_docs
        self.retriever = dspy.Retrieve(k=num_docs)
        self.generate = dspy.ChainOfThought("context, question -> answer")
    
    def forward(self, question: str) -> dspy.Prediction:
        # Retrieve relevant documents
        context = self.retriever(question).passages
        
        # Generate answer using context
        result = self.generate(
            context=context,
            question=question
        )
        
        return result

# Usage
rag = RAGModule(num_docs=5)
answer = rag(question="What are the benefits of RAG?")

Multi-Stage Pipelines

class ResearchPipeline(dspy.Module):
    """Multi-stage research and synthesis pipeline."""
    
    def __init__(self):
        super().__init__()
        
        # Stage 1: Query decomposition
        self.decompose = dspy.ChainOfThought(
            "question -> sub_questions: list[str]"
        )
        
        # Stage 2: Research each sub-question
        self.research = dspy.ChainOfThought(
            "sub_question -> findings"
        )
        
        # Stage 3: Synthesize findings
        self.synthesize = dspy.ChainOfThought(
            "question, all_findings: list[str] -> comprehensive_answer"
        )
    
    def forward(self, question: str) -> dspy.Prediction:
        # Decompose into sub-questions
        decomposed = self.decompose(question=question)
        sub_questions = decomposed.sub_questions
        
        # Research each
        all_findings = []
        for sq in sub_questions[:5]:  # Limit
            result = self.research(sub_question=sq)
            all_findings.append(result.findings)
        
        # Synthesize
        final = self.synthesize(
            question=question,
            all_findings=all_findings
        )
        
        return final

Optimization: The Power of DSPy

This is where DSPy justifies its learning curve. Instead of manually tweaking prompts and hoping for the best, you provide training examples and a metric function (“did the answer match?”), and DSPy searches the space of possible prompts, few-shot examples, and configurations to find what works best. It is like hyperparameter tuning for prompts — except the “hyperparameters” are the instructions, examples, and reasoning strategies themselves.

Automatic Prompt Optimization

DSPy can automatically optimize your prompts using training examples:
# Define training examples
trainset = [
    Example(question="What is 2+2?", answer="4").with_inputs("question"),
    Example(question="What is the capital of Japan?", answer="Tokyo").with_inputs("question"),
    Example(question="Who wrote Hamlet?", answer="William Shakespeare").with_inputs("question"),
]

# Create module
qa = dspy.Predict("question -> answer")

# Optimize with MIPROv2
from dspy.teleprompt import MIPROv2

optimizer = MIPROv2(
    metric=lambda example, pred, trace: pred.answer.lower() == example.answer.lower(),
    num_candidates=10,
    init_temperature=1.0
)

optimized_qa = optimizer.compile(qa, trainset=trainset)

# The optimized module has better prompts
result = optimized_qa(question="What is the speed of light?")

BootstrapFewShot: Learn from Examples

BootstrapFewShot is the fastest path from “I have some labeled examples” to “my model is measurably better.” It works by running your module on training examples, keeping the ones where the model got the right answer AND produced good reasoning, and then injecting those as few-shot demonstrations into future prompts. The name “bootstrap” comes from statistics — you are bootstrapping quality examples from your own model’s successful runs.
from dspy.teleprompt import BootstrapFewShot

# The metric function is the heart of optimization. DSPy uses it to
# decide which prompts and examples produce good results.
def accuracy_metric(example, pred, trace=None):
    return example.answer.lower() in pred.answer.lower()

# Bootstrap optimizer
bootstrap = BootstrapFewShot(
    metric=accuracy_metric,
    max_bootstrapped_demos=4,
    max_labeled_demos=16
)

# Compile
optimized = bootstrap.compile(qa, trainset=trainset)

Evaluation

from dspy.evaluate import Evaluate

# Create test set
testset = [
    Example(question="What is 3+3?", answer="6").with_inputs("question"),
    Example(question="Capital of Germany?", answer="Berlin").with_inputs("question"),
]

# Evaluate
evaluator = Evaluate(
    devset=testset,
    metric=accuracy_metric,
    display_progress=True
)

score = evaluator(optimized_qa)
print(f"Accuracy: {score}%")

Advanced Patterns

Assertions and Constraints

Assertions are DSPy’s answer to the “the model sometimes returns garbage” problem. Instead of praying that the model follows your format requirements, you declare hard constraints and DSPy automatically retries when they are violated. This is like adding type checking at runtime: if the model returns a verdict that is not one of your three allowed values, DSPy backtracks and tries again with the constraint error as feedback.
import dspy
from dspy.primitives.assertions import assert_transform_module, backtrack_handler

class FactChecker(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate = dspy.ChainOfThought("claim -> verdict, evidence")
    
    def forward(self, claim: str):
        result = self.generate(claim=claim)
        
        # Assert constraints
        dspy.Assert(
            result.verdict in ["true", "false", "unverifiable"],
            f"Verdict must be true/false/unverifiable, got: {result.verdict}"
        )
        
        dspy.Assert(
            len(result.evidence) > 20,
            "Evidence must be detailed (>20 chars)"
        )
        
        return result

# Wrap with assertion handling
checker = assert_transform_module(
    FactChecker(),
    backtrack_handler
)

Typed Predictors

Typed predictors bring Pydantic validation to LLM outputs. Instead of parsing JSON strings and hoping the keys are right, you get a fully validated Python object. If the model returns people: "Tim Cook" (a string instead of a list), the type system catches it. This eliminates an entire class of bugs that plague JSON-mode prompt engineering.
from pydantic import BaseModel
from typing import List

class ExtractedEntities(BaseModel):
    people: List[str]
    organizations: List[str]
    locations: List[str]

class EntityExtraction(dspy.Signature):
    """Extract named entities from text."""
    
    text: str = dspy.InputField()
    entities: ExtractedEntities = dspy.OutputField()

extractor = dspy.TypedPredictor(EntityExtraction)
result = extractor(text="Apple CEO Tim Cook visited Paris to meet with Emmanuel Macron.")

print(result.entities.people)  # ['Tim Cook', 'Emmanuel Macron']
print(result.entities.organizations)  # ['Apple']
print(result.entities.locations)  # ['Paris']

Caching and Efficiency

import dspy

# Enable caching
dspy.configure(
    lm=dspy.LM("openai/gpt-4o-mini"),
    experimental=True  # Enables caching
)

# Or use explicit cache
from dspy.utils import DummyLM

# For testing without API calls
dspy.configure(lm=DummyLM([
    {"answer": "Paris"},
    {"answer": "Tokyo"}
]))

DSPy vs LangChain

This is one of the most common questions in the AI engineering community. The honest answer: they solve different problems. LangChain is a toolkit for quickly assembling LLM pipelines from pre-built components. DSPy is a compiler for making those pipelines work reliably. You might prototype with LangChain in an afternoon, then port the production version to DSPy when you need it to be measurably good rather than impressively demoed.
AspectDSPyLangChain
PhilosophyProgramming LLMsChaining prompts
OptimizationAutomatic (data-driven)Manual (prompt-driven)
Type SafetyBuilt-in with PydanticLimited, added via extensions
Learning CurveSteeper (new mental model)Gentler (familiar patterns)
Best ForProduction systems with metricsPrototyping and exploration

When to Use DSPy

Use DSPy when:
  • Building production systems that need measurable optimization
  • You have training data to improve prompts
  • Type safety and reliability matter
  • You want model-agnostic code
Consider alternatives when:
  • Rapid prototyping is the priority
  • You don’t have training examples
  • Simple one-off tasks

Full Framework Comparison

The AI engineering framework landscape is crowded. Here is an honest comparison to help you choose.
FactorDSPyLangChainLlamaIndexRaw API Calls
PhilosophyCompile and optimize LLM programsChain together pre-built componentsSpecialized for data indexing and retrievalFull control, no abstractions
Learning curveSteep — requires understanding of signatures, modules, optimizersModerate — familiar patterns but large API surfaceModerate — focused on RAG use casesLowest — just HTTP calls
Prompt optimizationAutomatic — core value propositionManual — you write and tweak prompts yourselfManual with some template supportManual
RAG supportVia dspy.Retrieve moduleExtensive — many retriever integrationsBest-in-class — purpose-built for thisBuild it yourself
Type safetyBuilt-in with Pydantic-backed signaturesLimited, improvingLimitedNone unless you add it
DebuggingInspect optimized prompts, trace module executionVerbose callback system, LangSmith for tracingCallback events, LlamaTraceYou control everything
Community and ecosystemGrowing, academic rootsLargest community, most integrationsStrong for RAG, growingUniversal
Production maturityNewer, rapidly evolving APIWidely deployed but frequent breaking changesStable for RAG pipelinesBattle-tested
Best forTeams with labeled data who need measurable quality improvementsTeams that need many integrations and rapid prototypingTeams focused on document Q&A and retrievalTeams that want no dependencies and full control
Decision framework:
  1. Do you need the fastest path to a working demo? Use LangChain or raw API calls.
  2. Is your core use case document retrieval and Q&A? Start with LlamaIndex — it is purpose-built for this.
  3. Do you have labeled examples and need to measurably improve quality? Use DSPy — its optimization loop is unmatched.
  4. Do you want to minimize dependencies and maximize control? Use raw API calls with your own thin wrapper.
  5. In practice, many production systems use a combination: LlamaIndex for retrieval, DSPy for the generation module, and raw API calls for simple classification tasks.

DSPy Optimization Edge Cases

Small training sets produce overfitting. With fewer than 20 examples, BootstrapFewShot may find prompts that ace your training set but fail on new inputs. Always hold out a test set and evaluate on it — never optimize and evaluate on the same data. Optimization costs real money. MIPROv2 with num_candidates=10 makes many LLM calls to evaluate candidate prompts. On GPT-4o, optimizing a complex module can cost $5-50 depending on the number of training examples and candidates. Use GPT-4o-mini for optimization runs, then evaluate the winning prompt on your target model. Non-deterministic optimization results. Running the same optimizer twice on the same data can produce different optimized prompts. Set random seeds where possible, and run optimization 2-3 times to check stability. If results vary wildly, your training set is too small or your metric is too noisy. Metric function design is harder than it looks. A naive pred.answer == example.answer metric fails on valid paraphrases (“Paris, France” vs. “Paris”). Use fuzzy matching, semantic similarity, or LLM-as-judge metrics for open-ended outputs. The metric function is the single most important input to DSPy optimization — a bad metric optimizes toward the wrong goal.

Complete Example: Optimized QA System

import dspy
from dspy.teleprompt import BootstrapFewShot
from dspy.evaluate import Evaluate

# 1. Configure
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

# 2. Define signature
class QASignature(dspy.Signature):
    """Answer questions accurately and concisely."""
    
    context: str = dspy.InputField(desc="Background information")
    question: str = dspy.InputField(desc="Question to answer")
    answer: str = dspy.OutputField(desc="Concise, accurate answer")

# 3. Build module
class QASystem(dspy.Module):
    def __init__(self):
        super().__init__()
        self.qa = dspy.ChainOfThought(QASignature)
    
    def forward(self, context: str, question: str):
        return self.qa(context=context, question=question)

# 4. Create training data
trainset = [
    Example(
        context="Python is a programming language created by Guido van Rossum.",
        question="Who created Python?",
        answer="Guido van Rossum"
    ).with_inputs("context", "question"),
    Example(
        context="The Eiffel Tower is 330 meters tall and located in Paris.",
        question="How tall is the Eiffel Tower?",
        answer="330 meters"
    ).with_inputs("context", "question"),
    # Add more examples...
]

# 5. Define metric
def answer_match(example, pred, trace=None):
    return example.answer.lower() in pred.answer.lower()

# 6. Optimize
optimizer = BootstrapFewShot(
    metric=answer_match,
    max_bootstrapped_demos=4
)

optimized_qa = optimizer.compile(QASystem(), trainset=trainset)

# 7. Evaluate
testset = [...]  # Your test examples
evaluator = Evaluate(devset=testset, metric=answer_match)
score = evaluator(optimized_qa)
print(f"Test accuracy: {score}%")

# 8. Use in production
result = optimized_qa(
    context="Machine learning is a subset of AI.",
    question="What is machine learning a subset of?"
)
print(result.answer)

# 9. Save optimized module
optimized_qa.save("optimized_qa.json")

# 10. Load later
loaded_qa = QASystem()
loaded_qa.load("optimized_qa.json")

Key Takeaways

Signatures Over Prompts

Define what your LLM does with typed signatures, not string prompts

Automatic Optimization

DSPy optimizes prompts automatically using your training examples

Composable Modules

Build complex pipelines from simple, reusable modules

Model Agnostic

Switch models without rewriting prompts

Resources


What’s Next

Capstone Project

Apply everything you’ve learned in a comprehensive AI engineering project

Interview Deep-Dive

Strong Answer:
  • DSPy is worth it when three conditions are met simultaneously: (1) you have measurable quality metrics (accuracy, F1, user satisfaction scores), (2) you have labeled training examples (at least 50-100 to bootstrap), and (3) you plan to maintain the system long-term across model updates. If all three are true, DSPy’s automatic optimization will outperform hand-crafted prompts within a few optimization cycles and keep improving as you add data.
  • DSPy is NOT worth it when: you are prototyping and the prompt changes daily, you have no evaluation data to optimize against, your task is simple enough that a well-written zero-shot prompt gets 95% accuracy, or your team has no Python ML experience (the learning curve is real). A hand-crafted prompt that takes 2 hours to write and works well enough is better than spending 2 weeks learning DSPy for marginal improvement.
  • The honest trade-off is time horizon. In the short term (next 2 weeks), prompt engineering is always faster. In the medium term (3-6 months), DSPy pays off because OpenAI releases GPT-5 and your hand-crafted GPT-4o prompts break, while DSPy re-optimizes automatically. In the long term (1+ year), DSPy’s programmatic approach is dramatically more maintainable than a collection of prompt strings scattered across your codebase.
  • The migration path I recommend: do not rewrite everything at once. Pick your most important module (the one where quality matters most), port it to DSPy, optimize it, and measure the improvement. If DSPy improves quality by more than 5% on your eval set, port the next module. If not, keep your hand-crafted prompts.
Follow-up: Your DSPy-optimized module scores 92% on your eval set, but the optimized prompt it generates is a 2,000-token monstrosity that you cannot understand or debug. How do you handle prompt interpretability in DSPy?This is a legitimate concern and the most common criticism of DSPy. The optimized prompt might include automatically selected few-shot examples, chain-of-thought instructions, and formatting directives that work well but are opaque. The mitigation is treating the optimized prompt like a compiled binary — you do not read it, you test it. Your eval suite is your source of truth, not the prompt text. If you need interpretability for debugging, DSPy’s inspect_history() lets you see the actual prompts sent and responses received. For regulatory or compliance requirements where you must explain the system’s behavior, you can constrain DSPy’s optimization to only select few-shot examples (no prompt rewriting), which keeps the prompt human-readable while still getting the benefit of automated example selection.
Strong Answer:
  • BootstrapFewShot is essentially self-training through the LLM. You start with a small set of labeled examples (say, 20 question-answer pairs). DSPy runs your module on each training example and checks whether the output passes your metric function. For the examples where the model got the right answer AND produced good intermediate reasoning (for ChainOfThought modules), DSPy saves the complete trace — the input, the reasoning, and the correct output.
  • These successful traces become “bootstrapped demonstrations.” They are real examples of the model doing the task correctly, extracted from the model’s own behavior. DSPy then selects the best subset of these demonstrations (up to max_bootstrapped_demos) and injects them as few-shot examples into the prompt for future predictions.
  • The insight is that the model already knows how to do the task some percentage of the time. BootstrapFewShot identifies those successful cases and uses them as templates. If your model gets 70% accuracy zero-shot, the bootstrapped examples come from that 70% and help push the remaining 30% higher.
  • The quality filter is the metric function. Only traces where the metric returns True become candidates. This is why your metric function is so important — a loose metric (“does the output contain the answer anywhere?”) bootstraps mediocre examples. A strict metric (“does the output exactly match the expected answer?”) bootstraps only high-quality demonstrations.
  • The limitation: if your model gets 0% accuracy on a task zero-shot, there are no successful traces to bootstrap from. BootstrapFewShot cannot create quality from nothing — it amplifies existing capability. For tasks where the base model completely fails, you need to provide manually-crafted demonstrations via max_labeled_demos instead.
Follow-up: You run BootstrapFewShot and it improves accuracy from 70% to 85%. But the 15% of remaining failures are concentrated on a specific type of question. How do you diagnose and address this with DSPy?Run your eval set through the optimized module and categorize the failures. DSPy’s Evaluate gives you per-example results, so you can filter for failures and look for patterns. If the failures cluster around a specific question type (say, multi-hop reasoning questions), the bootstrapped examples probably did not include any successful multi-hop traces. The fix is targeted: manually create 5-10 high-quality demonstrations for the failing category and add them to the labeled demos. Then re-run BootstrapFewShot — it will mix your manual demonstrations with new bootstrapped ones. Alternatively, switch to MIPROv2 optimizer, which can also rewrite the task instructions (not just select examples) and may find better prompt phrasing for the difficult cases.
Strong Answer:
  • With raw API calls, each stage is a function that constructs a prompt string, calls the API, and parses the response. The three stages are coupled through their prompts — changing the output format of stage 1 requires updating the input parsing of stage 2. Prompt optimization is manual: you tweak stage 1’s prompt, re-run the pipeline, check final answer quality, and guess which stage caused the regression. This works for simple pipelines but becomes unmaintainable at 3+ stages.
  • With DSPy modules, each stage declares its input/output types through Signatures. Stage 1 outputs sub_questions: list[str], stage 2 takes a sub_question: str and outputs findings: str, stage 3 takes all_findings: list[str] and outputs comprehensive_answer: str. The types enforce a contract between stages — if stage 1 changes its output format, the type system catches the incompatibility.
  • The major practical benefit is independent optimization. You can optimize stage 2 (retrieval quality) without touching stages 1 and 3. DSPy’s optimizers trace through the entire pipeline but adjust each module’s prompt independently. With raw API calls, “optimizing stage 2” means changing the prompt and hoping it does not break the downstream stages.
  • The second benefit is model swapping. Each module can use a different model. Stage 1 (query decomposition) might use GPT-4o-mini because it is simple. Stage 2 (retrieval ranking) might use a local model. Stage 3 (synthesis) uses GPT-4o for quality. With DSPy, this is a configuration change. With raw API calls, each function has its own OpenAI client setup.
  • The trade-off: DSPy adds an abstraction layer that has a learning curve and makes debugging harder when things go wrong. For a 2-stage pipeline on a hackathon project, the abstraction overhead is not worth it. For a 4+ stage production pipeline maintained by a team, it pays for itself in maintainability.
Follow-up: Your 3-stage DSPy pipeline works but is slow — each stage makes a serial LLM call, and the total latency is 6 seconds. How do you optimize for speed?The first optimization is parallelizing independent calls. If stage 1 decomposes into 5 sub-questions, the 5 stage-2 calls are independent and can run concurrently with asyncio.gather. This turns 5 serial calls into 1 parallel batch, cutting stage 2 latency from 5x to 1x. Second, consider whether all stages need the same model. Stage 1 (decomposition) is a simple task — GPT-4o-mini is 2-3x faster than GPT-4o with comparable quality. Third, cache stage-2 results. If the same sub-question appears across different user queries (common for FAQs), cache the findings to avoid redundant LLM calls. Fourth, evaluate whether you need all three stages. Sometimes a single ChainOfThought module with “decompose the question, research each part, then synthesize” produces comparable quality to the 3-stage pipeline at 1/3 the latency. More stages is not always better.