Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
What is DSPy?
The core insight behind DSPy is that prompt engineering is a local maximum. You spend hours tweaking a prompt for GPT-4, then OpenAI releases a new model and your carefully crafted prompt works worse. Or you switch to Claude and everything breaks because the two models respond differently to the same instructions. DSPy treats this problem like a compiler treats assembly: you write high-level intent, and the framework figures out the optimal “machine code” (prompt) for your target model. DSPy (Declarative Self-improving Python) is a framework from Stanford NLP that replaces:- Prompting → with Programming
- String manipulation → with Typed signatures
- Manual tuning → with Automatic optimization
Installation and Setup
Core Concepts
Signatures: Define Input/Output
Signatures are DSPy’s replacement for prompt templates. Instead of writing “You are an expert that takes a question and returns an answer,” you declare the input and output types. The framework generates the actual prompt automatically — and crucially, can optimize that generated prompt later without you changing any code. Think of signatures as function type signatures in a statically typed language: they tell the system what goes in and what comes out, and the implementation details are handled elsewhere. Signatures define what your LLM module does:Multi-field Signatures
Modules: Building Blocks
DSPy modules are like LEGO bricks for LLM pipelines. Each module wraps a specific reasoning pattern (step-by-step thinking, code generation, tool use) and can be composed with other modules to build complex workflows. The key advantage over raw API calls: each module is independently optimizable, so you can improve one stage of your pipeline without disrupting others.ChainOfThought: Step-by-Step Reasoning
ChainOfThought automatically adds areasoning field to your output, forcing the model to show its work before giving a final answer. This is the DSPy equivalent of asking someone to “think out loud” — it dramatically improves accuracy on tasks that require multi-step logic, at the cost of more output tokens.
ProgramOfThought: Code-Based Reasoning
ReAct: Reason and Act
Building Complex Pipelines
Custom Modules
Multi-Stage Pipelines
Optimization: The Power of DSPy
This is where DSPy justifies its learning curve. Instead of manually tweaking prompts and hoping for the best, you provide training examples and a metric function (“did the answer match?”), and DSPy searches the space of possible prompts, few-shot examples, and configurations to find what works best. It is like hyperparameter tuning for prompts — except the “hyperparameters” are the instructions, examples, and reasoning strategies themselves.Automatic Prompt Optimization
DSPy can automatically optimize your prompts using training examples:BootstrapFewShot: Learn from Examples
BootstrapFewShot is the fastest path from “I have some labeled examples” to “my model is measurably better.” It works by running your module on training examples, keeping the ones where the model got the right answer AND produced good reasoning, and then injecting those as few-shot demonstrations into future prompts. The name “bootstrap” comes from statistics — you are bootstrapping quality examples from your own model’s successful runs.Evaluation
Advanced Patterns
Assertions and Constraints
Assertions are DSPy’s answer to the “the model sometimes returns garbage” problem. Instead of praying that the model follows your format requirements, you declare hard constraints and DSPy automatically retries when they are violated. This is like adding type checking at runtime: if the model returns a verdict that is not one of your three allowed values, DSPy backtracks and tries again with the constraint error as feedback.Typed Predictors
Typed predictors bring Pydantic validation to LLM outputs. Instead of parsing JSON strings and hoping the keys are right, you get a fully validated Python object. If the model returnspeople: "Tim Cook" (a string instead of a list), the type system catches it. This eliminates an entire class of bugs that plague JSON-mode prompt engineering.
Caching and Efficiency
DSPy vs LangChain
This is one of the most common questions in the AI engineering community. The honest answer: they solve different problems. LangChain is a toolkit for quickly assembling LLM pipelines from pre-built components. DSPy is a compiler for making those pipelines work reliably. You might prototype with LangChain in an afternoon, then port the production version to DSPy when you need it to be measurably good rather than impressively demoed.| Aspect | DSPy | LangChain |
|---|---|---|
| Philosophy | Programming LLMs | Chaining prompts |
| Optimization | Automatic (data-driven) | Manual (prompt-driven) |
| Type Safety | Built-in with Pydantic | Limited, added via extensions |
| Learning Curve | Steeper (new mental model) | Gentler (familiar patterns) |
| Best For | Production systems with metrics | Prototyping and exploration |
When to Use DSPy
Use DSPy when:- Building production systems that need measurable optimization
- You have training data to improve prompts
- Type safety and reliability matter
- You want model-agnostic code
- Rapid prototyping is the priority
- You don’t have training examples
- Simple one-off tasks
Full Framework Comparison
The AI engineering framework landscape is crowded. Here is an honest comparison to help you choose.| Factor | DSPy | LangChain | LlamaIndex | Raw API Calls |
|---|---|---|---|---|
| Philosophy | Compile and optimize LLM programs | Chain together pre-built components | Specialized for data indexing and retrieval | Full control, no abstractions |
| Learning curve | Steep — requires understanding of signatures, modules, optimizers | Moderate — familiar patterns but large API surface | Moderate — focused on RAG use cases | Lowest — just HTTP calls |
| Prompt optimization | Automatic — core value proposition | Manual — you write and tweak prompts yourself | Manual with some template support | Manual |
| RAG support | Via dspy.Retrieve module | Extensive — many retriever integrations | Best-in-class — purpose-built for this | Build it yourself |
| Type safety | Built-in with Pydantic-backed signatures | Limited, improving | Limited | None unless you add it |
| Debugging | Inspect optimized prompts, trace module execution | Verbose callback system, LangSmith for tracing | Callback events, LlamaTrace | You control everything |
| Community and ecosystem | Growing, academic roots | Largest community, most integrations | Strong for RAG, growing | Universal |
| Production maturity | Newer, rapidly evolving API | Widely deployed but frequent breaking changes | Stable for RAG pipelines | Battle-tested |
| Best for | Teams with labeled data who need measurable quality improvements | Teams that need many integrations and rapid prototyping | Teams focused on document Q&A and retrieval | Teams that want no dependencies and full control |
- Do you need the fastest path to a working demo? Use LangChain or raw API calls.
- Is your core use case document retrieval and Q&A? Start with LlamaIndex — it is purpose-built for this.
- Do you have labeled examples and need to measurably improve quality? Use DSPy — its optimization loop is unmatched.
- Do you want to minimize dependencies and maximize control? Use raw API calls with your own thin wrapper.
- In practice, many production systems use a combination: LlamaIndex for retrieval, DSPy for the generation module, and raw API calls for simple classification tasks.
DSPy Optimization Edge Cases
Small training sets produce overfitting. With fewer than 20 examples, BootstrapFewShot may find prompts that ace your training set but fail on new inputs. Always hold out a test set and evaluate on it — never optimize and evaluate on the same data. Optimization costs real money. MIPROv2 withnum_candidates=10 makes many LLM calls to evaluate candidate prompts. On GPT-4o, optimizing a complex module can cost $5-50 depending on the number of training examples and candidates. Use GPT-4o-mini for optimization runs, then evaluate the winning prompt on your target model.
Non-deterministic optimization results. Running the same optimizer twice on the same data can produce different optimized prompts. Set random seeds where possible, and run optimization 2-3 times to check stability. If results vary wildly, your training set is too small or your metric is too noisy.
Metric function design is harder than it looks. A naive pred.answer == example.answer metric fails on valid paraphrases (“Paris, France” vs. “Paris”). Use fuzzy matching, semantic similarity, or LLM-as-judge metrics for open-ended outputs. The metric function is the single most important input to DSPy optimization — a bad metric optimizes toward the wrong goal.
Complete Example: Optimized QA System
Key Takeaways
Signatures Over Prompts
Automatic Optimization
Composable Modules
Model Agnostic
Resources
What’s Next
Capstone Project
Interview Deep-Dive
Your team has spent two weeks hand-crafting prompts for a RAG pipeline. A colleague suggests rewriting everything in DSPy. What is your honest assessment of when DSPy is worth the migration cost versus when prompt engineering is sufficient?
Your team has spent two weeks hand-crafting prompts for a RAG pipeline. A colleague suggests rewriting everything in DSPy. What is your honest assessment of when DSPy is worth the migration cost versus when prompt engineering is sufficient?
- DSPy is worth it when three conditions are met simultaneously: (1) you have measurable quality metrics (accuracy, F1, user satisfaction scores), (2) you have labeled training examples (at least 50-100 to bootstrap), and (3) you plan to maintain the system long-term across model updates. If all three are true, DSPy’s automatic optimization will outperform hand-crafted prompts within a few optimization cycles and keep improving as you add data.
- DSPy is NOT worth it when: you are prototyping and the prompt changes daily, you have no evaluation data to optimize against, your task is simple enough that a well-written zero-shot prompt gets 95% accuracy, or your team has no Python ML experience (the learning curve is real). A hand-crafted prompt that takes 2 hours to write and works well enough is better than spending 2 weeks learning DSPy for marginal improvement.
- The honest trade-off is time horizon. In the short term (next 2 weeks), prompt engineering is always faster. In the medium term (3-6 months), DSPy pays off because OpenAI releases GPT-5 and your hand-crafted GPT-4o prompts break, while DSPy re-optimizes automatically. In the long term (1+ year), DSPy’s programmatic approach is dramatically more maintainable than a collection of prompt strings scattered across your codebase.
- The migration path I recommend: do not rewrite everything at once. Pick your most important module (the one where quality matters most), port it to DSPy, optimize it, and measure the improvement. If DSPy improves quality by more than 5% on your eval set, port the next module. If not, keep your hand-crafted prompts.
inspect_history() lets you see the actual prompts sent and responses received. For regulatory or compliance requirements where you must explain the system’s behavior, you can constrain DSPy’s optimization to only select few-shot examples (no prompt rewriting), which keeps the prompt human-readable while still getting the benefit of automated example selection.Explain how BootstrapFewShot optimization works under the hood. Where does the 'bootstrapped' training data come from if you only provided a few labeled examples?
Explain how BootstrapFewShot optimization works under the hood. Where does the 'bootstrapped' training data come from if you only provided a few labeled examples?
- BootstrapFewShot is essentially self-training through the LLM. You start with a small set of labeled examples (say, 20 question-answer pairs). DSPy runs your module on each training example and checks whether the output passes your metric function. For the examples where the model got the right answer AND produced good intermediate reasoning (for ChainOfThought modules), DSPy saves the complete trace — the input, the reasoning, and the correct output.
- These successful traces become “bootstrapped demonstrations.” They are real examples of the model doing the task correctly, extracted from the model’s own behavior. DSPy then selects the best subset of these demonstrations (up to
max_bootstrapped_demos) and injects them as few-shot examples into the prompt for future predictions. - The insight is that the model already knows how to do the task some percentage of the time. BootstrapFewShot identifies those successful cases and uses them as templates. If your model gets 70% accuracy zero-shot, the bootstrapped examples come from that 70% and help push the remaining 30% higher.
- The quality filter is the metric function. Only traces where the metric returns True become candidates. This is why your metric function is so important — a loose metric (“does the output contain the answer anywhere?”) bootstraps mediocre examples. A strict metric (“does the output exactly match the expected answer?”) bootstraps only high-quality demonstrations.
- The limitation: if your model gets 0% accuracy on a task zero-shot, there are no successful traces to bootstrap from. BootstrapFewShot cannot create quality from nothing — it amplifies existing capability. For tasks where the base model completely fails, you need to provide manually-crafted demonstrations via
max_labeled_demosinstead.
You are building a multi-stage pipeline: query decomposition, retrieval, and answer synthesis. In DSPy, each stage is a separate module. How does this compare to building the same pipeline with raw API calls, and what are the practical benefits of the module abstraction?
You are building a multi-stage pipeline: query decomposition, retrieval, and answer synthesis. In DSPy, each stage is a separate module. How does this compare to building the same pipeline with raw API calls, and what are the practical benefits of the module abstraction?
- With raw API calls, each stage is a function that constructs a prompt string, calls the API, and parses the response. The three stages are coupled through their prompts — changing the output format of stage 1 requires updating the input parsing of stage 2. Prompt optimization is manual: you tweak stage 1’s prompt, re-run the pipeline, check final answer quality, and guess which stage caused the regression. This works for simple pipelines but becomes unmaintainable at 3+ stages.
- With DSPy modules, each stage declares its input/output types through Signatures. Stage 1 outputs
sub_questions: list[str], stage 2 takes asub_question: strand outputsfindings: str, stage 3 takesall_findings: list[str]and outputscomprehensive_answer: str. The types enforce a contract between stages — if stage 1 changes its output format, the type system catches the incompatibility. - The major practical benefit is independent optimization. You can optimize stage 2 (retrieval quality) without touching stages 1 and 3. DSPy’s optimizers trace through the entire pipeline but adjust each module’s prompt independently. With raw API calls, “optimizing stage 2” means changing the prompt and hoping it does not break the downstream stages.
- The second benefit is model swapping. Each module can use a different model. Stage 1 (query decomposition) might use GPT-4o-mini because it is simple. Stage 2 (retrieval ranking) might use a local model. Stage 3 (synthesis) uses GPT-4o for quality. With DSPy, this is a configuration change. With raw API calls, each function has its own OpenAI client setup.
- The trade-off: DSPy adds an abstraction layer that has a learning curve and makes debugging harder when things go wrong. For a 2-stage pipeline on a hackathon project, the abstraction overhead is not worth it. For a 4+ stage production pipeline maintained by a team, it pays for itself in maintainability.