Chain-of-Thought (CoT) Prompting
Encourage step-by-step reasoning for complex problems. This is the single highest-ROI prompting technique you can learn. Just as a math teacher requires students to show their work (not because the answer matters less, but because the reasoning process catches errors), CoT forces the model to externalize its reasoning chain, which dramatically reduces logical mistakes.Zero-Shot CoT
Few-Shot CoT
Self-Ask with CoT
Tree-of-Thought (ToT) Prompting
Explore multiple reasoning paths for complex problems. If Chain-of-Thought is like solving a maze by walking one path, Tree-of-Thought is like sending scouts down every fork simultaneously and reporting back which routes look most promising. ToT excels at problems where the first approach might be a dead end — creative writing, puzzle-solving, code architecture decisions, and strategic planning. The trade-off is real: ToT makes multiple LLM calls per step (generating thoughts + evaluating each one), so latency and cost scale with breadth and depth. Use it when answer quality matters more than speed.ReAct (Reasoning + Acting)
Combine reasoning with tool use for complex tasks.Self-Consistency
Generate multiple reasoning paths and select the most consistent answer.Prompt Chaining
Break complex tasks into sequential steps.Least-to-Most Prompting
Decompose complex problems into simpler subproblems.- Choose the technique based on problem complexity
- Chain-of-Thought works well for math and logical reasoning
- Tree-of-Thought excels at creative problem-solving
- ReAct is ideal when external tools are needed
- Self-Consistency improves reliability for high-stakes answers
Practice Exercise
Build a reasoning system that:- Automatically selects the best prompting technique
- Implements at least 3 different techniques
- Evaluates and compares results across techniques
- Provides confidence scores for answers
- Supports custom tool integration
- Problem classification for technique selection
- Robust answer extraction
- Performance measurement
- Fallback strategies
Interview Deep-Dive
When would you choose Tree-of-Thought over Chain-of-Thought in a production system, and what are the cost implications?
When would you choose Tree-of-Thought over Chain-of-Thought in a production system, and what are the cost implications?
- The key decision factor is whether the problem has a single reasoning path or multiple viable approaches. Chain-of-Thought works well for linear reasoning tasks like math, logical deduction, or step-by-step analysis where there is one clear path forward. Tree-of-Thought excels at problems where the first approach might be a dead end, such as creative generation, code architecture decisions, puzzle-solving, or strategic planning.
- In production, the cost trade-off is significant. ToT makes multiple LLM calls per step: you generate N candidate thoughts, then evaluate each one, and repeat this at every depth level. For a tree with breadth 3 and depth 3, you are looking at roughly 3 + 9 + 27 generation calls plus the same number of evaluation calls. That is around 78 LLM calls for a single user query versus 1 call for CoT.
- The practical heuristic I use: if the task has a verifiable correct answer and the first-attempt success rate with CoT is above 80%, stick with CoT. If you need the answer to be creative or exploratory, or if the cost of a wrong answer is very high (say, generating a legal contract), the additional cost of ToT is justified.
- One pattern that works well in production is a “CoT-first, ToT-fallback” approach. Run CoT first, evaluate the confidence of the result, and only escalate to ToT if the confidence is below a threshold. This keeps average cost low while catching the hard cases.
Self-Consistency samples multiple reasoning paths and takes a majority vote. What failure modes have you seen with this approach, and how do you mitigate them?
Self-Consistency samples multiple reasoning paths and takes a majority vote. What failure modes have you seen with this approach, and how do you mitigate them?
- The biggest failure mode is when the model is confidently wrong in a consistent way. If 4 out of 5 samples all arrive at the same incorrect answer through similar flawed reasoning, majority voting amplifies the error instead of catching it. This happens most often with questions that trigger a common misconception the model learned during training.
- A second failure mode is answer extraction brittleness. Self-Consistency depends on being able to compare final answers across samples, but if the model phrases the same answer differently each time (“42”, “forty-two”, “the answer is 42 units”), your extraction logic might treat them as different answers and report artificially low confidence.
- Temperature selection matters more than most people realize. Too low (under 0.3) and all samples follow nearly identical reasoning paths, defeating the purpose. Too high (above 1.0) and you get incoherent samples that pollute the vote. The sweet spot is usually 0.5 to 0.8 depending on the task.
- To mitigate these issues, I normalize extracted answers before comparison (strip units, convert to canonical forms), use semantic similarity rather than exact match when answers are free-text, and combine Self-Consistency with a verification step where I ask the model to critique the majority answer. If the critique identifies a flaw, I fall back to the second-most-common answer or flag for human review.
You are building a customer support agent that uses ReAct. The agent sometimes enters infinite loops between Thought and Action steps. How do you diagnose and fix this?
You are building a customer support agent that uses ReAct. The agent sometimes enters infinite loops between Thought and Action steps. How do you diagnose and fix this?
- Infinite loops in ReAct agents typically happen for one of three reasons. First, the observation from a tool is ambiguous or unhelpful, so the agent keeps retrying the same tool hoping for a better result. Second, the agent oscillates between two tools, each of which provides partial information that triggers calling the other. Third, the agent’s reasoning gets stuck in a circular pattern where it keeps re-deriving the same thought without making progress toward the answer.
- The diagnostic approach starts with logging. You need to record every Thought, Action, and Observation in the trajectory. Then look for patterns: repeated identical tool calls, oscillation between two actions, or thoughts that repeat nearly verbatim. In the code from this chapter, the trajectory list captures this, but in production you want structured logging with deduplication detection.
- To fix it, I implement multiple safeguards. A hard cap on iterations (the max_steps parameter) is the safety net, but you also want a “stale detection” mechanism that compares the current thought to the last N thoughts using simple string similarity. If similarity exceeds 80%, inject a meta-prompt like “You seem to be repeating yourself. Summarize what you know so far and either provide a final answer or try a completely different approach.”
- Another effective technique is tool call deduplication: if the agent tries to call the same tool with the same arguments twice in a row, intercept it and return the cached result along with a nudge like “You already searched for this. Here is what you found. Use this information to proceed.”
- At a system design level, I also add a “give up gracefully” escape hatch. If the agent hits 70% of max_steps without converging, force it to synthesize a partial answer from what it has gathered so far, rather than continuing to flail.
Prompt chaining versus a single long prompt: walk me through how you decide which approach to use for a complex task.
Prompt chaining versus a single long prompt: walk me through how you decide which approach to use for a complex task.
- The way I think about this is in terms of task decomposability and error propagation. If a task can be cleanly split into independent subtasks where each step has a well-defined input and output, chaining wins. If the subtasks are deeply interdependent and the model needs to hold the full context simultaneously to make good decisions, a single prompt is better.
- Chaining has several concrete advantages. Each step can use a different model, so you can use gpt-4o-mini for classification and extraction steps but gpt-4o for the final synthesis. Each step can be independently tested, cached, and retried. You can add validation between steps to catch errors early. And each prompt is shorter, which means fewer hallucinations since the model has less context to get confused by.
- The downside of chaining is error compounding. If step 1 has 90% accuracy and step 2 has 90% accuracy, your end-to-end accuracy is 81%. With 5 steps at 90% each, you are down to 59%. So you need each step to be highly reliable, or you need error correction mechanisms between steps.
- A single long prompt avoids the error compounding problem and lets the model see all context at once, but it hits practical limits. Beyond about 4,000 tokens of instructions, models start ignoring parts of the prompt. And you lose the ability to use different models or add intermediate validation.
- My decision framework: if the task has more than 3 clearly separable stages, chain. If the task requires tight integration between steps (like writing code where the function signature in step 1 affects the implementation in step 3), use a single prompt with clear section markers. For the middle ground, I use a hybrid: chain the major stages but keep tightly coupled substeps within a single prompt.