Anthropic’s Claude models offer unique capabilities and API patterns. This chapter covers the Claude API in depth, from basic usage to advanced features like extended thinking.Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Getting Started with Claude
Installation and Setup
Basic Message API
Claude uses a messages-based API structure:Multi-Turn Conversations
System Prompts
System prompts in Claude are powerful for shaping behavior.Effective System Prompt Patterns
Dynamic System Prompts
Streaming Responses
Stream Claude responses for better UX:Async Streaming
Tool Use with Claude
Claude has powerful tool/function calling capabilities:Vision Capabilities
Claude can analyze images:Extended Thinking
Claude’s extended thinking mode for complex reasoning:Token Counting and Cost Management
Error Handling and Retries
Claude API Best Practices
- Use system prompts to establish consistent behavior
- Stream responses for long outputs to improve UX
- Implement proper error handling with exponential backoff
- Track token usage for cost management
- Use the appropriate model tier for your task complexity
Practice Exercise
Build a Claude-powered assistant with these features:- Multi-turn conversation with memory
- Tool use for external data access
- Image analysis capabilities
- Cost tracking per session
- Graceful error handling
- Effective system prompt design
- Proper conversation state management
- Robust error recovery
- Usage monitoring and limits
Interview Deep-Dive
You are choosing between OpenAI and Anthropic APIs for a production application. What are the key technical differences that would influence your decision?
You are choosing between OpenAI and Anthropic APIs for a production application. What are the key technical differences that would influence your decision?
Strong Answer:
- The API structures are similar but have important differences in how they handle conversations. OpenAI uses a flat messages array where the system prompt is a message with role “system.” Anthropic separates the system prompt into its own top-level parameter, which is a cleaner abstraction because the system prompt is architecturally different from conversation messages. This separation matters when you are managing conversation state: with Anthropic, you never accidentally trim the system prompt when truncating message history.
- Tool use (function calling) works differently between the two. OpenAI returns tool calls as part of the assistant message and expects tool results as separate messages with role “tool.” Anthropic uses content blocks: the assistant response contains both text blocks and tool_use blocks, and tool results are sent as tool_result content blocks in the next user message. Anthropic’s block-based approach is more flexible for multi-tool calls but requires slightly different message construction logic.
- For streaming, both support SSE, but Anthropic’s streaming API provides more structured events (message_start, content_block_start, content_block_delta, etc.) versus OpenAI’s simpler chunk-based stream. Anthropic’s approach gives you better control over rendering multi-modal responses (text + tool calls) in real time.
- Extended thinking is a differentiator for Claude. When you need the model to work through complex reasoning before responding, the thinking parameter lets you allocate a token budget specifically for internal reasoning. The thinking content is returned separately from the answer, so you can log it for debugging without exposing it to users. OpenAI does not have a direct equivalent; o1 models have internal reasoning but do not expose it.
- Cost and rate limits differ significantly by tier. For high-volume applications, Anthropic offers prompt caching for system prompts and frequently reused context, which can reduce costs by 90% on cached tokens. This is a major advantage when your system prompt is long or when you are using few-shot examples that repeat across requests.
Claude's extended thinking feature allocates a token budget for internal reasoning. How do you decide when to use it and how to set the budget?
Claude's extended thinking feature allocates a token budget for internal reasoning. How do you decide when to use it and how to set the budget?
Strong Answer:
- Extended thinking is worth the cost for tasks where the reasoning process is complex and the quality of the final answer depends heavily on working through intermediate steps. Math problems, multi-constraint planning, complex code generation, and multi-hop reasoning questions all benefit significantly. For simple factual questions, summaries, or creative writing, extended thinking adds cost and latency without improving quality.
- The budget_tokens parameter controls how many tokens the model can use for internal reasoning. Setting it too low (under 2000) can cut off the model’s reasoning mid-thought, producing worse results than no thinking at all. Setting it too high (over 20000) wastes tokens on tasks that do not need that much deliberation. My approach is to start with a moderate budget (5000-10000) and measure answer quality on a benchmark set, then adjust.
- A practical pattern I use is adaptive thinking budgets. For a coding assistant, I classify the request difficulty using a lightweight model call: “Is this a simple syntax question, a moderate implementation task, or a complex architecture problem?” Simple questions get no extended thinking, moderate tasks get 5000 tokens, and complex problems get 15000. This keeps average cost low while giving hard problems the reasoning space they need.
- One important nuance: the thinking tokens count toward your billing but are not deducted from the max_tokens for the actual response. So if you set budget_tokens=10000 and max_tokens=16000, the model can use up to 10000 tokens for thinking and up to 16000 for the response. You are paying for up to 26000 output tokens total. Plan your cost estimates accordingly.
- The thinking content itself is valuable for debugging and transparency. I log it (with PII scrubbing) and use it to diagnose cases where the model gets the wrong answer. Often the thinking reveals exactly where the reasoning went wrong, which helps you improve the system prompt or add relevant context.
You are building a multi-turn conversation system with Claude. How do you manage context window limits as conversations grow long?
You are building a multi-turn conversation system with Claude. How do you manage context window limits as conversations grow long?
Strong Answer:
- The core challenge is that Claude’s context window has a hard limit, and every turn in the conversation consumes tokens from that budget. The system prompt, conversation history, and the current user message all compete for the same space. As conversations grow, you must choose what to keep and what to drop.
- The simplest strategy is sliding window: keep the system prompt, the first user message (for task context), and the last N message pairs. The Conversation class in this chapter stores all messages but sends all of them to the API, which will fail once the context is exceeded. In production, I implement a token counter that measures the total tokens before each API call and trims from the middle of the conversation (keeping the beginning and end) when approaching the limit.
- A more sophisticated approach is summarization. When the conversation exceeds a threshold (say, 70% of the context window), I take the oldest messages that are about to be trimmed and generate a summary using a cheap, fast model (claude-3-haiku or gpt-4o-mini). This summary replaces the old messages, compressing the history while preserving key information. The trade-off is that summaries lose detail, so if the user references something specific from an earlier message, the model might not have the exact wording.
- For applications where conversation state matters (like a coding assistant that has seen multiple file edits), I use a structured memory approach. Instead of keeping raw message history, I maintain a running state object that tracks key decisions, code snippets, and user preferences. This state object is injected into the system prompt and stays compact regardless of conversation length.
- Token counting with Claude is straightforward using the count_tokens endpoint, which gives you an exact count before making the actual API call. I call this before every message and implement the trimming logic if the count exceeds 80% of the model’s limit, leaving 20% headroom for the response.
- One pitfall teams hit: they forget that the system prompt contributes to the token count. A 2000-token system prompt means you effectively have 2000 fewer tokens for conversation history. Keep system prompts concise and consider moving dynamic content (like few-shot examples) into the message history where it can be trimmed.