Function calling enables LLMs to interact with external systems by generating structured function calls that your application executes. Think of it like a voice assistant that can press buttons on your behalf: you say “check the weather in Tokyo,” the LLM decides to callDocumentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
get_weather(location="Tokyo"), your code actually fetches the weather, and the LLM weaves the result into a natural response. The model never actually executes code — it just decides which function to call and with what arguments. Your application is always in control of execution.
Function Schema Design
The schema is the menu you hand to the model. A clear, well-documented schema is the single biggest factor in reliable function calling. If your schema descriptions are vague, the model will guess wrong about which function to use and what arguments to pass. Practical tip: write function descriptions as if you are explaining to a new team member when to use each tool.OpenAI Function Schema
Pydantic Schema Generation
Manually writing JSON schemas is tedious and error-prone. A better approach is to define your parameters as Pydantic models and auto-generate the OpenAI schema. This gives you validation on both sides: the model is constrained to the schema when generating arguments, and Pydantic validates the result before your function runs.Function Execution Engine
The registry pattern decouples “what functions exist” from “how they get called.” This matters because the LLM returns a function name and arguments as strings — you need a clean way to look up the actual Python function, validate the arguments, execute it, and handle errors. Think of the registry as a switchboard operator connecting calls to the right department.Parallel Function Execution
Argument Validation
Error Handling Patterns
Tool Choice Control
Streaming with Function Calls
Key Patterns
| Pattern | Use Case | Implementation |
|---|---|---|
| Pydantic Schemas | Type-safe function definitions | Convert models to JSON Schema |
| Parallel Execution | Multiple independent calls | asyncio.gather |
| Validation | Argument correctness | Pydantic validators |
| Retry Logic | Transient failures | Exponential backoff |
| Tool Choice Control | Directing model behavior | force/auto/none/required |
What is Next
LLM Orchestration
Learn to orchestrate multiple LLM providers with unified APIs
Interview Deep-Dive
Explain the function calling loop in LLM APIs. What happens between the model generating a tool call and the user getting their final answer?
Explain the function calling loop in LLM APIs. What happens between the model generating a tool call and the user getting their final answer?
Strong Answer:
- The function calling loop is a multi-turn conversation between your application and the LLM. Here is the exact flow. Step one: you send the user’s message plus a list of tool schemas (JSON Schema definitions of your available functions) to the model. Step two: instead of returning a text response, the model returns one or more tool call objects, each containing a function name and JSON arguments. Critically, the model does not execute anything — it just generates structured data describing what it wants to call. Step three: your application parses the tool calls, validates the arguments, executes the actual functions against your APIs or databases, and collects the results. Step four: you append the tool call results as tool-role messages back into the conversation and send it to the model again. Step five: the model either generates another tool call (if it needs more information) or produces a final text response to the user.
- The loop continues until the model decides it has enough information to answer, or until you hit a max-iterations safety limit. In production I always set a max of 5-10 iterations to prevent runaway loops where the model keeps calling tools without converging on an answer. I have seen cases where a model gets stuck in a cycle calling the same search function with slightly different queries because none of the results satisfy it.
- The key architectural insight is that the model never directly touches your systems. It is always your code executing the functions and deciding what happens with the results. This is what makes it safe — you control validation, permissions, rate limiting, and error handling at the execution layer, not the LLM layer.
strict: true in the function schema rarely generates invalid arguments because strict mode uses constrained decoding to guarantee schema conformance. But with non-strict mode or weaker models, I see validation failures on about 2-3% of calls, so the retry mechanism is essential. The Pydantic validation layer also acts as a security boundary — it prevents the model from injecting unexpected parameters that your function was not designed to handle.You have 15 tools registered for an agent. The model keeps choosing the wrong tool or calling tools unnecessarily. How do you debug and fix this?
You have 15 tools registered for an agent. The model keeps choosing the wrong tool or calling tools unnecessarily. How do you debug and fix this?
Strong Answer:
- Tool selection problems almost always trace back to one of three root causes: poor tool descriptions, overlapping tool purposes, or missing routing signals.
- First, I audit every tool’s name and description. The description is the single most important factor in tool selection — it is the model’s only guide for when to use each tool. Vague descriptions like “search for data” cause confusion. I rewrite descriptions to be specific about when to use the tool and when not to: “Search the product catalog by keyword. Use this when the user asks about specific products, pricing, or availability. Do NOT use this for general knowledge questions.” Including explicit negative guidance (“do not use when…”) reduces false selections significantly.
- Second, I look for overlapping tools. If I have both
search_productsandget_product_details, the model might callsearch_productswhen the user asks about a specific product ID, because the descriptions are not clear about the boundary. I either consolidate overlapping tools or add explicit disambiguation: “Use search_products for keyword searches across the catalog. Use get_product_details only when you have a specific product ID.” - Third, I reduce the tool set. With 15 tools, the model spends significant context on schema parsing and the probability of mis-selection increases. I segment tools by intent: a routing step first determines the user’s intent category, then only the 3-5 relevant tools for that category are passed to the model. This two-stage approach cut our tool mis-selection rate from 12% to under 2% at a company I worked at.
- Finally, I use
tool_choicestrategically. For lookup intents where I know a tool must be called, I usetool_choice: "required"or even force a specific tool. For conversational turns where tools are optional, I useauto.
asyncio.gather rather than sequentially, because they are independent by definition — the model generated them in parallel specifically because they do not depend on each other. The main gotchas are: first, error isolation — if one tool call fails, you need to return the error for that specific call while still returning results for the successful ones. Do not let one failure abort the entire batch. Second, rate limiting — five parallel API calls might hit your external service’s rate limit. I use a semaphore to cap concurrency at 3-5 simultaneous outbound calls. Third, response assembly — each tool result must be sent back with the correct tool_call_id matching the original call. If you mix these up, the model gets confused about which result corresponds to which request. I have debugged this exact issue where swapped IDs caused the model to synthesize nonsensical answers because it attributed a weather API response to a database query.Compare function calling in OpenAI's API versus building tool-use with Anthropic's Claude. What are the key differences in design and capabilities?
Compare function calling in OpenAI's API versus building tool-use with Anthropic's Claude. What are the key differences in design and capabilities?
Strong Answer:
- Both APIs follow the same conceptual pattern (model suggests tool calls, you execute, you return results), but the implementation details differ in important ways.
- OpenAI’s function calling uses a
toolsarray with JSON Schema definitions and returns tool calls as a structuredtool_callsarray on the assistant message. The standout feature isstrict: truemode, which uses constrained decoding to guarantee the generated arguments conform exactly to your JSON Schema. This eliminates argument validation errors at the cost of slightly higher latency. OpenAI also supportstool_choicewith fine-grained control: auto, required, none, or force-a-specific-tool. - Anthropic’s tool use follows a similar pattern but with different ergonomics. The tool definitions go in a top-level
toolsparameter, and tool calls come back astool_usecontent blocks within the response. One key difference is how system messages work: Anthropic separates the system prompt from message history, which affects how you structure tool instructions. Anthropic does not have an equivalent to OpenAI’s strict mode as of my last check, so argument validation on your side is more important. - The practical difference that matters most in production is how each handles multi-turn tool conversations and streaming. OpenAI streams tool calls as deltas that you need to accumulate (function name and arguments arrive in chunks), which requires careful buffer management. Anthropic streams tool use blocks more atomically.
- For choosing between them: if argument schema compliance is critical (financial calculations, database queries), OpenAI’s strict mode is a significant advantage. If you need the model to handle ambiguous, open-ended tool selection with good reasoning about when not to use tools, Claude tends to be more conservative and less trigger-happy with tool calls, which can be either a pro or a con depending on your use case.
BaseModel with typed fields and descriptions, then I have converter functions that generate the provider-specific schema format: pydantic_to_openai_function() and pydantic_to_anthropic_tool(). The execution layer is provider-agnostic — it receives a function name and a dictionary of arguments regardless of which provider generated them. The tricky part is handling provider-specific behaviors: OpenAI might generate null for optional fields while Anthropic omits them entirely, and the tool response format differs between providers. I normalize these differences in a thin adapter layer so the rest of my application code never knows or cares which LLM generated the tool call. This abstraction paid off when we switched from OpenAI to Anthropic for our primary agent — the tool definitions and execution logic stayed identical, and we only changed the adapter layer.