Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
response_format, and GPT-4.5 capabilities.Why This Module Matters
The OpenAI API is the most widely-used LLM interface. Every AI startup, enterprise AI feature, and AI-powered tool uses it or something similar. Master this, and you can build anything.What’s New in 2025
| Feature | Description | Use Case |
|---|---|---|
| Responses API | Simpler, more powerful completions | New projects |
| Predicted Outputs | Speed up edits with known structure | Code refactoring |
| GPT-4.5 | Most capable model | Complex reasoning |
| o1 Reasoning | Chain-of-thought built-in | Math, coding |
| Structured Outputs | Guaranteed JSON schema | API responses |
Your Development Environment
Chat Completions: The Foundation
Chat completions are the bread and butter of every LLM application. The mental model is simple: you send a conversation (a list of messages with roles), and the model continues the conversation. Think of it like passing a script to an actor — the system message is the stage direction, the user messages are the audience’s lines, and the assistant messages are the actor’s previous lines. The model reads the whole script and generates the next line.The Complete Request Object
Production-Ready Chat Function
Streaming: Real-Time Responses
Why Streaming Matters
Without streaming, users wait 5-30 seconds staring at nothing. With streaming, they see the first token within 200-500ms — even if the full response takes 10 seconds. This is the same principle behind progressive image loading on the web: perceived performance matters as much as actual performance. In user studies, a streaming response that takes 10 seconds total feels faster than a non-streaming response that takes 5 seconds, because the user sees progress immediately.Production Streaming with FastAPI
Function Calling: LLMs That Take Action
Function calling is the bridge between “chatbot” and “agent.” Without it, an LLM can only generate text. With it, an LLM can check the weather, query a database, send an email, or call any API you expose. The model does not actually execute the function — it generates a structured request (“call get_weather with city=Paris”), you execute it in your code, and you feed the result back. This keeps the LLM in the reasoning seat while your code handles the doing.The Pattern
- You define functions the model can “call” (name, description, parameters)
- Model decides which function to call based on user input
- You execute the function and return results (the model never runs code)
- Model uses results to form final response
Complete Function Calling System
Parallel Function Calls
GPT-4 can call multiple functions in one response:Function Calling Edge Cases
Edge case — the model calls a function you did not expect: The model might callsend_email when you only expected search_products. Always validate the function name before executing. Never blindly dispatch tool calls without checking that the function is safe for the current context.
Edge case — malformed arguments: The model occasionally generates invalid JSON in the arguments field, especially for complex nested schemas. Wrap json.loads(tool_call.function.arguments) in a try/except and return a helpful error message to the model so it can retry.
Edge case — infinite tool-call loops: The model might call a function, get a result, and decide to call the same function again with slightly different parameters. Set a maximum loop count (3-5 iterations) and force a final response with tool_choice="none" after the limit.
Edge case — tool_choice="required" vs. "auto": Use "auto" (default) when the model should decide whether to call a function. Use "required" when you know a function call is needed (e.g., the user said “book the flight”). Use {"type": "function", "function": {"name": "specific_func"}} when you need a specific function called — useful for structured extraction where you want the model to always populate a schema.
Structured Outputs: Guaranteed JSON
Structured outputs solve the single most frustrating problem in LLM engineering: parsing. Before this feature, you would ask the model for JSON and get back… sometimes JSON, sometimes JSON wrapped in markdown, sometimes a conversational response with JSON buried in it, and sometimes invalid JSON that crashes your parser. Structured outputs use constrained decoding to guarantee the output matches your schema. It is not “usually works” — it is mathematically guaranteed.Structured Output Methods Compared
| Method | Guarantee | Supported Models | Limitations |
|---|---|---|---|
response_format: {"type": "json_object"} | Soft — model is nudged to produce JSON | All chat models | Can still produce invalid JSON, no schema enforcement |
response_format: {"type": "json_schema", "strict": true} | Hard — constrained decoding | GPT-4o, GPT-4o-mini (2024+) | Schema must follow OpenAI’s subset of JSON Schema (no oneOf, patternProperties) |
Instructor library (response_model=) | Soft + auto-retry on validation failure | Any model (wraps API) | Adds latency for retries, depends on model compliance |
| Function calling with strict schemas | Hard — constrained decoding | GPT-4o, GPT-4o-mini | Schema limitations same as structured outputs |
json_schema with strict: true for new projects — it is the gold standard. Use Instructor when you need Pydantic validators that go beyond schema validation (e.g., “age must be between 0 and 150”). Use json_object only as a fallback for older models that do not support strict schemas.
The Problem Structured Outputs Solve
With Structured Outputs - Guaranteed
Complex Nested Extraction
Vision: Processing Images
Image Analysis
Multiple Images
Production Error Handling
Cost Optimization Strategies
Cost optimization is not about being cheap — it is about being smart. The difference between a well-optimized and naive LLM application can be 10-50x in monthly spend. The biggest lever is model selection: gpt-4o-mini handles 80% of tasks at 6% of the cost. The second biggest lever is prompt length: every token in your system prompt is charged on every single request.Model Selection Matrix
| Task | Recommended Model | Cost/1M tokens | Why |
|---|---|---|---|
| Simple Q&A | gpt-4o-mini | 0.60 | Fast, cheap, sufficient |
| Code generation | gpt-4o | 10.00 | Better accuracy |
| Complex reasoning | gpt-4o | 10.00 | Necessary for quality |
| Summarization | gpt-4o-mini | 0.60 | Simple task |
| Data extraction | gpt-4o-mini + structured | 0.60 | Structured output helps |
| Creative writing | gpt-4o | 10.00 | Better quality |
Cost Estimation Quick Reference
For back-of-envelope cost estimation, use these rules of thumb:| Metric | Approximation |
|---|---|
| 1 token | Roughly 4 characters or 0.75 words in English |
| Typical system prompt | 200-500 tokens |
| Typical user message | 50-200 tokens |
| Typical assistant response | 100-500 tokens |
| 1,000 conversations/day (gpt-4o-mini) | ~2.00/day |
| 1,000 conversations/day (gpt-4o) | ~30/day |
| RAG with 5 chunks context (gpt-4o) | ~$0.01-0.03 per query |
Smart Model Router
Mini-Project: AI Customer Support Bot
Key Takeaways
Streaming Is Essential
Functions Enable Actions
Structured Outputs Save Time
Cost Awareness Matters
Temperature, top_p, and Penalties: A Decision Guide
These parameters interact in subtle ways. Most developers either ignore them entirely or tweak them randomly. Here is a principled framework:| Parameter | Value | Effect | When to Use |
|---|---|---|---|
temperature=0 | Deterministic | Near-identical outputs each run | Data extraction, classification, structured output |
temperature=0.3-0.5 | Low creativity | Consistent but with minor variation | Code generation, summarization, factual Q&A |
temperature=0.7-1.0 | Balanced | Good variety with reasonable coherence | Conversational chatbots, general writing |
temperature=1.5-2.0 | High creativity | Unpredictable, sometimes incoherent | Brainstorming, creative writing experiments |
top_p=0.1 | Very narrow | Only the most likely tokens considered | When you need extreme precision |
top_p=0.9 | Broad | Most tokens available, rare ones excluded | General-purpose, slightly more controlled than default |
frequency_penalty=0.5 | Reduce repetition | Penalizes tokens proportional to their count | Long-form writing that tends to repeat phrases |
presence_penalty=0.5 | Encourage variety | Penalizes any token that has appeared | When you want the model to explore new topics |
seed — but even then, OpenAI only guarantees “mostly deterministic.” For true determinism, use the logprobs response to verify consistency.
Bonus: Responses API (2025)
The Responses API is OpenAI’s next-generation interface, designed to fix the rough edges of chat completions. The key difference: instead of managing a messages array yourself, you pass a singleinput and optional instructions. It also handles multi-turn conversations, tool calls, and file search natively without you managing the message flow. For new projects, prefer this over chat completions. For existing projects, there is no urgency to migrate — chat completions will continue to work.
Predicted Outputs (Speed Boost)
Predicted outputs exploit a clever optimization: when the model’s output is likely to be very similar to something you already have (like refactoring code), you provide the “prediction” and the model only needs to generate the diff. Under the hood, tokens that match the prediction are processed much faster. The result is 2-5x faster generation for edit-like tasks. When you know most of the output in advance (like code refactoring), use predicted outputs for 2-5x faster generation:What’s Next
Vector Databases
Interview Deep-Dive
You are building a customer-facing AI feature that needs to extract structured data from user messages and call internal APIs based on the results. Walk me through how you would design this using the OpenAI API, and what failure modes you would design around.
You are building a customer-facing AI feature that needs to extract structured data from user messages and call internal APIs based on the results. Walk me through how you would design this using the OpenAI API, and what failure modes you would design around.
- The architecture has three layers: structured extraction, function calling, and error handling. I would use structured outputs with a strict JSON schema for the extraction step, function calling for the API interaction, and a retry wrapper around the entire flow.
- For extraction, I would define Pydantic models that represent the business entities — say, an OrderIntent with fields like action (enum: track, return, cancel), order_id (optional string), and reason (optional string). I would use
response_formatwithjson_schemaandstrict: Trueto guarantee the output matches. This is critical because without strict mode, you get “usually valid” JSON, and “usually” is not good enough when a parse failure crashes your webhook handler at 3am. - For function calling, I would define tools for each internal API (lookup_order, initiate_return, etc.) with tight parameter schemas. The key design decision: never let the model construct free-form API calls. Every parameter should be constrained — enums for status values, regex patterns for IDs, explicit required fields. The model decides WHICH function to call and with what arguments; my code validates and executes.
- The failure modes I would design around: (1) The model hallucinates a function that does not exist — handle with a strict whitelist check before execution. (2) The model extracts a plausible-looking but invalid order_id — validate against the database before processing. (3) Rate limits during high-traffic periods — implement exponential backoff with jitter, and a circuit breaker that falls back to a human agent queue after 3 failed retries. (4) The model returns valid JSON but semantically wrong data (extracts the wrong order_id from a message mentioning multiple orders) — add a confirmation step for high-stakes actions like cancellations.
- The gap between test and production is almost always input distribution. Test data is clean and well-formed; production users write in fragments, mix languages, include typos, and paste content from other apps. I would start by pulling the 3% failures and categorizing them.
- Common culprits: (1) Messages that are too long and get truncated by max_tokens on the response side — the model starts generating the JSON but hits the token limit before closing all brackets. Fix: increase max_tokens for extraction tasks or truncate the input. (2) Messages with special characters or Unicode that confuse the tokenizer. (3) Ambiguous messages where the model cannot confidently fill a required field and produces a schema violation trying to leave it blank.
- I would add logging that captures the raw input, the model’s raw output, and the Pydantic validation error for every failure. Then I would batch the failure cases into categories, create regression tests for each category, and either adjust the prompt to handle edge cases or add pre-processing to normalize the input before it hits the model.
Explain the difference between temperature, top_p, frequency_penalty, and presence_penalty. When would you use each, and what mistakes do people make combining them?
Explain the difference between temperature, top_p, frequency_penalty, and presence_penalty. When would you use each, and what mistakes do people make combining them?
- Temperature and top_p both control randomness, but through different mechanisms. Temperature scales the logits before softmax: at 0 the model always picks the highest-probability token (deterministic), at 1 it samples from the natural distribution, and at 2 it flattens the distribution dramatically so even low-probability tokens have a decent chance. Top_p (nucleus sampling) is a different approach: it dynamically truncates the distribution to include only the smallest set of tokens whose cumulative probability exceeds p. At top_p=0.1, only the most likely tokens are considered; at top_p=1.0, all tokens are candidates.
- The critical mistake: changing both simultaneously. They interact in non-obvious ways. If you set temperature=0.3 and top_p=0.5, you are double-constraining the distribution — the temperature already concentrated probability on top tokens, and then top_p further truncates. The result is more deterministic than either setting alone, which is usually not what you want. OpenAI’s own documentation says to change one or the other, not both.
- Frequency penalty and presence penalty both reduce repetition, but differently. Frequency penalty scales with how many times a token has appeared — it penalizes “the” more each time “the” appears. Presence penalty is binary: it penalizes a token the same amount whether it has appeared once or ten times. Use frequency penalty (0.3-0.8) when the model gets stuck repeating phrases. Use presence penalty (0.3-0.6) when you want topic diversity — it encourages the model to explore new concepts rather than rehashing the same point.
- My production defaults: temperature=0 for extraction and classification (determinism matters), temperature=0.7 with top_p=1.0 for conversational responses, frequency_penalty=0.3 for any task where repetition is noticeable. I almost never touch presence_penalty because it can cause the model to go off-topic.
- Temperature=0 is “approximately deterministic” but not guaranteed. There are several sources of non-determinism: GPU floating-point operations are not associative (the order of additions changes the result at the bit level), different hardware produces slightly different rounding, and OpenAI may route your request to different GPU clusters. The
seedparameter helps — when you set it, OpenAI returns asystem_fingerprintin the response, and outputs are deterministic as long as the fingerprint matches. But the fingerprint can change when OpenAI updates their infrastructure. - For true reproducibility in production, I cache the response keyed on the hash of the full request (messages + model + seed + all parameters). If the same request comes in again, return the cached response. This also saves money and reduces latency. For evaluation, I run each test case 3 times and check that all 3 outputs match; if they diverge, I flag it as a non-determinism issue and increase the test tolerance.
Walk me through how you would design a cost optimization strategy for an application making 500K OpenAI API calls per month across different task types.
Walk me through how you would design a cost optimization strategy for an application making 500K OpenAI API calls per month across different task types.
- At 500K calls/month, cost optimization is not about saving pennies — it is likely a $5K-50K/month line item depending on model mix. The first step is instrumentation: log every API call with model, input tokens, output tokens, task type, and calculated cost. Without this data, you are optimizing blind.
- The biggest lever is model routing. I would categorize every API call by task: classification, extraction, summarization, generation, reasoning. Then I would benchmark GPT-4o-mini against GPT-4o for each category using our actual prompts and a labeled evaluation set. In my experience, GPT-4o-mini handles classification and extraction at 95%+ of GPT-4o accuracy at 6% of the cost. That alone, if 60% of our calls are simple tasks, cuts our bill by 50%.
- The second lever is prompt length. Output tokens cost 2-4x more than input tokens. A verbose system prompt is cheap (amortized across many requests), but a verbose response is expensive on every single call. I would audit our prompts: add
max_tokenslimits to every call (prevents runaway responses), add explicit length constraints in the system prompt (“respond in 2-3 sentences”), and remove any unnecessary context from the messages array. - The third lever is caching. If the same question gets asked repeatedly (common in customer support), cache the response keyed on a hash of the messages. Even a 10% cache hit rate on 500K calls saves 50K API calls per month. Use Redis with a TTL of 1-24 hours depending on how dynamic your data is.
- The fourth lever is batching. For non-real-time workloads (nightly summarization, weekly report generation), use the Batch API which offers a 50% discount in exchange for up to 24-hour turnaround.
- The routing logic is probably miscategorizing some complex queries as simple and sending them to GPT-4o-mini when they need GPT-4o. I would pull the complaints, match them to the logged API calls, and check which model served each one. If the pattern is “all complaints were served by mini,” the routing heuristic is too aggressive.
- The fix is a quality feedback loop. Add a thumbs-up/thumbs-down to the UI, log the feedback with the request metadata, and periodically review which task types have the lowest satisfaction scores on GPT-4o-mini. Move those task types back to GPT-4o. You can also implement a “try cheap first, escalate if bad” pattern: route to GPT-4o-mini, use a lightweight quality check on the response (length, confidence score, keyword presence), and if it fails, automatically retry with GPT-4o. This adds latency for the escalated cases but keeps costs low for the majority.