> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# OpenAI API

> Master the OpenAI API: chat completions, function calling, streaming, and structured outputs

<Info>
  **December 2025 Update**: Now covers the new Responses API, Predicted Outputs, structured outputs with `response_format`, and GPT-4.5 capabilities.
</Info>

## Why This Module Matters

The OpenAI API is the most widely-used LLM interface. Every AI startup, enterprise AI feature, and AI-powered tool uses it or something similar. Master this, and you can build anything.

<Note>
  **Career Impact**: Companies pay \$200-350K for engineers who can build reliable, production-grade AI features. This module teaches exactly that.
</Note>

## What's New in 2025

| Feature                | Description                         | Use Case          |
| ---------------------- | ----------------------------------- | ----------------- |
| **Responses API**      | Simpler, more powerful completions  | New projects      |
| **Predicted Outputs**  | Speed up edits with known structure | Code refactoring  |
| **GPT-4.5**            | Most capable model                  | Complex reasoning |
| **o1 Reasoning**       | Chain-of-thought built-in           | Math, coding      |
| **Structured Outputs** | Guaranteed JSON schema              | API responses     |

## Your Development Environment

```python theme={null}
# Install dependencies
# pip install openai python-dotenv pydantic

# .env file (NEVER commit this)
# OPENAI_API_KEY=sk-...

from openai import OpenAI
from dotenv import load_dotenv
import os

load_dotenv()

client = OpenAI()  # Automatically reads OPENAI_API_KEY

# Verify connection
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Say hello!"}],
    max_tokens=10
)
print(response.choices[0].message.content)
```

<Warning>
  **Security**: Never hardcode API keys. Never commit `.env` files. Use environment variables or secret managers in production.
</Warning>

## Chat Completions: The Foundation

Chat completions are the bread and butter of every LLM application. The mental model is simple: you send a conversation (a list of messages with roles), and the model continues the conversation. Think of it like passing a script to an actor -- the system message is the stage direction, the user messages are the audience's lines, and the assistant messages are the actor's previous lines. The model reads the whole script and generates the next line.

### The Complete Request Object

```python theme={null}
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    # Required
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Python decorators"}
    ],
    
    # Optional but important -- these are your main control knobs
    temperature=0.7,        # 0=deterministic, 1=creative, 2=chaos. Use 0 for extraction, 0.7 for conversation
    max_tokens=1000,        # Hard cap on output length. Prevents runaway costs. Always set this.
    top_p=1.0,              # Nucleus sampling -- change EITHER temperature OR top_p, not both
    frequency_penalty=0,    # -2 to 2. Positive values penalize repeated tokens (reduces "the the the")
    presence_penalty=0,     # -2 to 2. Positive values encourage topic diversity
    stop=None,              # Stop sequences -- model halts when it generates any of these strings
    n=1,                    # Number of completions. >1 is useful for self-consistency but multiplies cost
    
    # Advanced
    seed=42,                # For reproducible outputs (best-effort, not guaranteed)
    user="user_123",        # Passed to OpenAI for abuse detection -- use your internal user ID
    logprobs=False,         # Return token probabilities -- useful for confidence scoring
)

# Access the response
print(response.choices[0].message.content)
print(response.usage)  # Token counts
```

### Production-Ready Chat Function

```python theme={null}
from openai import OpenAI
from dataclasses import dataclass
from typing import Optional, List
import json

@dataclass
class ChatMessage:
    role: str
    content: str

@dataclass  
class ChatResponse:
    content: str
    model: str
    input_tokens: int
    output_tokens: int
    finish_reason: str
    cost_estimate: float

class ChatClient:
    """Production-ready OpenAI chat client"""
    
    # Pricing per 1M tokens (USD). Keep this updated -- prices drop frequently.
    # Pitfall: output tokens cost 2-4x more than input tokens. A verbose system
    # prompt is cheap; a verbose response is expensive. Always set max_tokens.
    PRICING = {
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "gpt-4-turbo": {"input": 10.00, "output": 30.00},
    }
    
    def __init__(self, default_model: str = "gpt-4o-mini"):
        self.client = OpenAI()
        self.default_model = default_model
    
    def _calculate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:
        pricing = self.PRICING.get(model, self.PRICING["gpt-4o"])
        return (input_tokens / 1_000_000 * pricing["input"] + 
                output_tokens / 1_000_000 * pricing["output"])
    
    def chat(
        self,
        messages: List[ChatMessage],
        model: Optional[str] = None,
        temperature: float = 0.7,
        max_tokens: Optional[int] = None,
        json_response: bool = False
    ) -> ChatResponse:
        """Send chat completion with full response metadata"""
        
        model = model or self.default_model
        
        kwargs = {
            "model": model,
            "messages": [{"role": m.role, "content": m.content} for m in messages],
            "temperature": temperature,
        }
        
        if max_tokens:
            kwargs["max_tokens"] = max_tokens
        
        if json_response:
            kwargs["response_format"] = {"type": "json_object"}
        
        response = self.client.chat.completions.create(**kwargs)
        
        choice = response.choices[0]
        usage = response.usage
        
        return ChatResponse(
            content=choice.message.content,
            model=response.model,
            input_tokens=usage.prompt_tokens,
            output_tokens=usage.completion_tokens,
            finish_reason=choice.finish_reason,
            cost_estimate=self._calculate_cost(model, usage.prompt_tokens, usage.completion_tokens)
        )


# Usage
chat = ChatClient()
response = chat.chat([
    ChatMessage("system", "You are a coding tutor."),
    ChatMessage("user", "Explain list comprehensions in Python")
])

print(response.content)
print(f"Cost: ${response.cost_estimate:.6f}")
print(f"Tokens: {response.input_tokens} in, {response.output_tokens} out")
```

## Streaming: Real-Time Responses

### Why Streaming Matters

Without streaming, users wait 5-30 seconds staring at nothing. With streaming, they see the first token within 200-500ms -- even if the full response takes 10 seconds. This is the same principle behind progressive image loading on the web: perceived performance matters as much as actual performance. In user studies, a streaming response that takes 10 seconds total feels faster than a non-streaming response that takes 5 seconds, because the user sees progress immediately.

```python theme={null}
def stream_chat(prompt: str, on_token: callable = print):
    """Stream response with callback for each token"""
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    
    full_response = []
    for chunk in stream:
        if chunk.choices[0].delta.content:
            token = chunk.choices[0].delta.content
            full_response.append(token)
            on_token(token, end="", flush=True)
    
    print()  # Newline at end
    return "".join(full_response)

# Basic usage
response = stream_chat("Explain machine learning in one paragraph")
```

### Production Streaming with FastAPI

```python theme={null}
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json

app = FastAPI()
client = OpenAI()

@app.post("/chat/stream")
async def chat_stream(request: dict):
    """Server-Sent Events streaming endpoint"""
    
    def generate():
        stream = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=request["messages"],
            stream=True
        )
        
        for chunk in stream:
            if chunk.choices[0].delta.content:
                data = {"content": chunk.choices[0].delta.content}
                yield f"data: {json.dumps(data)}\n\n"
        
        yield "data: [DONE]\n\n"
    
    return StreamingResponse(
        generate(),
        media_type="text/event-stream"
    )

# Frontend JavaScript to consume:
# const eventSource = new EventSource('/chat/stream');
# eventSource.onmessage = (event) => {
#   if (event.data !== '[DONE]') {
#     const data = JSON.parse(event.data);
#     appendToChat(data.content);
#   }
# };
```

## Function Calling: LLMs That Take Action

Function calling is the bridge between "chatbot" and "agent." Without it, an LLM can only generate text. With it, an LLM can check the weather, query a database, send an email, or call any API you expose. The model does not actually execute the function -- it generates a structured request ("call get\_weather with city=Paris"), you execute it in your code, and you feed the result back. This keeps the LLM in the reasoning seat while your code handles the doing.

### The Pattern

1. You define functions the model can "call" (name, description, parameters)
2. Model decides which function to call based on user input
3. **You** execute the function and return results (the model never runs code)
4. Model uses results to form final response

### Complete Function Calling System

```python theme={null}
import json
from typing import Callable, Any
from openai import OpenAI

client = OpenAI()

class FunctionRegistry:
    """Register and execute functions that LLMs can call"""
    
    def __init__(self):
        self.functions: dict[str, Callable] = {}
        self.schemas: list[dict] = []
    
    def register(self, name: str, description: str, parameters: dict):
        """Decorator to register a function"""
        def decorator(func: Callable):
            self.functions[name] = func
            self.schemas.append({
                "type": "function",
                "function": {
                    "name": name,
                    "description": description,
                    "parameters": parameters
                }
            })
            return func
        return decorator
    
    def execute(self, name: str, arguments: dict) -> Any:
        """Execute a registered function"""
        if name not in self.functions:
            raise ValueError(f"Unknown function: {name}")
        return self.functions[name](**arguments)


# Create registry and register functions
registry = FunctionRegistry()

@registry.register(
    name="get_weather",
    description="Get current weather for a city",
    parameters={
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name"},
            "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "default": "celsius"}
        },
        "required": ["city"]
    }
)
def get_weather(city: str, unit: str = "celsius") -> dict:
    # Mock implementation - replace with real API
    return {"city": city, "temp": 22, "unit": unit, "condition": "sunny"}

@registry.register(
    name="search_products",
    description="Search product catalog",
    parameters={
        "type": "object",
        "properties": {
            "query": {"type": "string", "description": "Search terms"},
            "max_price": {"type": "number", "description": "Maximum price"},
            "category": {"type": "string", "enum": ["electronics", "clothing", "books"]}
        },
        "required": ["query"]
    }
)
def search_products(query: str, max_price: float = None, category: str = None) -> list:
    # Mock implementation
    return [
        {"id": "1", "name": f"{query} Pro", "price": 99.99},
        {"id": "2", "name": f"{query} Basic", "price": 49.99}
    ]

@registry.register(
    name="send_email",
    description="Send an email to a recipient",
    parameters={
        "type": "object",
        "properties": {
            "to": {"type": "string", "description": "Recipient email"},
            "subject": {"type": "string"},
            "body": {"type": "string"}
        },
        "required": ["to", "subject", "body"]
    }
)
def send_email(to: str, subject: str, body: str) -> dict:
    # Mock - replace with real email service
    return {"status": "sent", "to": to}


def chat_with_functions(user_message: str) -> str:
    """Complete function calling flow"""
    messages = [{"role": "user", "content": user_message}]
    
    # First call - model may request function calls
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=registry.schemas,
        tool_choice="auto"
    )
    
    message = response.choices[0].message
    
    # If no function calls, return directly
    if not message.tool_calls:
        return message.content
    
    # Execute all requested functions
    messages.append(message)
    
    for tool_call in message.tool_calls:
        function_name = tool_call.function.name
        arguments = json.loads(tool_call.function.arguments)
        
        print(f"Executing: {function_name}({arguments})")
        result = registry.execute(function_name, arguments)
        
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result)
        })
    
    # Final call with function results
    final_response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    
    return final_response.choices[0].message.content


# Usage examples
print(chat_with_functions("What's the weather in Paris?"))
print(chat_with_functions("Find me some laptop options under $100"))
print(chat_with_functions("Send an email to bob@example.com about the meeting tomorrow"))
```

### Parallel Function Calls

GPT-4 can call multiple functions in one response:

```python theme={null}
# User: "What's the weather in Paris and London?"
# Model responds with TWO tool_calls

for tool_call in message.tool_calls:
    # Execute each in parallel if possible
    results.append(execute_async(tool_call))
```

### Function Calling Edge Cases

**Edge case -- the model calls a function you did not expect**: The model might call `send_email` when you only expected `search_products`. Always validate the function name before executing. Never blindly dispatch tool calls without checking that the function is safe for the current context.

**Edge case -- malformed arguments**: The model occasionally generates invalid JSON in the `arguments` field, especially for complex nested schemas. Wrap `json.loads(tool_call.function.arguments)` in a try/except and return a helpful error message to the model so it can retry.

**Edge case -- infinite tool-call loops**: The model might call a function, get a result, and decide to call the same function again with slightly different parameters. Set a maximum loop count (3-5 iterations) and force a final response with `tool_choice="none"` after the limit.

**Edge case -- `tool_choice="required"` vs. `"auto"`**: Use `"auto"` (default) when the model should decide whether to call a function. Use `"required"` when you know a function call is needed (e.g., the user said "book the flight"). Use `{"type": "function", "function": {"name": "specific_func"}}` when you need a specific function called -- useful for structured extraction where you want the model to always populate a schema.

## Structured Outputs: Guaranteed JSON

Structured outputs solve the single most frustrating problem in LLM engineering: parsing. Before this feature, you would ask the model for JSON and get back... sometimes JSON, sometimes JSON wrapped in markdown, sometimes a conversational response with JSON buried in it, and sometimes invalid JSON that crashes your parser. Structured outputs use constrained decoding to guarantee the output matches your schema. It is not "usually works" -- it is mathematically guaranteed.

### Structured Output Methods Compared

| Method                                                     | Guarantee                               | Supported Models            | Limitations                                                                         |
| ---------------------------------------------------------- | --------------------------------------- | --------------------------- | ----------------------------------------------------------------------------------- |
| `response_format: {"type": "json_object"}`                 | Soft -- model is nudged to produce JSON | All chat models             | Can still produce invalid JSON, no schema enforcement                               |
| `response_format: {"type": "json_schema", "strict": true}` | Hard -- constrained decoding            | GPT-4o, GPT-4o-mini (2024+) | Schema must follow OpenAI's subset of JSON Schema (no `oneOf`, `patternProperties`) |
| Instructor library (`response_model=`)                     | Soft + auto-retry on validation failure | Any model (wraps API)       | Adds latency for retries, depends on model compliance                               |
| Function calling with strict schemas                       | Hard -- constrained decoding            | GPT-4o, GPT-4o-mini         | Schema limitations same as structured outputs                                       |

**When to use which**: Use `json_schema` with `strict: true` for new projects -- it is the gold standard. Use Instructor when you need Pydantic validators that go beyond schema validation (e.g., "age must be between 0 and 150"). Use `json_object` only as a fallback for older models that do not support strict schemas.

### The Problem Structured Outputs Solve

```python theme={null}
# Without structured outputs - unreliable
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Extract the name and email from: John at john@example.com"}]
)
# Might return: "Name: John, Email: john@example.com"
# Or: {"name": "John", "email": "john@example.com"}
# Or: "The name is John and email is john@example.com"
# Parsing nightmare!
```

### With Structured Outputs - Guaranteed

```python theme={null}
from pydantic import BaseModel, Field
from typing import Optional, List
from enum import Enum

class Priority(str, Enum):
    low = "low"
    medium = "medium"
    high = "high"
    critical = "critical"

class Task(BaseModel):
    title: str = Field(description="Short task title")
    description: str = Field(description="Detailed description")
    priority: Priority
    due_date: Optional[str] = Field(description="Due date in YYYY-MM-DD format")
    tags: List[str] = Field(default_factory=list)

class TaskExtraction(BaseModel):
    tasks: List[Task]
    summary: str

# Use with OpenAI
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Extract tasks from the user's message."},
        {"role": "user", "content": """
            Need to finish the quarterly report by Friday, it's high priority.
            Also should review the new hire's code when I get a chance.
            Don't forget to book the team dinner for next month.
        """}
    ],
    # "strict": True enables constrained decoding -- the model CANNOT produce
    # invalid JSON. This is different from response_format={"type": "json_object"}
    # which just nudges the model to produce JSON (but can still fail).
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "task_extraction",
            "strict": True,
            "schema": TaskExtraction.model_json_schema()
        }
    }
)

# Guaranteed to parse
tasks = TaskExtraction.model_validate_json(response.choices[0].message.content)

for task in tasks.tasks:
    print(f"[{task.priority.value.upper()}] {task.title}")
    if task.due_date:
        print(f"  Due: {task.due_date}")
```

### Complex Nested Extraction

```python theme={null}
from pydantic import BaseModel
from typing import Optional, List

class Address(BaseModel):
    street: str
    city: str
    state: Optional[str]
    country: str
    postal_code: Optional[str]

class Company(BaseModel):
    name: str
    industry: str
    employee_count: Optional[str]

class Person(BaseModel):
    full_name: str
    email: Optional[str]
    phone: Optional[str]
    job_title: Optional[str]
    company: Optional[Company]
    address: Optional[Address]
    skills: List[str]

class ExtractionResult(BaseModel):
    people: List[Person]
    confidence: float = Field(ge=0, le=1, description="Confidence score 0-1")

# Extract from unstructured text like resumes, emails, business cards
text = """
Hi, I'm Sarah Chen, a Senior Software Engineer at TechCorp (they have about 500 employees
in the fintech space). You can reach me at sarah.chen@techcorp.io or 555-0123.
I specialize in Python, distributed systems, and machine learning.
Our office is at 123 Innovation Drive, San Francisco, CA 94102.
"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Extract structured information from text."},
        {"role": "user", "content": text}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "extraction",
            "strict": True,
            "schema": ExtractionResult.model_json_schema()
        }
    }
)

result = ExtractionResult.model_validate_json(response.choices[0].message.content)
print(result.people[0].full_name)  # "Sarah Chen"
print(result.people[0].company.name)  # "TechCorp"
```

## Vision: Processing Images

### Image Analysis

```python theme={null}
import base64
from pathlib import Path

def encode_image(image_path: str) -> str:
    """Encode image to base64"""
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def analyze_image(image_path: str, prompt: str = "Describe this image in detail") -> str:
    """Analyze an image with GPT-4V"""
    base64_image = encode_image(image_path)
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                            "detail": "high"  # or "low" for faster/cheaper
                        }
                    }
                ]
            }
        ],
        max_tokens=1000
    )
    
    return response.choices[0].message.content

# Usage
description = analyze_image("receipt.jpg", "Extract all items and prices from this receipt")
```

### Multiple Images

```python theme={null}
def compare_images(image_paths: list[str], prompt: str) -> str:
    """Compare multiple images"""
    content = [{"type": "text", "text": prompt}]
    
    for path in image_paths:
        content.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{encode_image(path)}"}
        })
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}]
    )
    
    return response.choices[0].message.content

# Compare before/after, find differences, etc.
result = compare_images(
    ["before.jpg", "after.jpg"],
    "What changed between these two images?"
)
```

## Production Error Handling

```python theme={null}
from openai import OpenAI, APIError, RateLimitError, APIConnectionError, AuthenticationError
import time
from functools import wraps
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def retry_with_backoff(max_retries: int = 3, base_delay: float = 1.0):
    """Decorator for robust API calls with retry logic.
    
    Why exponential backoff? Rate limits and transient errors are normal
    with LLM APIs. Retrying immediately floods the server and gets you
    rate-limited harder. Exponential backoff (1s, 2s, 4s) gives the
    server time to recover. This pattern alone prevents 90% of production
    outages from API instability.
    """
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None
            
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                    
                except AuthenticationError:
                    logger.error("Invalid API key")
                    raise  # Don't retry auth errors
                    
                except RateLimitError as e:
                    last_exception = e
                    wait = base_delay * (2 ** attempt)
                    logger.warning(f"Rate limited. Retry {attempt + 1}/{max_retries} in {wait}s")
                    time.sleep(wait)
                    
                except APIConnectionError as e:
                    last_exception = e
                    wait = base_delay * (2 ** attempt)
                    logger.warning(f"Connection error. Retry {attempt + 1}/{max_retries} in {wait}s")
                    time.sleep(wait)
                    
                except APIError as e:
                    last_exception = e
                    if e.status_code >= 500:  # Server errors
                        wait = base_delay * (2 ** attempt)
                        logger.warning(f"Server error {e.status_code}. Retry {attempt + 1}/{max_retries}")
                        time.sleep(wait)
                    else:
                        raise  # Don't retry client errors (4xx)
            
            raise last_exception
        return wrapper
    return decorator

@retry_with_backoff(max_retries=3)
def reliable_chat(messages: list, model: str = "gpt-4o-mini") -> str:
    client = OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        timeout=30  # 30 second timeout
    )
    return response.choices[0].message.content
```

## Cost Optimization Strategies

Cost optimization is not about being cheap -- it is about being smart. The difference between a well-optimized and naive LLM application can be 10-50x in monthly spend. The biggest lever is model selection: gpt-4o-mini handles 80% of tasks at 6% of the cost. The second biggest lever is prompt length: every token in your system prompt is charged on every single request.

### Model Selection Matrix

| Task              | Recommended Model        | Cost/1M tokens | Why                     |
| ----------------- | ------------------------ | -------------- | ----------------------- |
| Simple Q\&A       | gpt-4o-mini              | $0.15/$0.60    | Fast, cheap, sufficient |
| Code generation   | gpt-4o                   | $2.50/$10.00   | Better accuracy         |
| Complex reasoning | gpt-4o                   | $2.50/$10.00   | Necessary for quality   |
| Summarization     | gpt-4o-mini              | $0.15/$0.60    | Simple task             |
| Data extraction   | gpt-4o-mini + structured | $0.15/$0.60    | Structured output helps |
| Creative writing  | gpt-4o                   | $2.50/$10.00   | Better quality          |

### Cost Estimation Quick Reference

For back-of-envelope cost estimation, use these rules of thumb:

| Metric                                | Approximation                                 |
| ------------------------------------- | --------------------------------------------- |
| 1 token                               | Roughly 4 characters or 0.75 words in English |
| Typical system prompt                 | 200-500 tokens                                |
| Typical user message                  | 50-200 tokens                                 |
| Typical assistant response            | 100-500 tokens                                |
| 1,000 conversations/day (gpt-4o-mini) | \~$0.50-$2.00/day                             |
| 1,000 conversations/day (gpt-4o)      | \~$8-$30/day                                  |
| RAG with 5 chunks context (gpt-4o)    | \~\$0.01-0.03 per query                       |

**Edge case -- hidden cost multipliers**: Function calling adds tokens for the function schemas on every request (often 200-500 tokens per function). If you define 10 functions, that is 2,000-5,000 extra input tokens per request. Only include functions relevant to the current context, not your entire function catalog.

**Edge case -- conversation history growth**: In multi-turn chat, you resend the entire conversation history on every request. A 20-turn conversation might accumulate 10,000+ tokens of history, costing 5-10x more than the first message. Implement conversation summarization or sliding window truncation for long conversations.

### Smart Model Router

```python theme={null}
from enum import Enum

class TaskComplexity(Enum):
    SIMPLE = "simple"      # FAQ, summarization, translation
    MODERATE = "moderate"  # Code explanation, analysis
    COMPLEX = "complex"    # Code generation, reasoning, creative

def select_model(complexity: TaskComplexity) -> str:
    """Select appropriate model based on task complexity"""
    return {
        TaskComplexity.SIMPLE: "gpt-4o-mini",
        TaskComplexity.MODERATE: "gpt-4o-mini",  # Try cheap first
        TaskComplexity.COMPLEX: "gpt-4o",
    }[complexity]

def smart_chat(prompt: str, complexity: TaskComplexity) -> tuple[str, float]:
    """Chat with cost-aware model selection"""
    model = select_model(complexity)
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Calculate cost
    usage = response.usage
    pricing = {"gpt-4o": (2.50, 10.00), "gpt-4o-mini": (0.15, 0.60)}
    input_rate, output_rate = pricing[model]
    cost = (usage.prompt_tokens / 1_000_000 * input_rate + 
            usage.completion_tokens / 1_000_000 * output_rate)
    
    return response.choices[0].message.content, cost
```

## Mini-Project: AI Customer Support Bot

```python theme={null}
from openai import OpenAI
from pydantic import BaseModel
from typing import Optional, List
from enum import Enum
from datetime import datetime

client = OpenAI()

# Knowledge base (in production, this would be a vector database)
KNOWLEDGE_BASE = {
    "refund_policy": "Refunds are available within 30 days of purchase for unused items.",
    "shipping": "Standard shipping takes 5-7 business days. Express is 2-3 days.",
    "contact": "Email support@example.com or call 1-800-555-0123",
    "hours": "Customer support is available Mon-Fri 9am-6pm EST"
}

class TicketPriority(str, Enum):
    low = "low"
    medium = "medium"
    high = "high"
    urgent = "urgent"

class SupportResponse(BaseModel):
    answer: str
    confidence: float
    needs_human: bool
    priority: TicketPriority
    suggested_actions: List[str]

# Function definitions for the bot
tools = [
    {
        "type": "function",
        "function": {
            "name": "lookup_knowledge",
            "description": "Look up information in the knowledge base",
            "parameters": {
                "type": "object",
                "properties": {
                    "topic": {"type": "string", "enum": list(KNOWLEDGE_BASE.keys())}
                },
                "required": ["topic"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "create_ticket",
            "description": "Create a support ticket for human follow-up",
            "parameters": {
                "type": "object",
                "properties": {
                    "summary": {"type": "string"},
                    "priority": {"type": "string", "enum": ["low", "medium", "high", "urgent"]},
                    "customer_email": {"type": "string"}
                },
                "required": ["summary", "priority"]
            }
        }
    }
]

def lookup_knowledge(topic: str) -> str:
    return KNOWLEDGE_BASE.get(topic, "Information not found")

def create_ticket(summary: str, priority: str, customer_email: str = None) -> dict:
    ticket_id = f"TKT-{datetime.now().strftime('%Y%m%d%H%M%S')}"
    return {"ticket_id": ticket_id, "status": "created", "priority": priority}

def handle_support_request(customer_message: str) -> SupportResponse:
    """Handle a customer support request"""
    
    messages = [
        {
            "role": "system",
            "content": """You are a helpful customer support agent. 
            Use the lookup_knowledge function to find answers.
            Create tickets for complex issues that need human attention.
            Always be polite and helpful."""
        },
        {"role": "user", "content": customer_message}
    ]
    
    # First call - may request functions
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        tools=tools,
        tool_choice="auto"
    )
    
    message = response.choices[0].message
    
    # Process function calls if any
    if message.tool_calls:
        messages.append(message)
        
        for tool_call in message.tool_calls:
            func_name = tool_call.function.name
            args = json.loads(tool_call.function.arguments)
            
            if func_name == "lookup_knowledge":
                result = lookup_knowledge(**args)
            elif func_name == "create_ticket":
                result = create_ticket(**args)
            
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result)
            })
        
        # Get final response
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages
        )
        message = response.choices[0].message
    
    # Get structured analysis
    analysis_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Analyze this support interaction"},
            {"role": "user", "content": f"Customer: {customer_message}\nAgent: {message.content}"}
        ],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "support_response",
                "schema": SupportResponse.model_json_schema()
            }
        }
    )
    
    return SupportResponse.model_validate_json(
        analysis_response.choices[0].message.content
    )

# Test the bot
result = handle_support_request("I want a refund for my order from last week")
print(f"Answer: {result.answer}")
print(f"Needs human: {result.needs_human}")
print(f"Priority: {result.priority}")
```

## Key Takeaways

<CardGroup cols={2}>
  <Card title="Streaming Is Essential" icon="signal-stream">
    Always stream for user-facing apps. Nobody wants to wait 10 seconds for a response to appear.
  </Card>

  <Card title="Functions Enable Actions" icon="bolt">
    Function calling turns LLMs from chatbots into agents that can search, book, send, and execute.
  </Card>

  <Card title="Structured Outputs Save Time" icon="code">
    Use Pydantic models + json\_schema for guaranteed parseable responses. No more regex parsing.
  </Card>

  <Card title="Cost Awareness Matters" icon="dollar-sign">
    gpt-4o-mini is 17x cheaper. Use it for simple tasks, save gpt-4o for complex reasoning.
  </Card>
</CardGroup>

### Temperature, top\_p, and Penalties: A Decision Guide

These parameters interact in subtle ways. Most developers either ignore them entirely or tweak them randomly. Here is a principled framework:

| Parameter               | Value             | Effect                                       | When to Use                                            |
| ----------------------- | ----------------- | -------------------------------------------- | ------------------------------------------------------ |
| `temperature=0`         | Deterministic     | Near-identical outputs each run              | Data extraction, classification, structured output     |
| `temperature=0.3-0.5`   | Low creativity    | Consistent but with minor variation          | Code generation, summarization, factual Q\&A           |
| `temperature=0.7-1.0`   | Balanced          | Good variety with reasonable coherence       | Conversational chatbots, general writing               |
| `temperature=1.5-2.0`   | High creativity   | Unpredictable, sometimes incoherent          | Brainstorming, creative writing experiments            |
| `top_p=0.1`             | Very narrow       | Only the most likely tokens considered       | When you need extreme precision                        |
| `top_p=0.9`             | Broad             | Most tokens available, rare ones excluded    | General-purpose, slightly more controlled than default |
| `frequency_penalty=0.5` | Reduce repetition | Penalizes tokens proportional to their count | Long-form writing that tends to repeat phrases         |
| `presence_penalty=0.5`  | Encourage variety | Penalizes any token that has appeared        | When you want the model to explore new topics          |

**Key rule**: Change *either* temperature *or* top\_p, never both at once. They control the same underlying mechanism (token sampling distribution) from different angles. Adjusting both creates unpredictable interactions.

**Edge case -- temperature=0 is not truly deterministic**: OpenAI's documentation says "best-effort." In practice, you will see occasional variation even at temperature=0 due to floating-point non-determinism in GPU computation. If you need reproducibility, also set `seed` -- but even then, OpenAI only guarantees "mostly deterministic." For true determinism, use the `logprobs` response to verify consistency.

## Bonus: Responses API (2025)

The Responses API is OpenAI's next-generation interface, designed to fix the rough edges of chat completions. The key difference: instead of managing a messages array yourself, you pass a single `input` and optional `instructions`. It also handles multi-turn conversations, tool calls, and file search natively without you managing the message flow. For new projects, prefer this over chat completions. For existing projects, there is no urgency to migrate -- chat completions will continue to work.

```python theme={null}
from openai import OpenAI

client = OpenAI()

# Basic Responses API usage
response = client.responses.create(
    model="gpt-4o",
    input="Explain quantum computing simply",
    instructions="You are a helpful physics teacher",
)

print(response.output_text)

# With structured output
from pydantic import BaseModel

class Explanation(BaseModel):
    concept: str
    simple_explanation: str
    analogy: str
    difficulty_level: str

response = client.responses.create(
    model="gpt-4o",
    input="Explain neural networks",
    text={
        "format": {
            "type": "json_schema",
            "name": "explanation",
            "schema": Explanation.model_json_schema()
        }
    }
)

result = Explanation.model_validate_json(response.output_text)
print(f"Analogy: {result.analogy}")
```

### Predicted Outputs (Speed Boost)

Predicted outputs exploit a clever optimization: when the model's output is likely to be very similar to something you already have (like refactoring code), you provide the "prediction" and the model only needs to generate the diff. Under the hood, tokens that match the prediction are processed much faster. The result is 2-5x faster generation for edit-like tasks.

When you know most of the output in advance (like code refactoring), use predicted outputs for 2-5x faster generation:

```python theme={null}
original_code = '''
def calculate_total(items):
    total = 0
    for item in items:
        total += item.price
    return total
'''

# We predict the output will be similar to input
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Refactor this function to use sum()"},
        {"role": "user", "content": original_code}
    ],
    prediction={
        "type": "content",
        "content": original_code  # Model uses this as a starting point
    }
)

# Much faster because tokens were pre-computed!
print(response.choices[0].message.content)
```

<Tip>
  **When to use Predicted Outputs**: Code editing, document revisions, template filling—any time the output is structurally similar to something you already have.
</Tip>

## What's Next

<Card title="Vector Databases" icon="arrow-right" href="/ai-engineering/vector-databases">
  Store embeddings at scale with pgvector and Pinecone for semantic search
</Card>

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="You are building a customer-facing AI feature that needs to extract structured data from user messages and call internal APIs based on the results. Walk me through how you would design this using the OpenAI API, and what failure modes you would design around.">
    **Strong Answer:**

    * The architecture has three layers: structured extraction, function calling, and error handling. I would use structured outputs with a strict JSON schema for the extraction step, function calling for the API interaction, and a retry wrapper around the entire flow.
    * For extraction, I would define Pydantic models that represent the business entities -- say, an OrderIntent with fields like action (enum: track, return, cancel), order\_id (optional string), and reason (optional string). I would use `response_format` with `json_schema` and `strict: True` to guarantee the output matches. This is critical because without strict mode, you get "usually valid" JSON, and "usually" is not good enough when a parse failure crashes your webhook handler at 3am.
    * For function calling, I would define tools for each internal API (lookup\_order, initiate\_return, etc.) with tight parameter schemas. The key design decision: never let the model construct free-form API calls. Every parameter should be constrained -- enums for status values, regex patterns for IDs, explicit required fields. The model decides WHICH function to call and with what arguments; my code validates and executes.
    * The failure modes I would design around: (1) The model hallucinates a function that does not exist -- handle with a strict whitelist check before execution. (2) The model extracts a plausible-looking but invalid order\_id -- validate against the database before processing. (3) Rate limits during high-traffic periods -- implement exponential backoff with jitter, and a circuit breaker that falls back to a human agent queue after 3 failed retries. (4) The model returns valid JSON but semantically wrong data (extracts the wrong order\_id from a message mentioning multiple orders) -- add a confirmation step for high-stakes actions like cancellations.

    **Follow-up: Your structured output extraction works 99.2% of the time in testing, but in production you are seeing a 3% failure rate on certain user messages. How do you debug this?**

    * The gap between test and production is almost always input distribution. Test data is clean and well-formed; production users write in fragments, mix languages, include typos, and paste content from other apps. I would start by pulling the 3% failures and categorizing them.
    * Common culprits: (1) Messages that are too long and get truncated by max\_tokens on the response side -- the model starts generating the JSON but hits the token limit before closing all brackets. Fix: increase max\_tokens for extraction tasks or truncate the input. (2) Messages with special characters or Unicode that confuse the tokenizer. (3) Ambiguous messages where the model cannot confidently fill a required field and produces a schema violation trying to leave it blank.
    * I would add logging that captures the raw input, the model's raw output, and the Pydantic validation error for every failure. Then I would batch the failure cases into categories, create regression tests for each category, and either adjust the prompt to handle edge cases or add pre-processing to normalize the input before it hits the model.
  </Accordion>

  <Accordion title="Explain the difference between temperature, top_p, frequency_penalty, and presence_penalty. When would you use each, and what mistakes do people make combining them?">
    **Strong Answer:**

    * Temperature and top\_p both control randomness, but through different mechanisms. Temperature scales the logits before softmax: at 0 the model always picks the highest-probability token (deterministic), at 1 it samples from the natural distribution, and at 2 it flattens the distribution dramatically so even low-probability tokens have a decent chance. Top\_p (nucleus sampling) is a different approach: it dynamically truncates the distribution to include only the smallest set of tokens whose cumulative probability exceeds p. At top\_p=0.1, only the most likely tokens are considered; at top\_p=1.0, all tokens are candidates.
    * The critical mistake: changing both simultaneously. They interact in non-obvious ways. If you set temperature=0.3 and top\_p=0.5, you are double-constraining the distribution -- the temperature already concentrated probability on top tokens, and then top\_p further truncates. The result is more deterministic than either setting alone, which is usually not what you want. OpenAI's own documentation says to change one or the other, not both.
    * Frequency penalty and presence penalty both reduce repetition, but differently. Frequency penalty scales with how many times a token has appeared -- it penalizes "the" more each time "the" appears. Presence penalty is binary: it penalizes a token the same amount whether it has appeared once or ten times. Use frequency penalty (0.3-0.8) when the model gets stuck repeating phrases. Use presence penalty (0.3-0.6) when you want topic diversity -- it encourages the model to explore new concepts rather than rehashing the same point.
    * My production defaults: temperature=0 for extraction and classification (determinism matters), temperature=0.7 with top\_p=1.0 for conversational responses, frequency\_penalty=0.3 for any task where repetition is noticeable. I almost never touch presence\_penalty because it can cause the model to go off-topic.

    **Follow-up: You set temperature=0 for determinism, but you notice that the same prompt sometimes gives different outputs. Why, and how do you get true reproducibility?**

    * Temperature=0 is "approximately deterministic" but not guaranteed. There are several sources of non-determinism: GPU floating-point operations are not associative (the order of additions changes the result at the bit level), different hardware produces slightly different rounding, and OpenAI may route your request to different GPU clusters. The `seed` parameter helps -- when you set it, OpenAI returns a `system_fingerprint` in the response, and outputs are deterministic as long as the fingerprint matches. But the fingerprint can change when OpenAI updates their infrastructure.
    * For true reproducibility in production, I cache the response keyed on the hash of the full request (messages + model + seed + all parameters). If the same request comes in again, return the cached response. This also saves money and reduces latency. For evaluation, I run each test case 3 times and check that all 3 outputs match; if they diverge, I flag it as a non-determinism issue and increase the test tolerance.
  </Accordion>

  <Accordion title="Walk me through how you would design a cost optimization strategy for an application making 500K OpenAI API calls per month across different task types.">
    **Strong Answer:**

    * At 500K calls/month, cost optimization is not about saving pennies -- it is likely a \$5K-50K/month line item depending on model mix. The first step is instrumentation: log every API call with model, input tokens, output tokens, task type, and calculated cost. Without this data, you are optimizing blind.
    * The biggest lever is model routing. I would categorize every API call by task: classification, extraction, summarization, generation, reasoning. Then I would benchmark GPT-4o-mini against GPT-4o for each category using our actual prompts and a labeled evaluation set. In my experience, GPT-4o-mini handles classification and extraction at 95%+ of GPT-4o accuracy at 6% of the cost. That alone, if 60% of our calls are simple tasks, cuts our bill by 50%.
    * The second lever is prompt length. Output tokens cost 2-4x more than input tokens. A verbose system prompt is cheap (amortized across many requests), but a verbose response is expensive on every single call. I would audit our prompts: add `max_tokens` limits to every call (prevents runaway responses), add explicit length constraints in the system prompt ("respond in 2-3 sentences"), and remove any unnecessary context from the messages array.
    * The third lever is caching. If the same question gets asked repeatedly (common in customer support), cache the response keyed on a hash of the messages. Even a 10% cache hit rate on 500K calls saves 50K API calls per month. Use Redis with a TTL of 1-24 hours depending on how dynamic your data is.
    * The fourth lever is batching. For non-real-time workloads (nightly summarization, weekly report generation), use the Batch API which offers a 50% discount in exchange for up to 24-hour turnaround.

    **Follow-up: After implementing model routing and caching, your costs dropped 45% but you are getting complaints that some responses feel "dumber" since the switch. How do you investigate?**

    * The routing logic is probably miscategorizing some complex queries as simple and sending them to GPT-4o-mini when they need GPT-4o. I would pull the complaints, match them to the logged API calls, and check which model served each one. If the pattern is "all complaints were served by mini," the routing heuristic is too aggressive.
    * The fix is a quality feedback loop. Add a thumbs-up/thumbs-down to the UI, log the feedback with the request metadata, and periodically review which task types have the lowest satisfaction scores on GPT-4o-mini. Move those task types back to GPT-4o. You can also implement a "try cheap first, escalate if bad" pattern: route to GPT-4o-mini, use a lightweight quality check on the response (length, confidence score, keyword presence), and if it fails, automatically retry with GPT-4o. This adds latency for the escalated cases but keeps costs low for the majority.
  </Accordion>
</AccordionGroup>