Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

December 2025 Update: Now covers the new Responses API, Predicted Outputs, structured outputs with response_format, and GPT-4.5 capabilities.

Why This Module Matters

The OpenAI API is the most widely-used LLM interface. Every AI startup, enterprise AI feature, and AI-powered tool uses it or something similar. Master this, and you can build anything.
Career Impact: Companies pay $200-350K for engineers who can build reliable, production-grade AI features. This module teaches exactly that.

What’s New in 2025

FeatureDescriptionUse Case
Responses APISimpler, more powerful completionsNew projects
Predicted OutputsSpeed up edits with known structureCode refactoring
GPT-4.5Most capable modelComplex reasoning
o1 ReasoningChain-of-thought built-inMath, coding
Structured OutputsGuaranteed JSON schemaAPI responses

Your Development Environment

# Install dependencies
# pip install openai python-dotenv pydantic

# .env file (NEVER commit this)
# OPENAI_API_KEY=sk-...

from openai import OpenAI
from dotenv import load_dotenv
import os

load_dotenv()

client = OpenAI()  # Automatically reads OPENAI_API_KEY

# Verify connection
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Say hello!"}],
    max_tokens=10
)
print(response.choices[0].message.content)
Security: Never hardcode API keys. Never commit .env files. Use environment variables or secret managers in production.

Chat Completions: The Foundation

Chat completions are the bread and butter of every LLM application. The mental model is simple: you send a conversation (a list of messages with roles), and the model continues the conversation. Think of it like passing a script to an actor — the system message is the stage direction, the user messages are the audience’s lines, and the assistant messages are the actor’s previous lines. The model reads the whole script and generates the next line.

The Complete Request Object

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    # Required
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Python decorators"}
    ],
    
    # Optional but important -- these are your main control knobs
    temperature=0.7,        # 0=deterministic, 1=creative, 2=chaos. Use 0 for extraction, 0.7 for conversation
    max_tokens=1000,        # Hard cap on output length. Prevents runaway costs. Always set this.
    top_p=1.0,              # Nucleus sampling -- change EITHER temperature OR top_p, not both
    frequency_penalty=0,    # -2 to 2. Positive values penalize repeated tokens (reduces "the the the")
    presence_penalty=0,     # -2 to 2. Positive values encourage topic diversity
    stop=None,              # Stop sequences -- model halts when it generates any of these strings
    n=1,                    # Number of completions. >1 is useful for self-consistency but multiplies cost
    
    # Advanced
    seed=42,                # For reproducible outputs (best-effort, not guaranteed)
    user="user_123",        # Passed to OpenAI for abuse detection -- use your internal user ID
    logprobs=False,         # Return token probabilities -- useful for confidence scoring
)

# Access the response
print(response.choices[0].message.content)
print(response.usage)  # Token counts

Production-Ready Chat Function

from openai import OpenAI
from dataclasses import dataclass
from typing import Optional, List
import json

@dataclass
class ChatMessage:
    role: str
    content: str

@dataclass  
class ChatResponse:
    content: str
    model: str
    input_tokens: int
    output_tokens: int
    finish_reason: str
    cost_estimate: float

class ChatClient:
    """Production-ready OpenAI chat client"""
    
    # Pricing per 1M tokens (USD). Keep this updated -- prices drop frequently.
    # Pitfall: output tokens cost 2-4x more than input tokens. A verbose system
    # prompt is cheap; a verbose response is expensive. Always set max_tokens.
    PRICING = {
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "gpt-4-turbo": {"input": 10.00, "output": 30.00},
    }
    
    def __init__(self, default_model: str = "gpt-4o-mini"):
        self.client = OpenAI()
        self.default_model = default_model
    
    def _calculate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float:
        pricing = self.PRICING.get(model, self.PRICING["gpt-4o"])
        return (input_tokens / 1_000_000 * pricing["input"] + 
                output_tokens / 1_000_000 * pricing["output"])
    
    def chat(
        self,
        messages: List[ChatMessage],
        model: Optional[str] = None,
        temperature: float = 0.7,
        max_tokens: Optional[int] = None,
        json_response: bool = False
    ) -> ChatResponse:
        """Send chat completion with full response metadata"""
        
        model = model or self.default_model
        
        kwargs = {
            "model": model,
            "messages": [{"role": m.role, "content": m.content} for m in messages],
            "temperature": temperature,
        }
        
        if max_tokens:
            kwargs["max_tokens"] = max_tokens
        
        if json_response:
            kwargs["response_format"] = {"type": "json_object"}
        
        response = self.client.chat.completions.create(**kwargs)
        
        choice = response.choices[0]
        usage = response.usage
        
        return ChatResponse(
            content=choice.message.content,
            model=response.model,
            input_tokens=usage.prompt_tokens,
            output_tokens=usage.completion_tokens,
            finish_reason=choice.finish_reason,
            cost_estimate=self._calculate_cost(model, usage.prompt_tokens, usage.completion_tokens)
        )


# Usage
chat = ChatClient()
response = chat.chat([
    ChatMessage("system", "You are a coding tutor."),
    ChatMessage("user", "Explain list comprehensions in Python")
])

print(response.content)
print(f"Cost: ${response.cost_estimate:.6f}")
print(f"Tokens: {response.input_tokens} in, {response.output_tokens} out")

Streaming: Real-Time Responses

Why Streaming Matters

Without streaming, users wait 5-30 seconds staring at nothing. With streaming, they see the first token within 200-500ms — even if the full response takes 10 seconds. This is the same principle behind progressive image loading on the web: perceived performance matters as much as actual performance. In user studies, a streaming response that takes 10 seconds total feels faster than a non-streaming response that takes 5 seconds, because the user sees progress immediately.
def stream_chat(prompt: str, on_token: callable = print):
    """Stream response with callback for each token"""
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    
    full_response = []
    for chunk in stream:
        if chunk.choices[0].delta.content:
            token = chunk.choices[0].delta.content
            full_response.append(token)
            on_token(token, end="", flush=True)
    
    print()  # Newline at end
    return "".join(full_response)

# Basic usage
response = stream_chat("Explain machine learning in one paragraph")

Production Streaming with FastAPI

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json

app = FastAPI()
client = OpenAI()

@app.post("/chat/stream")
async def chat_stream(request: dict):
    """Server-Sent Events streaming endpoint"""
    
    def generate():
        stream = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=request["messages"],
            stream=True
        )
        
        for chunk in stream:
            if chunk.choices[0].delta.content:
                data = {"content": chunk.choices[0].delta.content}
                yield f"data: {json.dumps(data)}\n\n"
        
        yield "data: [DONE]\n\n"
    
    return StreamingResponse(
        generate(),
        media_type="text/event-stream"
    )

# Frontend JavaScript to consume:
# const eventSource = new EventSource('/chat/stream');
# eventSource.onmessage = (event) => {
#   if (event.data !== '[DONE]') {
#     const data = JSON.parse(event.data);
#     appendToChat(data.content);
#   }
# };

Function Calling: LLMs That Take Action

Function calling is the bridge between “chatbot” and “agent.” Without it, an LLM can only generate text. With it, an LLM can check the weather, query a database, send an email, or call any API you expose. The model does not actually execute the function — it generates a structured request (“call get_weather with city=Paris”), you execute it in your code, and you feed the result back. This keeps the LLM in the reasoning seat while your code handles the doing.

The Pattern

  1. You define functions the model can “call” (name, description, parameters)
  2. Model decides which function to call based on user input
  3. You execute the function and return results (the model never runs code)
  4. Model uses results to form final response

Complete Function Calling System

import json
from typing import Callable, Any
from openai import OpenAI

client = OpenAI()

class FunctionRegistry:
    """Register and execute functions that LLMs can call"""
    
    def __init__(self):
        self.functions: dict[str, Callable] = {}
        self.schemas: list[dict] = []
    
    def register(self, name: str, description: str, parameters: dict):
        """Decorator to register a function"""
        def decorator(func: Callable):
            self.functions[name] = func
            self.schemas.append({
                "type": "function",
                "function": {
                    "name": name,
                    "description": description,
                    "parameters": parameters
                }
            })
            return func
        return decorator
    
    def execute(self, name: str, arguments: dict) -> Any:
        """Execute a registered function"""
        if name not in self.functions:
            raise ValueError(f"Unknown function: {name}")
        return self.functions[name](**arguments)


# Create registry and register functions
registry = FunctionRegistry()

@registry.register(
    name="get_weather",
    description="Get current weather for a city",
    parameters={
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name"},
            "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "default": "celsius"}
        },
        "required": ["city"]
    }
)
def get_weather(city: str, unit: str = "celsius") -> dict:
    # Mock implementation - replace with real API
    return {"city": city, "temp": 22, "unit": unit, "condition": "sunny"}

@registry.register(
    name="search_products",
    description="Search product catalog",
    parameters={
        "type": "object",
        "properties": {
            "query": {"type": "string", "description": "Search terms"},
            "max_price": {"type": "number", "description": "Maximum price"},
            "category": {"type": "string", "enum": ["electronics", "clothing", "books"]}
        },
        "required": ["query"]
    }
)
def search_products(query: str, max_price: float = None, category: str = None) -> list:
    # Mock implementation
    return [
        {"id": "1", "name": f"{query} Pro", "price": 99.99},
        {"id": "2", "name": f"{query} Basic", "price": 49.99}
    ]

@registry.register(
    name="send_email",
    description="Send an email to a recipient",
    parameters={
        "type": "object",
        "properties": {
            "to": {"type": "string", "description": "Recipient email"},
            "subject": {"type": "string"},
            "body": {"type": "string"}
        },
        "required": ["to", "subject", "body"]
    }
)
def send_email(to: str, subject: str, body: str) -> dict:
    # Mock - replace with real email service
    return {"status": "sent", "to": to}


def chat_with_functions(user_message: str) -> str:
    """Complete function calling flow"""
    messages = [{"role": "user", "content": user_message}]
    
    # First call - model may request function calls
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=registry.schemas,
        tool_choice="auto"
    )
    
    message = response.choices[0].message
    
    # If no function calls, return directly
    if not message.tool_calls:
        return message.content
    
    # Execute all requested functions
    messages.append(message)
    
    for tool_call in message.tool_calls:
        function_name = tool_call.function.name
        arguments = json.loads(tool_call.function.arguments)
        
        print(f"Executing: {function_name}({arguments})")
        result = registry.execute(function_name, arguments)
        
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result)
        })
    
    # Final call with function results
    final_response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    
    return final_response.choices[0].message.content


# Usage examples
print(chat_with_functions("What's the weather in Paris?"))
print(chat_with_functions("Find me some laptop options under $100"))
print(chat_with_functions("Send an email to bob@example.com about the meeting tomorrow"))

Parallel Function Calls

GPT-4 can call multiple functions in one response:
# User: "What's the weather in Paris and London?"
# Model responds with TWO tool_calls

for tool_call in message.tool_calls:
    # Execute each in parallel if possible
    results.append(execute_async(tool_call))

Function Calling Edge Cases

Edge case — the model calls a function you did not expect: The model might call send_email when you only expected search_products. Always validate the function name before executing. Never blindly dispatch tool calls without checking that the function is safe for the current context. Edge case — malformed arguments: The model occasionally generates invalid JSON in the arguments field, especially for complex nested schemas. Wrap json.loads(tool_call.function.arguments) in a try/except and return a helpful error message to the model so it can retry. Edge case — infinite tool-call loops: The model might call a function, get a result, and decide to call the same function again with slightly different parameters. Set a maximum loop count (3-5 iterations) and force a final response with tool_choice="none" after the limit. Edge case — tool_choice="required" vs. "auto": Use "auto" (default) when the model should decide whether to call a function. Use "required" when you know a function call is needed (e.g., the user said “book the flight”). Use {"type": "function", "function": {"name": "specific_func"}} when you need a specific function called — useful for structured extraction where you want the model to always populate a schema.

Structured Outputs: Guaranteed JSON

Structured outputs solve the single most frustrating problem in LLM engineering: parsing. Before this feature, you would ask the model for JSON and get back… sometimes JSON, sometimes JSON wrapped in markdown, sometimes a conversational response with JSON buried in it, and sometimes invalid JSON that crashes your parser. Structured outputs use constrained decoding to guarantee the output matches your schema. It is not “usually works” — it is mathematically guaranteed.

Structured Output Methods Compared

MethodGuaranteeSupported ModelsLimitations
response_format: {"type": "json_object"}Soft — model is nudged to produce JSONAll chat modelsCan still produce invalid JSON, no schema enforcement
response_format: {"type": "json_schema", "strict": true}Hard — constrained decodingGPT-4o, GPT-4o-mini (2024+)Schema must follow OpenAI’s subset of JSON Schema (no oneOf, patternProperties)
Instructor library (response_model=)Soft + auto-retry on validation failureAny model (wraps API)Adds latency for retries, depends on model compliance
Function calling with strict schemasHard — constrained decodingGPT-4o, GPT-4o-miniSchema limitations same as structured outputs
When to use which: Use json_schema with strict: true for new projects — it is the gold standard. Use Instructor when you need Pydantic validators that go beyond schema validation (e.g., “age must be between 0 and 150”). Use json_object only as a fallback for older models that do not support strict schemas.

The Problem Structured Outputs Solve

# Without structured outputs - unreliable
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Extract the name and email from: John at john@example.com"}]
)
# Might return: "Name: John, Email: john@example.com"
# Or: {"name": "John", "email": "john@example.com"}
# Or: "The name is John and email is john@example.com"
# Parsing nightmare!

With Structured Outputs - Guaranteed

from pydantic import BaseModel, Field
from typing import Optional, List
from enum import Enum

class Priority(str, Enum):
    low = "low"
    medium = "medium"
    high = "high"
    critical = "critical"

class Task(BaseModel):
    title: str = Field(description="Short task title")
    description: str = Field(description="Detailed description")
    priority: Priority
    due_date: Optional[str] = Field(description="Due date in YYYY-MM-DD format")
    tags: List[str] = Field(default_factory=list)

class TaskExtraction(BaseModel):
    tasks: List[Task]
    summary: str

# Use with OpenAI
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Extract tasks from the user's message."},
        {"role": "user", "content": """
            Need to finish the quarterly report by Friday, it's high priority.
            Also should review the new hire's code when I get a chance.
            Don't forget to book the team dinner for next month.
        """}
    ],
    # "strict": True enables constrained decoding -- the model CANNOT produce
    # invalid JSON. This is different from response_format={"type": "json_object"}
    # which just nudges the model to produce JSON (but can still fail).
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "task_extraction",
            "strict": True,
            "schema": TaskExtraction.model_json_schema()
        }
    }
)

# Guaranteed to parse
tasks = TaskExtraction.model_validate_json(response.choices[0].message.content)

for task in tasks.tasks:
    print(f"[{task.priority.value.upper()}] {task.title}")
    if task.due_date:
        print(f"  Due: {task.due_date}")

Complex Nested Extraction

from pydantic import BaseModel
from typing import Optional, List

class Address(BaseModel):
    street: str
    city: str
    state: Optional[str]
    country: str
    postal_code: Optional[str]

class Company(BaseModel):
    name: str
    industry: str
    employee_count: Optional[str]

class Person(BaseModel):
    full_name: str
    email: Optional[str]
    phone: Optional[str]
    job_title: Optional[str]
    company: Optional[Company]
    address: Optional[Address]
    skills: List[str]

class ExtractionResult(BaseModel):
    people: List[Person]
    confidence: float = Field(ge=0, le=1, description="Confidence score 0-1")

# Extract from unstructured text like resumes, emails, business cards
text = """
Hi, I'm Sarah Chen, a Senior Software Engineer at TechCorp (they have about 500 employees
in the fintech space). You can reach me at sarah.chen@techcorp.io or 555-0123.
I specialize in Python, distributed systems, and machine learning.
Our office is at 123 Innovation Drive, San Francisco, CA 94102.
"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Extract structured information from text."},
        {"role": "user", "content": text}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "extraction",
            "strict": True,
            "schema": ExtractionResult.model_json_schema()
        }
    }
)

result = ExtractionResult.model_validate_json(response.choices[0].message.content)
print(result.people[0].full_name)  # "Sarah Chen"
print(result.people[0].company.name)  # "TechCorp"

Vision: Processing Images

Image Analysis

import base64
from pathlib import Path

def encode_image(image_path: str) -> str:
    """Encode image to base64"""
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def analyze_image(image_path: str, prompt: str = "Describe this image in detail") -> str:
    """Analyze an image with GPT-4V"""
    base64_image = encode_image(image_path)
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                            "detail": "high"  # or "low" for faster/cheaper
                        }
                    }
                ]
            }
        ],
        max_tokens=1000
    )
    
    return response.choices[0].message.content

# Usage
description = analyze_image("receipt.jpg", "Extract all items and prices from this receipt")

Multiple Images

def compare_images(image_paths: list[str], prompt: str) -> str:
    """Compare multiple images"""
    content = [{"type": "text", "text": prompt}]
    
    for path in image_paths:
        content.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{encode_image(path)}"}
        })
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}]
    )
    
    return response.choices[0].message.content

# Compare before/after, find differences, etc.
result = compare_images(
    ["before.jpg", "after.jpg"],
    "What changed between these two images?"
)

Production Error Handling

from openai import OpenAI, APIError, RateLimitError, APIConnectionError, AuthenticationError
import time
from functools import wraps
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def retry_with_backoff(max_retries: int = 3, base_delay: float = 1.0):
    """Decorator for robust API calls with retry logic.
    
    Why exponential backoff? Rate limits and transient errors are normal
    with LLM APIs. Retrying immediately floods the server and gets you
    rate-limited harder. Exponential backoff (1s, 2s, 4s) gives the
    server time to recover. This pattern alone prevents 90% of production
    outages from API instability.
    """
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None
            
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                    
                except AuthenticationError:
                    logger.error("Invalid API key")
                    raise  # Don't retry auth errors
                    
                except RateLimitError as e:
                    last_exception = e
                    wait = base_delay * (2 ** attempt)
                    logger.warning(f"Rate limited. Retry {attempt + 1}/{max_retries} in {wait}s")
                    time.sleep(wait)
                    
                except APIConnectionError as e:
                    last_exception = e
                    wait = base_delay * (2 ** attempt)
                    logger.warning(f"Connection error. Retry {attempt + 1}/{max_retries} in {wait}s")
                    time.sleep(wait)
                    
                except APIError as e:
                    last_exception = e
                    if e.status_code >= 500:  # Server errors
                        wait = base_delay * (2 ** attempt)
                        logger.warning(f"Server error {e.status_code}. Retry {attempt + 1}/{max_retries}")
                        time.sleep(wait)
                    else:
                        raise  # Don't retry client errors (4xx)
            
            raise last_exception
        return wrapper
    return decorator

@retry_with_backoff(max_retries=3)
def reliable_chat(messages: list, model: str = "gpt-4o-mini") -> str:
    client = OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        timeout=30  # 30 second timeout
    )
    return response.choices[0].message.content

Cost Optimization Strategies

Cost optimization is not about being cheap — it is about being smart. The difference between a well-optimized and naive LLM application can be 10-50x in monthly spend. The biggest lever is model selection: gpt-4o-mini handles 80% of tasks at 6% of the cost. The second biggest lever is prompt length: every token in your system prompt is charged on every single request.

Model Selection Matrix

TaskRecommended ModelCost/1M tokensWhy
Simple Q&Agpt-4o-mini0.15/0.15/0.60Fast, cheap, sufficient
Code generationgpt-4o2.50/2.50/10.00Better accuracy
Complex reasoninggpt-4o2.50/2.50/10.00Necessary for quality
Summarizationgpt-4o-mini0.15/0.15/0.60Simple task
Data extractiongpt-4o-mini + structured0.15/0.15/0.60Structured output helps
Creative writinggpt-4o2.50/2.50/10.00Better quality

Cost Estimation Quick Reference

For back-of-envelope cost estimation, use these rules of thumb:
MetricApproximation
1 tokenRoughly 4 characters or 0.75 words in English
Typical system prompt200-500 tokens
Typical user message50-200 tokens
Typical assistant response100-500 tokens
1,000 conversations/day (gpt-4o-mini)~0.500.50-2.00/day
1,000 conversations/day (gpt-4o)~88-30/day
RAG with 5 chunks context (gpt-4o)~$0.01-0.03 per query
Edge case — hidden cost multipliers: Function calling adds tokens for the function schemas on every request (often 200-500 tokens per function). If you define 10 functions, that is 2,000-5,000 extra input tokens per request. Only include functions relevant to the current context, not your entire function catalog. Edge case — conversation history growth: In multi-turn chat, you resend the entire conversation history on every request. A 20-turn conversation might accumulate 10,000+ tokens of history, costing 5-10x more than the first message. Implement conversation summarization or sliding window truncation for long conversations.

Smart Model Router

from enum import Enum

class TaskComplexity(Enum):
    SIMPLE = "simple"      # FAQ, summarization, translation
    MODERATE = "moderate"  # Code explanation, analysis
    COMPLEX = "complex"    # Code generation, reasoning, creative

def select_model(complexity: TaskComplexity) -> str:
    """Select appropriate model based on task complexity"""
    return {
        TaskComplexity.SIMPLE: "gpt-4o-mini",
        TaskComplexity.MODERATE: "gpt-4o-mini",  # Try cheap first
        TaskComplexity.COMPLEX: "gpt-4o",
    }[complexity]

def smart_chat(prompt: str, complexity: TaskComplexity) -> tuple[str, float]:
    """Chat with cost-aware model selection"""
    model = select_model(complexity)
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Calculate cost
    usage = response.usage
    pricing = {"gpt-4o": (2.50, 10.00), "gpt-4o-mini": (0.15, 0.60)}
    input_rate, output_rate = pricing[model]
    cost = (usage.prompt_tokens / 1_000_000 * input_rate + 
            usage.completion_tokens / 1_000_000 * output_rate)
    
    return response.choices[0].message.content, cost

Mini-Project: AI Customer Support Bot

from openai import OpenAI
from pydantic import BaseModel
from typing import Optional, List
from enum import Enum
from datetime import datetime

client = OpenAI()

# Knowledge base (in production, this would be a vector database)
KNOWLEDGE_BASE = {
    "refund_policy": "Refunds are available within 30 days of purchase for unused items.",
    "shipping": "Standard shipping takes 5-7 business days. Express is 2-3 days.",
    "contact": "Email support@example.com or call 1-800-555-0123",
    "hours": "Customer support is available Mon-Fri 9am-6pm EST"
}

class TicketPriority(str, Enum):
    low = "low"
    medium = "medium"
    high = "high"
    urgent = "urgent"

class SupportResponse(BaseModel):
    answer: str
    confidence: float
    needs_human: bool
    priority: TicketPriority
    suggested_actions: List[str]

# Function definitions for the bot
tools = [
    {
        "type": "function",
        "function": {
            "name": "lookup_knowledge",
            "description": "Look up information in the knowledge base",
            "parameters": {
                "type": "object",
                "properties": {
                    "topic": {"type": "string", "enum": list(KNOWLEDGE_BASE.keys())}
                },
                "required": ["topic"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "create_ticket",
            "description": "Create a support ticket for human follow-up",
            "parameters": {
                "type": "object",
                "properties": {
                    "summary": {"type": "string"},
                    "priority": {"type": "string", "enum": ["low", "medium", "high", "urgent"]},
                    "customer_email": {"type": "string"}
                },
                "required": ["summary", "priority"]
            }
        }
    }
]

def lookup_knowledge(topic: str) -> str:
    return KNOWLEDGE_BASE.get(topic, "Information not found")

def create_ticket(summary: str, priority: str, customer_email: str = None) -> dict:
    ticket_id = f"TKT-{datetime.now().strftime('%Y%m%d%H%M%S')}"
    return {"ticket_id": ticket_id, "status": "created", "priority": priority}

def handle_support_request(customer_message: str) -> SupportResponse:
    """Handle a customer support request"""
    
    messages = [
        {
            "role": "system",
            "content": """You are a helpful customer support agent. 
            Use the lookup_knowledge function to find answers.
            Create tickets for complex issues that need human attention.
            Always be polite and helpful."""
        },
        {"role": "user", "content": customer_message}
    ]
    
    # First call - may request functions
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        tools=tools,
        tool_choice="auto"
    )
    
    message = response.choices[0].message
    
    # Process function calls if any
    if message.tool_calls:
        messages.append(message)
        
        for tool_call in message.tool_calls:
            func_name = tool_call.function.name
            args = json.loads(tool_call.function.arguments)
            
            if func_name == "lookup_knowledge":
                result = lookup_knowledge(**args)
            elif func_name == "create_ticket":
                result = create_ticket(**args)
            
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result)
            })
        
        # Get final response
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages
        )
        message = response.choices[0].message
    
    # Get structured analysis
    analysis_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Analyze this support interaction"},
            {"role": "user", "content": f"Customer: {customer_message}\nAgent: {message.content}"}
        ],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "support_response",
                "schema": SupportResponse.model_json_schema()
            }
        }
    )
    
    return SupportResponse.model_validate_json(
        analysis_response.choices[0].message.content
    )

# Test the bot
result = handle_support_request("I want a refund for my order from last week")
print(f"Answer: {result.answer}")
print(f"Needs human: {result.needs_human}")
print(f"Priority: {result.priority}")

Key Takeaways

Streaming Is Essential

Always stream for user-facing apps. Nobody wants to wait 10 seconds for a response to appear.

Functions Enable Actions

Function calling turns LLMs from chatbots into agents that can search, book, send, and execute.

Structured Outputs Save Time

Use Pydantic models + json_schema for guaranteed parseable responses. No more regex parsing.

Cost Awareness Matters

gpt-4o-mini is 17x cheaper. Use it for simple tasks, save gpt-4o for complex reasoning.

Temperature, top_p, and Penalties: A Decision Guide

These parameters interact in subtle ways. Most developers either ignore them entirely or tweak them randomly. Here is a principled framework:
ParameterValueEffectWhen to Use
temperature=0DeterministicNear-identical outputs each runData extraction, classification, structured output
temperature=0.3-0.5Low creativityConsistent but with minor variationCode generation, summarization, factual Q&A
temperature=0.7-1.0BalancedGood variety with reasonable coherenceConversational chatbots, general writing
temperature=1.5-2.0High creativityUnpredictable, sometimes incoherentBrainstorming, creative writing experiments
top_p=0.1Very narrowOnly the most likely tokens consideredWhen you need extreme precision
top_p=0.9BroadMost tokens available, rare ones excludedGeneral-purpose, slightly more controlled than default
frequency_penalty=0.5Reduce repetitionPenalizes tokens proportional to their countLong-form writing that tends to repeat phrases
presence_penalty=0.5Encourage varietyPenalizes any token that has appearedWhen you want the model to explore new topics
Key rule: Change either temperature or top_p, never both at once. They control the same underlying mechanism (token sampling distribution) from different angles. Adjusting both creates unpredictable interactions. Edge case — temperature=0 is not truly deterministic: OpenAI’s documentation says “best-effort.” In practice, you will see occasional variation even at temperature=0 due to floating-point non-determinism in GPU computation. If you need reproducibility, also set seed — but even then, OpenAI only guarantees “mostly deterministic.” For true determinism, use the logprobs response to verify consistency.

Bonus: Responses API (2025)

The Responses API is OpenAI’s next-generation interface, designed to fix the rough edges of chat completions. The key difference: instead of managing a messages array yourself, you pass a single input and optional instructions. It also handles multi-turn conversations, tool calls, and file search natively without you managing the message flow. For new projects, prefer this over chat completions. For existing projects, there is no urgency to migrate — chat completions will continue to work.
from openai import OpenAI

client = OpenAI()

# Basic Responses API usage
response = client.responses.create(
    model="gpt-4o",
    input="Explain quantum computing simply",
    instructions="You are a helpful physics teacher",
)

print(response.output_text)

# With structured output
from pydantic import BaseModel

class Explanation(BaseModel):
    concept: str
    simple_explanation: str
    analogy: str
    difficulty_level: str

response = client.responses.create(
    model="gpt-4o",
    input="Explain neural networks",
    text={
        "format": {
            "type": "json_schema",
            "name": "explanation",
            "schema": Explanation.model_json_schema()
        }
    }
)

result = Explanation.model_validate_json(response.output_text)
print(f"Analogy: {result.analogy}")

Predicted Outputs (Speed Boost)

Predicted outputs exploit a clever optimization: when the model’s output is likely to be very similar to something you already have (like refactoring code), you provide the “prediction” and the model only needs to generate the diff. Under the hood, tokens that match the prediction are processed much faster. The result is 2-5x faster generation for edit-like tasks. When you know most of the output in advance (like code refactoring), use predicted outputs for 2-5x faster generation:
original_code = '''
def calculate_total(items):
    total = 0
    for item in items:
        total += item.price
    return total
'''

# We predict the output will be similar to input
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Refactor this function to use sum()"},
        {"role": "user", "content": original_code}
    ],
    prediction={
        "type": "content",
        "content": original_code  # Model uses this as a starting point
    }
)

# Much faster because tokens were pre-computed!
print(response.choices[0].message.content)
When to use Predicted Outputs: Code editing, document revisions, template filling—any time the output is structurally similar to something you already have.

What’s Next

Vector Databases

Store embeddings at scale with pgvector and Pinecone for semantic search

Interview Deep-Dive

Strong Answer:
  • The architecture has three layers: structured extraction, function calling, and error handling. I would use structured outputs with a strict JSON schema for the extraction step, function calling for the API interaction, and a retry wrapper around the entire flow.
  • For extraction, I would define Pydantic models that represent the business entities — say, an OrderIntent with fields like action (enum: track, return, cancel), order_id (optional string), and reason (optional string). I would use response_format with json_schema and strict: True to guarantee the output matches. This is critical because without strict mode, you get “usually valid” JSON, and “usually” is not good enough when a parse failure crashes your webhook handler at 3am.
  • For function calling, I would define tools for each internal API (lookup_order, initiate_return, etc.) with tight parameter schemas. The key design decision: never let the model construct free-form API calls. Every parameter should be constrained — enums for status values, regex patterns for IDs, explicit required fields. The model decides WHICH function to call and with what arguments; my code validates and executes.
  • The failure modes I would design around: (1) The model hallucinates a function that does not exist — handle with a strict whitelist check before execution. (2) The model extracts a plausible-looking but invalid order_id — validate against the database before processing. (3) Rate limits during high-traffic periods — implement exponential backoff with jitter, and a circuit breaker that falls back to a human agent queue after 3 failed retries. (4) The model returns valid JSON but semantically wrong data (extracts the wrong order_id from a message mentioning multiple orders) — add a confirmation step for high-stakes actions like cancellations.
Follow-up: Your structured output extraction works 99.2% of the time in testing, but in production you are seeing a 3% failure rate on certain user messages. How do you debug this?
  • The gap between test and production is almost always input distribution. Test data is clean and well-formed; production users write in fragments, mix languages, include typos, and paste content from other apps. I would start by pulling the 3% failures and categorizing them.
  • Common culprits: (1) Messages that are too long and get truncated by max_tokens on the response side — the model starts generating the JSON but hits the token limit before closing all brackets. Fix: increase max_tokens for extraction tasks or truncate the input. (2) Messages with special characters or Unicode that confuse the tokenizer. (3) Ambiguous messages where the model cannot confidently fill a required field and produces a schema violation trying to leave it blank.
  • I would add logging that captures the raw input, the model’s raw output, and the Pydantic validation error for every failure. Then I would batch the failure cases into categories, create regression tests for each category, and either adjust the prompt to handle edge cases or add pre-processing to normalize the input before it hits the model.
Strong Answer:
  • Temperature and top_p both control randomness, but through different mechanisms. Temperature scales the logits before softmax: at 0 the model always picks the highest-probability token (deterministic), at 1 it samples from the natural distribution, and at 2 it flattens the distribution dramatically so even low-probability tokens have a decent chance. Top_p (nucleus sampling) is a different approach: it dynamically truncates the distribution to include only the smallest set of tokens whose cumulative probability exceeds p. At top_p=0.1, only the most likely tokens are considered; at top_p=1.0, all tokens are candidates.
  • The critical mistake: changing both simultaneously. They interact in non-obvious ways. If you set temperature=0.3 and top_p=0.5, you are double-constraining the distribution — the temperature already concentrated probability on top tokens, and then top_p further truncates. The result is more deterministic than either setting alone, which is usually not what you want. OpenAI’s own documentation says to change one or the other, not both.
  • Frequency penalty and presence penalty both reduce repetition, but differently. Frequency penalty scales with how many times a token has appeared — it penalizes “the” more each time “the” appears. Presence penalty is binary: it penalizes a token the same amount whether it has appeared once or ten times. Use frequency penalty (0.3-0.8) when the model gets stuck repeating phrases. Use presence penalty (0.3-0.6) when you want topic diversity — it encourages the model to explore new concepts rather than rehashing the same point.
  • My production defaults: temperature=0 for extraction and classification (determinism matters), temperature=0.7 with top_p=1.0 for conversational responses, frequency_penalty=0.3 for any task where repetition is noticeable. I almost never touch presence_penalty because it can cause the model to go off-topic.
Follow-up: You set temperature=0 for determinism, but you notice that the same prompt sometimes gives different outputs. Why, and how do you get true reproducibility?
  • Temperature=0 is “approximately deterministic” but not guaranteed. There are several sources of non-determinism: GPU floating-point operations are not associative (the order of additions changes the result at the bit level), different hardware produces slightly different rounding, and OpenAI may route your request to different GPU clusters. The seed parameter helps — when you set it, OpenAI returns a system_fingerprint in the response, and outputs are deterministic as long as the fingerprint matches. But the fingerprint can change when OpenAI updates their infrastructure.
  • For true reproducibility in production, I cache the response keyed on the hash of the full request (messages + model + seed + all parameters). If the same request comes in again, return the cached response. This also saves money and reduces latency. For evaluation, I run each test case 3 times and check that all 3 outputs match; if they diverge, I flag it as a non-determinism issue and increase the test tolerance.
Strong Answer:
  • At 500K calls/month, cost optimization is not about saving pennies — it is likely a $5K-50K/month line item depending on model mix. The first step is instrumentation: log every API call with model, input tokens, output tokens, task type, and calculated cost. Without this data, you are optimizing blind.
  • The biggest lever is model routing. I would categorize every API call by task: classification, extraction, summarization, generation, reasoning. Then I would benchmark GPT-4o-mini against GPT-4o for each category using our actual prompts and a labeled evaluation set. In my experience, GPT-4o-mini handles classification and extraction at 95%+ of GPT-4o accuracy at 6% of the cost. That alone, if 60% of our calls are simple tasks, cuts our bill by 50%.
  • The second lever is prompt length. Output tokens cost 2-4x more than input tokens. A verbose system prompt is cheap (amortized across many requests), but a verbose response is expensive on every single call. I would audit our prompts: add max_tokens limits to every call (prevents runaway responses), add explicit length constraints in the system prompt (“respond in 2-3 sentences”), and remove any unnecessary context from the messages array.
  • The third lever is caching. If the same question gets asked repeatedly (common in customer support), cache the response keyed on a hash of the messages. Even a 10% cache hit rate on 500K calls saves 50K API calls per month. Use Redis with a TTL of 1-24 hours depending on how dynamic your data is.
  • The fourth lever is batching. For non-real-time workloads (nightly summarization, weekly report generation), use the Batch API which offers a 50% discount in exchange for up to 24-hour turnaround.
Follow-up: After implementing model routing and caching, your costs dropped 45% but you are getting complaints that some responses feel “dumber” since the switch. How do you investigate?
  • The routing logic is probably miscategorizing some complex queries as simple and sending them to GPT-4o-mini when they need GPT-4o. I would pull the complaints, match them to the logged API calls, and check which model served each one. If the pattern is “all complaints were served by mini,” the routing heuristic is too aggressive.
  • The fix is a quality feedback loop. Add a thumbs-up/thumbs-down to the UI, log the feedback with the request metadata, and periodically review which task types have the lowest satisfaction scores on GPT-4o-mini. Move those task types back to GPT-4o. You can also implement a “try cheap first, escalate if bad” pattern: route to GPT-4o-mini, use a lightweight quality check on the response (length, confidence score, keyword presence), and if it fails, automatically retry with GPT-4o. This adds latency for the escalated cases but keeps costs low for the majority.