Skip to main content
Updated December 2025: Now covers GPT-4.5, Claude 3.5 Opus, Gemini 2.0 Flash, and the latest multimodal capabilities.

Why This Matters

Most developers use LLMs like magic boxes. They copy-paste prompts from Twitter and pray it works. You’ll be different. Understanding how LLMs work lets you:
  • Debug when outputs are wrong
  • Optimize costs (save 10x on API bills)
  • Design prompts that actually work
  • Know when to use which model
Real Talk: Companies waste thousands on AI because developers don’t understand token economics. After this module, you won’t.

The Core Mental Model

LLMs are next-token predictors. That’s it. Everything else is a consequence of this simple idea.
Input: "The capital of France is"
Model thinks: What token is most likely next?
Output: " Paris" (with high probability)
Next Token Prediction Flow They don’t “know” facts. They predict what text is likely to follow based on patterns in training data.
This explains hallucinations: If “The CEO of Apple is Steve Jobs” appeared often in training data, the model might predict “Steve Jobs” even though Tim Cook is the current CEO. It predicts likely text, not true text.

2025 Model Landscape

Understanding the current model landscape helps you make the right choices for your applications.

Model Comparison (December 2025)

ModelBest ForContextSpeedCost
GPT-4.5Complex reasoning, research128KSlow$$$$$
GPT-4oGeneral purpose, balanced128KFast$$
GPT-4o-miniSimple tasks, high volume128KVery Fast$
o1Math, coding, deep reasoning200KVery Slow$$$$
o1-miniQuick reasoning tasks128KSlow$$
Claude 3.5 SonnetCoding, long context200KFast$$
Claude 3.5 OpusMost capable, nuanced200KMedium$$$$
Gemini 2.0 FlashSpeed, multimodal, cheap1MVery Fast$
Gemini 1.5 ProLong context analysis2MMedium$$

When to Use What

def choose_model(task: str, priority: str = "balanced") -> str:
    """Simple model selection guide"""
    
    recommendations = {
        # Task-based
        "simple_qa": "gpt-4o-mini",
        "complex_reasoning": "o1",
        "coding": "claude-3-5-sonnet",
        "long_document": "gemini-1.5-pro",
        "real_time_chat": "gpt-4o",
        "vision_analysis": "gpt-4o",
        "math_problems": "o1-mini",
        "creative_writing": "claude-3-5-opus",
        
        # Priority-based fallbacks
        "cheapest": "gemini-2.0-flash",
        "fastest": "gpt-4o-mini",
        "smartest": "o1",
        "balanced": "gpt-4o",
    }
    
    return recommendations.get(task, recommendations[priority])
Pro Tip: Start with gpt-4o-mini for development and testing. It’s 15x cheaper than gpt-4o and fast enough to iterate quickly. Switch to a more capable model only when needed.

Tokenization: The Foundation

What is a Token?

Before we dive into code, let’s understand what a token actually is. A token is the smallest unit of text that an LLM processes. Think of it like this:
  • Humans read words
  • Computers process bytes
  • LLMs understand tokens
Tokens are created by breaking text into chunks based on common patterns the model learned during training. A token can be:
  • A whole word: "hello" → 1 token
  • Part of a word: "unhappiness"["un", "happiness"] → 2 tokens
  • A single character: "🎉" → 1 token
  • Punctuation: "," → 1 token
Why not just use words? Because:
  1. New words appear all the time (“ChatGPT”, “blockchain”)
  2. Different languages have different word structures
  3. Code has symbols and syntax that aren’t “words”
  4. Tokens allow the model to handle ANY text, even typos

Why Tokens Matter

Now that you know what tokens are, here’s why they’re critical:
  • Pricing: You pay per token, not per word
  • Context limits: 128K tokens, not 128K words
  • Output quality: Some words are multiple tokens, affecting generation
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

# English is efficient (~4 chars per token)
english = "Hello, how are you today?"
print(f"English: {len(english)} chars → {len(enc.encode(english))} tokens")
# English: 25 chars → 6 tokens

# Code is less efficient
code = "def calculate_average(numbers: list[float]) -> float:"
print(f"Code: {len(code)} chars → {len(enc.encode(code))} tokens")
# Code: 54 chars → 15 tokens

# Non-English is expensive
arabic = "مرحبا كيف حالك"
print(f"Arabic: {len(arabic)} chars → {len(enc.encode(arabic))} tokens")
# Arabic: 14 chars → 12 tokens (almost 1 token per char!)

# Numbers are weird
numbers = "1234567890"
print(f"Numbers: {len(numbers)} chars → {len(enc.encode(numbers))} tokens")
# Numbers: 10 chars → 3 tokens
Why the difference?
  • English: GPT models were trained primarily on English text, so common English words are single tokens
  • Code: Special characters like :, [, ], -> each become separate tokens
  • Arabic: Less represented in training data, so the tokenizer breaks it into smaller pieces
  • Numbers: Tokenized in chunks (e.g., “1234” might be one token, “567890” another)
Cost Impact: If you’re building an app for Arabic users, your API costs could be 3-4x higher than for English users with the same amount of text!

Token Economics Calculator

Now that you understand tokens affect pricing, let’s build a calculator to estimate costs before making API calls. This is critical for:
  • Budgeting your application
  • Choosing the right model for your use case
  • Avoiding surprise bills
import tiktoken
from dataclasses import dataclass

@dataclass
class ModelPricing:
    name: str
    input_per_million: float
    output_per_million: float
    context_window: int

# Updated December 2025 pricing
MODELS = {
    # OpenAI Models
    "gpt-4.5": ModelPricing("gpt-4.5", 75.00, 150.00, 128000),
    "gpt-4o": ModelPricing("gpt-4o", 2.50, 10.00, 128000),
    "gpt-4o-mini": ModelPricing("gpt-4o-mini", 0.15, 0.60, 128000),
    "o1": ModelPricing("o1", 15.00, 60.00, 200000),
    "o1-mini": ModelPricing("o1-mini", 3.00, 12.00, 128000),
    # Anthropic Models
    "claude-3-5-sonnet": ModelPricing("claude-3-5-sonnet", 3.00, 15.00, 200000),
    "claude-3-5-haiku": ModelPricing("claude-3-5-haiku", 0.80, 4.00, 200000),
    "claude-3-opus": ModelPricing("claude-3-opus", 15.00, 75.00, 200000),
    # Google Models
    "gemini-2.0-flash": ModelPricing("gemini-2.0-flash", 0.10, 0.40, 1000000),
    "gemini-1.5-pro": ModelPricing("gemini-1.5-pro", 1.25, 5.00, 2000000),
}

def estimate_cost(
    prompt: str,
    expected_output_tokens: int = 500,
    model: str = "gpt-4o"
) -> dict:
    """Estimate API call cost"""
    enc = tiktoken.encoding_for_model("gpt-4")  # Close enough for estimation
    input_tokens = len(enc.encode(prompt))
    
    pricing = MODELS[model]
    
    input_cost = (input_tokens / 1_000_000) * pricing.input_per_million
    output_cost = (expected_output_tokens / 1_000_000) * pricing.output_per_million
    
    return {
        "model": model,
        "input_tokens": input_tokens,
        "output_tokens": expected_output_tokens,
        "input_cost": f"${input_cost:.6f}",
        "output_cost": f"${output_cost:.6f}",
        "total_cost": f"${input_cost + output_cost:.6f}",
        "context_used": f"{(input_tokens / pricing.context_window) * 100:.1f}%"
    }

# Compare costs across models
prompt = "Explain quantum computing in detail with examples..."
for model in ["gpt-4o", "gpt-4o-mini", "claude-3-5-sonnet"]:
    print(estimate_cost(prompt, model=model))

The Token Limit Trap

Every model has a context window (e.g., 128K tokens for GPT-4o). But what happens when your input is too large? You have two options:
  1. Truncate (cut off text) - but you might lose important information
  2. Summarize (compress the content) - but this costs extra API calls
Here’s a smart truncation strategy that preserves the most important parts:
def smart_truncate(text: str, max_tokens: int = 4000, model: str = "gpt-4") -> str:
    """Truncate text to fit token limit while preserving meaning"""
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)
    
    if len(tokens) <= max_tokens:
        return text
    
    # Keep first 80% and last 20% to preserve context
    keep_start = int(max_tokens * 0.8)
    keep_end = max_tokens - keep_start
    
    truncated_tokens = tokens[:keep_start] + tokens[-keep_end:]
    
    return enc.decode(truncated_tokens)

Embeddings: Semantic Understanding

Why Embeddings?

Imagine you’re building a search feature. A user searches for “python programming”. Traditional keyword search would miss results containing:
  • “coding in Python”
  • “Python development”
  • “snake (the reptile)” ← Wrong match!
Embeddings solve this by converting text into numbers (vectors) that capture meaning, not just keywords. Use cases:
  • Semantic search: Find documents by meaning, not just keywords
  • Recommendations: “Users who liked X also liked Y”
  • Clustering: Group similar documents together
  • Classification: Categorize text by meaning
Embeddings convert text into vectors where similar meanings are close together in high-dimensional space.

The Intuition

Word Embeddings Space Notice how the relationship direction is consistent: King - Man + Woman ≈ Queen This vector arithmetic allows the model to understand analogies and relationships.

Practical Embedding System

from openai import OpenAI
import numpy as np
from typing import List
import json

client = OpenAI()

class EmbeddingCache:
    """Cache embeddings to avoid repeated API calls"""
    
    def __init__(self, cache_file: str = "embeddings_cache.json"):
        self.cache_file = cache_file
        self.cache = self._load_cache()
    
    def _load_cache(self) -> dict:
        try:
            with open(self.cache_file, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            return {}
    
    def _save_cache(self):
        with open(self.cache_file, 'w') as f:
            json.dump(self.cache, f)
    
    def get_embedding(self, text: str, model: str = "text-embedding-3-small") -> List[float]:
        cache_key = f"{model}:{text[:100]}"  # Use prefix for key
        
        if cache_key in self.cache:
            return self.cache[cache_key]
        
        response = client.embeddings.create(model=model, input=text)
        embedding = response.data[0].embedding
        
        self.cache[cache_key] = embedding
        self._save_cache()
        
        return embedding
    
    def get_embeddings_batch(self, texts: List[str], model: str = "text-embedding-3-small") -> List[List[float]]:
        """Batch embed for efficiency (up to 2048 texts per call)"""
        # Check cache first
        uncached = [(i, t) for i, t in enumerate(texts) if f"{model}:{t[:100]}" not in self.cache]
        
        if uncached:
            indices, uncached_texts = zip(*uncached)
            response = client.embeddings.create(model=model, input=list(uncached_texts))
            
            for i, emb_data in zip(indices, response.data):
                cache_key = f"{model}:{texts[i][:100]}"
                self.cache[cache_key] = emb_data.embedding
            
            self._save_cache()
        
        return [self.cache[f"{model}:{t[:100]}"] for t in texts]


def cosine_similarity(a: List[float], b: List[float]) -> float:
    """Compute cosine similarity between two vectors"""
    # Cosine similarity measures the angle between two vectors
    # Returns 1.0 for identical, 0.0 for unrelated, -1.0 for opposite
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


def find_most_similar(query: str, documents: List[str], top_k: int = 3) -> List[tuple]:
    """Find most similar documents to query"""
    cache = EmbeddingCache()
    
    query_emb = cache.get_embedding(query)
    doc_embs = cache.get_embeddings_batch(documents)
    
    similarities = [
        (doc, cosine_similarity(query_emb, doc_emb))
        for doc, doc_emb in zip(documents, doc_embs)
    ]
    
    return sorted(similarities, key=lambda x: x[1], reverse=True)[:top_k]


# Example usage
documents = [
    "Python is a programming language",
    "JavaScript runs in browsers",
    "Machine learning uses neural networks",
    "Snakes are reptiles that slither",
    "The stock market closed higher today"
]

results = find_most_similar("coding languages", documents)
for doc, score in results:
    print(f"{score:.3f}: {doc}")
# 0.847: Python is a programming language
# 0.812: JavaScript runs in browsers
# 0.623: Machine learning uses neural networks
Why cache embeddings? Each embedding API call costs money and takes time. Since the same text always produces the same embedding, caching can save you 90%+ on costs for repeated queries.
When to use embeddings vs. keyword search:
  • Use embeddings: When meaning matters (“cheap flights” = “affordable airfare”)
  • Use keywords: When exact matches matter (error codes, product SKUs)
  • Use both: Hybrid search often works best

Temperature & Sampling

What is Sampling?

Remember: LLMs predict the most likely next token. But if they always picked the #1 choice, outputs would be boring and repetitive. Sampling means randomly choosing from the top candidates based on their probabilities. This adds variety while still favoring likely tokens. Think of it like a weighted lottery:
  • “Paris” has 85 tickets
  • “Lyon” has 5 tickets
  • “a” has 3 tickets
  • “the” has 2 tickets
Most of the time you’ll draw “Paris”, but occasionally you’ll get something else.

How Sampling Works

When the model predicts the next token, it outputs probabilities for all possible tokens:
"The capital of France is" → 
  " Paris": 0.85
  " Lyon": 0.05
  " a": 0.03
  " the": 0.02
  ...
Temperature controls how these probabilities are used. Think of it as a “creativity dial”:
  • Low temperature (0-0.3): Conservative, picks the most likely tokens → Predictable output
  • Medium temperature (0.7-1.0): Balanced, some variety → Natural conversation
  • High temperature (1.5+): Wild, picks unlikely tokens → Creative but potentially nonsensical
What are logits? Raw scores before converting to probabilities. Temperature scales these scores before the conversion.
import numpy as np

def sample_with_temperature(logits: np.ndarray, temperature: float) -> int:
    """Demonstrate temperature sampling"""
    # Apply temperature (lower = more confident, higher = more random)
    scaled_logits = logits / temperature
    
    # Convert to probabilities
    probs = np.exp(scaled_logits) / np.sum(np.exp(scaled_logits))
    
    # Sample
    return np.random.choice(len(probs), p=probs)

# Temperature 0: Always pick highest probability (deterministic)
# Temperature 0.5: Slightly random, but still favors likely tokens
# Temperature 1.0: Sample according to original probabilities
# Temperature 2.0: More random, even unlikely tokens have a chance

When to Use What

TaskTemperatureWhy
Code generation0Deterministic, reproducible
Factual Q&A0-0.3Minimize hallucination
General chat0.7Natural variation
Creative writing0.9-1.2Unexpected combinations
Brainstorming1.0-1.5Explore diverse ideas
from openai import OpenAI

client = OpenAI()

def generate(prompt: str, task_type: str = "general") -> str:
    """Generate with appropriate temperature for task"""
    temp_map = {
        "code": 0,
        "factual": 0.2,
        "general": 0.7,
        "creative": 1.0,
        "brainstorm": 1.3
    }
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=temp_map.get(task_type, 0.7)
    )
    
    return response.choices[0].message.content

The Attention Mechanism

The Problem Attention Solves

Imagine reading a long document and trying to remember every single word equally. Impossible, right? You naturally focus on important parts and skim over others. LLMs had the same problem. Early models (RNNs) processed text sequentially, treating all words equally. They struggled with:
  • Long-range dependencies (“The cat, which was sitting on the mat that my grandmother gave me, was hungry”)
  • Understanding which words relate to each other
Attention lets the model decide which words to focus on when processing each token. It’s like highlighting the important parts of a document. Attention is what makes Transformers special. It allows the model to focus on different parts of the input sentence when processing a specific word. Consider the sentence:
“The animal didn’t cross the street because it was too tired.”
To understand what “it” refers to, the model must pay attention to “animal” and ignore “street”. Self-Attention Mechanism Without attention, the model wouldn’t know if “it” referred to the animal or the street, making translation and comprehension impossible.

Why Context Window Matters

Quadratic complexity means the computational cost grows with the square of the input length.
# GPT-4o has 128K context ≈ 96K words ≈ 300 pages

# But attention has quadratic complexity: O(n²)
# This means every token must "look at" every other token
# 128K tokens = 128K × 128K = 16 billion attention computations

# This is why:
# 1. Long contexts are slower
# 2. Long contexts cost more
# 3. Information at the start/end is remembered better than middle

The “Lost in the Middle” Problem

Research finding (Liu et al., 2023): LLMs are best at recalling information at the start and end of their context window. Information in the middle gets “lost”. This is similar to human memory - you remember the first and last items in a list better than the middle ones. Practical implication: When feeding multiple documents to an LLM, put the most important ones at the beginning and end:
def structure_for_attention(docs: list[str], query: str) -> str:
    """Structure documents to avoid 'lost in the middle' problem"""
    # Put most relevant docs at START and END
    # Put less relevant docs in MIDDLE
    
    # Note: rank_by_relevance would use embeddings to score relevance
    # (implementation depends on your use case)
    ranked = rank_by_relevance(docs, query)
    
    # Interleave: most relevant at edges
    n = len(ranked)
    reordered = []
    for i in range(n):
        if i % 2 == 0:
            reordered.insert(0, ranked[i])  # Add to start
        else:
            reordered.append(ranked[i])  # Add to end
    
    return "\n\n".join(reordered)

Prompt Engineering That Works

Why Prompt Structure Matters

You wouldn’t send an email that just says “help” and expect a useful response. Same with LLMs - structure and context dramatically improve output quality. Bad prompt: “Write a blog post about AI” Good prompt: Specific context, clear objective, defined format Why structured prompts work better:
  1. Reduces ambiguity - LLM doesn’t have to guess what you want
  2. Provides context - Like briefing a colleague before asking for help
  3. Sets expectations - Defines tone, style, and format upfront
  4. Improves consistency - Same structure = similar quality outputs

The COSTAR Framework

def build_prompt_costar(
    context: str,
    objective: str,
    style: str,
    tone: str,
    audience: str,
    response_format: str
) -> str:
    """
    COSTAR framework for structured prompts
    
    C - Context: Background information
    O - Objective: What you want to achieve
    S - Style: Writing style (formal, casual, technical)
    T - Tone: Emotional tone (professional, friendly)
    A - Audience: Who will read this
    R - Response: Format of output
    """
    return f"""
# Context
{context}

# Objective
{objective}

# Style
Write in a {style} style.

# Tone
Maintain a {tone} tone.

# Audience
This is for {audience}.

# Response Format
{response_format}
"""

# Example
prompt = build_prompt_costar(
    context="We're launching a new AI code review tool for developers.",
    objective="Write a product announcement for our blog.",
    style="technical but accessible",
    tone="excited but professional",
    audience="software developers who use GitHub",
    response_format="Blog post with headline, 3-4 paragraphs, and a call to action."
)

Few-Shot Prompting That Scales

The idea: Show the LLM examples of what you want, then ask it to do the same for new input. When to use:
  • Few-shot (2-5 examples): When you have examples and want consistent formatting
  • Zero-shot (no examples): When the task is simple or you want creative freedom
  • Many-shot (10+ examples): When you need very specific behavior (but watch token costs!)
Why it works: LLMs are pattern matchers. Examples show the pattern more clearly than descriptions.
def few_shot_prompt(
    task_description: str,
    examples: list[dict],  # [{"input": ..., "output": ...}]
    input_text: str
) -> str:
    """Build a few-shot prompt with examples"""
    prompt = f"{task_description}\n\n"
    
    for i, ex in enumerate(examples, 1):
        prompt += f"Example {i}:\n"
        prompt += f"Input: {ex['input']}\n"
        prompt += f"Output: {ex['output']}\n\n"
    
    prompt += f"Now process this:\n"
    prompt += f"Input: {input_text}\n"
    prompt += f"Output:"
    
    return prompt

# Example: Sentiment analysis
examples = [
    {"input": "This product is amazing!", "output": "positive"},
    {"input": "Worst purchase ever.", "output": "negative"},
    {"input": "It's okay, nothing special.", "output": "neutral"},
]

prompt = few_shot_prompt(
    "Classify the sentiment of the text as positive, negative, or neutral.",
    examples,
    "The quality exceeded my expectations!"
)

Chain of Thought (CoT) for Complex Reasoning

Research finding (Wei et al., 2022): LLMs perform significantly better on complex tasks when asked to “think step by step”. Why it works: Breaking down reasoning into steps helps the model:
  1. Avoid jumping to conclusions
  2. Show its work (so you can debug)
  3. Handle multi-step logic
  4. Reduce errors on math and reasoning tasks
When to use CoT:
  • Math problems
  • Logic puzzles
  • Multi-step reasoning
  • Code debugging
  • Planning tasks
def cot_prompt(question: str) -> str:
    """Force step-by-step reasoning"""
    return f"""
{question}

Let's solve this step by step:
1. First, I'll identify what we know
2. Then, I'll figure out what we need to find
3. Next, I'll work through the logic
4. Finally, I'll state the answer

Step 1:"""

# CoT is especially powerful for:
# - Math problems
# - Logic puzzles
# - Multi-step reasoning
# - Code debugging

Production Patterns

Why These Patterns Matter

In development, you can retry manually when an API call fails. In production with thousands of users, you need automatic handling for:
  • Rate limits - APIs have request limits per minute
  • Transient errors - Network blips, server restarts
  • Cost optimization - Avoid redundant calls

Retry with Exponential Backoff

The problem: APIs rate-limit you (e.g., 3,500 requests/minute for GPT-4). If you hit the limit, requests fail. The solution: Wait and retry, with increasing delays (1s, 2s, 4s, 8s…). This gives the rate limit time to reset.
import time
from openai import OpenAI, RateLimitError, APIError
from functools import wraps

def retry_with_backoff(max_retries: int = 3, base_delay: float = 1.0):
    """Decorator for automatic retry with exponential backoff"""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except RateLimitError:
                    if attempt == max_retries - 1:
                        raise
                    delay = base_delay * (2 ** attempt)
                    print(f"Rate limited. Waiting {delay}s...")
                    time.sleep(delay)
                except APIError as e:
                    if attempt == max_retries - 1:
                        raise
                    print(f"API error: {e}. Retrying...")
                    time.sleep(base_delay)
            return None
        return wrapper
    return decorator

@retry_with_backoff(max_retries=3)
def call_llm(prompt: str) -> str:
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Response Caching

The problem: If 100 users ask “What is Python?”, you’re paying for 100 identical API calls. The solution: Cache responses by request. First user pays, next 99 get instant free responses. When to cache:
  • ✅ Factual questions with stable answers
  • ✅ Common queries (FAQs, documentation)
  • ✅ Expensive prompts with long context
When NOT to cache:
  • ❌ Personalized responses (different per user)
  • ❌ Time-sensitive data (stock prices, news)
  • ❌ Creative content (you want variety)
import hashlib
import json
from pathlib import Path

class LLMCache:
    """Cache LLM responses to disk"""
    
    def __init__(self, cache_dir: str = ".llm_cache"):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
    
    def _hash_request(self, model: str, messages: list, **kwargs) -> str:
        """Create unique hash for request"""
        key_data = json.dumps({"model": model, "messages": messages, **kwargs}, sort_keys=True)
        return hashlib.md5(key_data.encode()).hexdigest()
    
    def get(self, model: str, messages: list, **kwargs) -> str | None:
        """Get cached response if exists"""
        cache_key = self._hash_request(model, messages, **kwargs)
        cache_file = self.cache_dir / f"{cache_key}.json"
        
        if cache_file.exists():
            with open(cache_file) as f:
                return json.load(f)["response"]
        return None
    
    def set(self, model: str, messages: list, response: str, **kwargs):
        """Cache a response"""
        cache_key = self._hash_request(model, messages, **kwargs)
        cache_file = self.cache_dir / f"{cache_key}.json"
        
        with open(cache_file, 'w') as f:
            json.dump({"response": response}, f)

def cached_llm_call(prompt: str, model: str = "gpt-4o-mini", use_cache: bool = True) -> str:
    """LLM call with caching"""
    cache = LLMCache()
    messages = [{"role": "user", "content": prompt}]
    
    if use_cache:
        cached = cache.get(model, messages)
        if cached:
            return cached
    
    client = OpenAI()
    response = client.chat.completions.create(model=model, messages=messages)
    result = response.choices[0].message.content
    
    if use_cache:
        cache.set(model, messages, result)
    
    return result

Mini-Project: Cost-Aware Chat Application

Let’s build a complete chat app that tracks costs and automatically optimizes token usage. This demonstrates:
  • Token counting and cost calculation
  • Budget enforcement
  • Automatic conversation summarization when approaching limits
  • Production-ready error handling
Key design decisions:
  1. Why summarize at 10K tokens? Leaves room for the response while staying well under the 128K limit
  2. Why keep last 5 messages? Recent context is most important for coherent conversation
  3. Why track per-message tokens? Enables smart decisions about what to keep vs. summarize
Build a complete chat app that tracks and optimizes costs:
from openai import OpenAI
from dataclasses import dataclass, field
from typing import List
import tiktoken

@dataclass
class Message:
    role: str
    content: str
    tokens: int = 0

@dataclass
class ConversationStats:
    total_input_tokens: int = 0
    total_output_tokens: int = 0
    total_cost: float = 0.0
    message_count: int = 0

class CostAwareChat:
    """Chat application with cost tracking and optimization"""
    
    def __init__(self, model: str = "gpt-4o-mini", budget_limit: float = 1.0):
        self.client = OpenAI()
        self.model = model
        self.budget_limit = budget_limit
        self.messages: List[Message] = []
        self.stats = ConversationStats()
        self.encoder = tiktoken.encoding_for_model("gpt-4")
        
        # Pricing per million tokens
        self.pricing = {
            "gpt-4o": {"input": 2.50, "output": 10.00},
            "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        }
    
    def _count_tokens(self, text: str) -> int:
        return len(self.encoder.encode(text))
    
    def _calculate_cost(self, input_tokens: int, output_tokens: int) -> float:
        pricing = self.pricing[self.model]
        return (input_tokens / 1_000_000) * pricing["input"] + \
               (output_tokens / 1_000_000) * pricing["output"]
    
    def _build_messages(self) -> list:
        """Build message list, potentially summarizing old messages"""
        total_tokens = sum(m.tokens for m in self.messages)
        
        # If approaching context limit, summarize old messages
        if total_tokens > 10000:
            return self._summarize_and_build()
        
        return [{"role": m.role, "content": m.content} for m in self.messages]
    
    def _summarize_and_build(self) -> list:
        """Summarize conversation history to save tokens"""
        # Keep last 5 messages, summarize the rest
        recent = self.messages[-5:]
        old = self.messages[:-5]
        
        if old:
            old_text = "\n".join([f"{m.role}: {m.content}" for m in old])
            summary_response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{
                    "role": "user",
                    "content": f"Summarize this conversation in 2-3 sentences:\n{old_text}"
                }],
                max_tokens=200
            )
            summary = summary_response.choices[0].message.content
            
            return [
                {"role": "system", "content": f"Previous conversation summary: {summary}"},
                *[{"role": m.role, "content": m.content} for m in recent]
            ]
        
        return [{"role": m.role, "content": m.content} for m in recent]
    
    def chat(self, user_message: str) -> str:
        """Send message and get response with cost tracking"""
        # Check budget
        if self.stats.total_cost >= self.budget_limit:
            return f"Budget limit of ${self.budget_limit} reached. Total spent: ${self.stats.total_cost:.4f}"
        
        # Add user message
        user_tokens = self._count_tokens(user_message)
        self.messages.append(Message("user", user_message, user_tokens))
        
        # Build and send
        messages = self._build_messages()
        input_tokens = sum(self._count_tokens(m["content"]) for m in messages)
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages
        )
        
        assistant_content = response.choices[0].message.content
        output_tokens = self._count_tokens(assistant_content)
        
        # Track stats
        self.stats.total_input_tokens += input_tokens
        self.stats.total_output_tokens += output_tokens
        self.stats.total_cost += self._calculate_cost(input_tokens, output_tokens)
        self.stats.message_count += 1
        
        # Add assistant message
        self.messages.append(Message("assistant", assistant_content, output_tokens))
        
        return assistant_content
    
    def get_stats(self) -> dict:
        """Get conversation statistics"""
        return {
            "messages": self.stats.message_count,
            "input_tokens": self.stats.total_input_tokens,
            "output_tokens": self.stats.total_output_tokens,
            "total_cost": f"${self.stats.total_cost:.6f}",
            "budget_remaining": f"${self.budget_limit - self.stats.total_cost:.6f}",
            "model": self.model
        }


# Usage
chat = CostAwareChat(model="gpt-4o-mini", budget_limit=0.50)

print(chat.chat("What is machine learning?"))
print(chat.chat("Give me an example"))
print(chat.chat("How is it different from AI?"))

print("\n--- Stats ---")
print(chat.get_stats())

Key Takeaways

Tokens = Money

Every token costs. Cache, truncate, and choose models wisely. gpt-4o-mini is 17x cheaper than gpt-4o.

Embeddings Power Search

Convert text to vectors for semantic similarity. Cache embeddings to avoid repeated costs.

Temperature = Creativity Dial

0 for deterministic (code), 0.7 for balanced, 1+ for creative. Match to your task.

Context Has Limits

128K tokens sounds like a lot, but attention degrades. Put important info at edges, not middle.

What’s Next

OpenAI API Deep Dive

Master function calling, structured outputs, streaming, and production patterns