LLM Fundamentals - Dev Weekends

Updated December 2025: Now covers GPT-4.5, Claude 3.5 Opus, Gemini 2.0 Flash, and the latest multimodal capabilities.

Why This Matters

Most developers use LLMs like magic boxes. They copy-paste prompts from Twitter and pray it works. You’ll be different. Understanding how LLMs work lets you:

Debug when outputs are wrong
Optimize costs (save 10x on API bills)
Design prompts that actually work
Know when to use which model

Real Talk: Companies waste thousands on AI because developers don’t understand token economics. After this module, you won’t.

The Core Mental Model

LLMs are next-token predictors. That’s it. Everything else is a consequence of this simple idea.

Input: "The capital of France is"
Model thinks: What token is most likely next?
Output: " Paris" (with high probability)

They don’t “know” facts. They predict what text is likely to follow based on patterns in training data.

This explains hallucinations: If “The CEO of Apple is Steve Jobs” appeared often in training data, the model might predict “Steve Jobs” even though Tim Cook is the current CEO. It predicts likely text, not true text.

2025 Model Landscape

Understanding the current model landscape helps you make the right choices for your applications.

Model Comparison (December 2025)

Model	Best For	Context	Speed	Cost
GPT-4.5	Complex reasoning, research	128K	Slow	$$$$$
GPT-4o	General purpose, balanced	128K	Fast	$$
GPT-4o-mini	Simple tasks, high volume	128K	Very Fast	$
o1	Math, coding, deep reasoning	200K	Very Slow	$$$$
o1-mini	Quick reasoning tasks	128K	Slow	$$
Claude 3.5 Sonnet	Coding, long context	200K	Fast	$$
Claude 3.5 Opus	Most capable, nuanced	200K	Medium	$$$$
Gemini 2.0 Flash	Speed, multimodal, cheap	1M	Very Fast	$
Gemini 1.5 Pro	Long context analysis	2M	Medium	$$

When to Use What

def choose_model(task: str, priority: str = "balanced") -> str:
    """Simple model selection guide"""
    
    recommendations = {
        # Task-based
        "simple_qa": "gpt-4o-mini",
        "complex_reasoning": "o1",
        "coding": "claude-3-5-sonnet",
        "long_document": "gemini-1.5-pro",
        "real_time_chat": "gpt-4o",
        "vision_analysis": "gpt-4o",
        "math_problems": "o1-mini",
        "creative_writing": "claude-3-5-opus",
        
        # Priority-based fallbacks
        "cheapest": "gemini-2.0-flash",
        "fastest": "gpt-4o-mini",
        "smartest": "o1",
        "balanced": "gpt-4o",
    }
    
    return recommendations.get(task, recommendations[priority])

Pro Tip: Start with gpt-4o-mini for development and testing. It’s 15x cheaper than gpt-4o and fast enough to iterate quickly. Switch to a more capable model only when needed.

Tokenization: The Foundation

What is a Token?

Before we dive into code, let’s understand what a token actually is. A token is the smallest unit of text that an LLM processes. Think of it like this:

Humans read words
Computers process bytes
LLMs understand tokens

Tokens are created by breaking text into chunks based on common patterns the model learned during training. A token can be:

A whole word: "hello" → 1 token
Part of a word: "unhappiness" → ["un", "happiness"] → 2 tokens
A single character: "🎉" → 1 token
Punctuation: "," → 1 token

Why not just use words? Because:

New words appear all the time (“ChatGPT”, “blockchain”)
Different languages have different word structures
Code has symbols and syntax that aren’t “words”
Tokens allow the model to handle ANY text, even typos

Why Tokens Matter

Now that you know what tokens are, here’s why they’re critical:

Pricing: You pay per token, not per word
Context limits: 128K tokens, not 128K words
Output quality: Some words are multiple tokens, affecting generation

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

# English is efficient (~4 chars per token)
english = "Hello, how are you today?"
print(f"English: {len(english)} chars → {len(enc.encode(english))} tokens")
# English: 25 chars → 6 tokens

# Code is less efficient
code = "def calculate_average(numbers: list[float]) -> float:"
print(f"Code: {len(code)} chars → {len(enc.encode(code))} tokens")
# Code: 54 chars → 15 tokens

# Non-English is expensive
arabic = "مرحبا كيف حالك"
print(f"Arabic: {len(arabic)} chars → {len(enc.encode(arabic))} tokens")
# Arabic: 14 chars → 12 tokens (almost 1 token per char!)

# Numbers are weird
numbers = "1234567890"
print(f"Numbers: {len(numbers)} chars → {len(enc.encode(numbers))} tokens")
# Numbers: 10 chars → 3 tokens

Why the difference?

English: GPT models were trained primarily on English text, so common English words are single tokens
Code: Special characters like :, [, ], -> each become separate tokens
Arabic: Less represented in training data, so the tokenizer breaks it into smaller pieces
Numbers: Tokenized in chunks (e.g., “1234” might be one token, “567890” another)

Cost Impact: If you’re building an app for Arabic users, your API costs could be 3-4x higher than for English users with the same amount of text!

Token Economics Calculator

Now that you understand tokens affect pricing, let’s build a calculator to estimate costs before making API calls. This is critical for:

Budgeting your application
Choosing the right model for your use case
Avoiding surprise bills

import tiktoken
from dataclasses import dataclass

@dataclass
class ModelPricing:
    name: str
    input_per_million: float
    output_per_million: float
    context_window: int

# Updated December 2025 pricing
MODELS = {
    # OpenAI Models
    "gpt-4.5": ModelPricing("gpt-4.5", 75.00, 150.00, 128000),
    "gpt-4o": ModelPricing("gpt-4o", 2.50, 10.00, 128000),
    "gpt-4o-mini": ModelPricing("gpt-4o-mini", 0.15, 0.60, 128000),
    "o1": ModelPricing("o1", 15.00, 60.00, 200000),
    "o1-mini": ModelPricing("o1-mini", 3.00, 12.00, 128000),
    # Anthropic Models
    "claude-3-5-sonnet": ModelPricing("claude-3-5-sonnet", 3.00, 15.00, 200000),
    "claude-3-5-haiku": ModelPricing("claude-3-5-haiku", 0.80, 4.00, 200000),
    "claude-3-opus": ModelPricing("claude-3-opus", 15.00, 75.00, 200000),
    # Google Models
    "gemini-2.0-flash": ModelPricing("gemini-2.0-flash", 0.10, 0.40, 1000000),
    "gemini-1.5-pro": ModelPricing("gemini-1.5-pro", 1.25, 5.00, 2000000),
}

def estimate_cost(
    prompt: str,
    expected_output_tokens: int = 500,
    model: str = "gpt-4o"
) -> dict:
    """Estimate API call cost"""
    enc = tiktoken.encoding_for_model("gpt-4")  # Close enough for estimation
    input_tokens = len(enc.encode(prompt))
    
    pricing = MODELS[model]
    
    input_cost = (input_tokens / 1_000_000) * pricing.input_per_million
    output_cost = (expected_output_tokens / 1_000_000) * pricing.output_per_million
    
    return {
        "model": model,
        "input_tokens": input_tokens,
        "output_tokens": expected_output_tokens,
        "input_cost": f"${input_cost:.6f}",
        "output_cost": f"${output_cost:.6f}",
        "total_cost": f"${input_cost + output_cost:.6f}",
        "context_used": f"{(input_tokens / pricing.context_window) * 100:.1f}%"
    }

# Compare costs across models
prompt = "Explain quantum computing in detail with examples..."
for model in ["gpt-4o", "gpt-4o-mini", "claude-3-5-sonnet"]:
    print(estimate_cost(prompt, model=model))

The Token Limit Trap

Every model has a context window (e.g., 128K tokens for GPT-4o). But what happens when your input is too large? You have two options:

Truncate (cut off text) - but you might lose important information
Summarize (compress the content) - but this costs extra API calls

Here’s a smart truncation strategy that preserves the most important parts:

def smart_truncate(text: str, max_tokens: int = 4000, model: str = "gpt-4") -> str:
    """Truncate text to fit token limit while preserving meaning"""
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)
    
    if len(tokens) <= max_tokens:
        return text
    
    # Keep first 80% and last 20% to preserve context
    keep_start = int(max_tokens * 0.8)
    keep_end = max_tokens - keep_start
    
    truncated_tokens = tokens[:keep_start] + tokens[-keep_end:]
    
    return enc.decode(truncated_tokens)

Embeddings: Semantic Understanding

Why Embeddings?

Imagine you’re building a search feature. A user searches for “python programming”. Traditional keyword search would miss results containing:

“coding in Python”
“Python development”
“snake (the reptile)” ← Wrong match!

Embeddings solve this by converting text into numbers (vectors) that capture meaning, not just keywords. Use cases:

Semantic search: Find documents by meaning, not just keywords
Recommendations: “Users who liked X also liked Y”
Clustering: Group similar documents together
Classification: Categorize text by meaning

Embeddings convert text into vectors where similar meanings are close together in high-dimensional space.

The Intuition

Notice how the relationship direction is consistent: King - Man + Woman ≈ Queen This vector arithmetic allows the model to understand analogies and relationships.

Practical Embedding System

from openai import OpenAI
import numpy as np
from typing import List
import json

client = OpenAI()

class EmbeddingCache:
    """Cache embeddings to avoid repeated API calls"""
    
    def __init__(self, cache_file: str = "embeddings_cache.json"):
        self.cache_file = cache_file
        self.cache = self._load_cache()
    
    def _load_cache(self) -> dict:
        try:
            with open(self.cache_file, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            return {}
    
    def _save_cache(self):
        with open(self.cache_file, 'w') as f:
            json.dump(self.cache, f)
    
    def get_embedding(self, text: str, model: str = "text-embedding-3-small") -> List[float]:
        cache_key = f"{model}:{text[:100]}"  # Use prefix for key
        
        if cache_key in self.cache:
            return self.cache[cache_key]
        
        response = client.embeddings.create(model=model, input=text)
        embedding = response.data[0].embedding
        
        self.cache[cache_key] = embedding
        self._save_cache()
        
        return embedding
    
    def get_embeddings_batch(self, texts: List[str], model: str = "text-embedding-3-small") -> List[List[float]]:
        """Batch embed for efficiency (up to 2048 texts per call)"""
        # Check cache first
        uncached = [(i, t) for i, t in enumerate(texts) if f"{model}:{t[:100]}" not in self.cache]
        
        if uncached:
            indices, uncached_texts = zip(*uncached)
            response = client.embeddings.create(model=model, input=list(uncached_texts))
            
            for i, emb_data in zip(indices, response.data):
                cache_key = f"{model}:{texts[i][:100]}"
                self.cache[cache_key] = emb_data.embedding
            
            self._save_cache()
        
        return [self.cache[f"{model}:{t[:100]}"] for t in texts]


def cosine_similarity(a: List[float], b: List[float]) -> float:
    """Compute cosine similarity between two vectors"""
    # Cosine similarity measures the angle between two vectors
    # Returns 1.0 for identical, 0.0 for unrelated, -1.0 for opposite
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


def find_most_similar(query: str, documents: List[str], top_k: int = 3) -> List[tuple]:
    """Find most similar documents to query"""
    cache = EmbeddingCache()
    
    query_emb = cache.get_embedding(query)
    doc_embs = cache.get_embeddings_batch(documents)
    
    similarities = [
        (doc, cosine_similarity(query_emb, doc_emb))
        for doc, doc_emb in zip(documents, doc_embs)
    ]
    
    return sorted(similarities, key=lambda x: x[1], reverse=True)[:top_k]


# Example usage
documents = [
    "Python is a programming language",
    "JavaScript runs in browsers",
    "Machine learning uses neural networks",
    "Snakes are reptiles that slither",
    "The stock market closed higher today"
]

results = find_most_similar("coding languages", documents)
for doc, score in results:
    print(f"{score:.3f}: {doc}")
# 0.847: Python is a programming language
# 0.812: JavaScript runs in browsers
# 0.623: Machine learning uses neural networks

Why cache embeddings? Each embedding API call costs money and takes time. Since the same text always produces the same embedding, caching can save you 90%+ on costs for repeated queries.

When to use embeddings vs. keyword search:

Use embeddings: When meaning matters (“cheap flights” = “affordable airfare”)
Use keywords: When exact matches matter (error codes, product SKUs)
Use both: Hybrid search often works best

Temperature & Sampling

What is Sampling?

Remember: LLMs predict the most likely next token. But if they always picked the #1 choice, outputs would be boring and repetitive. Sampling means randomly choosing from the top candidates based on their probabilities. This adds variety while still favoring likely tokens. Think of it like a weighted lottery:

“Paris” has 85 tickets
“Lyon” has 5 tickets
“a” has 3 tickets
“the” has 2 tickets

Most of the time you’ll draw “Paris”, but occasionally you’ll get something else.

How Sampling Works

When the model predicts the next token, it outputs probabilities for all possible tokens:

"The capital of France is" → 
  " Paris": 0.85
  " Lyon": 0.05
  " a": 0.03
  " the": 0.02
  ...

Temperature controls how these probabilities are used. Think of it as a “creativity dial”:

Low temperature (0-0.3): Conservative, picks the most likely tokens → Predictable output
Medium temperature (0.7-1.0): Balanced, some variety → Natural conversation
High temperature (1.5+): Wild, picks unlikely tokens → Creative but potentially nonsensical

What are logits? Raw scores before converting to probabilities. Temperature scales these scores before the conversion.

import numpy as np

def sample_with_temperature(logits: np.ndarray, temperature: float) -> int:
    """Demonstrate temperature sampling"""
    # Apply temperature (lower = more confident, higher = more random)
    scaled_logits = logits / temperature
    
    # Convert to probabilities
    probs = np.exp(scaled_logits) / np.sum(np.exp(scaled_logits))
    
    # Sample
    return np.random.choice(len(probs), p=probs)

# Temperature 0: Always pick highest probability (deterministic)
# Temperature 0.5: Slightly random, but still favors likely tokens
# Temperature 1.0: Sample according to original probabilities
# Temperature 2.0: More random, even unlikely tokens have a chance

When to Use What

Task	Temperature	Why
Code generation	0	Deterministic, reproducible
Factual Q&A	0-0.3	Minimize hallucination
General chat	0.7	Natural variation
Creative writing	0.9-1.2	Unexpected combinations
Brainstorming	1.0-1.5	Explore diverse ideas

from openai import OpenAI

client = OpenAI()

def generate(prompt: str, task_type: str = "general") -> str:
    """Generate with appropriate temperature for task"""
    temp_map = {
        "code": 0,
        "factual": 0.2,
        "general": 0.7,
        "creative": 1.0,
        "brainstorm": 1.3
    }
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=temp_map.get(task_type, 0.7)
    )
    
    return response.choices[0].message.content

The Attention Mechanism

The Problem Attention Solves

Imagine reading a long document and trying to remember every single word equally. Impossible, right? You naturally focus on important parts and skim over others. LLMs had the same problem. Early models (RNNs) processed text sequentially, treating all words equally. They struggled with:

Long-range dependencies (“The cat, which was sitting on the mat that my grandmother gave me, was hungry”)
Understanding which words relate to each other

Attention lets the model decide which words to focus on when processing each token. It’s like highlighting the important parts of a document. Attention is what makes Transformers special. It allows the model to focus on different parts of the input sentence when processing a specific word. Consider the sentence:

“The animal didn’t cross the street because it was too tired.”

To understand what “it” refers to, the model must pay attention to “animal” and ignore “street”.

Without attention, the model wouldn’t know if “it” referred to the animal or the street, making translation and comprehension impossible.

Why Context Window Matters

Quadratic complexity means the computational cost grows with the square of the input length.

# GPT-4o has 128K context ≈ 96K words ≈ 300 pages

# But attention has quadratic complexity: O(n²)
# This means every token must "look at" every other token
# 128K tokens = 128K × 128K = 16 billion attention computations

# This is why:
# 1. Long contexts are slower
# 2. Long contexts cost more
# 3. Information at the start/end is remembered better than middle

The “Lost in the Middle” Problem

Research finding (Liu et al., 2023): LLMs are best at recalling information at the start and end of their context window. Information in the middle gets “lost”. This is similar to human memory - you remember the first and last items in a list better than the middle ones. Practical implication: When feeding multiple documents to an LLM, put the most important ones at the beginning and end:

def structure_for_attention(docs: list[str], query: str) -> str:
    """Structure documents to avoid 'lost in the middle' problem"""
    # Put most relevant docs at START and END
    # Put less relevant docs in MIDDLE
    
    # Note: rank_by_relevance would use embeddings to score relevance
    # (implementation depends on your use case)
    ranked = rank_by_relevance(docs, query)
    
    # Interleave: most relevant at edges
    n = len(ranked)
    reordered = []
    for i in range(n):
        if i % 2 == 0:
            reordered.insert(0, ranked[i])  # Add to start
        else:
            reordered.append(ranked[i])  # Add to end
    
    return "\n\n".join(reordered)

Prompt Engineering That Works

Why Prompt Structure Matters

You wouldn’t send an email that just says “help” and expect a useful response. Same with LLMs - structure and context dramatically improve output quality. Bad prompt: “Write a blog post about AI” Good prompt: Specific context, clear objective, defined format Why structured prompts work better:

Reduces ambiguity - LLM doesn’t have to guess what you want
Provides context - Like briefing a colleague before asking for help
Sets expectations - Defines tone, style, and format upfront
Improves consistency - Same structure = similar quality outputs

The COSTAR Framework

def build_prompt_costar(
    context: str,
    objective: str,
    style: str,
    tone: str,
    audience: str,
    response_format: str
) -> str:
    """
    COSTAR framework for structured prompts
    
    C - Context: Background information
    O - Objective: What you want to achieve
    S - Style: Writing style (formal, casual, technical)
    T - Tone: Emotional tone (professional, friendly)
    A - Audience: Who will read this
    R - Response: Format of output
    """
    return f"""
# Context
{context}

# Objective
{objective}

# Style
Write in a {style} style.

# Tone
Maintain a {tone} tone.

# Audience
This is for {audience}.

# Response Format
{response_format}
"""

# Example
prompt = build_prompt_costar(
    context="We're launching a new AI code review tool for developers.",
    objective="Write a product announcement for our blog.",
    style="technical but accessible",
    tone="excited but professional",
    audience="software developers who use GitHub",
    response_format="Blog post with headline, 3-4 paragraphs, and a call to action."
)

Few-Shot Prompting That Scales

The idea: Show the LLM examples of what you want, then ask it to do the same for new input. When to use:

Few-shot (2-5 examples): When you have examples and want consistent formatting
Zero-shot (no examples): When the task is simple or you want creative freedom
Many-shot (10+ examples): When you need very specific behavior (but watch token costs!)

Why it works: LLMs are pattern matchers. Examples show the pattern more clearly than descriptions.

def few_shot_prompt(
    task_description: str,
    examples: list[dict],  # [{"input": ..., "output": ...}]
    input_text: str
) -> str:
    """Build a few-shot prompt with examples"""
    prompt = f"{task_description}\n\n"
    
    for i, ex in enumerate(examples, 1):
        prompt += f"Example {i}:\n"
        prompt += f"Input: {ex['input']}\n"
        prompt += f"Output: {ex['output']}\n\n"
    
    prompt += f"Now process this:\n"
    prompt += f"Input: {input_text}\n"
    prompt += f"Output:"
    
    return prompt

# Example: Sentiment analysis
examples = [
    {"input": "This product is amazing!", "output": "positive"},
    {"input": "Worst purchase ever.", "output": "negative"},
    {"input": "It's okay, nothing special.", "output": "neutral"},
]

prompt = few_shot_prompt(
    "Classify the sentiment of the text as positive, negative, or neutral.",
    examples,
    "The quality exceeded my expectations!"
)

Chain of Thought (CoT) for Complex Reasoning

Research finding (Wei et al., 2022): LLMs perform significantly better on complex tasks when asked to “think step by step”. Why it works: Breaking down reasoning into steps helps the model:

Avoid jumping to conclusions
Show its work (so you can debug)
Handle multi-step logic
Reduce errors on math and reasoning tasks

When to use CoT:

Math problems
Logic puzzles
Multi-step reasoning
Code debugging
Planning tasks

def cot_prompt(question: str) -> str:
    """Force step-by-step reasoning"""
    return f"""
{question}

Let's solve this step by step:
1. First, I'll identify what we know
2. Then, I'll figure out what we need to find
3. Next, I'll work through the logic
4. Finally, I'll state the answer

Step 1:"""

# CoT is especially powerful for:
# - Math problems
# - Logic puzzles
# - Multi-step reasoning
# - Code debugging

Production Patterns

Why These Patterns Matter

In development, you can retry manually when an API call fails. In production with thousands of users, you need automatic handling for:

Rate limits - APIs have request limits per minute
Transient errors - Network blips, server restarts
Cost optimization - Avoid redundant calls

Retry with Exponential Backoff

The problem: APIs rate-limit you (e.g., 3,500 requests/minute for GPT-4). If you hit the limit, requests fail. The solution: Wait and retry, with increasing delays (1s, 2s, 4s, 8s…). This gives the rate limit time to reset.

import time
from openai import OpenAI, RateLimitError, APIError
from functools import wraps

def retry_with_backoff(max_retries: int = 3, base_delay: float = 1.0):
    """Decorator for automatic retry with exponential backoff"""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except RateLimitError:
                    if attempt == max_retries - 1:
                        raise
                    delay = base_delay * (2 ** attempt)
                    print(f"Rate limited. Waiting {delay}s...")
                    time.sleep(delay)
                except APIError as e:
                    if attempt == max_retries - 1:
                        raise
                    print(f"API error: {e}. Retrying...")
                    time.sleep(base_delay)
            return None
        return wrapper
    return decorator

@retry_with_backoff(max_retries=3)
def call_llm(prompt: str) -> str:
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Response Caching

The problem: If 100 users ask “What is Python?”, you’re paying for 100 identical API calls. The solution: Cache responses by request. First user pays, next 99 get instant free responses. When to cache:

✅ Factual questions with stable answers
✅ Common queries (FAQs, documentation)
✅ Expensive prompts with long context

When NOT to cache:

❌ Personalized responses (different per user)
❌ Time-sensitive data (stock prices, news)
❌ Creative content (you want variety)

import hashlib
import json
from pathlib import Path

class LLMCache:
    """Cache LLM responses to disk"""
    
    def __init__(self, cache_dir: str = ".llm_cache"):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
    
    def _hash_request(self, model: str, messages: list, **kwargs) -> str:
        """Create unique hash for request"""
        key_data = json.dumps({"model": model, "messages": messages, **kwargs}, sort_keys=True)
        return hashlib.md5(key_data.encode()).hexdigest()
    
    def get(self, model: str, messages: list, **kwargs) -> str | None:
        """Get cached response if exists"""
        cache_key = self._hash_request(model, messages, **kwargs)
        cache_file = self.cache_dir / f"{cache_key}.json"
        
        if cache_file.exists():
            with open(cache_file) as f:
                return json.load(f)["response"]
        return None
    
    def set(self, model: str, messages: list, response: str, **kwargs):
        """Cache a response"""
        cache_key = self._hash_request(model, messages, **kwargs)
        cache_file = self.cache_dir / f"{cache_key}.json"
        
        with open(cache_file, 'w') as f:
            json.dump({"response": response}, f)

def cached_llm_call(prompt: str, model: str = "gpt-4o-mini", use_cache: bool = True) -> str:
    """LLM call with caching"""
    cache = LLMCache()
    messages = [{"role": "user", "content": prompt}]
    
    if use_cache:
        cached = cache.get(model, messages)
        if cached:
            return cached
    
    client = OpenAI()
    response = client.chat.completions.create(model=model, messages=messages)
    result = response.choices[0].message.content
    
    if use_cache:
        cache.set(model, messages, result)
    
    return result

Mini-Project: Cost-Aware Chat Application

Let’s build a complete chat app that tracks costs and automatically optimizes token usage. This demonstrates:

Token counting and cost calculation
Budget enforcement
Automatic conversation summarization when approaching limits
Production-ready error handling

Key design decisions:

Why summarize at 10K tokens? Leaves room for the response while staying well under the 128K limit
Why keep last 5 messages? Recent context is most important for coherent conversation
Why track per-message tokens? Enables smart decisions about what to keep vs. summarize

Build a complete chat app that tracks and optimizes costs:

from openai import OpenAI
from dataclasses import dataclass, field
from typing import List
import tiktoken

@dataclass
class Message:
    role: str
    content: str
    tokens: int = 0

@dataclass
class ConversationStats:
    total_input_tokens: int = 0
    total_output_tokens: int = 0
    total_cost: float = 0.0
    message_count: int = 0

class CostAwareChat:
    """Chat application with cost tracking and optimization"""
    
    def __init__(self, model: str = "gpt-4o-mini", budget_limit: float = 1.0):
        self.client = OpenAI()
        self.model = model
        self.budget_limit = budget_limit
        self.messages: List[Message] = []
        self.stats = ConversationStats()
        self.encoder = tiktoken.encoding_for_model("gpt-4")
        
        # Pricing per million tokens
        self.pricing = {
            "gpt-4o": {"input": 2.50, "output": 10.00},
            "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        }
    
    def _count_tokens(self, text: str) -> int:
        return len(self.encoder.encode(text))
    
    def _calculate_cost(self, input_tokens: int, output_tokens: int) -> float:
        pricing = self.pricing[self.model]
        return (input_tokens / 1_000_000) * pricing["input"] + \
               (output_tokens / 1_000_000) * pricing["output"]
    
    def _build_messages(self) -> list:
        """Build message list, potentially summarizing old messages"""
        total_tokens = sum(m.tokens for m in self.messages)
        
        # If approaching context limit, summarize old messages
        if total_tokens > 10000:
            return self._summarize_and_build()
        
        return [{"role": m.role, "content": m.content} for m in self.messages]
    
    def _summarize_and_build(self) -> list:
        """Summarize conversation history to save tokens"""
        # Keep last 5 messages, summarize the rest
        recent = self.messages[-5:]
        old = self.messages[:-5]
        
        if old:
            old_text = "\n".join([f"{m.role}: {m.content}" for m in old])
            summary_response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{
                    "role": "user",
                    "content": f"Summarize this conversation in 2-3 sentences:\n{old_text}"
                }],
                max_tokens=200
            )
            summary = summary_response.choices[0].message.content
            
            return [
                {"role": "system", "content": f"Previous conversation summary: {summary}"},
                *[{"role": m.role, "content": m.content} for m in recent]
            ]
        
        return [{"role": m.role, "content": m.content} for m in recent]
    
    def chat(self, user_message: str) -> str:
        """Send message and get response with cost tracking"""
        # Check budget
        if self.stats.total_cost >= self.budget_limit:
            return f"Budget limit of ${self.budget_limit} reached. Total spent: ${self.stats.total_cost:.4f}"
        
        # Add user message
        user_tokens = self._count_tokens(user_message)
        self.messages.append(Message("user", user_message, user_tokens))
        
        # Build and send
        messages = self._build_messages()
        input_tokens = sum(self._count_tokens(m["content"]) for m in messages)
        
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages
        )
        
        assistant_content = response.choices[0].message.content
        output_tokens = self._count_tokens(assistant_content)
        
        # Track stats
        self.stats.total_input_tokens += input_tokens
        self.stats.total_output_tokens += output_tokens
        self.stats.total_cost += self._calculate_cost(input_tokens, output_tokens)
        self.stats.message_count += 1
        
        # Add assistant message
        self.messages.append(Message("assistant", assistant_content, output_tokens))
        
        return assistant_content
    
    def get_stats(self) -> dict:
        """Get conversation statistics"""
        return {
            "messages": self.stats.message_count,
            "input_tokens": self.stats.total_input_tokens,
            "output_tokens": self.stats.total_output_tokens,
            "total_cost": f"${self.stats.total_cost:.6f}",
            "budget_remaining": f"${self.budget_limit - self.stats.total_cost:.6f}",
            "model": self.model
        }


# Usage
chat = CostAwareChat(model="gpt-4o-mini", budget_limit=0.50)

print(chat.chat("What is machine learning?"))
print(chat.chat("Give me an example"))
print(chat.chat("How is it different from AI?"))

print("\n--- Stats ---")
print(chat.get_stats())

Key Takeaways

Tokens = Money

Every token costs. Cache, truncate, and choose models wisely. gpt-4o-mini is 17x cheaper than gpt-4o.

Embeddings Power Search

Convert text to vectors for semantic similarity. Cache embeddings to avoid repeated costs.

Temperature = Creativity Dial

0 for deterministic (code), 0.7 for balanced, 1+ for creative. Match to your task.

Context Has Limits

128K tokens sounds like a lot, but attention degrades. Put important info at edges, not middle.

What’s Next

OpenAI API Deep Dive

Master function calling, structured outputs, streaming, and production patterns

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Why This Matters

​The Core Mental Model

​2025 Model Landscape

​Model Comparison (December 2025)

​When to Use What

​Tokenization: The Foundation

​What is a Token?

​Why Tokens Matter

​Token Economics Calculator

​The Token Limit Trap

​Embeddings: Semantic Understanding

​Why Embeddings?

​The Intuition

​Practical Embedding System

​Temperature & Sampling

​What is Sampling?

​How Sampling Works

​When to Use What

​The Attention Mechanism

​The Problem Attention Solves

​Why Context Window Matters

​The “Lost in the Middle” Problem

​Prompt Engineering That Works

​Why Prompt Structure Matters

​The COSTAR Framework

​Few-Shot Prompting That Scales

​Chain of Thought (CoT) for Complex Reasoning

​Production Patterns

​Why These Patterns Matter

​Retry with Exponential Backoff

​Response Caching

​Mini-Project: Cost-Aware Chat Application

​Key Takeaways

Tokens = Money

Embeddings Power Search

Temperature = Creativity Dial

Context Has Limits

​What’s Next

Why This Matters

The Core Mental Model

2025 Model Landscape

Model Comparison (December 2025)

When to Use What

Tokenization: The Foundation

What is a Token?

Why Tokens Matter

Token Economics Calculator

The Token Limit Trap

Embeddings: Semantic Understanding

Why Embeddings?

The Intuition

Practical Embedding System

Temperature & Sampling

What is Sampling?

How Sampling Works

When to Use What

The Attention Mechanism

The Problem Attention Solves

Why Context Window Matters

The “Lost in the Middle” Problem

Prompt Engineering That Works

Why Prompt Structure Matters

The COSTAR Framework

Few-Shot Prompting That Scales

Chain of Thought (CoT) for Complex Reasoning

Production Patterns

Why These Patterns Matter

Retry with Exponential Backoff

Response Caching

Mini-Project: Cost-Aware Chat Application

Key Takeaways

What’s Next