Most developers use LLMs like magic boxes. They copy-paste prompts from Twitter and pray it works.You’ll be different. Understanding how LLMs work lets you:
Debug when outputs are wrong
Optimize costs (save 10x on API bills)
Design prompts that actually work
Know when to use which model
Real Talk: Companies waste thousands on AI because developers don’t understand token economics. After this module, you won’t.
LLMs are next-token predictors. That’s it. Everything else is a consequence of this simple idea.
Copy
Input: "The capital of France is"Model thinks: What token is most likely next?Output: " Paris" (with high probability)
They don’t “know” facts. They predict what text is likely to follow based on patterns in training data.
This explains hallucinations: If “The CEO of Apple is Steve Jobs” appeared often in training data, the model might predict “Steve Jobs” even though Tim Cook is the current CEO. It predicts likely text, not true text.
Pro Tip: Start with gpt-4o-mini for development and testing. It’s 15x cheaper than gpt-4o and fast enough to iterate quickly. Switch to a more capable model only when needed.
Notice how the relationship direction is consistent:
King - Man + Woman ≈ QueenThis vector arithmetic allows the model to understand analogies and relationships.
from openai import OpenAIimport numpy as npfrom typing import Listimport jsonclient = OpenAI()class EmbeddingCache: """Cache embeddings to avoid repeated API calls""" def __init__(self, cache_file: str = "embeddings_cache.json"): self.cache_file = cache_file self.cache = self._load_cache() def _load_cache(self) -> dict: try: with open(self.cache_file, 'r') as f: return json.load(f) except FileNotFoundError: return {} def _save_cache(self): with open(self.cache_file, 'w') as f: json.dump(self.cache, f) def get_embedding(self, text: str, model: str = "text-embedding-3-small") -> List[float]: cache_key = f"{model}:{text[:100]}" # Use prefix for key if cache_key in self.cache: return self.cache[cache_key] response = client.embeddings.create(model=model, input=text) embedding = response.data[0].embedding self.cache[cache_key] = embedding self._save_cache() return embedding def get_embeddings_batch(self, texts: List[str], model: str = "text-embedding-3-small") -> List[List[float]]: """Batch embed for efficiency (up to 2048 texts per call)""" # Check cache first uncached = [(i, t) for i, t in enumerate(texts) if f"{model}:{t[:100]}" not in self.cache] if uncached: indices, uncached_texts = zip(*uncached) response = client.embeddings.create(model=model, input=list(uncached_texts)) for i, emb_data in zip(indices, response.data): cache_key = f"{model}:{texts[i][:100]}" self.cache[cache_key] = emb_data.embedding self._save_cache() return [self.cache[f"{model}:{t[:100]}"] for t in texts]def cosine_similarity(a: List[float], b: List[float]) -> float: """Compute cosine similarity between two vectors""" # Cosine similarity measures the angle between two vectors # Returns 1.0 for identical, 0.0 for unrelated, -1.0 for opposite a, b = np.array(a), np.array(b) return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))def find_most_similar(query: str, documents: List[str], top_k: int = 3) -> List[tuple]: """Find most similar documents to query""" cache = EmbeddingCache() query_emb = cache.get_embedding(query) doc_embs = cache.get_embeddings_batch(documents) similarities = [ (doc, cosine_similarity(query_emb, doc_emb)) for doc, doc_emb in zip(documents, doc_embs) ] return sorted(similarities, key=lambda x: x[1], reverse=True)[:top_k]# Example usagedocuments = [ "Python is a programming language", "JavaScript runs in browsers", "Machine learning uses neural networks", "Snakes are reptiles that slither", "The stock market closed higher today"]results = find_most_similar("coding languages", documents)for doc, score in results: print(f"{score:.3f}: {doc}")# 0.847: Python is a programming language# 0.812: JavaScript runs in browsers# 0.623: Machine learning uses neural networks
Why cache embeddings? Each embedding API call costs money and takes time. Since the same text always produces the same embedding, caching can save you 90%+ on costs for repeated queries.
When to use embeddings vs. keyword search:
Use embeddings: When meaning matters (“cheap flights” = “affordable airfare”)
Use keywords: When exact matches matter (error codes, product SKUs)
Remember: LLMs predict the most likely next token. But if they always picked the #1 choice, outputs would be boring and repetitive.Sampling means randomly choosing from the top candidates based on their probabilities. This adds variety while still favoring likely tokens.Think of it like a weighted lottery:
“Paris” has 85 tickets
“Lyon” has 5 tickets
“a” has 3 tickets
“the” has 2 tickets
Most of the time you’ll draw “Paris”, but occasionally you’ll get something else.
When the model predicts the next token, it outputs probabilities for all possible tokens:
Copy
"The capital of France is" → " Paris": 0.85 " Lyon": 0.05 " a": 0.03 " the": 0.02 ...
Temperature controls how these probabilities are used. Think of it as a “creativity dial”:
Low temperature (0-0.3): Conservative, picks the most likely tokens → Predictable output
Medium temperature (0.7-1.0): Balanced, some variety → Natural conversation
High temperature (1.5+): Wild, picks unlikely tokens → Creative but potentially nonsensical
What are logits? Raw scores before converting to probabilities. Temperature scales these scores before the conversion.
Copy
import numpy as npdef sample_with_temperature(logits: np.ndarray, temperature: float) -> int: """Demonstrate temperature sampling""" # Apply temperature (lower = more confident, higher = more random) scaled_logits = logits / temperature # Convert to probabilities probs = np.exp(scaled_logits) / np.sum(np.exp(scaled_logits)) # Sample return np.random.choice(len(probs), p=probs)# Temperature 0: Always pick highest probability (deterministic)# Temperature 0.5: Slightly random, but still favors likely tokens# Temperature 1.0: Sample according to original probabilities# Temperature 2.0: More random, even unlikely tokens have a chance
Imagine reading a long document and trying to remember every single word equally. Impossible, right? You naturally focus on important parts and skim over others.LLMs had the same problem. Early models (RNNs) processed text sequentially, treating all words equally. They struggled with:
Long-range dependencies (“The cat, which was sitting on the mat that my grandmother gave me, was hungry”)
Understanding which words relate to each other
Attention lets the model decide which words to focus on when processing each token. It’s like highlighting the important parts of a document.Attention is what makes Transformers special. It allows the model to focus on different parts of the input sentence when processing a specific word.Consider the sentence:
“The animal didn’t cross the street because it was too tired.”
To understand what “it” refers to, the model must pay attention to “animal” and ignore “street”.Without attention, the model wouldn’t know if “it” referred to the animal or the street, making translation and comprehension impossible.
Quadratic complexity means the computational cost grows with the square of the input length.
Copy
# GPT-4o has 128K context ≈ 96K words ≈ 300 pages# But attention has quadratic complexity: O(n²)# This means every token must "look at" every other token# 128K tokens = 128K × 128K = 16 billion attention computations# This is why:# 1. Long contexts are slower# 2. Long contexts cost more# 3. Information at the start/end is remembered better than middle
Research finding (Liu et al., 2023): LLMs are best at recalling information at the start and end of their context window. Information in the middle gets “lost”.This is similar to human memory - you remember the first and last items in a list better than the middle ones.Practical implication: When feeding multiple documents to an LLM, put the most important ones at the beginning and end:
Copy
def structure_for_attention(docs: list[str], query: str) -> str: """Structure documents to avoid 'lost in the middle' problem""" # Put most relevant docs at START and END # Put less relevant docs in MIDDLE # Note: rank_by_relevance would use embeddings to score relevance # (implementation depends on your use case) ranked = rank_by_relevance(docs, query) # Interleave: most relevant at edges n = len(ranked) reordered = [] for i in range(n): if i % 2 == 0: reordered.insert(0, ranked[i]) # Add to start else: reordered.append(ranked[i]) # Add to end return "\n\n".join(reordered)
You wouldn’t send an email that just says “help” and expect a useful response. Same with LLMs - structure and context dramatically improve output quality.Bad prompt: “Write a blog post about AI”
Good prompt: Specific context, clear objective, defined formatWhy structured prompts work better:
Reduces ambiguity - LLM doesn’t have to guess what you want
Provides context - Like briefing a colleague before asking for help
Sets expectations - Defines tone, style, and format upfront
Improves consistency - Same structure = similar quality outputs
def build_prompt_costar( context: str, objective: str, style: str, tone: str, audience: str, response_format: str) -> str: """ COSTAR framework for structured prompts C - Context: Background information O - Objective: What you want to achieve S - Style: Writing style (formal, casual, technical) T - Tone: Emotional tone (professional, friendly) A - Audience: Who will read this R - Response: Format of output """ return f"""# Context{context}# Objective{objective}# StyleWrite in a {style} style.# ToneMaintain a {tone} tone.# AudienceThis is for {audience}.# Response Format{response_format}"""# Exampleprompt = build_prompt_costar( context="We're launching a new AI code review tool for developers.", objective="Write a product announcement for our blog.", style="technical but accessible", tone="excited but professional", audience="software developers who use GitHub", response_format="Blog post with headline, 3-4 paragraphs, and a call to action.")
Research finding (Wei et al., 2022): LLMs perform significantly better on complex tasks when asked to “think step by step”.Why it works: Breaking down reasoning into steps helps the model:
Avoid jumping to conclusions
Show its work (so you can debug)
Handle multi-step logic
Reduce errors on math and reasoning tasks
When to use CoT:
Math problems
Logic puzzles
Multi-step reasoning
Code debugging
Planning tasks
Copy
def cot_prompt(question: str) -> str: """Force step-by-step reasoning""" return f"""{question}Let's solve this step by step:1. First, I'll identify what we know2. Then, I'll figure out what we need to find3. Next, I'll work through the logic4. Finally, I'll state the answerStep 1:"""# CoT is especially powerful for:# - Math problems# - Logic puzzles# - Multi-step reasoning# - Code debugging
The problem: APIs rate-limit you (e.g., 3,500 requests/minute for GPT-4). If you hit the limit, requests fail.The solution: Wait and retry, with increasing delays (1s, 2s, 4s, 8s…). This gives the rate limit time to reset.
The problem: If 100 users ask “What is Python?”, you’re paying for 100 identical API calls.The solution: Cache responses by request. First user pays, next 99 get instant free responses.When to cache: