Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Updated December 2025: Now covers GPT-4.5, Claude 3.5 Opus, Gemini 2.0 Flash, and the latest multimodal capabilities.
Why This Matters
Most developers use LLMs like magic boxes. They copy-paste prompts from Twitter and pray it works. You’ll be different. Understanding how LLMs work lets you:- Debug when outputs are wrong
- Optimize costs (save 10x on API bills)
- Design prompts that actually work
- Know when to use which model
Real Talk: Companies waste thousands on AI because developers don’t understand token economics. After this module, you won’t.
The Core Mental Model
LLMs are next-token predictors. That’s it. Everything else is a consequence of this simple idea.2025 Model Landscape
Understanding the current model landscape helps you make the right choices for your applications.Model Comparison (December 2025)
| Model | Best For | Context | Speed | Cost |
|---|---|---|---|---|
| GPT-4.5 | Complex reasoning, research | 128K | Slow | $$$$$ |
| GPT-4o | General purpose, balanced | 128K | Fast | $$ |
| GPT-4o-mini | Simple tasks, high volume | 128K | Very Fast | $ |
| o1 | Math, coding, deep reasoning | 200K | Very Slow | $$$$ |
| o1-mini | Quick reasoning tasks | 128K | Slow | $$ |
| Claude 3.5 Sonnet | Coding, long context | 200K | Fast | $$ |
| Claude 3.5 Opus | Most capable, nuanced | 200K | Medium | $$$$ |
| Gemini 2.0 Flash | Speed, multimodal, cheap | 1M | Very Fast | $ |
| Gemini 1.5 Pro | Long context analysis | 2M | Medium | $$ |
When to Use What
Tokenization: The Foundation
What is a Token?
Before we dive into code, let’s understand what a token actually is. A token is the smallest unit of text that an LLM processes. Think of it like this:- Humans read words
- Computers process bytes
- LLMs understand tokens
- A whole word:
"hello"→ 1 token - Part of a word:
"unhappiness"→["un", "happiness"]→ 2 tokens - A single character:
"🎉"→ 1 token - Punctuation:
","→ 1 token
- New words appear all the time (“ChatGPT”, “blockchain”)
- Different languages have different word structures
- Code has symbols and syntax that aren’t “words”
- Tokens allow the model to handle ANY text, even typos
Why Tokens Matter
Now that you know what tokens are, here’s why they’re critical:- Pricing: You pay per token, not per word
- Context limits: 128K tokens, not 128K words
- Output quality: Some words are multiple tokens, affecting generation
- English: GPT models were trained primarily on English text, so common English words are single tokens
- Code: Special characters like
:,[,],->each become separate tokens - Arabic: Less represented in training data, so the tokenizer breaks it into smaller pieces
- Numbers: Tokenized in chunks (e.g., “1234” might be one token, “567890” another)
Token Economics Calculator
Now that you understand tokens affect pricing, let’s build a calculator to estimate costs before making API calls. This is critical for:- Budgeting your application
- Choosing the right model for your use case
- Avoiding surprise bills
The Token Limit Trap
Every model has a context window (e.g., 128K tokens for GPT-4o). But what happens when your input is too large? You have two options:- Truncate (cut off text) - but you might lose important information
- Summarize (compress the content) - but this costs extra API calls
Embeddings: Semantic Understanding
Why Embeddings?
Imagine you’re building a search feature. A user searches for “python programming”. Traditional keyword search would miss results containing:- “coding in Python”
- “Python development”
- “snake (the reptile)” ← Wrong match!
- Semantic search: Find documents by meaning, not just keywords
- Recommendations: “Users who liked X also liked Y”
- Clustering: Group similar documents together
- Classification: Categorize text by meaning
The Intuition
King - Man + Woman ≈ Queen
This vector arithmetic allows the model to understand analogies and relationships.
Practical Embedding System
Why cache embeddings? Each embedding API call costs money and takes time. Since the same text always produces the same embedding, caching can save you 90%+ on costs for repeated queries.
- Use embeddings: When meaning matters (“cheap flights” = “affordable airfare”)
- Use keywords: When exact matches matter (error codes, product SKUs)
- Use both: Hybrid search often works best
Temperature & Sampling
What is Sampling?
Remember: LLMs predict the most likely next token. But if they always picked the #1 choice, outputs would be boring and repetitive. Sampling means randomly choosing from the top candidates based on their probabilities. This adds variety while still favoring likely tokens. Think of it like a weighted lottery:- “Paris” has 85 tickets
- “Lyon” has 5 tickets
- “a” has 3 tickets
- “the” has 2 tickets
How Sampling Works
When the model predicts the next token, it outputs probabilities for all possible tokens:- Low temperature (0-0.3): Conservative, picks the most likely tokens → Predictable output
- Medium temperature (0.7-1.0): Balanced, some variety → Natural conversation
- High temperature (1.5+): Wild, picks unlikely tokens → Creative but potentially nonsensical
When to Use What
| Task | Temperature | Why |
|---|---|---|
| Code generation | 0 | Deterministic, reproducible |
| Factual Q&A | 0-0.3 | Minimize hallucination |
| General chat | 0.7 | Natural variation |
| Creative writing | 0.9-1.2 | Unexpected combinations |
| Brainstorming | 1.0-1.5 | Explore diverse ideas |
The Attention Mechanism
The Problem Attention Solves
Imagine reading a long document and trying to remember every single word equally. Impossible, right? You naturally focus on important parts and skim over others. LLMs had the same problem. Early models (RNNs) processed text sequentially, treating all words equally. They struggled with:- Long-range dependencies (“The cat, which was sitting on the mat that my grandmother gave me, was hungry”)
- Understanding which words relate to each other
“The animal didn’t cross the street because it was too tired.”To understand what “it” refers to, the model must pay attention to “animal” and ignore “street”.
Why Context Window Matters
Quadratic complexity means the computational cost grows with the square of the input length.The “Lost in the Middle” Problem
Research finding (Liu et al., 2023): LLMs are best at recalling information at the start and end of their context window. Information in the middle gets “lost”. This is similar to human memory - you remember the first and last items in a list better than the middle ones. Practical implication: When feeding multiple documents to an LLM, put the most important ones at the beginning and end:Prompt Engineering That Works
Why Prompt Structure Matters
You wouldn’t send an email that just says “help” and expect a useful response. Same with LLMs - structure and context dramatically improve output quality. Bad prompt: “Write a blog post about AI” Good prompt: Specific context, clear objective, defined format Why structured prompts work better:- Reduces ambiguity - LLM doesn’t have to guess what you want
- Provides context - Like briefing a colleague before asking for help
- Sets expectations - Defines tone, style, and format upfront
- Improves consistency - Same structure = similar quality outputs
The COSTAR Framework
Few-Shot Prompting That Scales
The idea: Show the LLM examples of what you want, then ask it to do the same for new input. When to use:- Few-shot (2-5 examples): When you have examples and want consistent formatting
- Zero-shot (no examples): When the task is simple or you want creative freedom
- Many-shot (10+ examples): When you need very specific behavior (but watch token costs!)
Chain of Thought (CoT) for Complex Reasoning
Research finding (Wei et al., 2022): LLMs perform significantly better on complex tasks when asked to “think step by step”. Why it works: Breaking down reasoning into steps helps the model:- Avoid jumping to conclusions
- Show its work (so you can debug)
- Handle multi-step logic
- Reduce errors on math and reasoning tasks
- Math problems
- Logic puzzles
- Multi-step reasoning
- Code debugging
- Planning tasks
Production Patterns
Why These Patterns Matter
In development, you can retry manually when an API call fails. In production with thousands of users, you need automatic handling for:- Rate limits - APIs have request limits per minute
- Transient errors - Network blips, server restarts
- Cost optimization - Avoid redundant calls
Retry with Exponential Backoff
The problem: APIs rate-limit you (e.g., 3,500 requests/minute for GPT-4). If you hit the limit, requests fail. The solution: Wait and retry, with increasing delays (1s, 2s, 4s, 8s…). This gives the rate limit time to reset.Response Caching
The problem: If 100 users ask “What is Python?”, you’re paying for 100 identical API calls. The solution: Cache responses by request. First user pays, next 99 get instant free responses. When to cache:- ✅ Factual questions with stable answers
- ✅ Common queries (FAQs, documentation)
- ✅ Expensive prompts with long context
- ❌ Personalized responses (different per user)
- ❌ Time-sensitive data (stock prices, news)
- ❌ Creative content (you want variety)
Mini-Project: Cost-Aware Chat Application
Let’s build a complete chat app that tracks costs and automatically optimizes token usage. This demonstrates:- Token counting and cost calculation
- Budget enforcement
- Automatic conversation summarization when approaching limits
- Production-ready error handling
- Why summarize at 10K tokens? Leaves room for the response while staying well under the 128K limit
- Why keep last 5 messages? Recent context is most important for coherent conversation
- Why track per-message tokens? Enables smart decisions about what to keep vs. summarize
Key Takeaways
Tokens = Money
Every token costs. Cache, truncate, and choose models wisely. gpt-4o-mini is 17x cheaper than gpt-4o.
Embeddings Power Search
Convert text to vectors for semantic similarity. Cache embeddings to avoid repeated costs.
Temperature = Creativity Dial
0 for deterministic (code), 0.7 for balanced, 1+ for creative. Match to your task.
Context Has Limits
128K tokens sounds like a lot, but attention degrades. Put important info at edges, not middle.
What’s Next
OpenAI API Deep Dive
Master function calling, structured outputs, streaming, and production patterns
Interview Deep-Dive
Explain how the attention mechanism works in Transformers and why it replaced RNNs.
Explain how the attention mechanism works in Transformers and why it replaced RNNs.
Strong Answer:
- The core idea behind attention is that when processing any given token, the model should be able to “look at” every other token in the input and decide how much weight to assign each one. In a Transformer, we compute this through three learned projections — Query, Key, and Value matrices. Each token’s Query is dot-producted against every other token’s Key to produce an attention score, then those scores are normalized via softmax and used to create a weighted sum of the Values. That weighted sum becomes the representation of that token.
- RNNs processed text sequentially, left-to-right, compressing everything into a fixed-size hidden state. This meant that by the time the model reached token 500, it had largely forgotten token 10. Attention removes that bottleneck entirely — token 500 can directly attend to token 10 with a single matrix multiplication, no sequential chain required.
- The trade-off is computational complexity. Self-attention is O(n^2) with respect to sequence length because every token attends to every other token. For a 128K context window, that is 128K times 128K = roughly 16 billion attention computations per layer. This is why long-context models are slower and more expensive, and why approaches like FlashAttention, sliding-window attention, and ring attention exist to make this tractable.
- In practice, this also explains the “lost in the middle” phenomenon. While attention theoretically allows equal access to all positions, the learned attention patterns tend to favor tokens near the beginning and end of the context. If you are building a RAG pipeline, this means you should place your most relevant retrieved chunks at the start and end of the prompt, not buried in the middle.
A client is seeing unexpectedly high API costs. Walk me through how you would diagnose and reduce their spend.
A client is seeing unexpectedly high API costs. Walk me through how you would diagnose and reduce their spend.
Strong Answer:
- First thing I would do is instrument their pipeline to log token counts per request, broken down by input tokens and output tokens. Most cost blowups come from the input side — someone is stuffing their entire database into the context on every call. I have seen a RAG system where the retrieval step was returning 50 chunks of 500 tokens each, consuming 25K input tokens per query. Cutting that to the top 5 most relevant chunks dropped costs by 80% with no quality degradation.
- Second, I would audit their model selection. A common pattern is using gpt-4o for everything including simple classification or extraction tasks that gpt-4o-mini handles just as well. The pricing difference is roughly 17x on input and 17x on output. In one project I worked on, we routed simple intent-classification queries to gpt-4o-mini and only escalated complex reasoning tasks to gpt-4o, which cut the monthly bill from around 1,200.
- Third, response caching. If the same question or a semantically similar question gets asked frequently, cache the response. Semantic caching with an embedding similarity threshold of 0.95 or higher can safely serve cached results for about 30-40% of traffic in many customer-support-style applications.
- Fourth, I would look at output token waste. Are they asking for verbose explanations when a JSON object with three fields would suffice? Setting
max_tokensappropriately and using structured output formats can cut output tokens by 50-70%. - Finally, check for non-English text. Arabic, Chinese, Japanese, and Korean tokenize at roughly 2-4x the token count of English for the same semantic content. If your user base is multilingual, this can be a hidden cost multiplier.
What is the 'lost in the middle' problem, and how does it affect real-world LLM application design?
What is the 'lost in the middle' problem, and how does it affect real-world LLM application design?
Strong Answer:
- The “lost in the middle” finding, from Liu et al. in 2023, showed that LLMs have a U-shaped recall curve across their context window. They are best at remembering and using information placed at the very beginning and very end of the context, while information buried in the middle gets significantly less attention. In their experiments, accuracy dropped by as much as 20% for information placed in the middle versus the edges.
- This has direct implications for RAG pipeline design. If you retrieve 10 document chunks and concatenate them in arbitrary order, the chunks in positions 4-7 are essentially second-class citizens. The practical fix is to re-rank your chunks by relevance and then interleave them: place the most relevant at position 1, the second-most relevant at the last position, the third-most relevant at position 2, and so on. This “edges-first” ordering consistently improves answer quality in my experience.
- It also affects system prompt design. If your system prompt is very long — say 2,000 tokens of instructions — the instructions in the middle tend to be followed less reliably. I have seen teams restructure their system prompts to put the most critical behavioral instructions at the very beginning and repeat the single most important constraint at the very end, right before the user message.
- The deeper reason this happens is related to how positional encodings and attention patterns are learned during training. The model sees the beginning of sequences and the most recent tokens very frequently, so attention heads learn strong patterns for those positions. The middle positions are more variable during training and thus get weaker learned attention.
Compare temperature, top-p, and top-k sampling. When would you use each, and what are the failure modes?
Compare temperature, top-p, and top-k sampling. When would you use each, and what are the failure modes?
Strong Answer:
- Temperature scales the logits before the softmax operation. A temperature of 0 makes the distribution extremely peaked — essentially greedy decoding where the highest-probability token always wins. A temperature of 1.0 preserves the original distribution. Above 1.0, the distribution flattens and unlikely tokens get a meaningful chance of being selected. The failure mode of high temperature is incoherent output — at temperature 2.0, the model can produce grammatically valid but semantically nonsensical text because it is sampling from the tail of the distribution too often.
- Top-p (nucleus sampling) takes a different approach: instead of scaling probabilities, it dynamically selects the smallest set of tokens whose cumulative probability exceeds p. So top-p of 0.9 means “consider only the tokens that make up the top 90% of the probability mass.” This adapts naturally to the model’s confidence — when the model is very certain (one token has 95% probability), top-p 0.9 gives you essentially just that one token. When the model is uncertain (many tokens with similar probabilities), top-p 0.9 gives you a wider selection. The failure mode is that very low top-p values (like 0.1) can create repetitive, degenerate text because you are restricting the vocabulary too aggressively.
- Top-k is the simplest: only consider the top k tokens by probability, regardless of their actual probability values. Top-k of 50 means “pick from the 50 most likely tokens.” The weakness is that it is not adaptive. Sometimes the top 50 tokens are all reasonable continuations, sometimes only 3 are. Top-k treats both situations identically.
- In production, I typically use temperature 0 for any deterministic task — code generation, structured extraction, classification. For conversational applications, I use temperature 0.7 with top-p 0.9, which gives natural variation without risking incoherence. I almost never use top-k alone because top-p is strictly more adaptive. The one exception is when combining top-k with temperature for creative applications — top-k 100 with temperature 1.2 gives wide but not unbounded creativity.
seed parameter to address this — when you set both temperature 0 and a fixed seed, you get a system_fingerprint in the response that tells you which model version served the request. If the fingerprint changes, your output may differ. In production, when I need true reproducibility — for example, in a test suite or an audit log — I use temperature 0, a fixed seed, and I cache the response keyed on the input plus the system fingerprint. If the fingerprint changes, I invalidate the cache and accept that the output may have shifted. For compliance-sensitive applications, I also log the full request and response so that any output can be reproduced or at least explained.