Context Window Limits
Token Counting
Using tiktoken
Context Compression
Context compression is the art of saying the same thing in fewer tokens. It is like writing a good executive summary: the CEO doesn’t need the full 40-page report, just the parts that matter for their decision. Similarly, your LLM doesn’t need every sentence from a retrieved document — it needs the sentences relevant to the user’s question. The two approaches below represent different trade-offs: LLMLingua does mechanical compression (fast, no API calls), while extractive compression uses an LLM to pick the best sentences (smarter, but costs a small API call).LLMLingua Compression
Extractive Compression
Sliding Window Strategies
The sliding window is the simplest memory strategy: keep the last N messages, drop everything older. It is the “goldfish memory” approach, and it works surprisingly well for many chatbot use cases. The key design decision is what to preserve: always keep the system prompt (it defines behavior), always keep the most recent messages (they carry the current intent), and let the middle messages fall off as the window slides. A practical pitfall: if the user referenced something from 15 messages ago (“like I said earlier about the budget”), a pure sliding window loses that context. The summarization strategies later in this chapter solve that problem.Summarization Strategies
Summarization is how you fit a book into a context window that only holds a chapter. The two approaches below represent different philosophies: hierarchical summarization works bottom-up (summarize each section, then summarize the summaries), while map-reduce works in two passes (extract key points, then synthesize). Hierarchical is better for preserving structure; map-reduce is better when you have a specific question and want to focus the summary.Hierarchical Summarization
Conversation Memory Management
This is where it all comes together. A real chatbot conversation can run for dozens of turns, easily blowing through any context window. TheConversationMemory class below implements a progressive summarization strategy: recent messages are kept verbatim (they carry nuance and exact wording), while older messages are compressed into a running summary. Think of it like human memory — you remember the last few minutes in vivid detail but last week is compressed into “we talked about the project timeline and agreed on March.”
The critical subtlety is the _merge_summaries method: as old summaries get merged with new ones, information slowly degrades. In practice, this works well for conversational context but is not reliable for exact figures or commitments. If your use case requires perfect recall of specific facts, store them in a structured side-channel (a dictionary of key facts) rather than relying on summaries.
Dynamic Context Selection
Dynamic context selection is the strategy you use in RAG pipelines: you have 20 retrieved chunks but only room for 5 in the context window. The approach is essentially a knapsack problem — maximize relevance within a fixed token budget. Thereserve_tokens parameter is important and often forgotten: you need to leave room for the user’s question AND the model’s response, not just the context.
Strategy Selection Framework
Choosing the wrong context management strategy wastes tokens on the cheap end and loses critical information on the expensive end. Use this decision table.| Scenario | Recommended Strategy | Why | Watch Out For |
|---|---|---|---|
| Chatbot with short sessions (under 10 turns) | Sliding window, keep all | No compression needed — the full history fits | Users who paste large blocks of text in a single message |
| Chatbot with long sessions (50+ turns) | Sliding window + progressive summarization | Keeps recent detail while preserving older context in compressed form | Summary drift — facts from turn 5 may degrade after 3 summarization passes |
| RAG with many retrieved chunks | Dynamic context selection | Maximizes relevance within a fixed token budget | Accidentally filtering out the one chunk that contains the answer |
| Processing a 100-page PDF | Chunked processing + map-reduce summarization | Document is too large for any single context window | Lost cross-references (“as mentioned in Chapter 3”) when chunks are processed independently |
| Multi-document Q&A | Dynamic selection + extractive compression | Multiple documents compete for limited context space | Source attribution — compressed context loses provenance if you don’t track which doc each sentence came from |
| Structured data extraction from long forms | Sliding window with field-specific passes | Each pass focuses on extracting one field, so context is used efficiently | Fields that depend on each other (e.g., “same as billing address”) require a consolidation pass |
- Estimate your typical input size (tokens). If it fits within 50% of your model’s context window (leaving room for system prompt, history, and response), you don’t need context management yet. Ship without it.
- If input exceeds 50% of the window, determine whether the excess is from conversation history or from retrieved/uploaded content.
- For conversation history: start with a sliding window (simplest). Add summarization only when users report the bot “forgetting” things from earlier in the conversation.
- For retrieved content: implement dynamic context selection with a token budget. Add compression only if your top-K chunks consistently exceed the budget.
Edge Cases in Context Management
The “as I mentioned earlier” problem. A user references something from 20 messages ago that has been summarized away. The summary may have lost the specific detail. Mitigation: maintain a structured “key facts” dictionary alongside the running summary. When the user says “my budget is 5000," store `{"budget": "5000”}` in a side channel that never gets summarized. Token counting mismatches across models. If you switch from GPT-4o to Claude mid-conversation, your token counts are wrong — they use different tokenizers. Theo200k_base tokenizer for GPT-4o produces different counts than Claude’s tokenizer for the same text. Always count tokens with the tokenizer that matches your current model.
System prompt bloat. Developers keep adding instructions to the system prompt over time. A 2000-token system prompt in a 4000-token budget leaves only 2000 tokens for everything else. Audit your system prompt monthly. If it exceeds 500 tokens, ask whether each instruction is pulling its weight.
Context window =/= effective context. Models perform worse on information buried in the middle of long contexts (the “lost in the middle” effect documented by Liu et al., 2023). Even if you have 128K tokens available, information at position 40K-80K gets less attention than information at the beginning or end. Place your most important context (system instructions, key facts) at the start, and the most recent user query at the end.
Token Usage Summary
| Strategy | Use Case | Token Savings |
|---|---|---|
| Sliding Window | Long conversations | 50-70% |
| Summarization | Document processing | 60-80% |
| Compression | Context reduction | 30-70% |
| Dynamic Selection | RAG context | Variable |
| Chunked Processing | Long documents | N/A (enables) |
What is Next
LLM Testing
Interview Deep-Dive
You have a RAG system that retrieves 20 chunks from a vector database, but your model's context window only fits 5. How do you decide which chunks to keep, and what are the failure modes of a naive top-K approach?
You have a RAG system that retrieves 20 chunks from a vector database, but your model's context window only fits 5. How do you decide which chunks to keep, and what are the failure modes of a naive top-K approach?
- The naive approach is sorting by relevance score and taking the top 5. This fails in two common ways. First, the top 5 chunks might all come from the same section of the same document, giving you redundant coverage of one subtopic while missing others entirely. Second, a single highly relevant but 2,000-token chunk might crowd out three shorter chunks that together provide better coverage — you are solving a knapsack problem, not a sorting problem.
- The production approach is diversity-aware selection. After ranking by relevance, apply Maximal Marginal Relevance (MMR): for each candidate chunk, discount its score based on how similar it is to chunks already selected. This naturally picks chunks that are both relevant and non-overlapping. A lambda of 0.7 is a good starting point for the relevance-vs-diversity trade-off.
- Token budgeting is the second critical piece. Pre-compute token counts for every chunk (not character counts — they diverge by 10-30% from actual tokens). Then solve a greedy knapsack: pick the highest-scoring chunk that fits the remaining budget, repeat until full. Reserve at least 500 tokens for the question and 1,000-2,000 for the response.
- A subtle failure mode: relevance scores from different queries are not comparable. A 0.85 score for a vague query might be less useful than a 0.72 score for a specific one. If you hard-code a threshold like “only include above 0.8,” you over-filter specific queries and under-filter vague ones. Use relative ranking, not absolute thresholds.
Your chatbot uses progressive summarization. A user says 'I told you my budget was exactly $47,500' but the summary only says 'discussed budget constraints.' How do you prevent this information loss?
Your chatbot uses progressive summarization. A user says 'I told you my budget was exactly $47,500' but the summary only says 'discussed budget constraints.' How do you prevent this information loss?
- This is the fundamental limitation of summarization-based memory: summaries are lossy compression, and specific numbers are exactly the kind of detail that gets lost. The model summarizes “$47,500” into “budget constraints” because it optimizes for brevity, not precision.
- The solution is a dual-memory architecture. Layer one is the running summary (captures themes and context). Layer two is a structured fact store — a key-value dictionary that extracts and preserves specific data points: numbers, dates, names, commitments. After every user turn, run a cheap extraction call asking what specific facts the user mentioned. Store results like
{'"budget": "$47,500", "deadline": "March 15"'}. - The fact store is injected into the system prompt separately from the summary. It is never summarized or compressed — it persists verbatim for the entire conversation. The token cost is small (a few hundred tokens) but the reliability improvement is massive.
- The architecture becomes: system prompt (never dropped) + fact store (never summarized) + running summary (compressed older context) + recent messages (verbatim last N turns). Each layer has different compression characteristics.
You are processing 50-page legal contracts. The full document is 80,000 tokens but the model's context is 128K. Do you stuff it all in, or chunk and retrieve?
You are processing 50-page legal contracts. The full document is 80,000 tokens but the model's context is 128K. Do you stuff it all in, or chunk and retrieve?
- The instinct to “just stuff it all in since it fits” is the most common mistake. You still need room for the system prompt, the question, and the response. At 85-90% context utilization, quality degrades due to the “lost in the middle” problem.
- The answer depends on query patterns. For specific clause questions (“What does Section 4.2 say about indemnification?”), chunk-and-retrieve is clearly better. Embed each section, retrieve the 3-5 most relevant, and the model focuses on what matters. This costs less, runs faster, and is more accurate because the signal-to-noise ratio is higher.
- For cross-referencing questions (“Are there contradictions between termination and renewal clauses?”), use a two-stage pipeline: first retrieve the 5-8 relevant sections, then present only those. You get cross-section reasoning without 40 irrelevant pages.
- Full-document stuffing only makes sense for open-ended summarization where every section might be relevant. Even then, hierarchical summarization (summarize each section, then summarize summaries) often produces better results because the model focuses on each section independently.
Explain the difference between LLMLingua-style mechanical compression and LLM-based extractive compression. When would you choose each?
Explain the difference between LLMLingua-style mechanical compression and LLM-based extractive compression. When would you choose each?
- LLMLingua uses a small language model to remove tokens that contribute least to meaning — intelligent truncation at the token level. It is fast (runs locally, no API call), cheap, and predictable. The output reads oddly to humans but retains information the LLM needs.
- LLM-based extractive compression selects the most relevant sentences given a specific query. It produces clean, readable output but costs an API call and adds latency. It is smarter because it can prioritize based on the query.
- Choose LLMLingua for compressing large contexts quickly and cheaply (e.g., 10 retrieved documents before a RAG call) when readability does not matter. Choose extractive compression when output might be shown to users or when query-specific relevance matters more than uniform compression.
- The fundamental risk of any compression: you irreversibly discard information before the model sees it. If the compression removes a critical sentence, no prompt engineering recovers it. Mitigation is conservative compression ratios (keep 50-70%) and always compressing with the user’s query as context.