Use this file to discover all available pages before exploring further.
Managing context windows effectively is critical for LLM applications that handle long documents, conversations, or complex queries.Think of a context window like a desk. You can only spread out so many papers before things start falling off the edge. A 128K-token context window sounds enormous, but a 50-page PDF eats half of it, leaving little room for the conversation history and the system prompt. Every strategy in this chapter is about deciding which papers stay on the desk, which get filed in a summary drawer, and which get thrown away — all without losing the information your LLM needs to give a good answer.
Context compression is the art of saying the same thing in fewer tokens. It is like writing a good executive summary: the CEO doesn’t need the full 40-page report, just the parts that matter for their decision. Similarly, your LLM doesn’t need every sentence from a retrieved document — it needs the sentences relevant to the user’s question. The two approaches below represent different trade-offs: LLMLingua does mechanical compression (fast, no API calls), while extractive compression uses an LLM to pick the best sentences (smarter, but costs a small API call).
from llmlingua import PromptCompressorclass ContextCompressor: """Compress context while preserving meaning""" def __init__( self, model_name: str = "microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank", target_ratio: float = 0.5 ): self.compressor = PromptCompressor( model_name=model_name, use_llmlingua2=True ) self.target_ratio = target_ratio def compress( self, context: str, question: str = None, rate: float = None ) -> dict: """Compress context text""" result = self.compressor.compress_prompt( context, instruction=question or "", question=question or "", rate=rate or self.target_ratio, condition_compare=True, condition_in_question="after" ) return { "compressed": result["compressed_prompt"], "original_tokens": result["origin_tokens"], "compressed_tokens": result["compressed_tokens"], "ratio": result["ratio"] }# Usagecompressor = ContextCompressor(target_ratio=0.3)long_context = """Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It focuses on developing algorithms that can access data and use it to learn for themselves..."""result = compressor.compress( context=long_context, question="What is machine learning?")print(f"Compression ratio: {result['ratio']:.2%}")print(f"Original: {result['original_tokens']} tokens")print(f"Compressed: {result['compressed_tokens']} tokens")
The sliding window is the simplest memory strategy: keep the last N messages, drop everything older. It is the “goldfish memory” approach, and it works surprisingly well for many chatbot use cases. The key design decision is what to preserve: always keep the system prompt (it defines behavior), always keep the most recent messages (they carry the current intent), and let the middle messages fall off as the window slides.A practical pitfall: if the user referenced something from 15 messages ago (“like I said earlier about the budget”), a pure sliding window loses that context. The summarization strategies later in this chapter solve that problem.
from typing import List, Optionalfrom dataclasses import dataclass@dataclassclass WindowConfig: max_tokens: int = 4000 # Total budget for all messages overlap_tokens: int = 200 # Not used in chat; relevant for document chunking preserve_system: bool = True # System prompt is sacred -- never drop it preserve_recent: int = 5 # Always keep last N messages (the "working memory")class SlidingWindowManager: """Manage conversation with sliding window""" def __init__(self, config: WindowConfig = None): self.config = config or WindowConfig() self.counter = TokenCounter() self.messages: List[dict] = [] self.system_message: Optional[dict] = None def add_message(self, role: str, content: str): """Add message and apply window if needed""" message = {"role": role, "content": content} if role == "system": self.system_message = message else: self.messages.append(message) self._apply_window() def _apply_window(self): """Trim messages to fit window""" if not self.messages: return # Calculate current token count all_messages = self._get_all_messages() total_tokens = self.counter.count_messages(all_messages) if total_tokens <= self.config.max_tokens: return # Keep system message and recent messages preserved = self.messages[-self.config.preserve_recent:] trimmable = self.messages[:-self.config.preserve_recent] # Remove oldest messages until within limit while trimmable and total_tokens > self.config.max_tokens: trimmable.pop(0) self.messages = trimmable + preserved all_messages = self._get_all_messages() total_tokens = self.counter.count_messages(all_messages) def _get_all_messages(self) -> List[dict]: messages = [] if self.system_message: messages.append(self.system_message) messages.extend(self.messages) return messages def get_messages(self) -> List[dict]: return self._get_all_messages() def get_token_count(self) -> int: return self.counter.count_messages(self._get_all_messages())# Chunked processing for long documentsclass ChunkedProcessor: """Process long documents in chunks with overlap""" def __init__( self, max_chunk_tokens: int = 4000, overlap_tokens: int = 200 ): self.max_chunk_tokens = max_chunk_tokens self.overlap_tokens = overlap_tokens self.counter = TokenCounter() def process_document( self, document: str, process_fn, aggregate_fn = None ) -> List: """Process document in chunks""" chunks = self.counter.split_by_tokens( document, self.max_chunk_tokens, self.overlap_tokens ) results = [] for i, chunk in enumerate(chunks): result = process_fn(chunk, chunk_index=i, total_chunks=len(chunks)) results.append(result) if aggregate_fn: return aggregate_fn(results) return results
Summarization is how you fit a book into a context window that only holds a chapter. The two approaches below represent different philosophies: hierarchical summarization works bottom-up (summarize each section, then summarize the summaries), while map-reduce works in two passes (extract key points, then synthesize). Hierarchical is better for preserving structure; map-reduce is better when you have a specific question and want to focus the summary.
This is where it all comes together. A real chatbot conversation can run for dozens of turns, easily blowing through any context window. The ConversationMemory class below implements a progressive summarization strategy: recent messages are kept verbatim (they carry nuance and exact wording), while older messages are compressed into a running summary. Think of it like human memory — you remember the last few minutes in vivid detail but last week is compressed into “we talked about the project timeline and agreed on March.”The critical subtlety is the _merge_summaries method: as old summaries get merged with new ones, information slowly degrades. In practice, this works well for conversational context but is not reliable for exact figures or commitments. If your use case requires perfect recall of specific facts, store them in a structured side-channel (a dictionary of key facts) rather than relying on summaries.
from typing import List, Optionalfrom dataclasses import dataclassfrom datetime import datetime@dataclassclass ConversationTurn: role: str content: str timestamp: datetime token_count: int summary: Optional[str] = Noneclass ConversationMemory: """Manage long conversations with summarization.""" def __init__( self, max_tokens: int = 4000, summary_threshold: int = 2000, keep_recent: int = 4 ): self.max_tokens = max_tokens self.summary_threshold = summary_threshold self.keep_recent = keep_recent self.counter = TokenCounter() self.turns: List[ConversationTurn] = [] self.running_summary: str = "" self.system_message: Optional[str] = None def set_system(self, content: str): self.system_message = content def add_turn(self, role: str, content: str): """Add a conversation turn""" turn = ConversationTurn( role=role, content=content, timestamp=datetime.now(), token_count=self.counter.count(content) ) self.turns.append(turn) # Check if summarization needed self._maybe_summarize() def _maybe_summarize(self): """Summarize old turns if needed""" total = self._calculate_total_tokens() if total <= self.max_tokens: return # Summarize older turns to_summarize = self.turns[:-self.keep_recent] if not to_summarize: return # Create summary summary_text = self._summarize_turns(to_summarize) # Update state self.running_summary = self._merge_summaries( self.running_summary, summary_text ) self.turns = self.turns[-self.keep_recent:] def _calculate_total_tokens(self) -> int: total = 0 if self.system_message: total += self.counter.count(self.system_message) if self.running_summary: total += self.counter.count(self.running_summary) for turn in self.turns: total += turn.token_count return total def _summarize_turns(self, turns: List[ConversationTurn]) -> str: """Summarize a list of turns""" conversation = "\n".join([ f"{t.role}: {t.content}" for t in turns ]) response = client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": "Summarize this conversation, preserving key information, decisions, and context needed for continuity." }, {"role": "user", "content": conversation} ], temperature=0.3 ) return response.choices[0].message.content def _merge_summaries(self, old: str, new: str) -> str: """Merge old and new summaries""" if not old: return new response = client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": "Merge these two conversation summaries into one coherent summary." }, {"role": "user", "content": f"Previous summary:\n{old}\n\nNew summary:\n{new}"} ], temperature=0.3 ) return response.choices[0].message.content def get_messages(self) -> List[dict]: """Get messages for API call""" messages = [] if self.system_message: messages.append({ "role": "system", "content": self.system_message }) if self.running_summary: messages.append({ "role": "system", "content": f"Previous conversation summary: {self.running_summary}" }) for turn in self.turns: messages.append({ "role": turn.role, "content": turn.content }) return messages
Dynamic context selection is the strategy you use in RAG pipelines: you have 20 retrieved chunks but only room for 5 in the context window. The approach is essentially a knapsack problem — maximize relevance within a fixed token budget. The reserve_tokens parameter is important and often forgotten: you need to leave room for the user’s question AND the model’s response, not just the context.
from typing import List, Dictfrom dataclasses import dataclass@dataclassclass ContextItem: content: str relevance: float # From your similarity search (0 to 1) token_count: int # Pre-computed to avoid recounting source: str # Track provenance for citationsclass DynamicContextManager: """Select relevant context within token budget. Pitfall: Don't just stuff the top-K results in. A 0.95 relevance chunk with 2000 tokens might be less valuable than three 0.88 chunks at 400 tokens each -- you get more coverage for the same budget. """ def __init__(self, max_tokens: int = 4000): self.max_tokens = max_tokens self.counter = TokenCounter() def select_context( self, query: str, items: List[ContextItem], reserve_tokens: int = 500 # Reserve for query and response ) -> List[ContextItem]: """Select context items within budget""" available_tokens = self.max_tokens - reserve_tokens # Sort by relevance sorted_items = sorted(items, key=lambda x: x.relevance, reverse=True) selected = [] used_tokens = 0 for item in sorted_items: if used_tokens + item.token_count <= available_tokens: selected.append(item) used_tokens += item.token_count return selected def build_context( self, query: str, items: List[ContextItem], format_fn = None ) -> str: """Build context string from selected items""" selected = self.select_context(query, items) if format_fn: return format_fn(selected) # Default formatting context_parts = [] for item in selected: context_parts.append(f"[Source: {item.source}]\n{item.content}") return "\n\n---\n\n".join(context_parts)
Choosing the wrong context management strategy wastes tokens on the cheap end and loses critical information on the expensive end. Use this decision table.
Scenario
Recommended Strategy
Why
Watch Out For
Chatbot with short sessions (under 10 turns)
Sliding window, keep all
No compression needed — the full history fits
Users who paste large blocks of text in a single message
Chatbot with long sessions (50+ turns)
Sliding window + progressive summarization
Keeps recent detail while preserving older context in compressed form
Summary drift — facts from turn 5 may degrade after 3 summarization passes
RAG with many retrieved chunks
Dynamic context selection
Maximizes relevance within a fixed token budget
Accidentally filtering out the one chunk that contains the answer
Processing a 100-page PDF
Chunked processing + map-reduce summarization
Document is too large for any single context window
Lost cross-references (“as mentioned in Chapter 3”) when chunks are processed independently
Multi-document Q&A
Dynamic selection + extractive compression
Multiple documents compete for limited context space
Source attribution — compressed context loses provenance if you don’t track which doc each sentence came from
Structured data extraction from long forms
Sliding window with field-specific passes
Each pass focuses on extracting one field, so context is used efficiently
Fields that depend on each other (e.g., “same as billing address”) require a consolidation pass
Decision flowchart for new projects:
Estimate your typical input size (tokens). If it fits within 50% of your model’s context window (leaving room for system prompt, history, and response), you don’t need context management yet. Ship without it.
If input exceeds 50% of the window, determine whether the excess is from conversation history or from retrieved/uploaded content.
For conversation history: start with a sliding window (simplest). Add summarization only when users report the bot “forgetting” things from earlier in the conversation.
For retrieved content: implement dynamic context selection with a token budget. Add compression only if your top-K chunks consistently exceed the budget.
The “as I mentioned earlier” problem. A user references something from 20 messages ago that has been summarized away. The summary may have lost the specific detail. Mitigation: maintain a structured “key facts” dictionary alongside the running summary. When the user says “my budget is 5000," store `{"budget": "5000”}` in a side channel that never gets summarized.Token counting mismatches across models. If you switch from GPT-4o to Claude mid-conversation, your token counts are wrong — they use different tokenizers. The o200k_base tokenizer for GPT-4o produces different counts than Claude’s tokenizer for the same text. Always count tokens with the tokenizer that matches your current model.System prompt bloat. Developers keep adding instructions to the system prompt over time. A 2000-token system prompt in a 4000-token budget leaves only 2000 tokens for everything else. Audit your system prompt monthly. If it exceeds 500 tokens, ask whether each instruction is pulling its weight.Context window =/= effective context. Models perform worse on information buried in the middle of long contexts (the “lost in the middle” effect documented by Liu et al., 2023). Even if you have 128K tokens available, information at position 40K-80K gets less attention than information at the beginning or end. Place your most important context (system instructions, key facts) at the start, and the most recent user query at the end.
You have a RAG system that retrieves 20 chunks from a vector database, but your model's context window only fits 5. How do you decide which chunks to keep, and what are the failure modes of a naive top-K approach?
Strong Answer:
The naive approach is sorting by relevance score and taking the top 5. This fails in two common ways. First, the top 5 chunks might all come from the same section of the same document, giving you redundant coverage of one subtopic while missing others entirely. Second, a single highly relevant but 2,000-token chunk might crowd out three shorter chunks that together provide better coverage — you are solving a knapsack problem, not a sorting problem.
The production approach is diversity-aware selection. After ranking by relevance, apply Maximal Marginal Relevance (MMR): for each candidate chunk, discount its score based on how similar it is to chunks already selected. This naturally picks chunks that are both relevant and non-overlapping. A lambda of 0.7 is a good starting point for the relevance-vs-diversity trade-off.
Token budgeting is the second critical piece. Pre-compute token counts for every chunk (not character counts — they diverge by 10-30% from actual tokens). Then solve a greedy knapsack: pick the highest-scoring chunk that fits the remaining budget, repeat until full. Reserve at least 500 tokens for the question and 1,000-2,000 for the response.
A subtle failure mode: relevance scores from different queries are not comparable. A 0.85 score for a vague query might be less useful than a 0.72 score for a specific one. If you hard-code a threshold like “only include above 0.8,” you over-filter specific queries and under-filter vague ones. Use relative ranking, not absolute thresholds.
Follow-up: Users report the system sometimes gives correct answers but cites the wrong chunk. How do you debug this?This is a context position bias problem. LLMs disproportionately attend to the beginning and end of the context window (“lost in the middle” phenomenon). If chunk 1 is first and chunk 3 is buried in the middle, the model might synthesize from chunk 3 but attribute to chunk 1. The fix is placing the most relevant chunks at the beginning and end with less relevant ones in the middle, or randomizing order to eliminate systematic bias. For citations, ask the model to quote the exact sentence it cites, then verify that sentence actually appears in the claimed source chunk as a post-processing validation step.
Your chatbot uses progressive summarization. A user says 'I told you my budget was exactly $47,500' but the summary only says 'discussed budget constraints.' How do you prevent this information loss?
Strong Answer:
This is the fundamental limitation of summarization-based memory: summaries are lossy compression, and specific numbers are exactly the kind of detail that gets lost. The model summarizes “$47,500” into “budget constraints” because it optimizes for brevity, not precision.
The solution is a dual-memory architecture. Layer one is the running summary (captures themes and context). Layer two is a structured fact store — a key-value dictionary that extracts and preserves specific data points: numbers, dates, names, commitments. After every user turn, run a cheap extraction call asking what specific facts the user mentioned. Store results like {'"budget": "$47,500", "deadline": "March 15"'}.
The fact store is injected into the system prompt separately from the summary. It is never summarized or compressed — it persists verbatim for the entire conversation. The token cost is small (a few hundred tokens) but the reliability improvement is massive.
The architecture becomes: system prompt (never dropped) + fact store (never summarized) + running summary (compressed older context) + recent messages (verbatim last N turns). Each layer has different compression characteristics.
Follow-up: The fact store grows to 800 tokens over a 50-turn conversation. When and how do you prune it?Never delete facts the user has referenced more than once. For the rest, apply recency-weighted relevance: full weight for facts from the last 10 turns, half weight for 10-30 turns ago, and facts older than 30 turns that have never been re-referenced are candidates for archival into the summary. The safeguard is that pruning never silently drops a fact — if “budget: 47,500"movestothesummary,explicitlytellthesummarizertoincludetheexactfigure.Alsodeduplicate:iftheusercorrectedthebudgetfrom47,500 to $50,000, keep only the latest value.
You are processing 50-page legal contracts. The full document is 80,000 tokens but the model's context is 128K. Do you stuff it all in, or chunk and retrieve?
Strong Answer:
The instinct to “just stuff it all in since it fits” is the most common mistake. You still need room for the system prompt, the question, and the response. At 85-90% context utilization, quality degrades due to the “lost in the middle” problem.
The answer depends on query patterns. For specific clause questions (“What does Section 4.2 say about indemnification?”), chunk-and-retrieve is clearly better. Embed each section, retrieve the 3-5 most relevant, and the model focuses on what matters. This costs less, runs faster, and is more accurate because the signal-to-noise ratio is higher.
For cross-referencing questions (“Are there contradictions between termination and renewal clauses?”), use a two-stage pipeline: first retrieve the 5-8 relevant sections, then present only those. You get cross-section reasoning without 40 irrelevant pages.
Full-document stuffing only makes sense for open-ended summarization where every section might be relevant. Even then, hierarchical summarization (summarize each section, then summarize summaries) often produces better results because the model focuses on each section independently.
Follow-up: A lawyer says contract analysis requires exact original wording — any paraphrasing is unacceptable. How does this change your approach?This eliminates summarization and forces retrieval-only with verbatim chunks. The key adaptation is chunking on section boundaries rather than fixed character counts, since legal clauses have specific structure. When asked about Section 4.2, retrieve it verbatim plus any cross-referenced sections and the definitions section (legal text depends heavily on defined terms). Add a citation layer showing exactly which section, page, and paragraph each piece comes from so lawyers can verify against the original.
Explain the difference between LLMLingua-style mechanical compression and LLM-based extractive compression. When would you choose each?
Strong Answer:
LLMLingua uses a small language model to remove tokens that contribute least to meaning — intelligent truncation at the token level. It is fast (runs locally, no API call), cheap, and predictable. The output reads oddly to humans but retains information the LLM needs.
LLM-based extractive compression selects the most relevant sentences given a specific query. It produces clean, readable output but costs an API call and adds latency. It is smarter because it can prioritize based on the query.
Choose LLMLingua for compressing large contexts quickly and cheaply (e.g., 10 retrieved documents before a RAG call) when readability does not matter. Choose extractive compression when output might be shown to users or when query-specific relevance matters more than uniform compression.
The fundamental risk of any compression: you irreversibly discard information before the model sees it. If the compression removes a critical sentence, no prompt engineering recovers it. Mitigation is conservative compression ratios (keep 50-70%) and always compressing with the user’s query as context.
Follow-up: You compress to 50% and answer quality drops 15%. How do you determine whether compression or retrieval is the bottleneck?Run an ablation on 100 queries with four conditions: (1) uncompressed top-5 chunks, (2) compressed top-5 chunks, (3) uncompressed top-10 chunks compressed to the same budget as condition 1, (4) manually selected gold-standard context. If conditions 1 and 2 have similar quality, compression is fine — retrieval is the bottleneck. If condition 2 is much worse, compression is removing important information. If condition 4 is dramatically better than all others, retrieval is the bottleneck regardless. This ablation costs $5-10 in API calls and saves weeks of optimizing the wrong component.