Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Managing context windows effectively is critical for LLM applications that handle long documents, conversations, or complex queries. Think of a context window like a desk. You can only spread out so many papers before things start falling off the edge. A 128K-token context window sounds enormous, but a 50-page PDF eats half of it, leaving little room for the conversation history and the system prompt. Every strategy in this chapter is about deciding which papers stay on the desk, which get filed in a summary drawer, and which get thrown away — all without losing the information your LLM needs to give a good answer.

Context Window Limits

Model                    Context Window    Output Limit
-----------------------------------------------------------
GPT-4o                   128,000          16,384
GPT-4o-mini              128,000          16,384
Claude 3.5 Sonnet        200,000          8,192
Gemini 1.5 Pro           2,000,000        8,192
Llama 3.3 70B            128,000          4,096

Token Counting

Using tiktoken

import tiktoken
from typing import List, Dict

class TokenCounter:
    """Accurate token counting for OpenAI models.
    
    Why this matters: Tokens are not words. "ChatGPT" is 1 token, but
    "antidisestablishmentarianism" is 6 tokens. Code and non-English text
    often tokenize worse than English prose. If you estimate by word count
    (dividing by 0.75), you will undercount by 10-30% and hit context
    limits unexpectedly. Always count with the actual tokenizer.
    """
    
    ENCODINGS = {
        "gpt-4o": "o200k_base",
        "gpt-4o-mini": "o200k_base",
        "gpt-4-turbo": "cl100k_base",
        "gpt-3.5-turbo": "cl100k_base",
        "text-embedding-3-small": "cl100k_base",
        "text-embedding-3-large": "cl100k_base"
    }
    
    def __init__(self, model: str = "gpt-4o"):
        encoding_name = self.ENCODINGS.get(model, "o200k_base")
        self.encoding = tiktoken.get_encoding(encoding_name)
        self.model = model
    
    def count(self, text: str) -> int:
        """Count tokens in text"""
        return len(self.encoding.encode(text))
    
    def count_messages(self, messages: List[Dict[str, str]]) -> int:
        """Count tokens in chat messages (includes overhead).
        
        Each message carries ~4 tokens of invisible overhead for the
        role/delimiter tokens. This adds up fast: a 20-message conversation
        wastes ~80 tokens just on framing. Factor this in when budgeting.
        """
        tokens = 0
        
        # Per-message overhead
        for message in messages:
            tokens += 4  # <|im_start|>{role}\n{content}<|im_end|>\n
            for key, value in message.items():
                tokens += self.count(str(value))
        
        tokens += 2  # Priming for assistant response
        
        return tokens
    
    def truncate_to_limit(
        self,
        text: str,
        max_tokens: int,
        from_end: bool = False
    ) -> str:
        """Truncate text to token limit"""
        tokens = self.encoding.encode(text)
        
        if len(tokens) <= max_tokens:
            return text
        
        if from_end:
            truncated = tokens[-max_tokens:]
        else:
            truncated = tokens[:max_tokens]
        
        return self.encoding.decode(truncated)
    
    def split_by_tokens(
        self,
        text: str,
        chunk_size: int,
        overlap: int = 0
    ) -> List[str]:
        """Split text into chunks by token count"""
        tokens = self.encoding.encode(text)
        chunks = []
        
        start = 0
        while start < len(tokens):
            end = start + chunk_size
            chunk_tokens = tokens[start:end]
            chunks.append(self.encoding.decode(chunk_tokens))
            start = end - overlap
        
        return chunks

# Usage
counter = TokenCounter("gpt-4o")

text = "Your long document here..."
token_count = counter.count(text)
print(f"Token count: {token_count}")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
]
message_tokens = counter.count_messages(messages)
print(f"Message tokens: {message_tokens}")

Context Compression

Context compression is the art of saying the same thing in fewer tokens. It is like writing a good executive summary: the CEO doesn’t need the full 40-page report, just the parts that matter for their decision. Similarly, your LLM doesn’t need every sentence from a retrieved document — it needs the sentences relevant to the user’s question. The two approaches below represent different trade-offs: LLMLingua does mechanical compression (fast, no API calls), while extractive compression uses an LLM to pick the best sentences (smarter, but costs a small API call).

LLMLingua Compression

from llmlingua import PromptCompressor

class ContextCompressor:
    """Compress context while preserving meaning"""
    
    def __init__(
        self,
        model_name: str = "microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
        target_ratio: float = 0.5
    ):
        self.compressor = PromptCompressor(
            model_name=model_name,
            use_llmlingua2=True
        )
        self.target_ratio = target_ratio
    
    def compress(
        self,
        context: str,
        question: str = None,
        rate: float = None
    ) -> dict:
        """Compress context text"""
        
        result = self.compressor.compress_prompt(
            context,
            instruction=question or "",
            question=question or "",
            rate=rate or self.target_ratio,
            condition_compare=True,
            condition_in_question="after"
        )
        
        return {
            "compressed": result["compressed_prompt"],
            "original_tokens": result["origin_tokens"],
            "compressed_tokens": result["compressed_tokens"],
            "ratio": result["ratio"]
        }

# Usage
compressor = ContextCompressor(target_ratio=0.3)

long_context = """
Machine learning is a subset of artificial intelligence that enables 
systems to learn and improve from experience without being explicitly 
programmed. It focuses on developing algorithms that can access data 
and use it to learn for themselves...
"""

result = compressor.compress(
    context=long_context,
    question="What is machine learning?"
)

print(f"Compression ratio: {result['ratio']:.2%}")
print(f"Original: {result['original_tokens']} tokens")
print(f"Compressed: {result['compressed_tokens']} tokens")

Extractive Compression

from openai import OpenAI
from typing import List

client = OpenAI()

class ExtractiveCompressor:
    """Extract relevant sentences for compression"""
    
    def __init__(self, target_sentences: int = 5):
        self.target_sentences = target_sentences
    
    def compress(
        self,
        context: str,
        query: str
    ) -> str:
        """Extract most relevant sentences"""
        
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": f"""Extract the {self.target_sentences} most relevant sentences from the context that help answer the query. Return only the extracted sentences, one per line."""
                },
                {
                    "role": "user",
                    "content": f"Query: {query}\n\nContext:\n{context}"
                }
            ],
            temperature=0
        )
        
        return response.choices[0].message.content

# Usage
extractor = ExtractiveCompressor(target_sentences=3)
compressed = extractor.compress(
    context=long_document,
    query="What are the key benefits?"
)

Sliding Window Strategies

The sliding window is the simplest memory strategy: keep the last N messages, drop everything older. It is the “goldfish memory” approach, and it works surprisingly well for many chatbot use cases. The key design decision is what to preserve: always keep the system prompt (it defines behavior), always keep the most recent messages (they carry the current intent), and let the middle messages fall off as the window slides. A practical pitfall: if the user referenced something from 15 messages ago (“like I said earlier about the budget”), a pure sliding window loses that context. The summarization strategies later in this chapter solve that problem.
from typing import List, Optional
from dataclasses import dataclass

@dataclass
class WindowConfig:
    max_tokens: int = 4000       # Total budget for all messages
    overlap_tokens: int = 200     # Not used in chat; relevant for document chunking
    preserve_system: bool = True  # System prompt is sacred -- never drop it
    preserve_recent: int = 5      # Always keep last N messages (the "working memory")

class SlidingWindowManager:
    """Manage conversation with sliding window"""
    
    def __init__(self, config: WindowConfig = None):
        self.config = config or WindowConfig()
        self.counter = TokenCounter()
        self.messages: List[dict] = []
        self.system_message: Optional[dict] = None
    
    def add_message(self, role: str, content: str):
        """Add message and apply window if needed"""
        message = {"role": role, "content": content}
        
        if role == "system":
            self.system_message = message
        else:
            self.messages.append(message)
        
        self._apply_window()
    
    def _apply_window(self):
        """Trim messages to fit window"""
        if not self.messages:
            return
        
        # Calculate current token count
        all_messages = self._get_all_messages()
        total_tokens = self.counter.count_messages(all_messages)
        
        if total_tokens <= self.config.max_tokens:
            return
        
        # Keep system message and recent messages
        preserved = self.messages[-self.config.preserve_recent:]
        trimmable = self.messages[:-self.config.preserve_recent]
        
        # Remove oldest messages until within limit
        while trimmable and total_tokens > self.config.max_tokens:
            trimmable.pop(0)
            self.messages = trimmable + preserved
            all_messages = self._get_all_messages()
            total_tokens = self.counter.count_messages(all_messages)
    
    def _get_all_messages(self) -> List[dict]:
        messages = []
        if self.system_message:
            messages.append(self.system_message)
        messages.extend(self.messages)
        return messages
    
    def get_messages(self) -> List[dict]:
        return self._get_all_messages()
    
    def get_token_count(self) -> int:
        return self.counter.count_messages(self._get_all_messages())

# Chunked processing for long documents
class ChunkedProcessor:
    """Process long documents in chunks with overlap"""
    
    def __init__(
        self,
        max_chunk_tokens: int = 4000,
        overlap_tokens: int = 200
    ):
        self.max_chunk_tokens = max_chunk_tokens
        self.overlap_tokens = overlap_tokens
        self.counter = TokenCounter()
    
    def process_document(
        self,
        document: str,
        process_fn,
        aggregate_fn = None
    ) -> List:
        """Process document in chunks"""
        
        chunks = self.counter.split_by_tokens(
            document,
            self.max_chunk_tokens,
            self.overlap_tokens
        )
        
        results = []
        for i, chunk in enumerate(chunks):
            result = process_fn(chunk, chunk_index=i, total_chunks=len(chunks))
            results.append(result)
        
        if aggregate_fn:
            return aggregate_fn(results)
        
        return results

Summarization Strategies

Summarization is how you fit a book into a context window that only holds a chapter. The two approaches below represent different philosophies: hierarchical summarization works bottom-up (summarize each section, then summarize the summaries), while map-reduce works in two passes (extract key points, then synthesize). Hierarchical is better for preserving structure; map-reduce is better when you have a specific question and want to focus the summary.

Hierarchical Summarization

from openai import OpenAI
from typing import List

client = OpenAI()

class HierarchicalSummarizer:
    """Summarize long documents hierarchically"""
    
    def __init__(
        self,
        chunk_size: int = 4000,
        summary_ratio: float = 0.3
    ):
        self.chunk_size = chunk_size
        self.summary_ratio = summary_ratio
        self.counter = TokenCounter()
    
    def _summarize_chunk(self, chunk: str, max_tokens: int) -> str:
        """Summarize a single chunk"""
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": "Summarize the following text concisely while preserving key information."
                },
                {"role": "user", "content": chunk}
            ],
            max_tokens=max_tokens,
            temperature=0.3
        )
        return response.choices[0].message.content
    
    def summarize(self, document: str) -> str:
        """Hierarchically summarize document"""
        
        doc_tokens = self.counter.count(document)
        
        # If short enough, summarize directly
        if doc_tokens <= self.chunk_size:
            target_tokens = int(doc_tokens * self.summary_ratio)
            return self._summarize_chunk(document, target_tokens)
        
        # Split into chunks
        chunks = self.counter.split_by_tokens(document, self.chunk_size)
        
        # Summarize each chunk
        summaries = []
        for chunk in chunks:
            chunk_tokens = self.counter.count(chunk)
            target_tokens = int(chunk_tokens * self.summary_ratio)
            summary = self._summarize_chunk(chunk, max(target_tokens, 100))
            summaries.append(summary)
        
        # Combine summaries
        combined = "\n\n".join(summaries)
        
        # Recursively summarize if still too long
        if self.counter.count(combined) > self.chunk_size:
            return self.summarize(combined)
        
        return combined

# Map-Reduce Summarization
class MapReduceSummarizer:
    """Map-reduce style summarization"""
    
    def __init__(self, chunk_size: int = 4000):
        self.chunk_size = chunk_size
        self.counter = TokenCounter()
    
    def summarize(
        self,
        document: str,
        focus_query: str = None
    ) -> str:
        """Summarize with optional focus"""
        
        chunks = self.counter.split_by_tokens(document, self.chunk_size)
        
        # Map: Extract key points from each chunk
        key_points = []
        for chunk in chunks:
            system = "Extract key points from this text."
            if focus_query:
                system += f" Focus on information related to: {focus_query}"
            
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": system},
                    {"role": "user", "content": chunk}
                ],
                temperature=0.3
            )
            key_points.append(response.choices[0].message.content)
        
        # Reduce: Combine key points
        combined_points = "\n\n".join(key_points)
        
        reduce_prompt = "Synthesize these key points into a coherent summary."
        if focus_query:
            reduce_prompt += f" Focus on: {focus_query}"
        
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": reduce_prompt},
                {"role": "user", "content": combined_points}
            ],
            temperature=0.3
        )
        
        return response.choices[0].message.content

Conversation Memory Management

This is where it all comes together. A real chatbot conversation can run for dozens of turns, easily blowing through any context window. The ConversationMemory class below implements a progressive summarization strategy: recent messages are kept verbatim (they carry nuance and exact wording), while older messages are compressed into a running summary. Think of it like human memory — you remember the last few minutes in vivid detail but last week is compressed into “we talked about the project timeline and agreed on March.” The critical subtlety is the _merge_summaries method: as old summaries get merged with new ones, information slowly degrades. In practice, this works well for conversational context but is not reliable for exact figures or commitments. If your use case requires perfect recall of specific facts, store them in a structured side-channel (a dictionary of key facts) rather than relying on summaries.
from typing import List, Optional
from dataclasses import dataclass
from datetime import datetime

@dataclass
class ConversationTurn:
    role: str
    content: str
    timestamp: datetime
    token_count: int
    summary: Optional[str] = None

class ConversationMemory:
    """Manage long conversations with summarization."""
    
    def __init__(
        self,
        max_tokens: int = 4000,
        summary_threshold: int = 2000,
        keep_recent: int = 4
    ):
        self.max_tokens = max_tokens
        self.summary_threshold = summary_threshold
        self.keep_recent = keep_recent
        self.counter = TokenCounter()
        
        self.turns: List[ConversationTurn] = []
        self.running_summary: str = ""
        self.system_message: Optional[str] = None
    
    def set_system(self, content: str):
        self.system_message = content
    
    def add_turn(self, role: str, content: str):
        """Add a conversation turn"""
        turn = ConversationTurn(
            role=role,
            content=content,
            timestamp=datetime.now(),
            token_count=self.counter.count(content)
        )
        self.turns.append(turn)
        
        # Check if summarization needed
        self._maybe_summarize()
    
    def _maybe_summarize(self):
        """Summarize old turns if needed"""
        total = self._calculate_total_tokens()
        
        if total <= self.max_tokens:
            return
        
        # Summarize older turns
        to_summarize = self.turns[:-self.keep_recent]
        
        if not to_summarize:
            return
        
        # Create summary
        summary_text = self._summarize_turns(to_summarize)
        
        # Update state
        self.running_summary = self._merge_summaries(
            self.running_summary,
            summary_text
        )
        self.turns = self.turns[-self.keep_recent:]
    
    def _calculate_total_tokens(self) -> int:
        total = 0
        if self.system_message:
            total += self.counter.count(self.system_message)
        if self.running_summary:
            total += self.counter.count(self.running_summary)
        for turn in self.turns:
            total += turn.token_count
        return total
    
    def _summarize_turns(self, turns: List[ConversationTurn]) -> str:
        """Summarize a list of turns"""
        conversation = "\n".join([
            f"{t.role}: {t.content}"
            for t in turns
        ])
        
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": "Summarize this conversation, preserving key information, decisions, and context needed for continuity."
                },
                {"role": "user", "content": conversation}
            ],
            temperature=0.3
        )
        
        return response.choices[0].message.content
    
    def _merge_summaries(self, old: str, new: str) -> str:
        """Merge old and new summaries"""
        if not old:
            return new
        
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": "Merge these two conversation summaries into one coherent summary."
                },
                {"role": "user", "content": f"Previous summary:\n{old}\n\nNew summary:\n{new}"}
            ],
            temperature=0.3
        )
        
        return response.choices[0].message.content
    
    def get_messages(self) -> List[dict]:
        """Get messages for API call"""
        messages = []
        
        if self.system_message:
            messages.append({
                "role": "system",
                "content": self.system_message
            })
        
        if self.running_summary:
            messages.append({
                "role": "system",
                "content": f"Previous conversation summary: {self.running_summary}"
            })
        
        for turn in self.turns:
            messages.append({
                "role": turn.role,
                "content": turn.content
            })
        
        return messages

Dynamic Context Selection

Dynamic context selection is the strategy you use in RAG pipelines: you have 20 retrieved chunks but only room for 5 in the context window. The approach is essentially a knapsack problem — maximize relevance within a fixed token budget. The reserve_tokens parameter is important and often forgotten: you need to leave room for the user’s question AND the model’s response, not just the context.
from typing import List, Dict
from dataclasses import dataclass

@dataclass
class ContextItem:
    content: str
    relevance: float   # From your similarity search (0 to 1)
    token_count: int   # Pre-computed to avoid recounting
    source: str        # Track provenance for citations

class DynamicContextManager:
    """Select relevant context within token budget.
    
    Pitfall: Don't just stuff the top-K results in. A 0.95 relevance chunk
    with 2000 tokens might be less valuable than three 0.88 chunks at 400
    tokens each -- you get more coverage for the same budget.
    """
    
    def __init__(self, max_tokens: int = 4000):
        self.max_tokens = max_tokens
        self.counter = TokenCounter()
    
    def select_context(
        self,
        query: str,
        items: List[ContextItem],
        reserve_tokens: int = 500  # Reserve for query and response
    ) -> List[ContextItem]:
        """Select context items within budget"""
        
        available_tokens = self.max_tokens - reserve_tokens
        
        # Sort by relevance
        sorted_items = sorted(items, key=lambda x: x.relevance, reverse=True)
        
        selected = []
        used_tokens = 0
        
        for item in sorted_items:
            if used_tokens + item.token_count <= available_tokens:
                selected.append(item)
                used_tokens += item.token_count
        
        return selected
    
    def build_context(
        self,
        query: str,
        items: List[ContextItem],
        format_fn = None
    ) -> str:
        """Build context string from selected items"""
        
        selected = self.select_context(query, items)
        
        if format_fn:
            return format_fn(selected)
        
        # Default formatting
        context_parts = []
        for item in selected:
            context_parts.append(f"[Source: {item.source}]\n{item.content}")
        
        return "\n\n---\n\n".join(context_parts)

Strategy Selection Framework

Choosing the wrong context management strategy wastes tokens on the cheap end and loses critical information on the expensive end. Use this decision table.
ScenarioRecommended StrategyWhyWatch Out For
Chatbot with short sessions (under 10 turns)Sliding window, keep allNo compression needed — the full history fitsUsers who paste large blocks of text in a single message
Chatbot with long sessions (50+ turns)Sliding window + progressive summarizationKeeps recent detail while preserving older context in compressed formSummary drift — facts from turn 5 may degrade after 3 summarization passes
RAG with many retrieved chunksDynamic context selectionMaximizes relevance within a fixed token budgetAccidentally filtering out the one chunk that contains the answer
Processing a 100-page PDFChunked processing + map-reduce summarizationDocument is too large for any single context windowLost cross-references (“as mentioned in Chapter 3”) when chunks are processed independently
Multi-document Q&ADynamic selection + extractive compressionMultiple documents compete for limited context spaceSource attribution — compressed context loses provenance if you don’t track which doc each sentence came from
Structured data extraction from long formsSliding window with field-specific passesEach pass focuses on extracting one field, so context is used efficientlyFields that depend on each other (e.g., “same as billing address”) require a consolidation pass
Decision flowchart for new projects:
  1. Estimate your typical input size (tokens). If it fits within 50% of your model’s context window (leaving room for system prompt, history, and response), you don’t need context management yet. Ship without it.
  2. If input exceeds 50% of the window, determine whether the excess is from conversation history or from retrieved/uploaded content.
  3. For conversation history: start with a sliding window (simplest). Add summarization only when users report the bot “forgetting” things from earlier in the conversation.
  4. For retrieved content: implement dynamic context selection with a token budget. Add compression only if your top-K chunks consistently exceed the budget.

Edge Cases in Context Management

The “as I mentioned earlier” problem. A user references something from 20 messages ago that has been summarized away. The summary may have lost the specific detail. Mitigation: maintain a structured “key facts” dictionary alongside the running summary. When the user says “my budget is 5000," store `{"budget": "5000”}` in a side channel that never gets summarized. Token counting mismatches across models. If you switch from GPT-4o to Claude mid-conversation, your token counts are wrong — they use different tokenizers. The o200k_base tokenizer for GPT-4o produces different counts than Claude’s tokenizer for the same text. Always count tokens with the tokenizer that matches your current model. System prompt bloat. Developers keep adding instructions to the system prompt over time. A 2000-token system prompt in a 4000-token budget leaves only 2000 tokens for everything else. Audit your system prompt monthly. If it exceeds 500 tokens, ask whether each instruction is pulling its weight. Context window =/= effective context. Models perform worse on information buried in the middle of long contexts (the “lost in the middle” effect documented by Liu et al., 2023). Even if you have 128K tokens available, information at position 40K-80K gets less attention than information at the beginning or end. Place your most important context (system instructions, key facts) at the start, and the most recent user query at the end.

Token Usage Summary

StrategyUse CaseToken Savings
Sliding WindowLong conversations50-70%
SummarizationDocument processing60-80%
CompressionContext reduction30-70%
Dynamic SelectionRAG contextVariable
Chunked ProcessingLong documentsN/A (enables)

What is Next

LLM Testing

Learn testing strategies for LLM applications

Interview Deep-Dive

Strong Answer:
  • The naive approach is sorting by relevance score and taking the top 5. This fails in two common ways. First, the top 5 chunks might all come from the same section of the same document, giving you redundant coverage of one subtopic while missing others entirely. Second, a single highly relevant but 2,000-token chunk might crowd out three shorter chunks that together provide better coverage — you are solving a knapsack problem, not a sorting problem.
  • The production approach is diversity-aware selection. After ranking by relevance, apply Maximal Marginal Relevance (MMR): for each candidate chunk, discount its score based on how similar it is to chunks already selected. This naturally picks chunks that are both relevant and non-overlapping. A lambda of 0.7 is a good starting point for the relevance-vs-diversity trade-off.
  • Token budgeting is the second critical piece. Pre-compute token counts for every chunk (not character counts — they diverge by 10-30% from actual tokens). Then solve a greedy knapsack: pick the highest-scoring chunk that fits the remaining budget, repeat until full. Reserve at least 500 tokens for the question and 1,000-2,000 for the response.
  • A subtle failure mode: relevance scores from different queries are not comparable. A 0.85 score for a vague query might be less useful than a 0.72 score for a specific one. If you hard-code a threshold like “only include above 0.8,” you over-filter specific queries and under-filter vague ones. Use relative ranking, not absolute thresholds.
Follow-up: Users report the system sometimes gives correct answers but cites the wrong chunk. How do you debug this?This is a context position bias problem. LLMs disproportionately attend to the beginning and end of the context window (“lost in the middle” phenomenon). If chunk 1 is first and chunk 3 is buried in the middle, the model might synthesize from chunk 3 but attribute to chunk 1. The fix is placing the most relevant chunks at the beginning and end with less relevant ones in the middle, or randomizing order to eliminate systematic bias. For citations, ask the model to quote the exact sentence it cites, then verify that sentence actually appears in the claimed source chunk as a post-processing validation step.
Strong Answer:
  • This is the fundamental limitation of summarization-based memory: summaries are lossy compression, and specific numbers are exactly the kind of detail that gets lost. The model summarizes “$47,500” into “budget constraints” because it optimizes for brevity, not precision.
  • The solution is a dual-memory architecture. Layer one is the running summary (captures themes and context). Layer two is a structured fact store — a key-value dictionary that extracts and preserves specific data points: numbers, dates, names, commitments. After every user turn, run a cheap extraction call asking what specific facts the user mentioned. Store results like {'"budget": "$47,500", "deadline": "March 15"'}.
  • The fact store is injected into the system prompt separately from the summary. It is never summarized or compressed — it persists verbatim for the entire conversation. The token cost is small (a few hundred tokens) but the reliability improvement is massive.
  • The architecture becomes: system prompt (never dropped) + fact store (never summarized) + running summary (compressed older context) + recent messages (verbatim last N turns). Each layer has different compression characteristics.
Follow-up: The fact store grows to 800 tokens over a 50-turn conversation. When and how do you prune it?Never delete facts the user has referenced more than once. For the rest, apply recency-weighted relevance: full weight for facts from the last 10 turns, half weight for 10-30 turns ago, and facts older than 30 turns that have never been re-referenced are candidates for archival into the summary. The safeguard is that pruning never silently drops a fact — if “budget: 47,500"movestothesummary,explicitlytellthesummarizertoincludetheexactfigure.Alsodeduplicate:iftheusercorrectedthebudgetfrom47,500" moves to the summary, explicitly tell the summarizer to include the exact figure. Also deduplicate: if the user corrected the budget from 47,500 to $50,000, keep only the latest value.
Strong Answer:
  • LLMLingua uses a small language model to remove tokens that contribute least to meaning — intelligent truncation at the token level. It is fast (runs locally, no API call), cheap, and predictable. The output reads oddly to humans but retains information the LLM needs.
  • LLM-based extractive compression selects the most relevant sentences given a specific query. It produces clean, readable output but costs an API call and adds latency. It is smarter because it can prioritize based on the query.
  • Choose LLMLingua for compressing large contexts quickly and cheaply (e.g., 10 retrieved documents before a RAG call) when readability does not matter. Choose extractive compression when output might be shown to users or when query-specific relevance matters more than uniform compression.
  • The fundamental risk of any compression: you irreversibly discard information before the model sees it. If the compression removes a critical sentence, no prompt engineering recovers it. Mitigation is conservative compression ratios (keep 50-70%) and always compressing with the user’s query as context.
Follow-up: You compress to 50% and answer quality drops 15%. How do you determine whether compression or retrieval is the bottleneck?Run an ablation on 100 queries with four conditions: (1) uncompressed top-5 chunks, (2) compressed top-5 chunks, (3) uncompressed top-10 chunks compressed to the same budget as condition 1, (4) manually selected gold-standard context. If conditions 1 and 2 have similar quality, compression is fine — retrieval is the bottleneck. If condition 2 is much worse, compression is removing important information. If condition 4 is dramatically better than all others, retrieval is the bottleneck regardless. This ablation costs $5-10 in API calls and saves weeks of optimizing the wrong component.